Regression techniques on a bike-sharing dataset

Source: Deep Learning on Medium

Dataset Link : http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

About the dataset

Bike-sharing systems are the new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real-world applications of bike-sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure, and arrival position is explicitly recorded in these systems. This feature turns a bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of the important events in the city could be detected via monitoring these data.

About different columns in the original dataset :

  • instant: record index
  • dteday : date
  • season : season (1:winter, 2:spring, 3:summer, 4:fall)
  • yr : year (0: 2011, 1:2012)
  • mnth : month ( 1 to 12)
  • holiday : weather day is holiday or not (extracted from [Here])
  • weekday : day of the week
  • workingday : if day is neither weekend nor holiday is 1.
  • hum: Normalized humidity. The values are divided to 100 (max)
  • windspeed: Normalized wind speed. The values are divided to 67 (max)
  • casual: count of casual users
  • registered: count of registered users
  • cnt: count of total rental bikes including both casual and registered
Data visualization in CSV format

Pre-Processing

Feature Selection

In a dataset, there is always a chance that we do not need some attributes hence we use feature selection.

We will be dropping ‘date’ and ‘instant’ columns from our dataset.

#Feature Selectiondf = df.drop(columns=['dteday','instant'])

Normalizing the data using sklearn package in python

Normalization reference image
# Normalizing the datafrom sklearn import preprocessingx=df.drop(['cnt'],axis=1)y=df['cnt']x = preprocessing.normalize(x)

And dividing into training and testing sets to be ready for any model or algorithm to work.

#Splitting the datasetfrom sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

Regression Techniques

Now we will be applying different regression techniques using the existing libraries in python.

Linear Regression

linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables.

#Linear Regressionfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_errorfrom sklearn.metrics import mean_absolute_errorfrom sklearn.metrics import r2_scorelinearRegressor = LinearRegression()linearRegressor.fit(x_train, y_train)y_predicted = linearRegressor.predict(x_test)mse = mean_squared_error(y_test, y_predicted)r = r2_score(y_test, y_predicted)mae = mean_absolute_error(y_test,y_predicted)print("Mean Squared Error:",mse)print("R score:",r)print("Mean Absolute Error:",mae)
Output for linear regression

Polynomial Regression

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial in x.

from sklearn.preprocessing import PolynomialFeaturespolynomial_features= PolynomialFeatures(degree=2)x_poly = polynomial_features.fit_transform(x_train)x_poly_test = polynomial_features.fit_transform(x_test)model = LinearRegression()model.fit(x_poly, y_train)y_predicted_p = model.predict(x_poly_test)mse = mean_squared_error(y_test, y_predicted_p)r = r2_score(y_test, y_predicted_p)mae = mean_absolute_error(y_test,y_predicted_p)print("Mean Squared Error:",mse)print("R score:",r)print("Mean Absolute Error:",mae)
Output for Polynomial Regression

Decision Tree

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed.

# Decision Treefrom sklearn.tree import DecisionTreeRegressorregressor = DecisionTreeRegressor(random_state = 0)regressor.fit(x_train, y_train)y_predicted_d = regressor.predict(x_test)mse = mean_squared_error(y_test, y_predicted_d)r = r2_score(y_test, y_predicted_d)mae = mean_absolute_error(y_test,y_predicted_d)print("Mean Squared Error:",mse)print("R score:",r)print("Mean Absolute Error:",mae)
Output for Decision Tree Regressor

Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression).

# Random Forestfrom sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier()rf.fit(x_train,y_train);y_predicted_r = rf.predict(x_test)mse = mean_squared_error(y_test, y_predicted_r)r = r2_score(y_test, y_predicted_r)mae = mean_absolute_error(y_test,y_predicted_r)print("Mean Squared Error:",mse)print("R score:",r)print("Mean Absolute Error:",mae)
Output for Random Forest

Rather than trying each and every algorithm separately, we will try a new method and display it using pretty tables

Code:

from sklearn.linear_model import Lasso, ElasticNet, Ridge, SGDRegressorfrom sklearn.svm import SVR, NuSVRfrom sklearn.ensemble import BaggingRegressor, RandomForestRegressorfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.cluster import KMeansimport matplotlib.pyplot as pltfrom prettytable import PrettyTablefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifiertable = PrettyTable()table.field_names = ["Model", "Mean Squared Error", "R² score","Mean Absolute Error"]models = [LinearRegression(),DecisionTreeRegressor(random_state = 0),RandomForestRegressor( random_state=0, n_estimators=300),SGDRegressor(max_iter=1000, tol=1e-3),Lasso(alpha=0.1),ElasticNet(random_state=0),Ridge(alpha=.5),BaggingRegressor(),BaggingRegressor(KNeighborsClassifier(), max_samples=0.5, max_features=0.5),]for model in models:model.fit(x_train, y_train)y_res = model.predict(x_test)mse = mean_squared_error(y_test, y_res)score = model.score(x_test, y_test)mae = mean_absolute_error(y_test,y_res)table.add_row([type(model).__name__, format(mse, '.2f'), format(score, '.2f'),format(mae, '.2f')])print(table)
Final Output of all the algorithms

CONCLUSION

From the above output, we see that ElasticNet works best for our dataset.

Image used for ElasticNet

And if you want to get the whole code for the algorithms tested above, you can find it here 🙂

Blog by:

Naman Bansal

Github LinkedIn Twitter