This blog post is a sequel to our previous ** part** where we have visualized our data to a greater extent and got hang of it. In this part, we will see how we can leverage machine learning algorithms like

*Linear Regression*,

*Random Forest*and

*Gradient Boost*to get into top 10 percentile in Kaggle leaderboard. The model building part is split into following topics:

- Missing values analysis in windspeed.
*RMSLE*scorer.- Linear Regression.
- Regularization.
- Ensemble Models.
- Further Improvements

Only key code snippets are displayed as part of this blog and for complete analysis refer to the Ipython notebook in the ** Github** link.

**Missing Values Analysis in Windspeed**

As I mentioned earlier in the previous part, luckily we have no missing values in our dataset. But “windspeed” attribute has lot of 0 entries which is bit suspicious. Following is the simple visualization to depict frequency of “windspeed” values in data.

A ** discussion** in kaggle gives lot of information on this particular topic. At a high level we can make two or three simple inference about 0 entries in “windspeed” as follows:

- It can actually be 0 at these points.
- It is too low to be measured, for example varying from 0 to 5.
- All zeros or part of them are nothing but NAs.

For this blog we will consider the 0 entries as missing values and fill them using simple *Random Forest* *Classifier *model. Following is the code block for the same.

`from sklearn.ensemble import RandomForestClassifier`

wCol= ["season","weather","humidity","month","temp","year","atemp"]

dataWind0 = data[data["windspeed"]==0]

dataWindNot0 = data[data["windspeed"]!=0]

dataWindNot0["windspeed"] = dataWindNot0["windspeed"].astype("str")

rfModel_wind = RandomForestClassifier()

rfModel_wind.fit(dataWindNot0[wCol], dataWindNot0["windspeed"])

wind0Values = rfModel_wind.predict(X= dataWind0[wCol])

dataWind0["windspeed"] = wind0Values

data = dataWindNot0.append(dataWind0)

data["windspeed"] = data["windspeed"].astype("float")

data.reset_index(inplace=True)

data.drop('index',inplace=True,axis=1)

Let us see how the distribution of windspeed values look after the missing values are imputed.

**RMSLE Scorer**

One common way to evaluate regression model is through calculating *MSE* or *RMSE*. In this particular competition , the metric to evaluate our model is *Root Mean Square Logarithmic Error (RMSLE)*. *RMSLE* is particularly helpful when you want to penalize an under-predicted estimate greater than an over-predicted estimate.

Most of the Kaggle competition where we predict sales and inventory demand especially use *RMSLE *as their metric to evaluate. For example competition such as ** grupo-bimbo-inventory-demand **and

**use**

*sberbank-russian-housing-market**RMSLE*as metric.

Unfortunately sklearn metrics does not have direct implementation to calculate *RMSLE*. So let us construct custom function to perform *RMSLE* calculation.

`def rmsle(y, y_):`

log1 = np.nan_to_num(np.array([np.log(v + 1) for v in y]))

log2 = np.nan_to_num(np.array([np.log(v + 1) for v in y_]))

calc = (log1 - log2) ** 2

return np.sqrt(np.mean(calc))

We have prepared our data by filling missing values and constructed our custom RMSLE scorer. So we are now good to go for our model building experiment.

**Linear Regression**

As a first step, let us start with simple statistical technique like linear regression. It is always good to start from simple model than to try complex *machine learning* algorithms at first. Because at times features will have smooth, nearly linear dependence on the *covariates*. Then linear regression will model the dependence better than random forests, which will basically approximate a linear curve with an ugly irregular step function. A ** stackexchange** discussion gives loads of information about it.

Following is the simple code snippet that fits linear regression on our bike sharing dataset. Codes are self explanatory and the focus of this article is to cover the concepts than coding. Please feel free to drop a note in the comments if you find any challenges in understanding any part of it.

`from sklearn.linear_model import LinearRegression`

# Initialize logistic regression model

lModel = LinearRegression()

# Train the model

lModel.fit(X = X_train,y = np.log1p(y_train))

# Make predictions

preds = lModel.predict(X= X_validate)

print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

Before submitting our test results we will visualize the distribution of train and test results. Kaggle has a limit on number of submission per day. (in our case it is 5 submission/day). So visualizing the distribution gives good clue on how close we have predicted our test based on our training set. From the figure it visible that the distribution of the train and test set vary considerable.

The *RMSLE* value on test set is around 1.05 and it is definitely not on par with best score(0.33) in *Kaggle* leaderboard. We can improve this score substantially by a number of ways.

- Feature Engineering
- Regularization (L1 & L2)
- Ensemble Models

We have already created few features such as “weekday”, “month”, “hour” from “datetime” attribute. And there are innumerable number of ways one can come with feature engineering steps. As a part of this blog I am not taking that in consideration and I will leave that to the imagination of users.

**Regularization**

Regularization is extremely useful in any of the following cases. We do not face all the cases mentioned below but *overfitting* and *multicollinearity* may pose some issues for us.

*Overfitting*.- Large number of variables.
- Low ratio of no of observations to no of variables.
*Multicollinearity*.

*Overfitting* refers to a model that performs well on training set by learning the detail and noise in the training data but does not generalize well on new data. Let us take our example , *RMSLE* value on training data is around 0.98 and there is no big difference from test set results. So far we do not have any *overfitting* problem in our model but at times it will be a nightmare when we fit models.

Having really large number of variable may again result in overfitting. This is because when we have more variables model becomes more complex and sometimes lowers its predicting and generalization power.* L1 regularization (Lasso Regression)* comes handy in these situation by reducing the coefficients to zero there by producing simpler models.

*L2 regularization (Ridge Regression)* is extremely helpful for third case where we have more number of attributes than observation. But we are doing fine with that as of now since we have just 12 attributes and dataset has 10886 records. *Ridge Regression* is also quite useful when there is high *collinearity* between predictor variable. It may happen in our dataset since we have highly correlated variables like temp-atemp and month-season.

So it looks like we do not have much of threat from above issues as of now. Still nothing stops us building simple regularization models and see how far we can improve our score.

`from sklearn.linear_model import Lasso`

from sklearn.model_selection import GridSearchCV

from sklearn import metrics

lasso_m_ = Lasso()

alpha = [0.001,0.005,0.01,0.3,0.1,0.3,0.5,0.7,1]

lasso_params_ = { 'max_iter':[500],'alpha':alpha}

rmsle_scorer = metrics.make_scorer(rmsle, greater_is_better=False)

grid_lasso_m = GridSearchCV( lasso_m_,

lasso_params_,

scoring = rmsle_scorer,

cv=5)

grid_lasso_m.fit(X = X_train,y = np.log1p(y_train))

preds = grid_lasso_m.predict(X= X_validate)

print (grid_lasso_m.best_params_)

print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

The optimum value of regularization parameter (alpha-0.005) is obtained through grid search. The chart below visualizes RMSLE values for different alpha parameters. RMSLE value on test set is around 1.04 and has not improved from our previous. So regularization has not given any boost to our score. But let us not loose hope because when nothing goes right ensemble model always produce something out of the box for us.

**Ensemble Models**

*Ensemble models* are nothing but art of combining diverse set of individual *weak learners*(models) together to improve stability and predictive capacity of the model. *Ensemble Models *improves the performance of the model by

- Averaging out biases.
- Reducing the variance.
- Avoiding
*overfitting*.

If you are still wondering what ensemble model is all about then this ** series of articles** can get you started with it. So that’s enough introduction about ensemble model and here is a snippet on how we fit naive

*Random Forest*model on our dataset with default parameters.

`from sklearn.ensemble import RandomForestRegressor`

rfModel = RandomForestRegressor(n_estimators=100)

rfModel.fit(X = X_train,y = np.log1p(y_train))

preds = rfModel.predict(X= X_validate)

print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

RMSLE value on test set is around 0.439 and now we have got something to really cheer about. Random forest has produced a significant improvement from our previous scores. Random forest also gives importance of features as a by product of model fitting. It is especially useful in understanding how features contribute in model building and also in feature selection when we have lot of features to deal with. It is very obvious from figure below that “hour” attribute plays significant role in model prediction followed by temp, weekday, year etc.

It is generally believed that *boosting* based algorithms like *Gradient Boosting *and *XGBoost *performs better than algorithms based on *bagging*. So its time to test how boosting algorithms can help us in improving our test score

`from sklearn.ensemble import GradientBoostingRegressor`

gbm = GradientBoostingRegressor(n_estimators=4000,alpha=0.01)

gbm.fit(X = X_train,y = np.log1p(y_train))

preds = gbm.predict(X= X_validate)

print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

As we believed it does improve our score and the RMSLE value for Gradient Boosting is around 0.418. Though it looks like a marginal improvement from our Random Forest results, it will improve your position in Kaggle leaderboard from top 15 percentile to top 10 percentile and that’s a significant improvement.

We will again visualize how close we predicted the test result by comparing the train and test result. When we compare the chart below with the chart we created for linear regression. We can understand how close the gradient boost has captured the distribution of test set.

Now we will complete the cycle by submitting the our best fitting model results to kaggle. Kaggle requires the output file to be submitted in a specific format where the first field corresponds to datetime and the second field should be our predicted count.

`submission = pd.DataFrame({`

"datetime": datetimecol,

"count": [max(0, x) for x in np.exp(predsTest)]

})

submission.to_csv('bike_predictions_gbm_results.csv', index=False)

**Further Improvements**

The best score in kaggle leaderboard is around 0.33 which means there is lot of room for improvement in the model building activity. Following are few ways through which we can improve our score.

- Tuning parameter through grid search.
- Predicting casual and registered user count separately.
- Time Series Modeling.

One common way to improve ensemble model results is through tuning parameters. Sklearn provides a method called *GridSearchCV *which helps us perform exhaustive grid search. The following code may run several hours depending on the configuration of the system.

`from sklearn.ensemble import GradientBoostingRegressor`

from sklearn.model_selection import GridSearchCV

gbm = GradientBoostingRegressor()

param_grid = {

'n_estimators': [1000,2000,3000,4000],

'max_features': ["auto","sqrt","log2",0.6,0.8],

'min_samples_leaf':[30,40,50,60,70],

'min_samples_split':[150,200,250,300],

'max_depth' : [10,15,20,25],

'subsample': [0.4,0.6,0.8],

'learning_rate':[0.1,0.01,0.001]

}

CV_gbm=GridSearchCV(estimator=gbm,param_grid=param_grid,cv=5)

CV_gbm.fit(X = X_train,y = np.log1p(y_train))

print (CV_gbm.best_params_)

preds = CV_gbm.predict(X= X_validate)

print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

Sklearn provides alternative to GridSearchCV called RandomizedSearchCV. It takes additional parameter called n_iter which gives flexibility in choosing how many combination of parameters to try out. Greater the number of n_iter better the quality of the solution (trade-off between quality and runtime ).

When we performed our bivariate analysis between count and hour in first part of our blog. It is quite obvious the usage pattern of casual and registered users varied significantly across hour. And it is understood from our feature importance result (Random Forest) that hour is the most important feature in model building. These two results explain us instead of modeling count as a dependent variable we can build two individual model considering casual and registered users as dependent variable. An article from analytics vidhya explains how we can approach the problem on those lines.

Another completely orthogonal way of approaching the problem is to do time series analysis on the data. It can visualized from the charts below that there is cyclic trend across the hour of the day and seasonal trend across the months of the year.

Approaching the problem through time series models may or may not improve our score. But it is definitely useful to put some efforts on those lines since most of the top attribute that contributes to the model are derived from “datetime” attribute. But the only disadvantage in running a time series model is that we do not have enough data to feed since the training set involve just two years of data from 2011–2012.

**End Notes**

So this article is just a humble introduction to get started with kaggle competition and also gives a head-start on how we can approach a new dataset. To know more about kaggle competitions and to understand different approaches of solving a problem have a look at kernels section in kaggle.

Got questions? **Reach out** to us.