This blog post is a sequel to our previous part where we have visualized our data to a greater extent and got hang of it. In this part, we will see how we can leverage machine learning algorithms like Linear Regression, Random Forest and Gradient Boost to get into top 10 percentile in Kaggle leaderboard. The model building part is split into following topics:

  • Missing values analysis in windspeed.
  • RMSLE scorer.
  • Linear Regression.
  • Regularization.
  • Ensemble Models.
  • Further Improvements

Only key code snippets are displayed as part of this blog and for complete analysis refer to the Ipython notebook in the Github link.

Missing Values Analysis in Windspeed

As I mentioned earlier in the previous part, luckily we have no missing values in our dataset. But “windspeed” attribute has lot of 0 entries which is bit suspicious. Following is the simple visualization to depict frequency of “windspeed” values in data.

Data Science - 1

A discussion in kaggle gives lot of information on this particular topic. At a high level we can make two or three simple inference about 0 entries in “windspeed” as follows:

  • It can actually be 0 at these points.
  • It is too low to be measured, for example varying from 0 to 5.
  • All zeros or part of them are nothing but NAs.

For this blog we will consider the 0 entries as missing values and fill them using simple Random Forest Classifier model. Following is the code block for the same.

from sklearn.ensemble import RandomForestClassifier
wCol= ["season","weather","humidity","month","temp","year","atemp"]
dataWind0 = data[data["windspeed"]==0]
dataWindNot0 = data[data["windspeed"]!=0]
dataWindNot0["windspeed"] = dataWindNot0["windspeed"].astype("str")

rfModel_wind = RandomForestClassifier()
rfModel_wind.fit(dataWindNot0[wCol], dataWindNot0["windspeed"])
wind0Values = rfModel_wind.predict(X= dataWind0[wCol])

dataWind0["windspeed"] = wind0Values
data = dataWindNot0.append(dataWind0)
data["windspeed"] = data["windspeed"].astype("float")
data.reset_index(inplace=True)
data.drop('index',inplace=True,axis=1)

Let us see how the distribution of windspeed values look after the missing values are imputed.

Data Science - 2

RMSLE Scorer

One common way to evaluate regression model is through calculating MSE or RMSE. In this particular competition , the metric to evaluate our model is Root Mean Square Logarithmic Error (RMSLE). RMSLE is particularly helpful when you want to penalize an under-predicted estimate greater than an over-predicted estimate.

Most of the Kaggle competition where we predict sales and inventory demand especially use RMSLE as their metric to evaluate. For example competition such as grupo-bimbo-inventory-demand and sberbank-russian-housing-market use RMSLE as metric.

Data Science - 3

Unfortunately sklearn metrics does not have direct implementation to calculate RMSLE. So let us construct custom function to perform RMSLE calculation.

def rmsle(y, y_):
log1 = np.nan_to_num(np.array([np.log(v + 1) for v in y]))
log2 = np.nan_to_num(np.array([np.log(v + 1) for v in y_]))
calc = (log1 - log2) ** 2
return np.sqrt(np.mean(calc))

We have prepared our data by filling missing values and constructed our custom RMSLE scorer. So we are now good to go for our model building experiment.

Linear Regression

As a first step, let us start with simple statistical technique like linear regression. It is always good to start from simple model than to try complex machine learning algorithms at first. Because at times features will have smooth, nearly linear dependence on the covariates. Then linear regression will model the dependence better than random forests, which will basically approximate a linear curve with an ugly irregular step function. A stackexchange discussion gives loads of information about it.

Following is the simple code snippet that fits linear regression on our bike sharing dataset. Codes are self explanatory and the focus of this article is to cover the concepts than coding. Please feel free to drop a note in the comments if you find any challenges in understanding any part of it.

from sklearn.linear_model import LinearRegression

# Initialize logistic regression model
lModel = LinearRegression()

# Train the model
lModel.fit(X = X_train,y = np.log1p(y_train))

# Make predictions
preds = lModel.predict(X= X_validate)
print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

Before submitting our test results we will visualize the distribution of train and test results. Kaggle has a limit on number of submission per day. (in our case it is 5 submission/day). So visualizing the distribution gives good clue on how close we have predicted our test based on our training set. From the figure it visible that the distribution of the train and test set vary considerable.

Data Science - 4

The RMSLE value on test set is around 1.05 and it is definitely not on par with best score(0.33) in Kaggle leaderboard. We can improve this score substantially by a number of ways.

  • Feature Engineering
  • Regularization (L1 & L2)
  • Ensemble Models

We have already created few features such as “weekday”, “month”, “hour” from “datetime” attribute. And there are innumerable number of ways one can come with feature engineering steps. As a part of this blog I am not taking that in consideration and I will leave that to the imagination of users.

Regularization

Regularization is extremely useful in any of the following cases. We do not face all the cases mentioned below but overfitting and multicollinearity may pose some issues for us.

  • Overfitting.
  • Large number of variables.
  • Low ratio of no of observations to no of variables.
  • Multicollinearity.

Overfitting refers to a model that performs well on training set by learning the detail and noise in the training data but does not generalize well on new data. Let us take our example , RMSLE value on training data is around 0.98 and there is no big difference from test set results. So far we do not have any overfitting problem in our model but at times it will be a nightmare when we fit models.

Having really large number of variable may again result in overfitting. This is because when we have more variables model becomes more complex and sometimes lowers its predicting and generalization power. L1 regularization (Lasso Regression) comes handy in these situation by reducing the coefficients to zero there by producing simpler models.

L2 regularization (Ridge Regression) is extremely helpful for third case where we have more number of attributes than observation. But we are doing fine with that as of now since we have just 12 attributes and dataset has 10886 records. Ridge Regression is also quite useful when there is high collinearity between predictor variable. It may happen in our dataset since we have highly correlated variables like temp-atemp and month-season.

So it looks like we do not have much of threat from above issues as of now. Still nothing stops us building simple regularization models and see how far we can improve our score.

from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
lasso_m_ = Lasso()
alpha = [0.001,0.005,0.01,0.3,0.1,0.3,0.5,0.7,1]
lasso_params_ = { 'max_iter':[500],'alpha':alpha}
rmsle_scorer = metrics.make_scorer(rmsle, greater_is_better=False)
grid_lasso_m = GridSearchCV( lasso_m_,
lasso_params_,
scoring = rmsle_scorer,
cv=5)
grid_lasso_m.fit(X = X_train,y = np.log1p(y_train))
preds = grid_lasso_m.predict(X= X_validate)
print (grid_lasso_m.best_params_)
print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

The optimum value of regularization parameter (alpha-0.005) is obtained through grid search. The chart below visualizes RMSLE values for different alpha parameters. RMSLE value on test set is around 1.04 and has not improved from our previous. So regularization has not given any boost to our score. But let us not loose hope because when nothing goes right ensemble model always produce something out of the box for us.

Data Science - 5

Ensemble Models

Ensemble models are nothing but art of combining diverse set of individual weak learners(models) together to improve stability and predictive capacity of the model. Ensemble Models improves the performance of the model by

  • Averaging out biases.
  • Reducing the variance.
  • Avoiding overfitting.

If you are still wondering what ensemble model is all about then this series of articles can get you started with it. So that’s enough introduction about ensemble model and here is a snippet on how we fit naive Random Forest model on our dataset with default parameters.

from sklearn.ensemble import RandomForestRegressor
rfModel = RandomForestRegressor(n_estimators=100)
rfModel.fit(X = X_train,y = np.log1p(y_train))
preds = rfModel.predict(X= X_validate)
print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

RMSLE value on test set is around 0.439 and now we have got something to really cheer about. Random forest has produced a significant improvement from our previous scores. Random forest also gives importance of features as a by product of model fitting. It is especially useful in understanding how features contribute in model building and also in feature selection when we have lot of features to deal with. It is very obvious from figure below that “hour” attribute plays significant role in model prediction followed by temp, weekday, year etc.

Data Science - 6

It is generally believed that boosting based algorithms like Gradient Boosting and XGBoost performs better than algorithms based on bagging. So its time to test how boosting algorithms can help us in improving our test score.

from sklearn.ensemble import GradientBoostingRegressor
gbm = GradientBoostingRegressor(n_estimators=4000,alpha=0.01)
gbm.fit(X = X_train,y = np.log1p(y_train))
preds = gbm.predict(X= X_validate)
print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

As we believed it does improve our score and the RMSLE value for Gradient Boosting is around 0.418. Though it looks like a marginal improvement from our Random Forest results, it will improve your position in Kaggle leaderboard from top 15 percentile to top 10 percentile and that’s a significant improvement.

We will again visualize how close we predicted the test result by comparing the train and test result. When we compare the chart below with the chart we created for linear regression. We can understand how close the gradient boost has captured the distribution of test set.

Data Science - 7

Now we will complete the cycle by submitting the our best fitting model results to kaggle. Kaggle requires the output file to be submitted in a specific format where the first field corresponds to datetime and the second field should be our predicted count.

submission = pd.DataFrame({
"datetime": datetimecol,
"count": [max(0, x) for x in np.exp(predsTest)]
})
submission.to_csv('bike_predictions_gbm_results.csv', index=False)

Further Improvements

The best score in kaggle leaderboard is around 0.33 which means there is lot of room for improvement in the model building activity. Following are few ways through which we can improve our score.

  • Tuning parameter through grid search.
  • Predicting casual and registered user count separately.
  • Time Series Modeling.

One common way to improve ensemble model results is through tuning parameters. Sklearn provides a method called GridSearchCV which helps us perform exhaustive grid search. The following code may run several hours depending on the configuration of the system.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

gbm = GradientBoostingRegressor()
param_grid = {
'n_estimators': [1000,2000,3000,4000],
'max_features': ["auto","sqrt","log2",0.6,0.8],
'min_samples_leaf':[30,40,50,60,70],
'min_samples_split':[150,200,250,300],
'max_depth' : [10,15,20,25],
'subsample': [0.4,0.6,0.8],
'learning_rate':[0.1,0.01,0.001]
}

CV_gbm=GridSearchCV(estimator=gbm,param_grid=param_grid,cv=5)
CV_gbm.fit(X = X_train,y = np.log1p(y_train))
print (CV_gbm.best_params_)

preds = CV_gbm.predict(X= X_validate)
print ("RMSLE Value: ",rmsle(Y_validate,np.exp(preds)))

Sklearn provides alternative to GridSearchCV called RandomizedSearchCV. It takes additional parameter called n_iter which gives flexibility in choosing how many combination of parameters to try out. Greater the number of n_iter better the quality of the solution (trade-off between quality and runtime ).

When we performed our bivariate analysis between count and hour in first part of our blog. It is quite obvious the usage pattern of casual and registered users varied significantly across hour. And it is understood from our feature importance result (Random Forest) that hour is the most important feature in model building. These two results explain us instead of modeling count as a dependent variable we can build two individual model considering casual and registered users as dependent variable. An article from analytics vidhya explains how we can approach the problem on those lines.

Another completely orthogonal way of approaching the problem is to do time series analysis on the data. It can visualized from the charts below that there is cyclic trend across the hour of the day and seasonal trend across the months of the year.

Data Science - 8

Approaching the problem through time series models may or may not improve our score. But it is definitely useful to put some efforts on those lines since most of the top attribute that contributes to the model are derived from “datetime” attribute. But the only disadvantage in running a time series model is that we do not have enough data to feed since the training set involve just two years of data from 2011–2012.

End Notes

So this article is just a humble introduction to get started with kaggle competition and also gives a head-start on how we can approach a new dataset. To know more about kaggle competitions and to understand different approaches of solving a problem have a look at kernels section in kaggle.

Got questions? Reach out to us.

5600 Tennyson Pkwy
Suite 120
Plano TX 75024.

+1 888-227-2794

+1 972-232-2233

+1 888-227-7192

solutions@visualbi.com