Project completed in week 3 (12.10.-16.10.20) of the Data Science Bootcamp at Spiced Academy in Berlin.
I really enjoyed this week’s project for two main reasons: it involved time series analysis and it solved a real-world practical business problem.
Exploratory analysis
The dataset includes 12 variables, of which 3 are dependent, and 1 is actually the one I had to predict: bike count. I started by extracting the month, day, weekday, and hour from the datetime feature, so I ended up with 15 variables to help me predict the bike count.
It’s a common assumption that the more variables you have in a dataset and include in a model, the better the predictions. But usually (as in this case) it’s exactly the contrary. The more variables you have, the higher the chance that some of them will be correlated with each other (multicollinearity) or just be irrelevant. Also, some variables that seem useless independently might be useful combined. Selecting what features to include in the model can be quite tricky, but a look at the correlation matrix or the sklearn.feature_selection
module can help!
In the bike dataset, I noticed a high correlation between temperature, season, windspeed, season, and month, which was not surprising. I used this information, as well as the Recursive Feature Elimination (RFE), to select the most relevant variables for the bike count prediction. In the end, I included only five features: temperature, hour, weather, weekday, and holiday.
Feature engineering
After selecting the features, I had to do a bit of feature engineering to prepare them for Linear Regression. I tried both with scaling and polynomial expansion. For the Random Forest models, this was not necessary, since the models do feature engineering themselves. The feature engineering part was very important also for fitting the evaluation score used in the Kaggle competition, which in this case was Root Mean Squared Log Error (RMSLE). In order to optimize my models against the RMSLE, I took the logarithm of the target variable(bike count).
Regression models
Once the dirty work was done, I trained and tested four different regression models:
- RFR1: Random Forest Regressor with default hyperparameters
- RFR2: Random Forest Regressor with the best hyperparameters found with GridSearchCV
- RFR3: Random Forest Regressor with hyperparamenters recommended by TPOT
- GBR: Gradient Boost Regressor
- SVR: Support Vector Regressor
- LR: Linear Regression
RFR1 | RFR2 | RFR3 | GBR | SVR | LR | |
RMSLE | 0.49 | 0.47 | 0.48 | 0.10 | 0.20 | 0.24 |
kaggle | – | 0.49 | – | – | – | – |
To get an idea, the best kaggle score on this competition is 0.33, so I think my 0.49 is not that bad. But of course, as always, there’s room for improvement and I might try to do just that over the weekend.
Friday Lightning Talk
At our weekly project review, I talked about the Random Forest models that I implemented, explained the choice of hyperparameters and the different results I got. As usual, listening to other people’s talks gave me more ideas on how to improve my models or other things to try out! That’s the beauty and art of Machine Learning – there are many ways to look at data, combine information, and solve a problem!