After considering the modeling framework, the next step is to realize the analysis.

As suggested in the previous schemas, in order to create the predictive model to predict daily subscriber/pass user, number of trips, we are going to combine several source of data into one dataframe. Then running simple randomForest algorithm and GAM via caret package (machine learning for R).

Here is the required library:

## New Subscriber / pass user

### Combining data

The base dataset comes from CitiBike System Data, and for this part, I am using the “Citi Bike Daily Ridership and Membership Data”. The data is aggregated in daily level which should be sufficient. The data retrieving and loading script could be found here.

Let us begin the data wrangling:

### Correlation exploration

Next let’s visualize the correlation between each features:

From the plots, we can see some clear correlation between the predictor and outcome variable. As observed in the plot, the following variables will be in the model: month, season, day of the week, Cloud Cover, max temperature, average temperature, holiday, precipitation. As opposed, the day of the month and min temperature are dropped.

### Modeling

Now we are going to model the outcome variable with the selected predictor.

Some tips for modeling with caret package:

1. With the model object, we can use caret::varImp() function to print out the list of predictors’ importance.
2. Use caret::trainControl(), we can fine tuning the training part. For example, changing the methode for paramter searching, creating the prediction bounds to avoid unexpected outcome, etc…

First GAM model for Subscriber:

Next randomForest model for subscriber:

The following codes will model the new 24h pass and new 7-day pass per day. Same as those for subscriber.

## Number of Trips

In the post, we have created new feature: active membership.

With the information of new subscriber and pass user, we can use the prediction to create the feature active membership for the targeted time.

So we firstly begin data combination:

### Combining data

This part we are going to use the previously created SQLite database to query the data. We only keep the data at daily level.

Now prepare the data frame for modeling.

### Correlation exploration

Now let’s visualize the correlation between the count of trips and the predictors. Here plots trips carried out by pass users (customer), annual subscriber and sum of them.

From the plots, we can see some clear correlation between the predictor and outcome variable. As observed in the plot, the following variables will be in the model: month, season, day of the week, Cloud Cover, max temperature, average temperature, holiday, precipitation and active membership.

### Modeling

We need to create the model for annual subscriber trips & customer trips, but it is possible to only create the model for the total trips.

## Summary

In this post, we have constructed the model with random forest and general addictive model and done some feature engineering to create some useful features: active membership, holiday, day of week, etc…

With the current model (having around 80% - 90% accuracy), we can begin the construction of shiny application.