New York city’s Citi Bike provides an open data. All files are zipped and can be found over the site.
The objectif is to use the dataset to do explore some interesting insight of the NYC citibike users’ usage patern.
The current documentation is to make the data retrieval reproducible.
The pipeline of data retrieval is:
- Scrapper the page which lists all the zip files
- Get the url of all the zip files
- Download the zip files and unzip
- Construct a data frame
- Due to large volume of data, the data is stored in SQLite and only subset a trunk
The package required:
1 | library(plyr) |
Systeme Data extract - transform - load
Data extracting
The data is downloaded via the site of citibikenyc.com. We have the detailed information of the trips from july 2013 to december 2014. The data includes:
- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth
1 | # install.packages("rvest") |
Data transforming
A script will try to retrieve the data from the site and unzip the zipped csv file. Once the files are downloaded and unzipped, the csv files will be read by ldply
and create a aggregated data frame.
Due to the size of the data, a SQLite database is created via script R by using package dplyr
(this operation could be done by command line tool of SQLite).
Some variables are also created in order to extract maximum information:
- convert start/end datetime into POXIct and extract information of “year”, “month”, “month day”, “week day”, “year day”, “hour”, “minute”, “second”, and “isoweek”
- calculate the age of the rider
- calculate the
direct
distance of each ride
1 | ##---------------------------------------------------- |
Data loading & saving
Once data is clean, we will the following code to construct a SQLite database in order to store the data and can be easily subset via dplyr
.
1 | ##---------------------------------------------------- |
Getting weather data
In order to make the data more explorable, we need to get more information of surrounding environment. Here we can find is weather.
By using the pacakge weatherData
, the whole year weather can be easily obtained.
Here is the code to construct a data frame with weather information of new york city:
1 | # get new york city weather data for a year: KNYC (central park) |