Yi's Notes

⟨Data Scientist|Photographer⟩

0%

Currently, I am working on a project which requires visualize Polygons on the map. The problem that I have encountered is that in order to visualize the result in leafletjs, I need to convert the shape file to geojson format.

So the solution is to use rgdal package to do the conversion.

The input is the shapefile, transformed to a sp::SpatialPolygonsDataFrame, and the output is a geojson file.

1
2
3
4
5
6
7
8
9
10
# load library
library(rgdal)
library(sp)
library(maptools)
# read shapefile
P4S.latlon <- CRS("+proj=longlat +datum=WGS84")
shp <- readShapePoly("input/uga2/UGA_WGS_region", verbose=TRUE, proj4string=P4S.latlon)
# write geojson
geojson_file <- "path/to/file.json"
writeOGR(shp, geojson_file,layer = "geojson", driver = "GeoJSON")

The ultimate use case is to use shiny to render on live (but this is quite limited to the quantity of polygons), here is a function to do so:

1
2
3
4
5
6
7
spToGeoJSON <- function(sp_obj){
temp_file<-tempfile()
writeOGR(sp_obj, temp_file,layer = "geojson", driver = "GeoJSON")
geojs <- paste(readLines(temp_file), collapse=" ")
file.remove(temp_file)
return(geojs)
}

The function returns a string variable with geojson content.

What is geocoding

Geocoding and reverse geocoding are the process of turning the address or place name into geographic coordinates, vice versa.

More detailed definition: wikipedia

Simple illustration:

Address --> Latitude,Longitude: Geocoding
Latitude,Longitude --> Address: Reverse geocoding

Implementation

The implementation is done by various providers, here is a list:

Global Country Specialized
Google Yandex GeoOttawa
Bing Geocoder.ca FreeGeoIP
OpenStreetMap Baidu MaxMind
HERE What3Words
TomTom CanadaPost
MapQuest GeoNames
OpenCage
Yahoo
ArcGIS

So with these services, next step is to use script to access the service’s API.

Read more »

Application available via StarryLab

This will be the last post of the series CitiBike Analysis. Followed by 5 previous posts, this one will create a data product by using a web-framework of R: shiny.

By using shiny, we can easily implement the modeling and visualization in a web page.

Before creating the application, a minimum design is required. In the 4th post, we have initialize the first design. Now I have enhance/modify the design as following:

I will go details about some shiny techniques next time. Since some requests of detailing: back-end and front-end communication, shiny programming techniques.

After considering the modeling framework, the next step is to realize the analysis.

As suggested in the previous schemas, in order to create the predictive model to predict daily subscriber/pass user, number of trips, we are going to combine several source of data into one dataframe. Then running simple randomForest algorithm and GAM via caret package (machine learning for R).

Here is the required library:

1
2
3
4
5
6
7
8
9
library(plyr)  # require rbind.fill
library(stringr)
library(dplyr)
library(dygraphs)
library(xts)
library(mgcv)
library(lubridate)
library(caret)
library(ggfortify)
Read more »

Recently I discovered a low-cost VPS service. So I decided to create a small lab (not so powerfull) to demonstrate/use the power of R/Python/spark(not yet) toolbox.

Runabove from http://www.ovh.com/ provides the cheap vps as low as $0.004/hr, that is $2.50/mo, even cheaper than the cheapest digital ocean’s VPS.

I set up an instance Cloud Sandbox Large with ubuntu 14.04 which has:

  • 1 core
  • 4 GB RAM
  • 30 GB SSD
  • 1 TB Bandwidth*

This is sufficient to run R/python.

Once the instance is set-up, login via ssh. Before the configuration, update & upgrade the system in first place.

1
2
sudo apt-get update
sudo apt-get upgrade

PS. Server configuration & dns redicrection is not documented here

Read more »

After contructing the hardware, the next step is to install Operating System.

Raspberry OS

There are quite a lot of choices of Operating Systems for Raspberry Pi. As we can see in the official download page:

  • NOOBS: Raspian and possible to net-install other OS
  • Raspbian: Debian distribution
  • Snappy Ubuntu Core
  • PIDORA: Fedora distribution
  • OPENELEC: Media Center
  • OSMC: Media Center
  • Raspbmc: another media center

And other personal/3rd party distribution:

Source: http://elinux.org/RPi_Distributions

Read more »

Motivation

Having working on a Raspberry Pi B+ version, I have found that the real limitation is the memory shortage and CPU performance.

With the release of new Raspberry Pi 2, the little board is equipped with ARM-v7 (ARM Cortex-A7) 900MHz (Quad-core), and 1GB memory (shared with GPU), here is documented specification. By purchasing an MicroSD-HC UHS-I card, we can get a 40MB/s reading speed and around 15-20MB/s writing speed. That is very close to at least a low-end computer.

Therefore, it is possible to create somehow a 3 nodes cluster with low cost (around 200€). The real motivation is to learn how a cluster works and to do some PoC (Proof of Concept) thing.

Read more »

The previous two posts illustrate the extraction and exploration of trip data and weather data. In this section, I will illustrate importing the data of new membership subscription and the data of public holiday.

With the information of new membership subscription, I can recalculate the “active” user each day. This variable could well explain the number of trips, especially for customer user. And the public holiday could explain as well some strange behavior that are abnormal in the weekdays.

The libraries required are:

1
2
3
4
5
6
7
8
library(plyr)  # require rbind.fill
library(stringr)
library(dplyr)
library(xts)
library(lubridate)
library(rvest)
library(ggplot2)
library(ggfortify)
Read more »