Yi's Notes

⟨Data Scientist|Photographer⟩

0%

In last post, we have set up a jupyterhub in CEntOS, the same installation guide can be found for Ubuntu-like system.

In this post, we will discuss how to make jupyterhub work with pyspark shell.

Step 1: create a kernel directory in the user’s home folder

1
mkdir ~/.ipython/kernels/pyspark

Step 2: create and edit kernel file

Create & Edit the file: nano ~/.ipython/kernels/pyspark/kernel.json

Put the script below into the file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"display_name": "pySpark (Spark 1.4.1)",
"language": "python",
"argv": [
"/usr/bin/python2",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "<spark_dir>",
"PYTHONPATH": "<spark_dir>/python/:<spark_dir>/python/lib/py4j-0.8.2.1-src.zip",
"PYTHONSTARTUP": "<spark_dir>/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master spark://127.0.0.1:7077 pyspark-shell"
}
}

Please replace the <spark_dir> with the location of the spark: /opt/mapr/spark/spark-1.4.1 in my example.

Put lauching arguments in PYSPARK_SUBMIT_ARGS, as shown here create a local pyspark-shell, otherwise --master yarn-client pyspark-shell in my case for MapR cluster.

Step 3: launch jupyterhub and create a spark notebook with kernel spark-*.*.*

Optional Step: copy the file to every spark user

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text, just like RStudio’s rmarkdown. The project comes from IPython which now becomes a kernel for Jupyter.

Jupyterhub is a multi-user server that manages and proxies multiple instances of the single-user Jupyter notebook server.

The ubuntu-like installation guide can be found here. And here goes the installation steps for CEntOS 7.

Read more »

CEntOS (the community enterprise operating system) is based on RedHat distribution of Linux. The reputation of the stability makes it a first choice for enterprise grade server.

Here is some simple configuration before compiling and other installation.

Add EPEL repo

Check here for the latest rpm.

1
su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm'

Install Dev Tools

1
yum group install "Development Tools"

R and its friends

According to this, we the installation of R is as simple as yum install -y R. For RStudio and Shiny-Server, considering this by selecting RedHat/CentOS and this by selecting RedHat/CentOS.

Have been changed to work on Mac OS X for a while with R, here is some documentation for configure R in Mac.

Install R with homebrew

Install the missing package management for OS X

1
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Then retrieve the caskroom for R binary installation:

1
brew install caskroom/cask/brew-cask

Then install R:

1
brew install Caskroom/cask/r

R configuration

Use a browser for R help document viewing:

  1. create a .rprofile in the home location
  2. add options(help_type = "html") in the .rprofile

Internalisation of R, run the code in terminal:

1
defaults write org.R-project.R force.LANG en_US.UTF-8

Brief introduction

D3.js is the library most used for artist level Data Visualization. For a basic / easy way of doing data visualization with javascript please see highcharts or ECharts.

Notes

JQuery like manipulation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<html>
<head>
<meta charset="utf-8">
<title>HelloWorld</title>
</head>
<body>
<p>Hello World 1</p>
<p>Hello World 2</p>
<script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
<script>
d3.select("body").selectAll("p").text("http://jinyi.me");
</script>
</body>
</html>

Choose elements and bind the data

1
2
3
4
5
var body = d3.select("body"); //selection
var p1 = body.select("p"); //selection
var p = body.selectAll("p"); //selection
var svg = body.select("svg"); //selection
var rects = svg.selectAll("rect"); //selection
  • datum():bind one data to elements
  • data(): bind an array to elements
1
2
3
4
5
6
7
8
var dataset = ["I like dog","I like cat","I like snake"];
var body = d3.select("body");
var p = body.selectAll("p");

p.data(dataset)
.text(function(d, i){
return d;
});

SVG Canvas & Scales & Axis

D3.js uses SVG as canvas to draw.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
var width = 300;  //canvas width
var height = 300; //canvas heigth

var svg = d3.select("body")
.append("svg") //add a svg element
.attr("width", width) //set width
.attr("height", height); //set height

var dataset = [ 2.5 , 2.1 , 1.7 , 1.3 , 0.9 ]; // data set for histogram

var linear = d3.scale.linear()
.domain([0, d3.max(dataset)])
.range([0, 250]); // linear scale to rescale the pixels

var axis = d3.svg.axis() // axis
.scale(linear) // scale for axis
.orient("bottom") // scale's orientation
.ticks(7); // scales' ticks

svg.append("g") // draw axis with tag 'g'
.attr("class","axis")
.attr("transform","translate(20,130)")
.call(axis); // call accept a fuction as parameter and it gives itself as parameter

var rectHeight = 25; //height of each rectangle

svg.selectAll("rect")
.data(dataset)
.enter()
.append("rect") // above code create if not enough "rect" tags
.attr("x",20)
.attr("y",function(d,i){
return i * rectHeight;
}) // position (x,y) of left upper corner
.attr("width",function(d){
return linear(d); // scale is used here
})
.attr("height",rectHeight-2)
.attr("fill","steelblue");

Animation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
var svg = d3.select('body')
.append('svg');
var circle1=svg.append('circle')
.attr("cx", 100)
.attr("cy", 100)
.attr("r", 45)
.style("fill","green"); // create a circle

circle1.transition() // initialize the transition, afterward is the new state
.duration(2000) // duration for the transition
.ease('linear') // Ways of transition: linear, circle, elastic, bounce
.attr("cx", 300)
.style('fill','red')
.attr('r', 25);

Or delay() which is used for a transformation delay.

User interaction

Use function .on('eventlisener', func(){}).

Layout

A list of bult-in layout:

  • Bundle - apply Holten’s hierarchical bundling algorithm to edges.
  • Chord - produce a chord diagram from a matrix of relationships.
  • Cluster - cluster entities into a dendrogram.
  • Force - position linked nodes using physical simulation.
  • Hierarchy - derive a custom hierarchical layout implementation.
  • Histogram - compute the distribution of data using quantized bins.
  • Pack - produce a hierarchical layout using recursive circle-packing.
  • Partition - recursively partition a node tree into a sunburst or icicle.
  • Pie - compute the start and end angles for arcs in a pie or donut chart.
  • Stack - compute the baseline for each series in a stacked bar or area chart.
  • Tree - position a tree of nodes tidily.
  • Treemap - use recursive spatial subdivision to display a tree of nodes.

Reference

Apache Spark in a distributed in-memory cluster computing system. Many people including me like to use Spark in python with IPython for a data analysis purpose.

But unfortunately the configuration is always a little bit tricky for the moment.

For the complicated way, you can try this link. Otherwise, use the following python library: https://github.com/minrk/findspark.

Steps to follow:

  • Download spark and unzip: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz && tar -zxvf spark-1.5.1-bin-hadoop2.6.tgz
  • Configure the global variable SPARK_HOME to the unzipped folder, don’t forget to source the .bashrc or .zshrc.
  • The installation is simple by using pip install findspark.
  • Get into IPython and play
1
2
3
4
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")

That’s it, go play with the SparkContext.

I recently purchased a new add-on for Raspberry. With a successful advertisement of sending the Sense HAT to Internation Space Station, now the widget is ready for purchase: link, online shopping, french store.

To begin with the little board, all you have to do is to run the following command in order to install the driver:

1
2
3
sudo apt-get update
sudo apt-get install sense-hat
sudo pip-3.2 install pillow # used for python 3.2

A reboot of the raspberry will give you access to the board.

Troubleshooting

Cannot get sense-hat installed

This problem seems to be related to the operating system. I have HypriotOS installed, and the os repository does not provide sense-hat, so I reinstall the original image of raspbian.

Still not working

Make sure you have reboot your raspberry. Once the Sense HAT shows no LED after reboot, this means it is sucessfully installed.

Make sure you have enabled I2C by sudo raspi-config, select 8. Advanced Options - I2C - Enable

Heart Program

With provided example, you can already play with the sense HAT.

Here is a quick start with Hello world.

1
2
3
from sense_hat import SenseHat # import sense_hat object
sense = SenseHat()
sense.show_message("Hello world!") # show hello world

So I make a deeper step, by showing a heart in order to convience my GF (lol) for the purchase:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sense_hat import SenseHat

X = (255, 0, 0)
O = (0, 0, 0)

heart = [
O, O, O, O, O, O, O, O,
O, O, X, O, O, X, O, O,
O, X, X, X, X, X, X, O,
O, X, X, X, X, X, X, O,
O, X, X, X, X, X, X, O,
O, O, X, X, X, X, O, O,
O, O, O, X, X, O, O, O,
O, O, O, O, O, O, O, O
]

sense = SenseHat()
sense.set_pixels(heart)