This is a note of Spark Summit 2014 - Intro to Spark.
The course materials can be downloaded from here, and the slide can be found here.
Video can be found on youtube.
Spark Setup
Download Spark, and choose a pre-built version. All should work in win/linux/mac platform.
Once unzipped, use following command to start spark-shell from root of un zipped folder.
1 | cd spark |
Once unzipped, use following command to start pyspark from root of un zipped folder.
1 | cd spark |
Simple Spark App
Spark has two basic opertions: transform (lazy evaluation) & action (trigger transform).
Word count
The “Hello World” of mapreduce programming: word count. The code in scala is simple:
1 | val f = sc.textFile("README.md") // read text file |
Steps:
sc
is the spark context which should be used to initialize all spark transform and action.textFile
a methode to load text file into spark.flatMap
a methode to map the data per line and return the flatten result.map
a methode to map the data per line and return the corresponding result.reduceByKey
a methode to do the reduce transformation.saveAsTextFile
a mehtode to do the write out action.
The result is in wc_out, by defaut spark create a folder and in it contains serveral results by file.
Join
The join operation in spark is done by join
function and exists other functions to do the inner/left/right/outer join.
1 | val format = new java.text.SimpleDateFormat("yyyy-MM-dd") |
Steps:
- create a date parser from java.text
- create class
Register
&Click
- read the data “reg.tsv” & “clk.tsv” and parse them, the key is the first element
r(1)
, the map function create a tuple pair of (key, value) - simply apply
join
methode toreg
withclk
Essentials
Note that a RDD can be persisted into memory in order to speed up the process.