This is a note of Spark Summit 2014 - Intro to Spark.
Video can be found on youtube.
Download Spark, and choose a pre-built version. All should work in win/linux/mac platform.
Once unzipped, use following command to start spark-shell from root of un zipped folder.
Once unzipped, use following command to start pyspark from root of un zipped folder.
Spark has two basic opertions: transform (lazy evaluation) & action (trigger transform).
The “Hello World” of mapreduce programming: word count. The code in scala is simple:
val f = sc.textFile("README.md") // read text file
scis the spark context which should be used to initialize all spark transform and action.
textFilea methode to load text file into spark.
flatMapa methode to map the data per line and return the flatten result.
mapa methode to map the data per line and return the corresponding result.
reduceByKeya methode to do the reduce transformation.
saveAsTextFilea mehtode to do the write out action.
The result is in wc_out, by defaut spark create a folder and in it contains serveral results by file.
The join operation in spark is done by
join function and exists other functions to do the inner/left/right/outer join.
val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
- create a date parser from java.text
- create class
- read the data “reg.tsv” & “clk.tsv” and parse them, the key is the first element
r(1), the map function create a tuple pair of (key, value)
- simply apply
Note that a RDD can be persisted into memory in order to speed up the process.