Configure Pyspark in IPython simple way

Apache Spark in a distributed in-memory cluster computing system. Many people including me like to use Spark in python with IPython for a data analysis purpose.

But unfortunately the configuration is always a little bit tricky for the moment.

For the complicated way, you can try this link. Otherwise, use the following python library:

Steps to follow:

  • Download spark and unzip: wget && tar -zxvf spark-1.5.1-bin-hadoop2.6.tgz
  • Configure the global variable SPARK_HOME to the unzipped folder, don’t forget to source the .bashrc or .zshrc.
  • The installation is simple by using pip install findspark.
  • Get into IPython and play
import findspark
import pyspark
sc = pyspark.SparkContext(appName="myAppName")

That’s it, go play with the SparkContext.