Professional Documents
Culture Documents
v2 3 Running+PySpark+on+Jupyter+NoteBook
v2 3 Running+PySpark+on+Jupyter+NoteBook
1. Anaconda installation
2. Java 1.8
3. Spark 2.3
You can see jupyter_notebook_config.py has been created inside the /root/.jupyter directory.
To exit
> Press ‘Esc’ > Type :wq! > Hit Enter
3. Now, use the following command to run the Jupyter Notebook
Note: Don’t kill this process, keep it running until you are done using your Jyputer notebook.
To stop/close your Jyputer Notebook open this terminal and press Ctrl + C
You can see it has given you a URL where your Jupyter notebook is running. In my case the
URL is
http://(ip-10-0-0-228.ec2.internal or 127.0.0.1):7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
In order to open the Jupyter notebook on your browser, replace the highlighted part with the
Public IP of your EC2 Instance.
4. PySpark on Jupiter
In order to run PySpark on Jupyter notebook, you need to paste and run the following
few lines of codes in your every PySpark Notebook.
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Note: The value for the above environment variables may be different in your case, we suggest
you verify these values before using it. Please follow the below-mentioned steps for the same.
i. Anaconda and Python
os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"
Make sure Anaconda is installed in the /opt/cloudera/parcels/ director. To Verify please check
whether the Anaconda directory is present under /opt/cloudera/parcels/ path.
ls /opt/cloudera/parcels/
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
The name of this Spark file(highlighted above) be exactly the same as it is present under your
/opt/cloudera/parcels/
Run the following command to check/verify and replace it in case you have a different
version/distribution installed.
ls /opt/cloudera/parcels/
iv. Version of py4j and pyspark.zip file
After you have identified your correct Spark home path this step is to verify the version of the
py4j file.
For that run the following command(make sure to modify this command according to your Spark
home identified in the 3rd point)
ls /opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib
conf = SparkConf().setAppName("jupyter_Spark").setMaster("yarn-client")
sc = SparkContext(conf=conf)
sc
To stop the notebook open the terminal/Putty window where the Jupyter process is running and
press Ctrl + C.
Enter ‘y’ to shut down the cluster.