v2 3 Running+PySpark+on+Jupyter+NoteBook

Prerequisite:
1. Anaconda installation
2. Java 1.8
3. Spark 2.3
1. Verify the installation

Log in to EC2 user as the root user
a. Run the following command to verify the anaconda installation

conda --version
b. Run the following command to verify Java 8 installation

java -version
c. Run the following command to verify Spark2 installation

spark2-shell --version
2. Configure Jupyter(One-time process)
Jupyter notebook is already installed in our CDH instance with Anaconda. However, we need to
configure it before we can actually run it.
a. Run the following command to generate jupyter configuration file.

jupyter notebook --generate-config
You can see jupyter_notebook_config.py has been created inside the /root/.jupyter directory.
b. Allow access to your remote Jupyter server
Open the jupyter_notebook_config.py file

vi .jupyter/jupyter_notebook_config.py
Press “I” to get into the insert mode

Copy and paste the below two lines as shown in the screenshot.
c.NotebookApp.allow_origin = ' *' #allow all origins
c.NotebookApp.ip = '0.0.0.0' # listen on all IPs
To exit
> Press ‘Esc’ > Type :wq! > Hit Enter
3. Now, use the following command to run the Jupyter Notebook
jupyter notebook --port 7861 --allow-root
You will see the Jupyter server as started.
Note: Don’t kill this process, keep it running until you are done using your Jyputer notebook.
To stop/close your Jyputer Notebook open this terminal and press Ctrl + C
You can see it has given you a URL where your Jupyter notebook is running. In my case the
URL is
http://(ip-10-0-0-228.ec2.internal or 127.0.0.1):7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
In order to open the Jupyter notebook on your browser, replace the highlighted part with the
Public IP of your EC2 Instance.
In my case, the final URL will be-

http://3.89.129.54:7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
Now, you can open your web browser and copy-paste your final URL to open the Jupyter
notebook. You should be able to see the Jupyter home page.
4. PySpark on Jupiter
In order to run PySpark on Jupyter notebook, you need to paste and run the following
few lines of codes in your every PySpark Notebook.
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Note: The value for the above environment variables may be different in your case, we suggest
you verify these values before using it. Please follow the below-mentioned steps for the same.
i. Anaconda and Python
os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda/bin/python"
Make sure Anaconda is installed in the /opt/cloudera/parcels/ director. To Verify please check
whether the Anaconda directory is present under /opt/cloudera/parcels/ path.
ls /opt/cloudera/parcels/
ii. Java Home
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
Run the following command to get your Java_Home path

echo $JAVA_HOME
Hence /usr/java/jdk1.8.0_161/jre will be the value of our JAVA_HOME variable
iii. Spark Home

For the Spark home path in the following line-
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
The name of this Spark file(highlighted above) be exactly the same as it is present under your
/opt/cloudera/parcels/
Run the following command to check/verify and replace it in case you have a different
version/distribution installed.
ls /opt/cloudera/parcels/
iv. Version of py4j and pyspark.zip file
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")

sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
After you have identified your correct Spark home path this step is to verify the version of the
py4j file.
For that run the following command(make sure to modify this command according to your Spark
home identified in the 3rd point)
ls /opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/python/lib
You should be able to see the py4j and p

yspark file. Make sure you modify this according to
your instance in the above code.
5. Test your PySpark setup
a. Open a new Jupyter Notebook

b. Copy-paste the environment variable that you have finalized in the 4th point into a cell.
(You need to do it for every pySpark notebook which you will create)
c. Run the cell you should not see any error.

d. Now, let’s initialize the SparkContext object. Copy-paste the following code in a new cell
> Run the cell
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("jupyter_Spark").setMaster("yarn-client")
sc = SparkContext(conf=conf)
sc
You should be able see the following output
This means your pySpark is working fine.

6. Closing Jupyter Notebook
To stop the notebook open the terminal/Putty window where the Jupyter process is running and
press Ctrl + C.
Enter ‘y’ to shut down the cluster.

v2 3 Running+PySpark+on+Jupyter+NoteBook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

v2 3 Running+PySpark+on+Jupyter+NoteBook

Uploaded by

Copyright:

Available Formats

Prerequisite:

1. Verify the installation

a. Run the following command to verify the anaconda installation

b. Run the following command to verify Java 8 installation

c. Run the following command to verify Spark2 installation

a. Run the following command to generate jupyter configuration file.

b. Allow access to your remote Jupyter server

Open the jupyter_notebook_config.py file

Press “I” to get into the insert mode

jupyter notebook --port 7861 --allow-root

You will see the Jupyter server as started.

In my case, the final URL will be-

ii. Java Home

Run the following command to get your Java_Home path

Hence /usr/java/jdk1.8.0_161/jre will be the value of our JAVA_HOME variable

iii. Spark Home

sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")

You should be able to see the py4j and p

a. Open a new Jupyter Notebook

c. Run the cell you should not see any error.

You should be able see the following output

This means your pySpark is working fine.

You might also like

v2 3 Running+PySpark+on+Jupyter+NoteBook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

v2 3 Running+PySpark+on+Jupyter+NoteBook

Uploaded by

Copyright:

Available Formats

Prerequisite:

1. Verify the installation

a. Run the following command to verify the anaconda installation

b. Run the following command to verify Java 8 installation

c. Run the following command to verify Spark2 installation

a. Run the following command to generate jupyter configuration file.

b. Allow access to your remote Jupyter server

Open the ​ jupyter_notebook_config.py ​file

Press “I” to get into the insert mode

jupyter notebook --port 7861 --allow-root

You will see the Jupyter server as started.

In my case, the final URL will be-

ii. Java Home

Run the following command to get your Java_Home path

Hence​ /usr/java/jdk1.8.0_161/jre​ will be the value of our JAVA_HOME variable

iii. Spark Home

sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/​py4j-0.10.6-src​.zip"​)

You should be able to see the ​py4j​ and p

a. Open a new Jupyter Notebook

c. Run the cell you should not see any error.

You should be able see the following output

This means your pySpark is working fine.

You might also like

Open the jupyter_notebook_config.py file

Hence /usr/java/jdk1.8.0_161/jre will be the value of our JAVA_HOME variable

sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.6-src.zip")

You should be able to see the py4j and p