Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Prerequisite:

1. Anaconda installation
2. Java 1.8
3. Spark 2.3

1. Verify the installation


Log in to EC2 user as the ​root​ user

a. Run the following command to verify the anaconda installation


conda --version

b. Run the following command to verify Java 8 installation


java -version

c. Run the following command to verify Spark2 installation


spark2-shell --version
2. ​Configure Jupyter(One-time process)
Jupyter notebook is already installed in our CDH instance with Anaconda. However, we need to
configure it before we can actually run it.

a. Run the following command to generate jupyter configuration file.


jupyter notebook --generate-config

You can see​ jupyter_notebook_config.py​ has been created inside the ​/root/.jupyter​ directory.

b. Allow access to your remote Jupyter server

Open the ​ jupyter_notebook_config.py ​file


vi .jupyter/jupyter_notebook_config.py

Press “I” to get into the insert mode


Copy and paste the below two lines as shown in the screenshot.
c.NotebookApp.allow_origin = ' ​ *'​ ​#allow all origins
c.NotebookApp.ip = ​'0.0.0.0'​ #​ listen on all IPs

To exit
> Press ​‘Esc’ > ​Type ​:wq! ​> Hit ​Enter
3. ​Now, use the following command to run the Jupyter Notebook

jupyter notebook --port 7861 --allow-root

You will see the Jupyter server as started.

Note​: Don’t kill this process, keep it running until you are done using your Jyputer notebook.
To stop/close your Jyputer Notebook open this terminal and press ​Ctrl + C

You can see it has given you a URL where your Jupyter notebook is running. In my case the
URL is

http://​(ip-10-0-0-228.ec2.internal or 127.0.0.1)​:7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005

In order to open the Jupyter notebook on your browser, replace the highlighted part with the
Public IP ​of your EC2 Instance.

In my case, the final URL will be-


http://3.89.129.54:7861/?token=5935c2bdf10c2212013af0b4e9e7cd98cd22ca03eeaf4005
Now, you can open your web browser and copy-paste your final URL to open the Jupyter
notebook. You should be able to see the Jupyter home page.

4. PySpark on Jupiter

In order to run PySpark on Jupyter notebook, you need to paste and run the following
few lines of codes in your every PySpark Notebook.

import​ os
import​ sys
os.environ[​"PYSPARK_PYTHON"​] = ​"/opt/cloudera/parcels/Anaconda/bin/python"
os.environ[​"JAVA_HOME"​] = ​"/usr/java/jdk1.8.0_161/jre"
os.environ[​"SPARK_HOME"​] = ​"/opt/cloudera/parcels/SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101/lib/spark2/"
os.environ[​"PYLIB"​] = os.environ[​"SPARK_HOME"​] + ​"/python/lib"
sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/py4j-0.10.6-src.zip"​)
sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/pyspark.zip"​)

Note​: The value for the above environment variables may be different in your case, we suggest
you verify these values before using it. Please follow the below-mentioned steps for the same.
i. Anaconda and Python

os.environ[​"PYSPARK_PYTHON"​] = ​"/opt/cloudera/parcels/Anaconda/bin/python"

Make sure Anaconda is installed in the​ /opt/cloudera/parcels/​ director. To Verify please check
whether the ​Anaconda​ directory is present under​ /opt/cloudera/parcels/​ path.

ls /opt/cloudera/parcels/

ii. Java Home

os.environ[​"JAVA_HOME"​] = ​"/usr/java/jdk1.8.0_161/jre"

Run the following command to get your Java_Home path


echo ​$JAVA_HOME

Hence​ /usr/java/jdk1.8.0_161/jre​ will be the value of our JAVA_HOME variable

iii. Spark Home


For the Spark home path in the following line-
os.environ[​"SPARK_HOME"​] = ​"/opt/cloudera/parcels/​SPARK2-2.3.0.cloudera2-1.cdh5.13.3.p0.316101​/lib/spark2/"

The name of this Spark file(highlighted above) be exactly the same as it is present under your
/opt/cloudera/parcels/

Run the following command to check/verify and replace it in case you have a different
version/distribution installed.

ls /opt/cloudera/parcels/
iv. Version of py4j and pyspark.zip file

sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/​py4j-0.10.6-src​.zip"​)


sys.path.insert(​0​, os.environ[​"PYLIB"​] +​"/pyspark.zip"​)

After you have identified your correct Spark home path this step is to verify the version of the
py4j file.

For that run the following command(make sure to modify this command according to your Spark
home identified in the 3rd point)

ls /opt/cloudera/parcels/SPARK2​-2.3.0​.cloudera2​-1.​cdh5​.13.3​.p0​.316101​/lib/spark2/python/lib

You should be able to see the ​py4j​ and p


​ yspark​ file. Make sure you modify this according to
your instance in the above code.
5. Test your PySpark setup

a. Open a new Jupyter Notebook


b. Copy-paste the environment variable that you have finalized in the 4th point into a cell.
(You need to do it for every pySpark notebook which you will create)

c. Run the cell you should not see any error.


d. Now, let’s initialize the SparkContext object. Copy-paste the following code in a new cell
> Run the cell
from​ pyspark ​import​ SparkContext, SparkConf

conf = SparkConf().setAppName(​"jupyter_Spark"​).setMaster(​"yarn-client"​)
sc = SparkContext(conf=conf)
sc

You should be able see the following output

This means your pySpark is working fine.


6.​ Closing Jupyter Notebook

To stop the notebook open the terminal/Putty window where the Jupyter process is running and
press ​Ctrl + C.
Enter ​‘y’ ​to shut down the cluster.

You might also like