2 DE +Installing+Apache+Spark+on+CDH+EC2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Installing Apache Spark 2 on AWS EC2

Prerequisite: JDK 1.8 on your EC2 instance.

1. Download the Spark 2 CSD file from below link.

wget http://archive.cloudera.com/spark2/csd/SPARK2_ON_YARN-2.3.0.cloudera2.jar

2. move this spark jar to /opt/cloudera/csd/

cp SPARK2_ON_YARN-2.3.0.cloudera2.jar /opt/cloudera/csd/

3. Go to the CSD jar location


cd​ /opt/cloudera/csd/

ls -ltrh
4. Change the owner using the following command

chown cloudera-scm:cloudera-scm SPARK2_ON_YARN-2.3.0.cloudera2.jar

Verify using the following command.

ls -ltrh

5. Now change the permissions using the chmod command.

chmod 644 SPARK2_ON_YARN-2.3.0.cloudera2.jar


6. Now restart the cloudera-scm-server and agent using the
systemctl command.

systemctl restart cloudera-scm-server


systemctl restart cloudera-scm-agent
systemctl status cloudera-scm-server
7. Now Refresh the Cloudera manager using web UI and log in. (it
take some time)

● Click on the ​Parcels ​tab in right side.corner.


● Scroll down and verify a ​Spark2​ parcel is present.
● Click on the ​Download​ button in front of ​Spark2​.

● Now click on ​Distribute​.

● Now click on ​Activate ​and then click on ​OK​.


● Finally it is done and distributed.

8. Now Scroll up and go back to the Cloudera Manager Home


page.
● Click on the ​Stale Configuration​ under the ​Cloudera management services ​tab.

● Click on ​Restart Cloudera Management Services

● It takes some time, Once done click on ​Finish​.


9. Now go to Cloudera manager home and click on add
service.
● Select ​Spark2 ​and then click on ​Continue​.

● Integrate with services like.​ S3, HBase, HDFS, Hive, YARN(MR2 Included), Zookeeper

Or you may get the screen something like below.


● Select History server by clicking the text box provided just below ​History Server​ text.

● You will the pop-up window as shown below. Select the checkbox. Click ​Ok ​to get back.

● Click on ​Continue
● Click on ​Continue

● Click on ​Continue
● Click on ​Finish​.
10. Now restart YARN by clicking the ​Stale configuration​ icon, as shown below

● Then click on ​Restart Slale Services


● Click on ​Restart Now

● Once all steps are complete, click on ​Finish​.


10. Back to the Cloudera manager homepage, In case if you are
getting a Health issue as shown below, then follow the steps to
resolve it.

● Click on the ​Issue ​ICON then ​suppress​.

11. Create the Spark Lineage directory


Open putty/terminal login to your ec2 instance, go to the ​root​ user and run the following
command, as shown below in the screenshot

mkdir /var/​log​/spark2/lineage
chown spark:spark /var/​log​/spark2/lineage
chmod 777 /var/​log​/spark2/lineage
12. The Spark2 job by default reads data from HDFS, so now let’s
create an ec2-user directory in HDFS​(Skip if already done​)
In putty run the following commands. (run from ​root​ user)
sudo su hdfs
hadoop fs -mkdir -p /user/ec2-user
hadoop fs -chown -R ec2-user:ec2-user /user/ec2-user
hadoop fs -chmod -R 777 /user/ec2-user
Now, for your spark jobs, keep your data file in ​/user/ec2-user/​ directory.(you can further create
a directory here and store your data files)

Now, return to ec2-user by running the command: ​exit

13. To verify the Spark version, run the following command

spark2-submit --version

It should show Version 2.3.0 as shown in the image below.


14. Open ​Cloudera Manager​, Click on ​YARN​ from the cluster
services, then Click in ​Configuration​ from the Upper Menu bar.
NOTE:​ These two parameters value should be matched to our configuration.
1. Serach ​ “yarn.nodemanager.resource.memory-mb” ​and set it to​ 10 GB > ​Click​ Save
changes

2. Search​ ”yarn.scheduler.maximum-allocation-mb” ​and set it to​ 8 GB > ​Click​ Save Changes


15. How to open the history server web UI:

Note: Before ope​n History server Web UI. ​First, you need to configure the Security group of
your instance:

i. Go to the EC2 dashboard and select your instance.

II. Scroll down the ​Description ​tab and under your security group click on ​avdmap​.
III. Then go to ​Inbound ​> ​Edit ​> ​Add Rule

IV. A new will get added in the list, configure the rule as below
A. Type section - ​“All​ ​TCP​”
B. Protocol - ​TCP​.
C. Port range - ​0-65535
D. Source - ​My IP
E. Click on ​Save​.
V. Copy public IP address from the ec2 dashboard.

VII. Now Paste the public IP address into the browser as shown below. The format is

<your EC2 public ip>:18089

You might also like