Professional Documents
Culture Documents
BDALab Assn5
BDALab Assn5
BDALab Assn5
Aim: Setup spark Multinode cluster. Write wordcount program in scala and
execute on spark cluster.
Theory:
What is Spark?
Spark speeds development and operations in a variety of ways. Spark will help
teams:
Prerequisites:
1. Java: Any version (I have used 8)
2. Hadoop multi-node cluster: As setup from experiment 2
Step 2: Extract the tgz file. Add SPARK_HOME and spark bin path to
PATH variable in .bashrc file as shown in the images below.
Step 3: Go to conf folder inside spark folder. Here copy the ‘spark-
env.sh.template’ to ‘spark-env.sh’. Edit the spark-env.sh as shown. Do the
same for workers file as shown in the images.
Step 4: Fire up the spark shell using ‘spark-shell’ command to check if
spark is running. Use Ctrl+D to exit the shell.
Step 5: Make directories for spark with the same name as in the master in
the slaves.
Step 6: Copy the files from spark folder of master to spark folder of slaves.
You can check the files in slaves after copying as shown in above step
images.
Step 7: Add the SPARK_HOME and spark bin path to PATH variable
through .bashrc as shown below in the slaves. Also, add the JAVA_HOME
and HADOOP_HOME if not previously added to .bashrc.
Step 8: go to the sbin folder of spark in master. Start the cluster using
‘./start-all.sh’. Use jps on master and slaves to get the ‘Master’ and
‘Worker’ as shown below. You can check the Spark GUI on
‘localhost:8080’ in the browser to see the slave workers active.
Step 9: Start hadoop multi node cluster using start-dfs.sh. Check if
properly running on master and slaves using jps.
Step 10: Create a new directory on hdfs for WordCountScala and inside it
an Input directory to store input. Create an input file ‘input.txt’ on local
and push it to hdfs through the command given in below image. Check
through GUI if properly created.
Step 11: Create a file WordCountScala.sc and write the code for it in that
file and save it.
Code:
val textFile = sc.textFile("hdfs://hadoop-
master:9000/WordCountScala/Input/input.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://hadoop-master:9000/Output/output-spark")
Step 12: Run the scala file using below command. Check if the output has
been created on GUI in /WordCountScala/Output/output-spark.
spark-shell -i <path to scala file>
Step 13: To get the output from hdfs to local, use the command below. We
can check the output by doing nano or cat to the files. The output is shown
below.
hdfs fs -get <Path to output on HDFS> <Directory name in local>
Step 14: Stop the hadoop cluster as well as the spark cluster using
commands given in the below image.
Conclusion:
In this experiment, I learned how to setup multi node cluster for Spark. I learned
the advantages of using Spark over Hadoop. Spark facilitates writing code in
Scala, Java and Python also while Hadoop only facilitates in Java. Spark uses
RDD (Resilient Distributed Dataset) which are fault tolerant and can be
distributed among multiple nodes in a cluster. It supports parallelization and
faster memory access. I learned the use of Scala language and how to write a
program in it.