BDALab Assn5

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

BDA Lab Assignment 5

Name: Rushabh Gala


Reg No.: 201080072
Branch: IT
Batch: B

Aim: Setup spark Multinode cluster. Write wordcount program in scala and
execute on spark cluster.

Theory:

What is Spark?

Apache Spark is a lightning-fast, open-source data-processing engine for


machine learning and AI applications, backed by the largest open-source
community in big data. Apache Spark (Spark) easily handles large-scale data
sets and is a fast, general-purpose clustering system that is well-suited for
PySpark. It is designed to deliver the computational speed, scalability, and
programmability required for big data—specifically for streaming data, graph
data, analytics, machine learning, large-scale data processing, and artificial
intelligence (AI) applications. Spark's analytics engine processes data 10 to 100
times faster than some alternatives, such as Hadoop (link resides outside
ibm.com) for smaller workloads. It scales by distributing processing workflows
across large clusters of computers, with built-in parallelism and fault tolerance.
It even includes APIs for programming languages that are popular among data
analysts and data scientists, including Scala, Java, Python, and R.

Spark is often compared to Apache Hadoop, and specifically to


Hadoop MapReduce, Hadoop’s native data-processing component. The chief
difference between Spark and MapReduce is that Spark processes and keeps the
data in memory for subsequent steps—without writing to or reading from disk
—which results in dramatically faster processing speeds. (You’ll find more on
how Spark compares to and complements Hadoop elsewhere in this article.)

How Apache Spark Works?

Apache Spark has a hierarchical primary/secondary architecture. The Spark


Driver is the primary node that controls the cluster manager, which manages the
secondary nodes and delivers data results to the application client.

Resilient Distributed Dataset (RDD):

Resilient Distributed Datasets (RDDs) are fault-tolerant collections of elements


that can be distributed among multiple nodes in a cluster and worked on in
parallel. RDDs are a fundamental structure in Apache Spark.

Spark loads data by referencing a data source or by parallelizing an existing


collection with the SparkContext parallelize method of caching data into an
RDD for processing. Once data is loaded into an RDD, Spark performs
transformations and actions on RDDs in memory—the key to Spark’s speed.
Spark also stores the data in memory unless the system runs out of memory or
the user decides to write the data to disk for persistence.

Each dataset in an RDD is divided into logical partitions, which may be


computed on different nodes of the cluster. And users can perform two types of
RDD operations: transformations and actions. Transformations are operations
applied to create a new RDD. Actions are used to instruct Apache Spark to
apply computation and pass the result back to the driver.

Spark speeds development and operations in a variety of ways. Spark will help
teams:

 Accelerate app development: Apache Spark's Streaming and SQL


programming models backed by MLlib and GraphX make it easier
to build apps that exploit machine learning and graph analytics.
 Innovate faster: APIs provide ease of use when manipulating semi-
structured data and transforming data.
 Manage with ease: A unified engine supports SQL queries, streaming
data, machine learning (ML) and graph processing.
 Process faster: Spark can be 100x faster than Hadoop (link resides
outside ibm.com) for smaller workloads because of its advanced in-
memory computing engine and disk data storage.
 Speed memory access: Spark can be used to create one large memory
space for data processing, enabling more advanced users to access data
via interfaces using Python, R, and Spark SQL.

Method and Output:

Prerequisites:
1. Java: Any version (I have used 8)
2. Hadoop multi-node cluster: As setup from experiment 2

Step 1: Switch to hadoopuser. Install scala. Download spark tarball


through wget or directly from site.
Scala: sudo apt install scala
Spark: wget https://dlcdn.apache.org/spark/spark-3.4.2/spark-3.4.2-bin-
without-hadoop.tgz
Note: Here Spark 3.4.2 has been downloaded but whichever version is there
you can download it.

Step 2: Extract the tgz file. Add SPARK_HOME and spark bin path to
PATH variable in .bashrc file as shown in the images below.
Step 3: Go to conf folder inside spark folder. Here copy the ‘spark-
env.sh.template’ to ‘spark-env.sh’. Edit the spark-env.sh as shown. Do the
same for workers file as shown in the images.
Step 4: Fire up the spark shell using ‘spark-shell’ command to check if
spark is running. Use Ctrl+D to exit the shell.
Step 5: Make directories for spark with the same name as in the master in
the slaves.
Step 6: Copy the files from spark folder of master to spark folder of slaves.
You can check the files in slaves after copying as shown in above step
images.
Step 7: Add the SPARK_HOME and spark bin path to PATH variable
through .bashrc as shown below in the slaves. Also, add the JAVA_HOME
and HADOOP_HOME if not previously added to .bashrc.
Step 8: go to the sbin folder of spark in master. Start the cluster using
‘./start-all.sh’. Use jps on master and slaves to get the ‘Master’ and
‘Worker’ as shown below. You can check the Spark GUI on
‘localhost:8080’ in the browser to see the slave workers active.
Step 9: Start hadoop multi node cluster using start-dfs.sh. Check if
properly running on master and slaves using jps.

Step 10: Create a new directory on hdfs for WordCountScala and inside it
an Input directory to store input. Create an input file ‘input.txt’ on local
and push it to hdfs through the command given in below image. Check
through GUI if properly created.
Step 11: Create a file WordCountScala.sc and write the code for it in that
file and save it.

Code:
val textFile = sc.textFile("hdfs://hadoop-
master:9000/WordCountScala/Input/input.txt")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://hadoop-master:9000/Output/output-spark")

Step 12: Run the scala file using below command. Check if the output has
been created on GUI in /WordCountScala/Output/output-spark.
spark-shell -i <path to scala file>
Step 13: To get the output from hdfs to local, use the command below. We
can check the output by doing nano or cat to the files. The output is shown
below.
hdfs fs -get <Path to output on HDFS> <Directory name in local>

Step 14: Stop the hadoop cluster as well as the spark cluster using
commands given in the below image.

Conclusion:
In this experiment, I learned how to setup multi node cluster for Spark. I learned
the advantages of using Spark over Hadoop. Spark facilitates writing code in
Scala, Java and Python also while Hadoop only facilitates in Java. Spark uses
RDD (Resilient Distributed Dataset) which are fault tolerant and can be
distributed among multiple nodes in a cluster. It supports parallelization and
faster memory access. I learned the use of Scala language and how to write a
program in it.

You might also like