Spark

In your master node you have the driver program, which drives your application.
The code you are writing behaves as a driver program or if you are using the
interactive
shell, the shell acts as the driver program
(Master node in a hadoop cluster is responsible for storing data in

HDFS and executing parallel computation the stored data using MapReduce)
1. Driver Program - When Client have submitted the application code, driver
program on
Master convert them into DAG by the use of transformation and actions.
2. Now that DAG converts into Physical execution plan or changed into tasks and
sent the
Bundles to Cluster.
3.Now Cluster manager comes into picture and arrange the resouces for executing
tasks i.e.
Executo, Driver program has full view or monitor on Exectors after completion of
tasks
it release the resource.
Datasets - Distributed Collcetion of Items.

RDD has been replaced by Datasets, Datasets works same as RDD but with richer
Optimization.
We can use Transformtion and actions of same Datasets while computing
// File Read
val testfile = spark.read.textfile(Readme.md)
// Filter lines with Spark

val linescount = textfile.filter(line => line.contains(spark))
//Word Count
val wordcount = textfile.flatmap(line => line.split(" ")).groupbykey(words).count()
Worcount.collect()
Caching
Caching is the technique where we saved our data in memory and when we have to use
that
data repeatedly.
//
Wordcount.cache()
Our application depends on the Spark API, so we�ll also include an sbt
configuration file, build.sbt,
which explains that Spark is a dependency. Once the SBT file is on track
We will create the JAR file against our application code and with the help of
(Spark-Submit)
Submits the Command
# Your directory layout should look like this

$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application

$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.12/simple-project_2.12-1.0.jar
# Use spark-submit to run your application

$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.12/simple-project_2.12-1.0.jar
...
At a high level, every Spark application consists of a driver program that runs the
user�s
main function and executes various parallel operations on a cluster
Sometimes, a variable needs to be shared across tasks, or between tasks and the
driver
program. Spark supports two types of shared variables: broadcast variables, which
can be
used to cache a value in memory on all nodes, and accumulators, which are variables
that
are only �added� to, such as counters and sums.
The first thing a Spark program must do is to create a SparkContext object, which
tells
Spark how to access a cluster. To create a SparkContext you first need to build a
Spark Conf
object that contains information about your application.
//
val conf = new sparkconf.setAppname(Appname).Setmaster(Master)
new sparkcontext(conf)
Parallelized collections are created by calling SparkContext�s parallelize method

on an
existing collection in your driver program
//
val array = Array(1,2,3,4,5)
val data = sc.parallize(array)
One important parameter for parallel collections is the number of partitions to cut
the
dataset into.
Spark will automatically takes partition based on Cluster
//
Data has 10 partitions
val data = sc.parallize(array,10)
RDDs support two types of operations: transformations, which create a new dataset
from
an existing one, and actions, which return a value to the driver program after
running
a computation on the dataset.
Each transformed RDD may be recomputed each time you run an action on it. However,
you
may also persist an RDD in memory using the persist (or cache) method, in which
case Spark
will keep the elements around on the cluster for much faster access the next time
you
query it
//
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
//
lineLengths.persist()
val dataframe = df.spark.json("Path of the file.json")
$spark-submit --class name --master yarn --deploy mode client --num-executors 3

--driver-memory 4g --executor-memory 2g --executor-cores 2 /jar file path /input
file path
/output location
:history
284 val a = sc.textFile("/home/ubuntu/india.txt")
285 a.collect()
286 val b = a.map(_.split(" "))
287 b.collect()
288 a.collect()
289 b.collect()
290 val c = a.flatMap(_.split(" "))
291 c.collect()
292 a.collect()
293 val data = sc.textFile("/home/ubuntu/india.txt",4)
294 data.collect()
295 val data2 = data.flatMap(_.split(" "))
296 data2.collect()
297 val data3 = data2.map(e=>(e,1))
298 data3.collect()
299 val data4 = data3.reduceByKey(_+_)
300 data4.foreach(println)
301 data4.collect()
303 :history
val bid = 1
val bidtime = 2
val bidder = 3
val bidderrate = 4
val openbid = 5
val price = 6
val itemtype = 7
val daystolive = 8
10:36 PM: val auctionid = 0

val bid = 1
val bidtime = 2
val bidder = 3
val bidderrate = 4
val openbid = 5
val price = 6
val itemtype = 7
val daystolive = 8
Evarcity eLearning (to All - Entire Audience):
10:40 PM: scala> :history

371 data.collect()
372 val data2 = data.flatMap(_.split(" "))
373 data2.collect()
374 val data3 = data2.map(e=>(e.length,e))
375 data3.collect()
376 val data4 = data3.sortByKey()
377 data4.collect()
379 val data5 = data4.filterByRange(4,7).collect
380 val data5 = data4.filterByRange(4,7)
382 val data5 = data4.filterByRange(7,7).collect
383 data5collect()
384 data5.collect()
385 val data5 = data4.filterByRange(7,7)
386 data4.collect()
388 data4.lookup(8)
389 data4.lookup(7)
/FileStore/tables/HeroData.txt

404 val t4 = t3.foldByKey(" ")(_+_)
405 t4.collect()
406 t4.foreach(println)
407 t2.collect()
408 val y = t2.groupBy(e=>e.charAt(0))
409 y.collect()
410 y.foreach(println)
411 t2.collect()
412 val w = t2.keyBy(e=>e.charAt(0))
413 w.collect()
414 val y = t2.groupBy(e=>e.charAt(0))
415 y.foreach(println)
416 t2.partitions.size
417 val b = t2.coalesce(7)
418 b.partitions.size
419 val a = t2.coalesce(2)
420 a.partitions.size
421 val u = t2.repartition(10)
422 u.partitions.size
/FileStore/tables/people.txt

431 data4.collect()
433 val t1 = data3.reduceByKey(_+_)
434 t1.collect()
435 t1.foreach(println)
436 data3.collect()
437 data3.values
438 data3.values.collect()
439 data3.keys.collect
440 data3.countByKey()
441 data3.collect()
442 val t1 = sc.textFile("/home/ubuntu/evarcity.txt")
443 val t2 = t1.flatMap(_.split(" "))
444 t2.collect()
445 val t3 = t2.keyBy(_.length)
446 t3.collect()
447 data3.collect()
448 val b = data3.intersection(t3)
449 b.collect()

Spark

Uploaded by

Copyright:

Available Formats

You might also like

Spark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark

Uploaded by

Copyright:

Available Formats

In your master node you have the driver program, which drives your application.

(Master node in a hadoop cluster is responsible for storing data in

Datasets - Distributed Collcetion of Items.

We can use Transformtion and actions of same Datasets while computing

// Filter lines with Spark

# Your directory layout should look like this

# Package a jar containing your application

# Use spark-submit to run your application

Parallelized collections are created by calling SparkContext�s parallelize method

Spark will automatically takes partition based on Cluster

val data = sc.parallize(array,10)

val dataframe = df.spark.json("Path of the file.json")

$spark-submit --class name --master yarn --deploy mode client --num-executors 3

10:36 PM: val auctionid = 0

Evarcity eLearning (to All - Entire Audience):

10:40 PM: scala> :history

10:42 PM: scala> :history

10:39 PM: scala> :history

You might also like