Professional Documents
Culture Documents
Spark
Spark
Spark
The code you are writing behaves as a driver program or if you are using the
interactive
shell, the shell acts as the driver program
1. Driver Program - When Client have submitted the application code, driver
program on
Master convert them into DAG by the use of transformation and actions.
2. Now that DAG converts into Physical execution plan or changed into tasks and
sent the
Bundles to Cluster.
3.Now Cluster manager comes into picture and arrange the resouces for executing
tasks i.e.
Executo, Driver program has full view or monitor on Exectors after completion of
tasks
it release the resource.
// File Read
val testfile = spark.read.textfile(Readme.md)
//Word Count
val wordcount = textfile.flatmap(line => line.split(" ")).groupbykey(words).count()
Worcount.collect()
Caching
Caching is the technique where we saved our data in memory and when we have to use
that
data repeatedly.
//
Wordcount.cache()
Our application depends on the Spark API, so we�ll also include an sbt
configuration file, build.sbt,
which explains that Spark is a dependency. Once the SBT file is on track
We will create the JAR file against our application code and with the help of
(Spark-Submit)
Submits the Command
At a high level, every Spark application consists of a driver program that runs the
user�s
main function and executes various parallel operations on a cluster
Sometimes, a variable needs to be shared across tasks, or between tasks and the
driver
program. Spark supports two types of shared variables: broadcast variables, which
can be
used to cache a value in memory on all nodes, and accumulators, which are variables
that
are only �added� to, such as counters and sums.
The first thing a Spark program must do is to create a SparkContext object, which
tells
Spark how to access a cluster. To create a SparkContext you first need to build a
Spark Conf
object that contains information about your application.
//
val conf = new sparkconf.setAppname(Appname).Setmaster(Master)
new sparkcontext(conf)
//
val array = Array(1,2,3,4,5)
val data = sc.parallize(array)
One important parameter for parallel collections is the number of partitions to cut
the
dataset into.
//
Data has 10 partitions
RDDs support two types of operations: transformations, which create a new dataset
from
an existing one, and actions, which return a value to the driver program after
running
a computation on the dataset.
Each transformed RDD may be recomputed each time you run an action on it. However,
you
may also persist an RDD in memory using the persist (or cache) method, in which
case Spark
will keep the elements around on the cluster for much faster access the next time
you
query it
//
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
//
lineLengths.persist()
:history
284 val a = sc.textFile("/home/ubuntu/india.txt")
285 a.collect()
286 val b = a.map(_.split(" "))
287 b.collect()
288 a.collect()
289 b.collect()
290 val c = a.flatMap(_.split(" "))
291 c.collect()
292 a.collect()
293 val data = sc.textFile("/home/ubuntu/india.txt",4)
294 data.collect()
295 val data2 = data.flatMap(_.split(" "))
296 data2.collect()
297 val data3 = data2.map(e=>(e,1))
298 data3.collect()
299 val data4 = data3.reduceByKey(_+_)
300 data4.foreach(println)
301 data4.collect()
302 data4.foreach(println)
303 :history
val bid = 1
val bidtime = 2
val bidder = 3
val bidderrate = 4
val openbid = 5
val price = 6
val itemtype = 7
val daystolive = 8
/FileStore/tables/HeroData.txt
/FileStore/tables/people.txt