Spark

If task in parallel - same stage ..
eg - map
If task need shuffling between the nodes then a new stage is created - eg countByValue
Now each stage is broken into task which may be distributed across a cluster..
Finally the tasks are scheduled across the cluster and executed.
Map - each element of the collection into a new element

val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))
// Map each element to its length

val result = rdd.map(s => s.length)
// Output: [11, 12]

result.collect()
Maps can also create key value pairs - rdd.map(x => (x,1))
Spark can do these special stuff on key value pairs -
1. reduceByKey((x,y) => x+y) - combine values using same key.

X = sum
Y = next value
Both x, y are values
Myavgfriends -
val mapval = filtereddata.mapValues(x=> (x,1)).reduceByKey( (x,y) => (x._1+y._1, x._2+y._2))
val xyz = mapval.mapValues(x => x._1 / x._2 )
2. groupByKey() Group values with same key
3. sortByKey() Sort RDD by key

4. keys(), Values() - Create RDD of just keys or just values
5 - SQL style joins - join, right outer join, left outer join, cogroup, substractByKey
countByValue() is a transformation operation in Spark that is used to count the

number of occurrences of each unique value in an RDD
Wordcount program -
Val lines = sc.textfile(“file.txt”) // gives each lines from file
val data2 = lines.flatMap(x => x.split(" "))
// val data3 = data2.map(x => (x,1))
// val data4 = data3.reduceByKey((x,y) => x+y).sortBy(_._2)
val data4 = data2.countByValue()
Map vs flat map -
Map - one to one

flat map - one to many
Flat map - each element of the collection into zero or more new elements.
val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))
// Split each string into words

val result = rdd.flatMap(s => s.split(" "))
// Output: ["Hello", "World", "Goodbye", "World"]

result.collect()
Equals - compared for equality. The method returns true if the argument is not null,
is of the same class as the object being compared, and has the same values for all of its
fields as the object being compared.
case class Person(name: String, age: Int)
val person1 = Person("Alice", 30)

val person2 = Person("Alice", 30)
val person3 = Person("Bob", 25)
println(person1.equals(person2)) // true
println(person1.equals(person3)) // false
Broadcast variable - if the dataset is small (if it could load to the memory) then we could
load it driver program, spark automatically forward it to each executor when needed.
sc.broadcast()
.value() to get the object back
But if tables are massive. we ‘d only want to transfer it once to each executor and keep it there.
Refer mybroadcast.scala - com/sundogsoftware/spark/mybroadcast.scala
val broadcastMap = spark.sparkContext.broadcast(nameId())

val momap: Int => String = (movieID:Int) => broadcastMap.value(movieID)
val lookupNameUDF = udf(momap)
val moviesWithNames =
movieCounts.withColumn("movieTitle",lookupNameUDF(col("movieID")))
UDF - In Spark Scala, a User-Defined Function (UDF) allows you to define custom functions
that can be applied to DataFrame columns or used in SQL expressions. UDFs provide flexibility
in performing custom operations on data within Spark.
import org.apache.spark.sql.functions.udf
// Define the UDF
val myUDF = udf((input: DataType) => {
// Custom logic to process the input and return a result
result})
val df: DataFrame = ... // Your DataFrame
val resultDF = df.withColumn("newColumn", myUDF(col("inputColumn")))
resultDF.show()
Example -
val squareUDF = udf((x: Int) => x * x)
// Apply the UDF to a column
val resultDF = df.withColumn("squared", squareUDF(df("number")))
Spark-submit
1. Create jar by going in project structure => artifacts => include dep + code(without lib)=>
build
2. This will create a out folder and a .jar file in it
3. Navigate in the folder where your jar is placed (for ubuntu) and run the following
command -
spark-submit --class com.sundogsoftware.spark.HelloWorld hello2.jar
spark-submit --class com.example.MyApp --master local myapp.jar
By SBT -
Remove code which is not suitable for cluster like local[*](cant take adv of cluster), file path.
Sbt.build file
name := "MovieSimilarities1MDataset"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "2.4.5" % "provided", // provided means it is installed
at server(EMR)
"org.apache.spark" %% "spark-sql" % "2.4.5" % "provided"
)
//Spark and scala should have compatible version
RDD DataFrame Dataset
Resilient Distributed Structured data with Strongly typed structured

Data Structure
Dataset schema data with schema
Immutable Yes Yes Yes
Schema No schema enforcement Schema enforcement Schema enforcement
Optimization Limited optimization Catalyst optimizer Catalyst optimizer
Lower performance Higher performance

Performance Comparable to DataFrames
compared to others compared to RDDs
Lower-level API with more Higher-level API with Higher-level API with type
API
flexibility ease of use safety
Uses Java serialization by Uses optimized Uses optimized
Serialization
default serialization formats serialization formats
Compatible with any JVM Compatible with any Compatible with any JVM
Interoperability
language JVM language language
Supports both structured Supports both structured

Integration Best for structured data
and unstructured data and unstructured data
Partially type-safe
Type Safety Not type-safe Type-safe
(based on DataFrame)
Compile-time
No No Yes
checks
Catalyst optimizer No Yes Yes

RDD creation -
import org.apache.spark.SparkContext
1.
val data = Seq(1, 2, 3, 4, 5)
val rdd = sparkContext.parallelize(data)
2.
val sc = new SparkContext("local[*]","myavgfriend")
val data = sc.textFile("data/fakefriends-noheader.csv")
3. Spark supports reading data from various external data sources, such as HDFS, Amazon S3,
Apache Cassandra, JDBC databases, etc
val rdd = sparkContext.textFile("hdfs://path/to/file.txt")
Dataframe creation -
import org.apache.spark.sql.SparkSession
1. RDD TO Dataframe - RDD to dataframe by toDF() method
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()

.appName("RDD to DataFrame")
.master("local")
.getOrCreate()
import spark.implicits._
val rdd = spark.sparkContext.parallelize(Seq(("Alice", 25), ("Bob", 30), ("Charlie",

35)))
val df = rdd.toDF("Name", "Age")

2. createDataFrame
val data = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val df = spark.createDataFrame(data).toDF("Name", "Age")
3. Read file with header
val df = spark.read.format("csv")
.option("header", "true")
.load("/path/to/file.csv")
4. File without header
import org.apache.spark.sql.{SparkSession, Row}

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val moviesSchema = new StructType()

.add("userID", IntegerType, nullable = true)
.add("movieID", IntegerType, nullable = true)
.add("rating", IntegerType, nullable = true)
.add("timestamp", LongType, nullable = true)
// Load up movie data as dataset

val moviesDS = spark.read
.option("sep", "\t")
.schema(moviesSchema)
.csv("data/ml-100k/u.data")
Dataset creation -
1. DF to DS
case class UserRatings(userID: Int, movieID: Int, rating: Int, timestamp: Long)
val userRatingsSchema = new StructType()

.add("userID", IntegerType, nullable = true)
.add("movieID", IntegerType, nullable = true)
.add("rating", IntegerType, nullable = true)
.add("timestamp", LongType, nullable = true)
val ratingsDS = spark.read

.option("sep", "\t")
.schema(userRatingsSchema)
.csv("data/ml-100k/u.data")
.as[UserRatings]
Use of case class -

Type safety: By using case classes to define the schema, you get compile-time type
checking. The compiler ensures that the data you are working with adheres to the
specified structure. This reduces the chances of runtime errors caused by incorrect data
manipulation or mismatched types.
When you use a case class to represent the structure of your data and convert a
DataFrame to a Dataset using the as[] method, Spark can map the columns of the
DataFrame to the fields of the case class based on their names and types.
Use of import spark.implicits._ - working with DataFrames or Datasets in Scala to

enable convenient conversions between different data types
provide access to useful functions and operators( filtering, aggregating, joining, etc.).
DataFrame(toDF())/Dataset conversions(as[T])
Example 1 - rating counter
// Load up each line of the ratings data into an RDD

val lines = sc.textFile("data/ml-100k/u.data")
// Convert each line to a string, split it out by tabs, and extract the third field.
// (The file format is userID, movieID, rating, timestamp)
val ratings = lines.map(x => x.split("\t")(2))
// Count up how many times each value (rating) occurs

val results = ratings.countByValue()
// Sort the resulting map of (rating, count) tuples

val sortedResults = results.toSeq.sortBy(_._1)
// Print each result on its own line.

sortedResults.foreach(println)
(5,21201)
(4,34174)
(3,27145)
(2,11370)
Example 2 - Avg friends
/** A function that splits a line of input into (age, numFriends) tuples. */
/**
0,Will,33,385
1,Jean-Luc,26,2
*/
def parseLine(line: String): (Int, Int) = {
// Split by commas
val fields = line.split(",")
// Extract the age and numFriends fields, and convert to integers
val age = fields(2).toInt
val numFriends = fields(3).toInt
// Create a tuple that is our result.
(age, numFriends)
// Load each line of the source data into an RDD

val lines = sc.textFile("data/fakefriends-noheader.csv")
// Use our parseLines function to convert to (age, numFriends) tuples

val rdd = lines.map(parseLine)
// Lots going on here...

// We are starting with an RDD of form (age, numFriends) where age is the KEY and
numFriends is the VALUE
// We use mapValues to convert each numFriends value to a tuple of (numFriends, 1)
// Then we use reduceByKey to sum up the total numFriends and total instances for each
age, by
// adding together all the numFriends values and 1's respectively.
val totalsByAge = rdd.mapValues(x => (x, 1)).reduceByKey( (x,y) => (x._1 + y._1, x._2
+ y._2)) // x = first row, y = second row
// So now we have tuples of (age, (totalFriends, totalInstances))

// To compute the average we divide totalFriends / totalInstances for each age.
val averagesByAge = totalsByAge.mapValues(x => x._1 / x._2) // x._1,x._2 both are
from same row
// Collect the results from the RDD (This kicks off computing the DAG and actually executes
the job)
val results = averagesByAge.collect()
// Sort and print the final results.

results.sorted.foreach(println)
Example 3 - Min temp
def dataExtractot(x:String) : (String,Int,String) ={

val prop = x.split(",")(2)
val station = x.split(",")(0)
val temp = x.split(",")(3).toInt
return (station,temp,prop)
}
val data = sc.textFile("data/1800.csv")
val filteredData = data.map(dataExtractot)
val twocol = filteredData.filter(x => x._3=="TMIN")
val twocolumnonly = twocol.map(x => (x._1,x._2))
val results = twocolumnonly.reduceByKey((x,y) => min(x,y)) // x,y are values of 1st
and 2nd row
for (result <- results.sortByKey()) {
val station = result._1
val temp = result._2
val formattedTemp = f"$temp%.2f F"
println(s"$station minimum temperature: $formattedTemp")
Example 4 - Word count

Spark

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark

Uploaded by

Copyright:

Available Formats

If task in parallel - same stage ..

Map - each element of the collection into a new element

// Map each element to its length

// Output: [11, 12]

Spark can do these special stuff on key value pairs -

1. reduceByKey((x,y) => x+y) - combine values using same key.

3. sortByKey() Sort RDD by key

countByValue() is a transformation operation in Spark that is used to count the

Map vs flat map -

Map - one to one

val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))

// Split each string into words

// Output: ["Hello", "World", "Goodbye", "World"]

case class Person(name: String, age: Int)

val person1 = Person("Alice", 30)

Refer mybroadcast.scala - com/sundogsoftware/spark/mybroadcast.scala

val broadcastMap = spark.sparkContext.broadcast(nameId())

spark-submit --class com.sundogsoftware.spark.HelloWorld hello2.jar

spark-submit --class com.example.MyApp --master local myapp.jar

libraryDependencies ++= Seq(

Resilient Distributed Structured data with Strongly typed structured

Immutable Yes Yes Yes

Schema No schema enforcement Schema enforcement Schema enforcement

Optimization Limited optimization Catalyst optimizer Catalyst optimizer

Lower performance Higher performance

Supports both structured Supports both structured

Catalyst optimizer No Yes Yes

val rdd = sparkContext.textFile("hdfs://path/to/file.txt")

1. RDD TO Dataframe - RDD to dataframe by toDF() method

val spark = SparkSession.builder()

val rdd = spark.sparkContext.parallelize(Seq(("Alice", 25), ("Bob", 30), ("Charlie",

val df = rdd.toDF("Name", "Age")

3. Read file with header

4. File without header

import org.apache.spark.sql.{SparkSession, Row}

val moviesSchema = new StructType()

// Load up movie data as dataset

val userRatingsSchema = new StructType()

val ratingsDS = spark.read

Use of case class -

Use of import spark.implicits._ - working with DataFrames or Datasets in Scala to

Example 1 - rating counter

// Load up each line of the ratings data into an RDD

// Count up how many times each value (rating) occurs

// Sort the resulting map of (rating, count) tuples

// Print each result on its own line.

// Load each line of the source data into an RDD

// Use our parseLines function to convert to (age, numFriends) tuples

// Lots going on here...

// So now we have tuples of (age, (totalFriends, totalInstances))

// Sort and print the final results.

def dataExtractot(x:String) : (String,Int,String) ={

Example 4 - Word count

You might also like