Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

If task in parallel - same stage ..

eg - map
If task need shuffling between the nodes then a new stage is created - eg countByValue
Now each stage is broken into task which may be distributed across a cluster..
Finally the tasks are scheduled across the cluster and executed.

Map - each element of the collection into a new element


val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))

// Map each element to its length


val result = rdd.map(s => s.length)

// Output: [11, 12]


result.collect()

Maps can also create key value pairs - rdd.map(x => (x,1))

Spark can do these special stuff on key value pairs -

1. reduceByKey((x,y) => x+y) - combine values using same key.


X = sum
Y = next value
Both x, y are values

Myavgfriends -
val mapval = filtereddata.mapValues(x=> (x,1)).reduceByKey( (x,y) => (x._1+y._1, x._2+y._2))
val xyz = mapval.mapValues(x => x._1 / x._2 )
2. groupByKey() Group values with same key

3. sortByKey() Sort RDD by key


4. keys(), Values() - Create RDD of just keys or just values

5 - SQL style joins - join, right outer join, left outer join, cogroup, substractByKey

countByValue() is a transformation operation in Spark that is used to count the


number of occurrences of each unique value in an RDD

Wordcount program -
Val lines = sc.textfile(“file.txt”) // gives each lines from file
val data2 = lines.flatMap(x => x.split(" "))
// val data3 = data2.map(x => (x,1))
// val data4 = data3.reduceByKey((x,y) => x+y).sortBy(_._2)
val data4 = data2.countByValue()

Map vs flat map -

Map - one to one


flat map - one to many

Flat map - each element of the collection into zero or more new elements.

val rdd = sc.parallelize(Seq("Hello World", "Goodbye World"))

// Split each string into words


val result = rdd.flatMap(s => s.split(" "))

// Output: ["Hello", "World", "Goodbye", "World"]


result.collect()
Equals - compared for equality. The method returns true if the argument is not null,
is of the same class as the object being compared, and has the same values for all of its
fields as the object being compared.

case class Person(name: String, age: Int)

val person1 = Person("Alice", 30)


val person2 = Person("Alice", 30)
val person3 = Person("Bob", 25)

println(person1.equals(person2)) // true
println(person1.equals(person3)) // false

Broadcast variable - if the dataset is small (if it could load to the memory) then we could
load it driver program, spark automatically forward it to each executor when needed.
sc.broadcast()
.value() to get the object back

But if tables are massive. we ‘d only want to transfer it once to each executor and keep it there.

Refer mybroadcast.scala - com/sundogsoftware/spark/mybroadcast.scala

val broadcastMap = spark.sparkContext.broadcast(nameId())


val momap: Int => String = (movieID:Int) => broadcastMap.value(movieID)
val lookupNameUDF = udf(momap)
val moviesWithNames =
movieCounts.withColumn("movieTitle",lookupNameUDF(col("movieID")))

UDF - In Spark Scala, a User-Defined Function (UDF) allows you to define custom functions
that can be applied to DataFrame columns or used in SQL expressions. UDFs provide flexibility
in performing custom operations on data within Spark.

import org.apache.spark.sql.functions.udf
// Define the UDF
val myUDF = udf((input: DataType) => {
// Custom logic to process the input and return a result
result})
val df: DataFrame = ... // Your DataFrame
val resultDF = df.withColumn("newColumn", myUDF(col("inputColumn")))
resultDF.show()
Example -
val squareUDF = udf((x: Int) => x * x)
// Apply the UDF to a column
val resultDF = df.withColumn("squared", squareUDF(df("number")))

Spark-submit
1. Create jar by going in project structure => artifacts => include dep + code(without lib)=>
build
2. This will create a out folder and a .jar file in it
3. Navigate in the folder where your jar is placed (for ubuntu) and run the following
command -

spark-submit --class com.sundogsoftware.spark.HelloWorld hello2.jar

spark-submit --class com.example.MyApp --master local myapp.jar

By SBT -
Remove code which is not suitable for cluster like local[*](cant take adv of cluster), file path.

Sbt.build file

name := "MovieSimilarities1MDataset"

version := "1.0"

organization := "com.sundogsoftware"

scalaVersion := "2.11.8"

libraryDependencies ++= Seq(


"org.apache.spark" %% "spark-core" % "2.4.5" % "provided", // provided means it is installed
at server(EMR)
"org.apache.spark" %% "spark-sql" % "2.4.5" % "provided"
)
//Spark and scala should have compatible version
RDD DataFrame Dataset

Resilient Distributed Structured data with Strongly typed structured


Data Structure
Dataset schema data with schema

Immutable Yes Yes Yes

Schema No schema enforcement Schema enforcement Schema enforcement

Optimization Limited optimization Catalyst optimizer Catalyst optimizer

Lower performance Higher performance


Performance Comparable to DataFrames
compared to others compared to RDDs

Lower-level API with more Higher-level API with Higher-level API with type
API
flexibility ease of use safety
Uses Java serialization by Uses optimized Uses optimized
Serialization
default serialization formats serialization formats

Compatible with any JVM Compatible with any Compatible with any JVM
Interoperability
language JVM language language

Supports both structured Supports both structured


Integration Best for structured data
and unstructured data and unstructured data

Partially type-safe
Type Safety Not type-safe Type-safe
(based on DataFrame)

Compile-time
No No Yes
checks

Catalyst optimizer No Yes Yes


RDD creation -

import org.apache.spark.SparkContext

1.
val data = Seq(1, 2, 3, 4, 5)
val rdd = sparkContext.parallelize(data)

2.
val sc = new SparkContext("local[*]","myavgfriend")
val data = sc.textFile("data/fakefriends-noheader.csv")

3. Spark supports reading data from various external data sources, such as HDFS, Amazon S3,
Apache Cassandra, JDBC databases, etc

val rdd = sparkContext.textFile("hdfs://path/to/file.txt")

Dataframe creation -

import org.apache.spark.sql.SparkSession

1. RDD TO Dataframe - RDD to dataframe by toDF() method

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()


.appName("RDD to DataFrame")
.master("local")
.getOrCreate()

import spark.implicits._

val rdd = spark.sparkContext.parallelize(Seq(("Alice", 25), ("Bob", 30), ("Charlie",


35)))

val df = rdd.toDF("Name", "Age")


2. createDataFrame
import spark.implicits._
val data = Seq(("Alice", 25), ("Bob", 30), ("Charlie", 35))
val df = spark.createDataFrame(data).toDF("Name", "Age")

3. Read file with header

val df = spark.read.format("csv")
.option("header", "true")
.load("/path/to/file.csv")

4. File without header

import org.apache.spark.sql.{SparkSession, Row}


import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val moviesSchema = new StructType()


.add("userID", IntegerType, nullable = true)
.add("movieID", IntegerType, nullable = true)
.add("rating", IntegerType, nullable = true)
.add("timestamp", LongType, nullable = true)

import spark.implicits._

// Load up movie data as dataset


val moviesDS = spark.read
.option("sep", "\t")
.schema(moviesSchema)
.csv("data/ml-100k/u.data")

Dataset creation -

1. DF to DS

case class UserRatings(userID: Int, movieID: Int, rating: Int, timestamp: Long)

val userRatingsSchema = new StructType()


.add("userID", IntegerType, nullable = true)
.add("movieID", IntegerType, nullable = true)
.add("rating", IntegerType, nullable = true)
.add("timestamp", LongType, nullable = true)

val ratingsDS = spark.read


.option("sep", "\t")
.schema(userRatingsSchema)
.csv("data/ml-100k/u.data")
.as[UserRatings]

Use of case class -


Type safety: By using case classes to define the schema, you get compile-time type
checking. The compiler ensures that the data you are working with adheres to the
specified structure. This reduces the chances of runtime errors caused by incorrect data
manipulation or mismatched types.
When you use a case class to represent the structure of your data and convert a
DataFrame to a Dataset using the as[] method, Spark can map the columns of the
DataFrame to the fields of the case class based on their names and types.

Use of import spark.implicits._ - working with DataFrames or Datasets in Scala to


enable convenient conversions between different data types
provide access to useful functions and operators( filtering, aggregating, joining, etc.).
DataFrame(toDF())/Dataset conversions(as[T])

Example 1 - rating counter

// Load up each line of the ratings data into an RDD


val lines = sc.textFile("data/ml-100k/u.data")

// Convert each line to a string, split it out by tabs, and extract the third field.
// (The file format is userID, movieID, rating, timestamp)
val ratings = lines.map(x => x.split("\t")(2))

// Count up how many times each value (rating) occurs


val results = ratings.countByValue()

// Sort the resulting map of (rating, count) tuples


val sortedResults = results.toSeq.sortBy(_._1)

// Print each result on its own line.


sortedResults.foreach(println)

(5,21201)
(4,34174)
(3,27145)
(2,11370)
Example 2 - Avg friends

/** A function that splits a line of input into (age, numFriends) tuples. */
/**
0,Will,33,385
1,Jean-Luc,26,2
*/
def parseLine(line: String): (Int, Int) = {
// Split by commas
val fields = line.split(",")
// Extract the age and numFriends fields, and convert to integers
val age = fields(2).toInt
val numFriends = fields(3).toInt
// Create a tuple that is our result.
(age, numFriends)

// Load each line of the source data into an RDD


val lines = sc.textFile("data/fakefriends-noheader.csv")

// Use our parseLines function to convert to (age, numFriends) tuples


val rdd = lines.map(parseLine)

// Lots going on here...


// We are starting with an RDD of form (age, numFriends) where age is the KEY and
numFriends is the VALUE
// We use mapValues to convert each numFriends value to a tuple of (numFriends, 1)
// Then we use reduceByKey to sum up the total numFriends and total instances for each
age, by
// adding together all the numFriends values and 1's respectively.
val totalsByAge = rdd.mapValues(x => (x, 1)).reduceByKey( (x,y) => (x._1 + y._1, x._2
+ y._2)) // x = first row, y = second row

// So now we have tuples of (age, (totalFriends, totalInstances))


// To compute the average we divide totalFriends / totalInstances for each age.
val averagesByAge = totalsByAge.mapValues(x => x._1 / x._2) // x._1,x._2 both are
from same row

// Collect the results from the RDD (This kicks off computing the DAG and actually executes
the job)
val results = averagesByAge.collect()

// Sort and print the final results.


results.sorted.foreach(println)
Example 3 - Min temp

def dataExtractot(x:String) : (String,Int,String) ={


val prop = x.split(",")(2)
val station = x.split(",")(0)
val temp = x.split(",")(3).toInt
return (station,temp,prop)
}
val data = sc.textFile("data/1800.csv")
val filteredData = data.map(dataExtractot)
val twocol = filteredData.filter(x => x._3=="TMIN")
val twocolumnonly = twocol.map(x => (x._1,x._2))
val results = twocolumnonly.reduceByKey((x,y) => min(x,y)) // x,y are values of 1st
and 2nd row
for (result <- results.sortByKey()) {
val station = result._1
val temp = result._2
val formattedTemp = f"$temp%.2f F"
println(s"$station minimum temperature: $formattedTemp")

Example 4 - Word count

You might also like