Unit Testing of Spark Applications

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Unit Testing of Spark Applications

Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
Agenda

● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
What is Spark ?
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.

● Runs on Hadoop, Mesos, or


in the cloud.

src: http://spark.apache.org/
What is Unit Testing ?

●Unit Testing is a Software Testing method by which individual units


of source code are tested to determine whether they are fit for use or
not.

● They ensure that code meets its design specifications and behaves as
intended.

● Its goal is to isolate each part of the program and show that the
individual parts are correct.

src: https://en.wikipedia.org/wiki/Unit_testing
Why we need Unit Testing ?
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.

● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.

● Simplifies integration
- Makes Integration Tests easier to write.

● Documentation
- Provides a living documentation of the system.

● Design
- Can act as formal design of project.

src: https://en.wikipedia.org/wiki/Unit_testing
Unit Testing of Spark Applications
Unit to Test

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD

class WordCount {
def get(url: String, sc: SparkContext): RDD[(String, Int)] = {
val lines = sc.textFile(url)

lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)


}
}
Method 1
import org.scalatest.{ BeforeAndAfterAll, FunSuite }
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

class WordCountTest extends FunSuite with BeforeAndAfterAll {


private var sparkConf: SparkConf = _
private var sc: SparkContext = _

override def beforeAll() {


sparkConf = new SparkConf().setAppName("unit-testing").setMaster("local")
sc = new SparkContext(sparkConf)
}

private val wordCount = new WordCount

test("get word count rdd") {


val result = wordCount.get("file.txt", sc)

assert(result.take(10).length === 10)


}

override def afterAll() {


sc.stop()
}
}
Cons of Method 1

●Explicit management of SparkContext creation and


destruction.

● Developer has to write more lines of code for testing.

● Code duplication as Before and After step has to be repeated


in all Test Suites.
Method 2 (Better Way)

Spark Testing Base

A spark package containing base classes to use when writing


tests with Spark.

How ?
"com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2"
Method 2 (Better Way) contd...
Example 1
import org.scalatest.FunSuite

import com.holdenkarau.spark.testing.SharedSparkContext

class WordCountTest extends FunSuite with SharedSparkContext {


private val wordCount = new WordCount

test("get word count rdd") {


val result = wordCount.get("file.txt", sc)

assert(result.take(10).length === 10)


}
}
Method 2 (Better Way) contd...
Example 2
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
import com.holdenkarau.spark.testing.RDDComparisons

class WordCountTest extends FunSuite with SharedSparkContext {


private val wordCount = new WordCount

test("get word count rdd with comparison") {


val expected =
sc.textFile("file.txt")
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)

val result = wordCount.get("file.txt", sc)

assert(RDDComparisons.compare(expected, result).isEmpty)
}
}
Pros of Method 2
● Succinct code.

● Rich Test API.

● Supports Scala, Java and Python.

● Provides API for testing Streaming applications too.

● Has in-built RDD comparators.

● Supports both Local & Cluster mode testing.


When to use What ?

Method 1 Method 2
●For Small Scale Spark ●For Large Scale Spark
applications. applications.

●No requirement of extended ●Requirement of Cluster mode or


capabilities of spark-testing-base. Performance testing.

● For Sample applications. ● For Production applications.


Demo
Questions & Option[A]
References

● https://github.com/holdenk/spark-testing-base

● Effective testing for spark programs Strata NY 2015

● Testing Spark: Best Practices


Thank you

You might also like