Professional Documents
Culture Documents
Unit Testing of Spark Applications
Unit Testing of Spark Applications
Unit Testing of Spark Applications
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
Agenda
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
What is Spark ?
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
src: http://spark.apache.org/
What is Unit Testing ?
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
src: https://en.wikipedia.org/wiki/Unit_testing
Why we need Unit Testing ?
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
src: https://en.wikipedia.org/wiki/Unit_testing
Unit Testing of Spark Applications
Unit to Test
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class WordCount {
def get(url: String, sc: SparkContext): RDD[(String, Int)] = {
val lines = sc.textFile(url)
How ?
"com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2"
Method 2 (Better Way) contd...
Example 1
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
assert(RDDComparisons.compare(expected, result).isEmpty)
}
}
Pros of Method 2
● Succinct code.
Method 1 Method 2
●For Small Scale Spark ●For Large Scale Spark
applications. applications.
● https://github.com/holdenk/spark-testing-base