Professional Documents
Culture Documents
A Tale of Spark Session and Spark Context
A Tale of Spark Session and Spark Context
Context
I have Spark Context, SQL context, Hive context already!
Some History….
Prior Spark 2.0, Spark Context was the entry point of any spark
application and used to access all spark features and needed a
sparkConf which had all the cluster configs and parameters to
create a Spark Context object. We could primarily create just
RDDs using Spark Context and we had to create specific spark
contexts for any other spark interactions. For SQL SQLContext,
hive HiveContext, streaming Streaming Application. In a
nutshell, Spark session is a combination of all these different
contexts. Internally, Spark session creates a new SparkContext for
all the operations and also all the above-mentioned contexts can
be accessed using the SparkSession object.
The spark session builder will try to get a spark session if there is
one already created or create a new one and assigns the newly
created SparkSession as the global default. Note
that enableHiveSupport here is similar to creating a HiveContext and
all it does is enables access to Hive metastore, Hive serdes, and
Hive udfs.
scala> spark
res1: org.apache.spark.sql.SparkSession =
org.apache.spark.sql.SparkSession@2bd158ea
We can access spark context and other contexts using the spark
session object.
scala> spark.sparkContext
res2: org.apache.spark.SparkContext =
org.apache.spark.SparkContext@6803b02dscala> spark.sqlContext
res3: org.apache.spark.sql.SQLContext =
org.apache.spark.sql.SQLContext@74037f9b
Also, we can verify that the spark session gives a unified view of all
the contexts and isolation of configuration and environment. We
can directly query without creating a SQL Context like we used
and run the queries similarly. Let’s say we have a table
called people_session1 .This table will be only visible in the
session spark . Let's say we created a new session session2 .These
tables won’t be visible for when we try to access them and also we
can create another table with the same name without affecting the
table in spark session.
scala> people.createOrReplaceTempView("people_session1")scala>
spark.sql("show tables").show()
+--------+---------------+-----------+
|database| tableName|isTemporary|
+--------+---------------+-----------+
| |people_session1| true|
+--------+---------------+-----------+scala>
spark.catalog.listTables.show()
+---------------+--------+-----------+---------+-----------+
| name|database|description|tableType|isTemporary|
+---------------+--------+-----------+---------+-----------+
|people_session1| null| null|TEMPORARY| true|
+---------------+--------+-----------+---------+-----------+scala>
session2.sql("show tables").show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+scala>
session2.catalog.listTables.show()
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
+----+--------+-----------+---------+-----------+
Set configurations:
To set any configurations, we can either set the configs when we
create our spark session using the .config option or use the set
method.
scala> spark.conf.get("spark.sql.crossJoin.enabled")
res4: String = falsescala>
spark.conf.set("spark.sql.crossJoin.enabled", "true")scala>
spark.conf.get("spark.sql.crossJoin.enabled")
res6: String = true
5) Spark implicits
6) Create UDF’s
val capitalizeAndRemoveSpaces =
spark.udf.register("capitalizeAndRemoveSpaces",
capAndRemoveSpaces(_: String): String)
10) The close and stop methods are provided to stop the underlying
spark context.
scala> session2.close
scala> session2.sparkContext.isStopped
res5: Boolean = true
Also, note that closing/stopping session2 will kill spark spark session
as well because the underlying spark context is the same.
scala> spark.sparkContext.isStopped
res6: Boolean = true