PySpark Questions

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

1) Q- Reading a CSV file

you can read the csv file with sqlContext.read.csv. You use
inferSchema set to True to tell Spark to guess automatically the type
of data. By default, it is turn to False.
df = sqlContext.read.csv(SparkFiles.get("adult_data.csv"),
header=True, inferSchema= True)

2) What do mean by Broadcast variables?


Ans. In order to save the copy of data across all nodes, we use it.
With SparkContext.broadcast(), a broadcast variable is created.

For Examples:
>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> b.unpersist()
>>> large_broadcast = sc.broadcast(range(10000))

What do you mean by Status Tracker?


Ans. Status Trackers are Low-level status reporting APIs which helps to
monitor job and stage progress.
def __init__(self, jtracker):
self._jtracker = jtracker

How do we create RDDs in Spark?


Ans: Spark provides two methods to create RDD:

• By parallelizing a collection in your Driver program.


• This makes use of SparkContext’s ‘parallelize’.

method val DataArray = Array(2,4,6,8,10)

val DataRDD = sc.parallelize(DataArray)

• By loading an external dataset from external storage like HDFS, HBase, shared file
system.
What do you understand by "joins" in PySpark
DataFrame? What are the different types of joins available
in PySpark?
In PySpark, joins merge or join two DataFrames together. It facilitates us to link two or
multiple DataFrames together.

INNER Join, LEFT OUTER Join, RIGHT OUTER Join, LEFT ANTI Join, LEFT SEMI Join,
CROSS Join, and SELF Join are among the SQL join types PySpark supports. Following is
the syntax of PySpark Join.

Syntax:

1. join(self, other, on=None, how=None)

Parameter Explanation:

The join() procedure accepts the following parameters and returns a DataFrame:

o "other": It specifies the join's right side.


o "on": It specifies the join column's name.
o "how": It is used to specify an option. Options are inner, cross, outer, full, full outer,
left, left outer, right, right outer, left semi, and left anti. The default is inner.

Types of Join in PySpark DataFrame

Join String Equivalent SQL Join

inner INNER JOIN

outer, full, fullouter, full_outer FULL OUTER JOIN

left, leftouter, left_outer LEFT JOIN

right, rightouter, right_outer RIGHT JOIN

cross

anti, leftanti, left_anti


semi, leftsemi, left_semi

What is Parquet file in PySpark?


In PySpark, the Parquet file is a column-type format supported by several data
processing systems. By using the Parquet file, Spark SQL can perform both read and
write operations.

The Parquet file contains a column type format storage which provides the following
advantages:

o It is small and consumes less space.


o It facilitates us to fetch specific columns for access.
o It follows type-specific encoding.
o It offers better-summarized data.
o It contains very limited I/O operations.

How do we create RDDs in Spark?


Spark provides two methods to create RDD:

Spark provides two methods to create RDD:

1. By parallelizing a collection in your Driver program.

2. This makes use of SparkContext’s ‘parallelize’

1
2method val DataArray = Array(2,4,6,8,10)
3val DataRDD = sc.parallelize(DataArray)
2. By loading an external dataset from external storage like HDFS, HBase, shared
file system.

Which one do you prefer? Either groupByKey() or ReduseByKey?

The groupByKey can cause out of disk problems as data is sent over the network
and collected on the reduced workers. You can see the below example.

sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))

Whereas in reducebykey, Data are combined at each partition, only one output
for one key at each partition to send over the network. reduceByKey required
combining all your values into another value with the exact same type.

sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))

Q - Group DataFrame by hour of the Day:

https://stackoverflow.com/questions/66647892/group-df-by-hour-
of-day/66649736#66649736
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.Window

val df = Seq(
(75, "2019-01-19 02:13:00", 5 , "Brooklyn", "Williamsburg"),
(255, "2019-01-19 12:05:00", 8 , "Brooklyn", "Williamsburg"),
(99, "2019-01-20 12:05:00", 3 , "Brooklyn", "DUMBO"),
(102, "2019-01-01 02:05:00", 1 , "Brooklyn", "DUBMO"),
(10, "2019-01-07 11:05:00", 13, "Brooklyn", "Park Slope"),
(75, "2019-01-01 11:05:00", 2 , "Brooklyn", "Williamsburg"),
(12, "2019-01-11 01:05:00", 1 , "Brooklyn", "Park Slope"),
(98, "2019-01-28 01:05:00", 8 , "Brooklyn", "DUMBO"),
(75, "2019-01-10 00:05:00", 8 , "Brooklyn", "Williamsburg"),
(255, "2019-01-11 12:05:00", 12, "Brooklyn", "DUMBO"),
).toDF("PULocationID", "pickup_datetime", "number_of_pickups",
"Borough", "Zone")
df.show()

val df1 = df.


withColumn("pickup_datetime",
F.to_timestamp(F.col("pickup_datetime"),"yyyy-MM-dd HH:mm:ss")).
withColumn("hour", F.hour(F.col("pickup_datetime")))
df1.show()
df1.printSchema()

val windowSpec =
Window.partitionBy("hour").orderBy(F.desc("number_of_pickups"))
val df2 = df1.withColumn("rn", F.row_number.over(windowSpec))

df2.filter(F.col("rn") === 1).drop(F.col("rn")).select("hour",


"Zone", "number_of_pickups").show()

Q- Drop column
There are two intuitive API to drop columns:

• drop(): Drop a column


• dropna(): Drop NA’s

Below you drop the column education_num


df.drop('education_num').columns

['age',
'workclass',
'fnlwgt',
'education',
'marital',
'occupation',
'relationship',
'race',
'sex',
'capital_gain',
'capital_loss',
'hours_week',
'native_country',
'label']

You might also like