PySpark Questions

1) Q- Reading a CSV file
you can read the csv file with sqlContext.read.csv. You use
inferSchema set to True to tell Spark to guess automatically the type
of data. By default, it is turn to False.
df = sqlContext.read.csv(SparkFiles.get("adult_data.csv"),
header=True, inferSchema= True)
2) What do mean by Broadcast variables?

Ans. In order to save the copy of data across all nodes, we use it.
With SparkContext.broadcast(), a broadcast variable is created.
For Examples:
>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> b.unpersist()
>>> large_broadcast = sc.broadcast(range(10000))
What do you mean by Status Tracker?

Ans. Status Trackers are Low-level status reporting APIs which helps to
monitor job and stage progress.
def __init__(self, jtracker):
self._jtracker = jtracker
How do we create RDDs in Spark?

Ans: Spark provides two methods to create RDD:
• By parallelizing a collection in your Driver program.

• This makes use of SparkContext’s ‘parallelize’.
method val DataArray = Array(2,4,6,8,10)
val DataRDD = sc.parallelize(DataArray)
• By loading an external dataset from external storage like HDFS, HBase, shared file
system.
What do you understand by "joins" in PySpark
DataFrame? What are the different types of joins available
in PySpark?
In PySpark, joins merge or join two DataFrames together. It facilitates us to link two or
multiple DataFrames together.
INNER Join, LEFT OUTER Join, RIGHT OUTER Join, LEFT ANTI Join, LEFT SEMI Join,
CROSS Join, and SELF Join are among the SQL join types PySpark supports. Following is
the syntax of PySpark Join.
Syntax:
1. join(self, other, on=None, how=None)
Parameter Explanation:
The join() procedure accepts the following parameters and returns a DataFrame:
o "other": It specifies the join's right side.

o "on": It specifies the join column's name.
o "how": It is used to specify an option. Options are inner, cross, outer, full, full outer,
left, left outer, right, right outer, left semi, and left anti. The default is inner.
Types of Join in PySpark DataFrame
Join String Equivalent SQL Join
inner INNER JOIN
outer, full, fullouter, full_outer FULL OUTER JOIN
left, leftouter, left_outer LEFT JOIN
right, rightouter, right_outer RIGHT JOIN
cross
anti, leftanti, left_anti

semi, leftsemi, left_semi
What is Parquet file in PySpark?

In PySpark, the Parquet file is a column-type format supported by several data
processing systems. By using the Parquet file, Spark SQL can perform both read and
write operations.
The Parquet file contains a column type format storage which provides the following
advantages:
o It is small and consumes less space.

o It facilitates us to fetch specific columns for access.
o It follows type-specific encoding.
o It offers better-summarized data.
o It contains very limited I/O operations.
How do we create RDDs in Spark?

Spark provides two methods to create RDD:
Spark provides two methods to create RDD:
1. By parallelizing a collection in your Driver program.
2. This makes use of SparkContext’s ‘parallelize’
1
2method val DataArray = Array(2,4,6,8,10)
3val DataRDD = sc.parallelize(DataArray)
2. By loading an external dataset from external storage like HDFS, HBase, shared
file system.
Which one do you prefer? Either groupByKey() or ReduseByKey?
The groupByKey can cause out of disk problems as data is sent over the network
and collected on the reduced workers. You can see the below example.
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))
Whereas in reducebykey, Data are combined at each partition, only one output
for one key at each partition to send over the network. reduceByKey required
combining all your values into another value with the exact same type.
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))
Q - Group DataFrame by hour of the Day:
https://stackoverflow.com/questions/66647892/group-df-by-hour-
of-day/66649736#66649736
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.Window
val df = Seq(
(75, "2019-01-19 02:13:00", 5 , "Brooklyn", "Williamsburg"),
(99, "2019-01-20 12:05:00", 3 , "Brooklyn", "DUMBO"),
(102, "2019-01-01 02:05:00", 1 , "Brooklyn", "DUBMO"),
(10, "2019-01-07 11:05:00", 13, "Brooklyn", "Park Slope"),
(12, "2019-01-11 01:05:00", 1 , "Brooklyn", "Park Slope"),
(98, "2019-01-28 01:05:00", 8 , "Brooklyn", "DUMBO"),
(255, "2019-01-11 12:05:00", 12, "Brooklyn", "DUMBO"),
).toDF("PULocationID", "pickup_datetime", "number_of_pickups",
"Borough", "Zone")
df.show()
val df1 = df.

withColumn("pickup_datetime",
F.to_timestamp(F.col("pickup_datetime"),"yyyy-MM-dd HH:mm:ss")).
withColumn("hour", F.hour(F.col("pickup_datetime")))
df1.show()
df1.printSchema()
val windowSpec =
Window.partitionBy("hour").orderBy(F.desc("number_of_pickups"))
val df2 = df1.withColumn("rn", F.row_number.over(windowSpec))
df2.filter(F.col("rn") === 1).drop(F.col("rn")).select("hour",

"Zone", "number_of_pickups").show()
Q- Drop column
There are two intuitive API to drop columns:
• drop(): Drop a column

• dropna(): Drop NA’s
Below you drop the column education_num

df.drop('education_num').columns
['age',
'workclass',
'fnlwgt',
'education',
'marital',
'occupation',
'relationship',
'race',
'sex',
'capital_gain',
'capital_loss',
'hours_week',
'native_country',
'label']

PySpark Questions

Uploaded by

Copyright:

Available Formats

You might also like

PySpark Questions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PySpark Questions

Uploaded by

Copyright:

Available Formats

1) Q- Reading a CSV file

2) What do mean by Broadcast variables?

What do you mean by Status Tracker?

How do we create RDDs in Spark?

• By parallelizing a collection in your Driver program.

method val DataArray = Array(2,4,6,8,10)

val DataRDD = sc.parallelize(DataArray)

1. join(self, other, on=None, how=None)

o "other": It specifies the join's right side.

Types of Join in PySpark DataFrame

Join String Equivalent SQL Join

inner INNER JOIN

outer, full, fullouter, full_outer FULL OUTER JOIN

left, leftouter, left_outer LEFT JOIN

right, rightouter, right_outer RIGHT JOIN

anti, leftanti, left_anti

What is Parquet file in PySpark?

o It is small and consumes less space.

How do we create RDDs in Spark?

Spark provides two methods to create RDD:

1. By parallelizing a collection in your Driver program.

2. This makes use of SparkContext’s ‘parallelize’

Which one do you prefer? Either groupByKey() or ReduseByKey?

Q - Group DataFrame by hour of the Day:

val df1 = df.

df2.filter(F.col("rn") === 1).drop(F.col("rn")).select("hour",

• drop(): Drop a column

Below you drop the column education_num

You might also like