Professional Documents
Culture Documents
PySpark Questions
PySpark Questions
PySpark Questions
you can read the csv file with sqlContext.read.csv. You use
inferSchema set to True to tell Spark to guess automatically the type
of data. By default, it is turn to False.
df = sqlContext.read.csv(SparkFiles.get("adult_data.csv"),
header=True, inferSchema= True)
For Examples:
>>> from pyspark.context import SparkContext
>>> sc = SparkContext('local', 'test')
>>> b = sc.broadcast([1, 2, 3, 4, 5])
>>> b.value
[1, 2, 3, 4, 5]
>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
>>> b.unpersist()
>>> large_broadcast = sc.broadcast(range(10000))
• By loading an external dataset from external storage like HDFS, HBase, shared file
system.
What do you understand by "joins" in PySpark
DataFrame? What are the different types of joins available
in PySpark?
In PySpark, joins merge or join two DataFrames together. It facilitates us to link two or
multiple DataFrames together.
INNER Join, LEFT OUTER Join, RIGHT OUTER Join, LEFT ANTI Join, LEFT SEMI Join,
CROSS Join, and SELF Join are among the SQL join types PySpark supports. Following is
the syntax of PySpark Join.
Syntax:
Parameter Explanation:
The join() procedure accepts the following parameters and returns a DataFrame:
cross
The Parquet file contains a column type format storage which provides the following
advantages:
1
2method val DataArray = Array(2,4,6,8,10)
3val DataRDD = sc.parallelize(DataArray)
2. By loading an external dataset from external storage like HDFS, HBase, shared
file system.
The groupByKey can cause out of disk problems as data is sent over the network
and collected on the reduced workers. You can see the below example.
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" ") )
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))
Whereas in reducebykey, Data are combined at each partition, only one output
for one key at each partition to send over the network. reduceByKey required
combining all your values into another value with the exact same type.
sparkContext.textFile("hdfs://")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.reduceByKey((x,y)=> (x+y))
https://stackoverflow.com/questions/66647892/group-df-by-hour-
of-day/66649736#66649736
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.Window
val df = Seq(
(75, "2019-01-19 02:13:00", 5 , "Brooklyn", "Williamsburg"),
(255, "2019-01-19 12:05:00", 8 , "Brooklyn", "Williamsburg"),
(99, "2019-01-20 12:05:00", 3 , "Brooklyn", "DUMBO"),
(102, "2019-01-01 02:05:00", 1 , "Brooklyn", "DUBMO"),
(10, "2019-01-07 11:05:00", 13, "Brooklyn", "Park Slope"),
(75, "2019-01-01 11:05:00", 2 , "Brooklyn", "Williamsburg"),
(12, "2019-01-11 01:05:00", 1 , "Brooklyn", "Park Slope"),
(98, "2019-01-28 01:05:00", 8 , "Brooklyn", "DUMBO"),
(75, "2019-01-10 00:05:00", 8 , "Brooklyn", "Williamsburg"),
(255, "2019-01-11 12:05:00", 12, "Brooklyn", "DUMBO"),
).toDF("PULocationID", "pickup_datetime", "number_of_pickups",
"Borough", "Zone")
df.show()
val windowSpec =
Window.partitionBy("hour").orderBy(F.desc("number_of_pickups"))
val df2 = df1.withColumn("rn", F.row_number.over(windowSpec))
Q- Drop column
There are two intuitive API to drop columns:
['age',
'workclass',
'fnlwgt',
'education',
'marital',
'occupation',
'relationship',
'race',
'sex',
'capital_gain',
'capital_loss',
'hours_week',
'native_country',
'label']