New 77

You might also like

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 1

Here's a concise list of commonly used functions in PySpark:

Transformation Functions:

map(func): Applies a function to each element of an RDD.


flatMap(func): Similar to map, but flattens the result.
filter(func): Filters elements based on a function.
reduce(func): Aggregates the elements of an RDD using a specified function.
groupByKey(): Groups the values of each key in an RDD.
sortByKey(): Sorts the RDD by key.
join(otherRDD): Joins two RDDs based on a common key.
Action Functions:

collect(): Retrieves all elements of an RDD as a list in the driver program.


count(): Returns the number of elements in an RDD.
take(n): Returns the first n elements of an RDD as a list.
first(): Returns the first element of an RDD.
foreach(func): Applies a function to each element of an RDD (usually for side
effects).
saveAsTextFile(path): Saves the RDD as a text file in the specified path.
Pair RDD Functions:

reduceByKey(func): Reduces values of each key using the specified function.


groupByKey(): Groups the values of each key in an RDD.
mapValues(func): Applies a function to each value of a key-value pair RDD.
flatMapValues(func): Similar to mapValues, but flattens the result.
sortByKey(): Sorts the RDD by key.
Other Functions:

broadcast(variable): Broadcasts a read-only variable to all worker nodes.


accumulator(initial_value): Creates a shared variable that can be accumulated
across tasks.
These are just some of the functions available in PySpark. The Spark API is
extensive, and there are many more functions and capabilities available for data
manipulation, transformation, and analysis.

You might also like