Here's a concise list of commonly used functions in PySpark:
Transformation Functions:
map(func): Applies a function to each element of an RDD.
flatMap(func): Similar to map, but flattens the result. filter(func): Filters elements based on a function. reduce(func): Aggregates the elements of an RDD using a specified function. groupByKey(): Groups the values of each key in an RDD. sortByKey(): Sorts the RDD by key. join(otherRDD): Joins two RDDs based on a common key. Action Functions:
collect(): Retrieves all elements of an RDD as a list in the driver program.
count(): Returns the number of elements in an RDD. take(n): Returns the first n elements of an RDD as a list. first(): Returns the first element of an RDD. foreach(func): Applies a function to each element of an RDD (usually for side effects). saveAsTextFile(path): Saves the RDD as a text file in the specified path. Pair RDD Functions:
reduceByKey(func): Reduces values of each key using the specified function.
groupByKey(): Groups the values of each key in an RDD. mapValues(func): Applies a function to each value of a key-value pair RDD. flatMapValues(func): Similar to mapValues, but flattens the result. sortByKey(): Sorts the RDD by key. Other Functions:
broadcast(variable): Broadcasts a read-only variable to all worker nodes.
accumulator(initial_value): Creates a shared variable that can be accumulated across tasks. These are just some of the functions available in PySpark. The Spark API is extensive, and there are many more functions and capabilities available for data manipulation, transformation, and analysis.