Professional Documents
Culture Documents
2.RDDs in Spark
2.RDDs in Spark
2.RDDs in Spark
def sum_partition(iterator):
yield sum(iterator)
sums_list = sums_rdd.collect()
print(sums_list)
reduceByKey()
from pyspark import SparkContext, SparkConf
# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
Rdd1=[1,2,3,4]
Rdd2=[3,4,5,6,6,4]
Rdd1.Subtract(Rdd2)
Rdd1.Intersection(Rdd2)
Rdd1.distinct()
Reduce() on RDD’s
6. Which transformation is used to create a new RDD by applying a function to each element of the source RDD?
a) `map()`
b) `filter()`
c) `reduce()`
d) `flatmap()`
7. Which transformation is used to create a new RDD by applying a function that returns an iterator to each element of the
source RDD, and then flattening the results?
a) `map()`
b) `filter()`
c) `reduce()`
d) `flatmap()`
8. Which transformation is used to create a new RDD containing only the elements that satisfy a given condition?
a) `map()`
b) `filter()`
c) `reduce()`
d) `distinct()`
9. Which transformation is used to create a new RDD by combining elements from two RDDs with the same key?
a) `join()`
b) `union()`
c) `groupByKey()`
d) `reduceByKey()`
10. What is the difference between `map()` and `flatmap()` transformations in Spark?
a) `map()` applies a function to each element, while `flatmap()` applies a function returning an iterator to each
element.
b) `map()` creates a new RDD with the same number of elements, while `flatmap()` creates an RDD with fewer
elements.
c) `map()` can only be used with numerical data, while `flatmap()` works with any data type.
d) `map()` is used for parallel execution, while `flatmap()` is used for sequential execution.
11. Which transformation should be used to create a new RDD containing the distinct elements of the source RDD?
**
a) distinct()
b) unique()
c) removeDuplicates()
d) filter()