Professional Documents
Culture Documents
CS 2018 042
CS 2018 042
G.D.Vindika shehan
CS/2018/042
CS/2018/042
Question 01
(a) Consider the data file ‘Words.txt’ and write PySpark code
segments to do the following.
(b) Read the data in the file and count how many times each
word appears using RDDs.
Code
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
input_rdd = sc.textFile("sample_data/Words.txt")
words_rdd = input_rdd.flatMap(lambda line: line.split(" "))
word_count_rdd = words_rdd.countByValue()
for value, count in word_count_rdd.items():
print(f"{value}: {count}")
Output
CS/2018/042
CS/2018/042
Question 02
(a) Create a data frame from the following employee data.
FName LName Age Salary Country
Jane Doe 34 123900 USA
Harvey Spectur 28 234590 USA
Jaing Xu 32 1090890 China
Won Gu 25 1903490 China
# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)
CS/2018/042
Code
dataFrame.printSchema()
Output
result1.show()
CS/2018/042
Output
result2.show()
CS/2018/042
Output
Full Code
# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)
dataFrame.printSchema()
CS/2018/042
dataFrame.createOrReplaceTempView("people")
result1.show()
result2.show()
Output