Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

CS/2018/042

44033 Big Data Analysis


Assignment 01

G.D.Vindika shehan

CS/2018/042
CS/2018/042

Question 01
(a) Consider the data file ‘Words.txt’ and write PySpark code
segments to do the following.
(b) Read the data in the file and count how many times each
word appears using RDDs.

Code
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
input_rdd = sc.textFile("sample_data/Words.txt")
words_rdd = input_rdd.flatMap(lambda line: line.split(" "))
word_count_rdd = words_rdd.countByValue()
for value, count in word_count_rdd.items():
print(f"{value}: {count}")

Output
CS/2018/042
CS/2018/042

Question 02
(a) Create a data frame from the following employee data.
FName LName Age Salary Country
Jane Doe 34 123900 USA
Harvey Spectur 28 234590 USA
Jaing Xu 32 1090890 China
Won Gu 25 1903490 China

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField,
StringType,IntegerType,LongType

spark = SparkSession.builder.appName("DataFrame to Table


Example").getOrCreate()

employeeData = [("Jane","Doe", 34,123900,"USA"),


("Harvey","Spectur", 28,234590,"USA"),
("Jaing","Xu", 32,1090890,"China"),
("Won","Gu", 25,1903490,"China")]

# Define the schema


tableStructure = StructType([
StructField("FName", StringType(), True),
StructField("LName", StringType(), True),
StructField("Age",IntegerType(),True),
StructField("Salary",IntegerType(),True),
StructField("Country",StringType(),True)
])

# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)
CS/2018/042

(b) Write PySpark code to retrieve the following.


(i) Display the schema of the data Frame created in
the part(a)

Code
dataFrame.printSchema()

Output

(ii) Filter employees whose age is above 30 years.


Code
result1 = spark.sql("SELECT * FROM people WHERE Age > 30")

result1.show()
CS/2018/042

Output

(iii) Create a temporary SQL table from the data


frame created.
dataFrame.createOrReplaceTempView("people")

(iv) Write SparkSQL to retrieve employees who earn


the highest salary among the ones who lived in
USA.
Code

result2 = spark.sql("SELECT * FROM people WHERE Country='USA'


ORDER BY Salary DESC ").limit(1)

result2.show()
CS/2018/042

Output

Full Code

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType, StructField,
StringType,IntegerType,LongType

spark = SparkSession.builder.appName("DataFrame to Table


Example").getOrCreate()

employeeData = [("Jane","Doe", 34,123900,"USA"),


("Harvey","Spectur", 28,234590,"USA"),
("Jaing","Xu", 32,1090890,"China"),
("Won","Gu", 25,1903490,"China")]

# Define the schema


tableStructure = StructType([
StructField("FName", StringType(), True),
StructField("LName", StringType(), True),
StructField("Age",IntegerType(),True),
StructField("Salary",IntegerType(),True),
StructField("Country",StringType(),True)
])

# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)

dataFrame.printSchema()
CS/2018/042

dataFrame.createOrReplaceTempView("people")

result1 = spark.sql("SELECT * FROM people WHERE Age > 30")

result1.show()

result2 = spark.sql("SELECT * FROM people WHERE Country='USA' ORDER BY


Salary DESC ").limit(1)

result2.show()

Output

You might also like