Welcome to Scribd!

CS 2018 042

Uploaded by

0% found this document useful (0 votes)

12 views8 pages

This document contains code to analyze employee data using PySpark. It reads employee data from a file into a DataFrame and performs operations like filtering employees over 30, creating a SQL table from the DataFrame, and retrieving the highest paid USA employee. The code shows the schema, filters for age over 30, and selects the top USA earner.

Original Description:

BigData

Original Title

CS_2018_042

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

12 views8 pages

CS 2018 042

Uploaded by

veeresvaran

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 8

Search inside document

CS/2018/042

44033 Big Data Analysis

Assignment 01

G.D.Vindika shehan

CS/2018/042
CS/2018/042

Question 01
(a) Consider the data file ‘Words.txt’ and write PySpark code
segments to do the following.
(b) Read the data in the file and count how many times each
word appears using RDDs.

Code
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
input_rdd = sc.textFile("sample_data/Words.txt")
words_rdd = input_rdd.flatMap(lambda line: line.split(" "))
word_count_rdd = words_rdd.countByValue()
for value, count in word_count_rdd.items():
print(f"{value}: {count}")

Output
CS/2018/042
CS/2018/042

Question 02
(a) Create a data frame from the following employee data.
FName LName Age Salary Country
Jane Doe 34 123900 USA
Harvey Spectur 28 234590 USA
Jaing Xu 32 1090890 China
Won Gu 25 1903490 China

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField,
StringType,IntegerType,LongType

spark = SparkSession.builder.appName("DataFrame to Table

Example").getOrCreate()

employeeData = [("Jane","Doe", 34,123900,"USA"),

("Harvey","Spectur", 28,234590,"USA"),
("Jaing","Xu", 32,1090890,"China"),
("Won","Gu", 25,1903490,"China")]

# Define the schema

tableStructure = StructType([
StructField("FName", StringType(), True),
StructField("LName", StringType(), True),
StructField("Age",IntegerType(),True),
StructField("Salary",IntegerType(),True),
StructField("Country",StringType(),True)
])

# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)
CS/2018/042

(b) Write PySpark code to retrieve the following.

(i) Display the schema of the data Frame created in
the part(a)

Code
dataFrame.printSchema()

Output

(ii) Filter employees whose age is above 30 years.

Code
result1 = spark.sql("SELECT * FROM people WHERE Age > 30")

result1.show()
CS/2018/042

Output

(iii) Create a temporary SQL table from the data

frame created.
dataFrame.createOrReplaceTempView("people")

(iv) Write SparkSQL to retrieve employees who earn

the highest salary among the ones who lived in
USA.
Code

result2 = spark.sql("SELECT * FROM people WHERE Country='USA'

ORDER BY Salary DESC ").limit(1)

result2.show()
CS/2018/042

Output

Full Code

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField,
StringType,IntegerType,LongType

spark = SparkSession.builder.appName("DataFrame to Table

Example").getOrCreate()

employeeData = [("Jane","Doe", 34,123900,"USA"),

("Harvey","Spectur", 28,234590,"USA"),
("Jaing","Xu", 32,1090890,"China"),
("Won","Gu", 25,1903490,"China")]

# Define the schema

# Create a DataFrame
dataFrame = spark.createDataFrame(employeeData, tableStructure)

dataFrame.printSchema()
CS/2018/042

dataFrame.createOrReplaceTempView("people")

result1 = spark.sql("SELECT * FROM people WHERE Age > 30")

result1.show()

result2 = spark.sql("SELECT * FROM people WHERE Country='USA' ORDER BY

Salary DESC ").limit(1)

result2.show()

Output

PySpark Data Frame Questions PDF
Document57 pages
PySpark Data Frame Questions PDF
Varun Pathak
100% (1)
Py Spark Final
Document1 page
Py Spark Final
roy.scar2196
No ratings yet
YORK Mini VRF ODU - JDOH (040 To 060) - Installation Manual - FAN-1707 201602
Document36 pages
YORK Mini VRF ODU - JDOH (040 To 060) - Installation Manual - FAN-1707 201602
Douglas Rodriguez
100% (1)
Basic Engineering Mathematics
Document301 pages
Basic Engineering Mathematics
Kimutai Kirui Alphonce
87% (30)
Python Code
Document7 pages
Python Code
Gnan Shetty
No ratings yet
CORE JAVA and BIG DATA Slips
Document23 pages
CORE JAVA and BIG DATA Slips
Mahir Bhatia
No ratings yet
JAVA2 Merged
Document38 pages
JAVA2 Merged
pp5006277
No ratings yet
Java2 Exer
Document16 pages
Java2 Exer
pp5006277
No ratings yet
02 Data - Engg - 23-24 Worksheet Practical#5b 1
Document22 pages
02 Data - Engg - 23-24 Worksheet Practical#5b 1
taslimnkhan2004
No ratings yet
Pyspark Commands
Document12 pages
Pyspark Commands
Rambabu Giduturi
No ratings yet
RLAB KP
Document16 pages
RLAB KP
Akshay Hebbar
No ratings yet
C Chapter 10
Document32 pages
C Chapter 10
azharjaved
No ratings yet
PBLJ Lab Manual PDF
Document76 pages
PBLJ Lab Manual PDF
Srishti Gupta
No ratings yet
Java Programming Lab Manual
Document103 pages
Java Programming Lab Manual
NamburiVK
0% (3)
Apache Spark
Document5 pages
Apache Spark
Sanghamitra Das
No ratings yet
COMSCPBTech498771rObjrPr - Program On Array of Objects
Document2 pages
COMSCPBTech498771rObjrPr - Program On Array of Objects
Aneal Singh
No ratings yet
19.3.2 Data Preprocessing Di Spark
Document5 pages
19.3.2 Data Preprocessing Di Spark
Yafi Shalihuddin
No ratings yet
Rinansyah Afif Arrozak (NEW)
Document4 pages
Rinansyah Afif Arrozak (NEW)
rinansyahafifarrozak
No ratings yet
Ans Key Xii CS KV2 Halwara
Document20 pages
Ans Key Xii CS KV2 Halwara
Amisha Dalal
No ratings yet
Assessment Python
Document11 pages
Assessment Python
Sathiya Sankar
No ratings yet
Practical Record 2 PYTHON AND SQL PROGRAMS - 2023
Document76 pages
Practical Record 2 PYTHON AND SQL PROGRAMS - 2023
isnprincipal2020
No ratings yet
Class Xii Computer Science (083) : General Instructions - (I) All Questions Are Compulsory (Ii) Programming Language: C++
Document7 pages
Class Xii Computer Science (083) : General Instructions - (I) All Questions Are Compulsory (Ii) Programming Language: C++
Aditya Kumar
No ratings yet
Ajp PRC 18
Document4 pages
Ajp PRC 18
nstrnsdtn
No ratings yet
Wa0012.
Document30 pages
Wa0012.
hewepo4344
No ratings yet
Angadi Institute of Technology and Management
Document34 pages
Angadi Institute of Technology and Management
ABHISHEK YAKKUNDI
No ratings yet
C++ Programs
Document30 pages
C++ Programs
ankush laybar
No ratings yet
Computer Science Question Paper
Document9 pages
Computer Science Question Paper
Hema_Pandey123
No ratings yet
Programming Fundamentals
Document11 pages
Programming Fundamentals
Sathiya Sankar
No ratings yet
Positve or Negitive Skewed
Document5 pages
Positve or Negitive Skewed
chirusagar
No ratings yet
Java New
Document48 pages
Java New
surinegi8958
No ratings yet
PC02 Ass 5
Document11 pages
PC02 Ass 5
Shane Mak
No ratings yet
Java Da-3
Document10 pages
Java Da-3
Annapoorna A Nair 21MIS0130
No ratings yet
SQL - Interview Question With Ans
Document13 pages
SQL - Interview Question With Ans
r01997434
No ratings yet
Ajp Practical 20
Document4 pages
Ajp Practical 20
nstrnsdtn
No ratings yet
Design and Analysis of Algorithms Laboratory Manual-15CSL47 4 Semester CSE Department, CIT-Mandya Cbcs Scheme
Document30 pages
Design and Analysis of Algorithms Laboratory Manual-15CSL47 4 Semester CSE Department, CIT-Mandya Cbcs Scheme
sachu195
100% (1)
JAVAP
Document48 pages
JAVAP
surinegi8958
No ratings yet
XII Student Management Project
Document25 pages
XII Student Management Project
khshri3
No ratings yet
Agni College of Technology Mock Test - 2: Find The Output: (Questions 1 To 10)
Document5 pages
Agni College of Technology Mock Test - 2: Find The Output: (Questions 1 To 10)
Madan Bala
No ratings yet
Lab1 - Sort & Search
Document6 pages
Lab1 - Sort & Search
msyafiqnazri123
No ratings yet
Assessment C++
Document11 pages
Assessment C++
Sathiya Sankar
No ratings yet
Structures
Document20 pages
Structures
vskkavitha786
No ratings yet
Week12 Assignment Solution
Document10 pages
Week12 Assignment Solution
Arnab Dey
No ratings yet
20T127 Mini Project
Document11 pages
20T127 Mini Project
20T118ROSELIN VINISHA R ETEA
No ratings yet
APP8
Document46 pages
APP8
the pranksters
No ratings yet
DC 3&4@151
Document18 pages
DC 3&4@151
ankit pandey
No ratings yet
C Jpurnal
Document33 pages
C Jpurnal
ramji bhai
No ratings yet
Ip Worksheet 3 - Q'S
Document6 pages
Ip Worksheet 3 - Q'S
Shabin Muhammed
No ratings yet
The Test On C: 4mark
Document8 pages
The Test On C: 4mark
amitvit3
No ratings yet
Latebloomerworksheet
Document8 pages
Latebloomerworksheet
Deivanai K CS
No ratings yet
Neb Grade Xii
Document20 pages
Neb Grade Xii
Samjhana Lama
No ratings yet
Bhavikjava
Document19 pages
Bhavikjava
Mia Khalifa
No ratings yet
Spark
Document12 pages
Spark
PRAMOTH KJ
No ratings yet
Second Model Examination 2011 (Science)
Document8 pages
Second Model Examination 2011 (Science)
amit34521
No ratings yet
It Is An Intimation That Following Lectures Are Not Going To Repeat and Would Be Part of Mid-Term & Final Exam As Well
Document11 pages
It Is An Intimation That Following Lectures Are Not Going To Repeat and Would Be Part of Mid-Term & Final Exam As Well
Malik Saqib
No ratings yet
JDBC
Document4 pages
JDBC
dinkeshjainjee
No ratings yet
Programming Lab11
Document15 pages
Programming Lab11
Ritesh Gupta
No ratings yet
Module 5 Assignment Java (MCA)
Document16 pages
Module 5 Assignment Java (MCA)
TCS110-Riya Singh
No ratings yet
OOP Lab Record Ddvda
Document95 pages
OOP Lab Record Ddvda
Develop with Acelogic
No ratings yet
C, C++ Questions: Provided by
Document12 pages
C, C++ Questions: Provided by
api-3764166
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
150+ C Pattern Programs
From Everand
150+ C Pattern Programs
Hernando Abella
No ratings yet
Enercat 1
Document8 pages
Enercat 1
avinavgautam
100% (1)
Daftar Pustaka
Document2 pages
Daftar Pustaka
Johny Iskandar Arsyad Nst
No ratings yet
DTB User Guide en v1
Document8 pages
DTB User Guide en v1
abd sy
No ratings yet
Abhr Septoct 70 73
Document4 pages
Abhr Septoct 70 73
andradeinsua
No ratings yet
FYP Proposal Submission Form: Human Activity Monitoring (Wall and Fence)
Document2 pages
FYP Proposal Submission Form: Human Activity Monitoring (Wall and Fence)
RizwanAli
No ratings yet
Ial Maths s1 Review Exercise 1
Document15 pages
Ial Maths s1 Review Exercise 1
ali.halawi
No ratings yet
Book 1
Document100 pages
Book 1
Devasyruc
100% (1)
Daily Mis Report For Heat Treatment (Feb)
Document40 pages
Daily Mis Report For Heat Treatment (Feb)
YashJhunjhunwala
No ratings yet
Kodak Professional Apparatus 1936
Document62 pages
Kodak Professional Apparatus 1936
ajsikel
No ratings yet
DSP Lab Question Bank
Document6 pages
DSP Lab Question Bank
Gaurav Reddy
No ratings yet
Water Supply Borehole Construction
Document10 pages
Water Supply Borehole Construction
Scott Downs
100% (1)
Planning The New Venture
Document17 pages
Planning The New Venture
StoryKing
No ratings yet
LM2500 Assessment
Document3 pages
LM2500 Assessment
KALPUSH
50% (2)
Codemeter For Mac Os X: Runtime Kit Version 6.40B
Document3 pages
Codemeter For Mac Os X: Runtime Kit Version 6.40B
denipujin
No ratings yet
Tda 2613 Q
Document11 pages
Tda 2613 Q
paulmx13
No ratings yet
Project Report
Document37 pages
Project Report
Kaushal Mishra
No ratings yet
Hypothesis Testing Lecture
Document28 pages
Hypothesis Testing Lecture
yogibh
No ratings yet
Toyama Electric Profile
Document17 pages
Toyama Electric Profile
srikant
0% (1)
Daniel Throssell - Please Unsubscribe
Document8 pages
Daniel Throssell - Please Unsubscribe
Fox Copy
No ratings yet
Writing Lesson Plan 1-18-17
Document2 pages
Writing Lesson Plan 1-18-17
api-271289896
No ratings yet
Ahmet Ondortoglu Principal RF Engineer: Experience
Document12 pages
Ahmet Ondortoglu Principal RF Engineer: Experience
Muthanna Ali
No ratings yet
OAF - Oracle Application Framework Training Manual Document
Document1 page
OAF - Oracle Application Framework Training Manual Document
Tejeshwar Kumar
No ratings yet
Catalogo de SELECCIÓN CARDAN
Document36 pages
Catalogo de SELECCIÓN CARDAN
Tony Casilla
No ratings yet
N R P Book - DR V Patkar
Document250 pages
N R P Book - DR V Patkar
Al Fatima Trust
100% (1)
Question Paper (Unit-Test-1) Analog IC Design (MEL G 632) Date: 21-02-2017 Time: 12:00 Hours To 13:00 Hours Closed Book Full-Marks: 15
Document2 pages
Question Paper (Unit-Test-1) Analog IC Design (MEL G 632) Date: 21-02-2017 Time: 12:00 Hours To 13:00 Hours Closed Book Full-Marks: 15
Gaurav Patil
No ratings yet
Distributed Dbms Architecture
Document16 pages
Distributed Dbms Architecture
Abid Farooq Bhutta
No ratings yet
Chemie Dampf en
Document16 pages
Chemie Dampf en
mishtinil
No ratings yet
Microsoft Word - Deployment Diagram - An - An Introduction
Document6 pages
Microsoft Word - Deployment Diagram - An - An Introduction
Chandra Mohan
No ratings yet