Welcome to Scribd!

Week 6 Assignment

Uploaded by

0% found this document useful (0 votes)

8 views2 pages

The document outlines 5 assignments involving loading and manipulating data using PySpark. Assignment 1 asks to create a dataframe from sample data with 2 columns. Assignment 2 asks to load library data and enforce a schema. Assignment 3 asks to drop columns, remove duplicates and count unique values from train data. Assignment 4 asks to read sales data using different modes and count records. Assignment 5 asks to load hospital data, modify columns, add new columns with calculations, and select specific columns.

Original Description:

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

8 views2 pages

Week 6 Assignment

Uploaded by

kalidas

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 2

Search inside document

Assignment

1. Write PySpark code to create a new dataframe with the data given below
having 2 columns (‘season’) and (‘windspeed’).
[ Datatypes of the column names can be inferred ]
# Data
[("Spring", 12.3),
("Summer", 10.5),
("Autumn", 8.2),
("Winter", 15.1)]

2. Consider the library management dataset located at the following path

(/public/trendytech/datasets/library_data.json). Using PySpark, load the
data into a Dataframe and enforce schema using StructType.

3. Given the dataset (/public/trendytech/datasets/train.csv), create a

Dataframe using PySpark and perform the following operations
a) Drop the columns passenger_name and age from the dataset.
b) Count the number of rows after removing duplicates of columns
train_number and ticket_number.
c) Count the number of unique train names.

4. You are working as a Data Engineer in a large retail company. The

company has a dataset named "sales_data.json" that contains sales records
from various stores. The dataset is stored in JSON format and may have
some corrupt or malformed records due to occasional data quality issues.
Your task is to read the "sales_data.json" dataset
(/public/trendytech/datasets/sales_data.json) using PySpark, utilizing
different read modes to handle corrupt records. You need to create a
Dataframe using pyspark and perform the following operations:
1. Read the dataset using the "permissive" mode and count the number of
records read.
2. Read the dataset using the "dropmalformed" mode and display the
number of malformed records.
3. Read the dataset using the "failfast" mode.

5. You have a hospital dataset with the following fields:

● patient_id (integer): Unique identifier for each patient.
● admission_date (date): The date the patient was admitted to the
hospital. (MM-dd-yyyy)
● discharge_date (date): The date the patient was discharged from the
hospital. (yyyy-MM-dd)
● diagnosis (string): The diagnosed medical condition of the patient.
● doctor_id (integer): The identifier of the doctor responsible for the
patient's care.
● total_cost (float): The total cost of the hospital stay for the patient.
Using PySpark, load the data into a Dataframe and perform the following
operations on the hospital dataset
(/public/trendytech/datasets/hospital.csv):
1. Drop the "doctor_id" column from the dataset.
2. Rename the "total_cost" column to "hospital_bill".
3. Add a new column called "duration_of_stay" that represents the number
of days a patient stayed in the hospital. (hint: The duration should be
calculated as the difference between the "discharge_date" and
"admission_date" columns.)
4. Create a new column called "adjusted_total_cost" that calculates the
adjusted total cost based on the diagnosis as follows:
If the diagnosis is "Heart Attack", multiply the hospital_bill by 1.5.
If the diagnosis is "Appendicitis", multiply the hospital_bill by 1.2.
For any other diagnosis, keep the hospital_bill as it is.

5. Select the "patient_id", "diagnosis", "hospital_bill", and

"adjusted_total_cost" columns.

Sas Interview Questions With Answers
Document124 pages
Sas Interview Questions With Answers
Venkata Lakshmi Burla
100% (5)
Sas Questions
Document3 pages
Sas Questions
gaggy1983
No ratings yet
Description: Hint: Perform Steps As Mentioned Below
Document11 pages
Description: Hint: Perform Steps As Mentioned Below
Anish Kumar
100% (1)
New Hope
Document25 pages
New Hope
Julius Juju Kenango
100% (3)
Tutorial Windografher
Document181 pages
Tutorial Windografher
Gonzalo Gomez
83% (6)
Hipaa 837 Inst
Document313 pages
Hipaa 837 Inst
Sam Shiva
No ratings yet
Oregano Capstone
Document16 pages
Oregano Capstone
Princess Cloewie Ferrer
100% (1)
HP425 Seminar 1: Introduction To Stata: "Drop "
Document6 pages
HP425 Seminar 1: Introduction To Stata: "Drop "
Tú Minh Đặng
No ratings yet
DSML Problem Statements
Document8 pages
DSML Problem Statements
Mangesh Pawar
No ratings yet
Oops Revision
Document1 page
Oops Revision
Keerthi Sattaluri
No ratings yet
Health Care System Analysispdf
Document19 pages
Health Care System Analysispdf
hungdang
No ratings yet
500 Question 2
Document157 pages
500 Question 2
saran2010
No ratings yet
Section 1: Input, Output and Storage Specifications. List of Tables in The Database
Document4 pages
Section 1: Input, Output and Storage Specifications. List of Tables in The Database
Rajat Sharma
No ratings yet
October, 2011 (For July Session) : Section 1: C Programming Lab
Document33 pages
October, 2011 (For July Session) : Section 1: C Programming Lab
Prabesh Manandhar
No ratings yet
B2 Oops
Document3 pages
B2 Oops
Rajesh Krishnamoorthy
No ratings yet
Roll NO 2020
Document8 pages
Roll NO 2020
Ali Mohsin
No ratings yet
Tut w3s
Document6 pages
Tut w3s
Farhana Sharmin Tithi
100% (1)
Prog Assignment 3
Document10 pages
Prog Assignment 3
Razvan Raducanu
No ratings yet
PPC Project Mock
Document20 pages
PPC Project Mock
bxoaoa
No ratings yet
Hospital Billing System
Document16 pages
Hospital Billing System
Aashutosh Sharma
100% (1)
Tutorial 01 Updated 06-10-2015
Document8 pages
Tutorial 01 Updated 06-10-2015
SC Priyadarshani de Silva
No ratings yet
Sunny Micro Project
Document18 pages
Sunny Micro Project
santoshade491
No ratings yet
Parth B Rathod (IP)
Document35 pages
Parth B Rathod (IP)
Bipin Chandra
No ratings yet
SAS ETL Tool
Document19 pages
SAS ETL Tool
usha85
No ratings yet
stata应用课程回归
Document50 pages
stata应用课程回归
Mengdi Shi
No ratings yet
OOP Final Project Report
Document55 pages
OOP Final Project Report
Raja Haríì Rajpoot
No ratings yet
Exercises 3.3 Minería de Datos Profesor Luis Bayonet Sección: 01 José Germán 18-0200
Document3 pages
Exercises 3.3 Minería de Datos Profesor Luis Bayonet Sección: 01 José Germán 18-0200
José Germán
No ratings yet
Project
Document32 pages
Project
Anit Kumar Mohapatra
No ratings yet
HealthCare Domain Testing
Document6 pages
HealthCare Domain Testing
subhabirajdar
No ratings yet
361المشروع اسيات قواعد البيانات
Document8 pages
361المشروع اسيات قواعد البيانات
الصنوي فادي محمد
No ratings yet
Blank Acl Workpaper
Document6 pages
Blank Acl Workpaper
xpac_53
No ratings yet
Performing Analysis of Meteorological Data: Punam Seal
Document21 pages
Performing Analysis of Meteorological Data: Punam Seal
Punam
No ratings yet
Hospital Management System
Document11 pages
Hospital Management System
jojorefrence6969
No ratings yet
Hospital-Management-System-Project Khushi
Document19 pages
Hospital-Management-System-Project Khushi
yash
No ratings yet
JPR - QB Pract 04 (A)
Document2 pages
JPR - QB Pract 04 (A)
api-3728136
No ratings yet
Material 3 - System Identification Tutorial
Document41 pages
Material 3 - System Identification Tutorial
Jannata Savira T
No ratings yet
FOCP2 Projects 2022
Document7 pages
FOCP2 Projects 2022
Abcd
No ratings yet
Assignment 2 Pe 1
Document10 pages
Assignment 2 Pe 1
Vishal Gupta
No ratings yet
You Now Add The Custom Columns You Want On This User
Document7 pages
You Now Add The Custom Columns You Want On This User
Val DW
No ratings yet
DSWS - Getting Started With Python - Version2 - 0 PDF
Document8 pages
DSWS - Getting Started With Python - Version2 - 0 PDF
Faddy Lutfi-Tejowati
No ratings yet
DS Report 03
Document30 pages
DS Report 03
Vishweshwar Hosmath
No ratings yet
Dealing With Health Care Data Using The SAS® System
Document9 pages
Dealing With Health Care Data Using The SAS® System
Akshay Mathur
No ratings yet
MS Access 2010 Example v2 PDF
Document33 pages
MS Access 2010 Example v2 PDF
Wy Teay
No ratings yet
C# Practice Exercises 2
Document2 pages
C# Practice Exercises 2
Tony Nakovski
No ratings yet
Ibb Test SQL
Document5 pages
Ibb Test SQL
gina
No ratings yet
Report On Petroleum Consumption Data Analytics: - Submitted by
Document18 pages
Report On Petroleum Consumption Data Analytics: - Submitted by
Ayush Sharma
No ratings yet
D03 Tutorial Updated
Document4 pages
D03 Tutorial Updated
LEONG KAI JUN RVHS STU 2018
No ratings yet
Pre Processing
Document10 pages
Pre Processing
Prathmesh Patil
No ratings yet
IT Spss Ppractical
Document40 pages
IT Spss Ppractical
Asia meso
No ratings yet
CS 102 Project
Document3 pages
CS 102 Project
Wesam AL Mofareh
No ratings yet
Sandeep
Document8 pages
Sandeep
pavan teja saikam
No ratings yet
Web Application
Document13 pages
Web Application
ouiam ouhdifa
No ratings yet
Format Page
Document8 pages
Format Page
Piyush Shahane
No ratings yet
BacLink 4.LIS Cerner Classic
Document12 pages
BacLink 4.LIS Cerner Classic
vn_ny84bio021666
No ratings yet
Predictive Modeling Business Report Seetharaman Final Changes PDF
Document28 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
Ankita Mishra
100% (1)
Healthcare Cost Analysis (Source Code) : Description
Document6 pages
Healthcare Cost Analysis (Source Code) : Description
Shivaraj Bevoor
No ratings yet
R Programming Exercice
Document5 pages
R Programming Exercice
Carmen Deni Martinez Gomez
No ratings yet
Usemore Soap Company
Document8 pages
Usemore Soap Company
g14032
No ratings yet
Data Understanding and Preparation
Document48 pages
Data Understanding and Preparation
MohamedYounes
No ratings yet
BAHRIA UNIVERSITY, (Karachi Campus) : Department of Software Engineering
Document11 pages
BAHRIA UNIVERSITY, (Karachi Campus) : Department of Software Engineering
Sadia Afzal
No ratings yet
Assignment Su2023
Document4 pages
Assignment Su2023
Đức Lê Trung
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Here There Be Tygers by Stephen King
Document3 pages
Here There Be Tygers by Stephen King
xXPizzaPresidentXx
100% (1)
Plasma TV: Service Manual
Document34 pages
Plasma TV: Service Manual
Claudelino Toffoletti
No ratings yet
Effectiveness of Myofunctional Therapy in Ankyloglossia: A Systematic Review
Document18 pages
Effectiveness of Myofunctional Therapy in Ankyloglossia: A Systematic Review
mistic0
No ratings yet
Social and Psychological Manipulation
Document287 pages
Social and Psychological Manipulation
Dean Amory
100% (36)
Summer Fields School Sub - Psychology CLASS - XII Opt Date: 30.5.14
Document2 pages
Summer Fields School Sub - Psychology CLASS - XII Opt Date: 30.5.14
max life
No ratings yet
Sop For LP Pump (R1)
Document6 pages
Sop For LP Pump (R1)
Sonrat
100% (1)
Clean Room Detail
Document40 pages
Clean Room Detail
Ajay Sastry
No ratings yet
Cii-Sorabji Green Business Centre
Document46 pages
Cii-Sorabji Green Business Centre
madhu
No ratings yet
Wages For ESIC
Document10 pages
Wages For ESIC
vai8hav
No ratings yet
Immigration Reform and Control Act of 1986
Document3 pages
Immigration Reform and Control Act of 1986
d8n
No ratings yet
PDF
Document72 pages
PDF
Akhmad Ulil Albab
No ratings yet
DC 1 5d
Document18 pages
DC 1 5d
Symon Miranda Sioson
No ratings yet
SPE-182811-MS Single-Well Chemical Tracer Test For Residual Oil Measurement: Field Trial and Case Study
Document15 pages
SPE-182811-MS Single-Well Chemical Tracer Test For Residual Oil Measurement: Field Trial and Case Study
SHOBHIT ROOH
No ratings yet
Super HDB Turbo 15W-40
Document1 page
Super HDB Turbo 15W-40
izzybj
No ratings yet
Core dSAT Vocabulary - Set 3
Document2 pages
Core dSAT Vocabulary - Set 3
Hưng Nguyễn
No ratings yet
Asst. Prof Akash Prasad: Semester
Document17 pages
Asst. Prof Akash Prasad: Semester
Hemu Bhardwaj
No ratings yet
Franciscan ThePulseNewsletter
Document4 pages
Franciscan ThePulseNewsletter
Jill McDonough
No ratings yet
Flo Watcher
Document2 pages
Flo Watcher
Anonymous yWgZxGW5d
No ratings yet
Sharp+LC-32A47L-A
Document32 pages
Sharp+LC-32A47L-A
alexnder gallego
No ratings yet
Iso 27001 Checklist
Document8 pages
Iso 27001 Checklist
ALEX COSTA CRUZ
No ratings yet
Epidural Intraspinal Anticoagulation Guidelines - UK
Document9 pages
Epidural Intraspinal Anticoagulation Guidelines - UK
josh
No ratings yet
Ingles III - Advantages and Disadvantages of Rapid Development in The Province of Tete
Document6 pages
Ingles III - Advantages and Disadvantages of Rapid Development in The Province of Tete
Vercinio Teodoro Vtb
No ratings yet
Reliance Presentation
Document36 pages
Reliance Presentation
rahsatpute
No ratings yet
Typhoid Fever para Present
Document23 pages
Typhoid Fever para Present
Gino Al Ballano Borinaga
No ratings yet
February 21, 2014 Strathmore Times
Document28 pages
February 21, 2014 Strathmore Times
Strathmore Times
No ratings yet
TEST1
Document5 pages
TEST1
mirza danieal
No ratings yet
HealthySFMRA - 03001 - San Francisco MRA Claim Form
Document3 pages
HealthySFMRA - 03001 - San Francisco MRA Claim Form
lepeep
No ratings yet
DO Transfer Pump
Document40 pages
DO Transfer Pump
Udana Hettiarachchi
No ratings yet