Welcome to Scribd!

Assignment 2: Hive

Uploaded by

0% found this document useful (0 votes)

14 views11 pages

The document summarizes the steps taken to analyze a dataset using Hive and PySpark. Key steps include: 1) Loading data into Hive and dropping the ID column as it was not important for classification. 2) Loading data into PySpark, preprocessing including one-hot encoding categorical variables, and splitting data into train and test sets. 3) Defining a logistic regression model and evaluating model performance using AUC and F1 score on the test data.

Original Description:

Spark ML

Original Title

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

14 views11 pages

Assignment 2: Hive

Uploaded by

HPot PotTech

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 11

Search inside document

Assignment 2

Hive:
Loaded the data into hive. Dropped the ‘ID’ column and saved the data for further processing in spark.
Dropped ‘ID’ column as it was not an important feature for classification.

a) Create and load table

b) Drop ID column and dump table to local for further processing

PySpark:

Steps:

1) Load Data
a) Load the dataset and infer schema

b) Change the column names for better interpretation

2) Preprocessing and Understanding Data

a) None of the columns have Null value. I dropped the ‘ID’ column using hive as it was not useful for
classification.

b) Get the number of examples which belong to each class.

We can see that there are more examples belonging to class 0 are more, so the dataset is not balanced.

c) As the dataset is not balanced, add a column "weight" which tells how important an example is during
training. We give more weight of 2.7 to examples that belong to class 1 and give weight 1 to examples
that belong to class 0.

d) Look at the number of Defaulters/Non-Defaulters for each sex.

e) Distribution of values for Pay_2 column for the dataset. Value of 0 and -2 is not defined for this
dataset, still a lot of examples have these values.

f) Distribution of values for the Age column for this dataset. We can see most of the data is of people of
age group 24-40.
g) Distribution of values for the Marriage column for this dataset. Value of 0 is not defined for this
dataset, but some examples have this value.

h) Distribution of values for the Balance_limit column for this dataset.

i) Randomly split the dataset into training and testing in 60:40 ratio.

j) One-hot encode the categorical variables Marriage and Education.

k) Assemble all the features into one column using Vector assembler.
3) Define Model and Pipeline
a) We use Logistic Regression model and we provide input and output columns to model for training. We
also provide weight that the model should give each example during training.

b) Define Pipeline to chain multiple transformations to specify machine learning workflow.

4) Get Model Predictions

a) Fit the model and get predictions on the test data.

b) Get the number of examples where prediction is the same as true label.
c) Print number of examples which have true label 0/1. Print number of examples which has predicted
label 0/1. Print output for some test examples.

5) Evaluate Model
a) Evaluate the model using AUC metric which is in the BinaryClassificationEvaluator package. Give
probabilities and labels as input to the evaluator.
b) Using F1 score for evaluating the performance of the classifier. Metrics used for evaluation depend on
what is the end goal we want to achieve, but in most scenarios F1 score is a good metric to evaluate
performance of the classifier if the dataset is unbalanced.

Spark WebUI

Most time is taken by model.fit() paragraph. This task started many jobs from Job id 1564 to 1636 on my
laptop. They all took around 3 seconds combined. This is the total duration. For calculating the Executor
Computing Time, we have to go inside each job and sum the Executor Computing Time of each stage of
the job. We can see from the following figure that job id 1564 to job id 1636 have the same job group.
Most time taken by a single job was 0.3s. There were few jobs which had a duration of 0.3s. One such
job is job id 1543 in figure below.

Toyota Mark X 2006
Document19 pages
Toyota Mark X 2006
Rodrigo Mogro
100% (1)
Lab Program C++ First Year SASTRA University
Document6 pages
Lab Program C++ First Year SASTRA University
star
No ratings yet
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
Document6 pages
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
priyam
100% (1)
C# Interview Questions You'll Most Likely Be Asked
From Everand
C# Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
TikTok Presentation
Document10 pages
TikTok Presentation
Indra Maulana
No ratings yet
File 482621234 482621234 - Assignment 2 - 7378831553794248
Document5 pages
File 482621234 482621234 - Assignment 2 - 7378831553794248
Bob Philip
No ratings yet
Hackathon Best Practices
Document2 pages
Hackathon Best Practices
Daniel Wee
No ratings yet
Data Science & Big Data - Practical
Document7 pages
Data Science & Big Data - Practical
RAKESH G
No ratings yet
Lab-11 Random Forest
Document2 pages
Lab-11 Random Forest
KamranKhan
No ratings yet
Microsoft Ucertify dp-100
Document15 pages
Microsoft Ucertify dp-100
Steven Doh
No ratings yet
Object Oriented Programming With C++
Document9 pages
Object Oriented Programming With C++
Yogs_ereader
No ratings yet
Important Questions
Document4 pages
Important Questions
Adilrabia rsl
No ratings yet
Faculty of Computer Science and Information Technology SSK 3101 (Computer Programming II) Semester I 2020/2021 Lab 2 (Individual) Learning Objective
Document3 pages
Faculty of Computer Science and Information Technology SSK 3101 (Computer Programming II) Semester I 2020/2021 Lab 2 (Individual) Learning Objective
manar ahmed
No ratings yet
Data Structure Manual For BE Students
Document100 pages
Data Structure Manual For BE Students
raghu07_k
100% (1)
Visual Basic Net
Document7 pages
Visual Basic Net
Betty Nakibuuka
0% (1)
2324 BigData Lab3
Document6 pages
2324 BigData Lab3
Elie Al Howayek
No ratings yet
Lab1 BoW ImageClassification
Document3 pages
Lab1 BoW ImageClassification
Vikramaditya Tarai
No ratings yet
Team Alacrity - Amazon ML Challenge 2023 - Text File
Document8 pages
Team Alacrity - Amazon ML Challenge 2023 - Text File
omkar sameer chaubal
No ratings yet
Spreadsheet& RDBMs
Document10 pages
Spreadsheet& RDBMs
Pushpa devi
No ratings yet
Object Oriented Methodology & Programming (Using C++) : Laboratory Assignments of
Document12 pages
Object Oriented Methodology & Programming (Using C++) : Laboratory Assignments of
RakeshSah
No ratings yet
Object Oriented Programming Assignment 3
Document7 pages
Object Oriented Programming Assignment 3
faria eman
No ratings yet
CSP Lab Manual 9
Document11 pages
CSP Lab Manual 9
ẄâQâŗÂlï
No ratings yet
The Bcs Professional Examinations BCS Level 5 Diploma in IT April 2008 Examiners' Report Object Oriented Programming
Document11 pages
The Bcs Professional Examinations BCS Level 5 Diploma in IT April 2008 Examiners' Report Object Oriented Programming
Ozioma Ihekwoaba
No ratings yet
Computer (A I) New
Document5 pages
Computer (A I) New
Chirag Chinnappa
No ratings yet
Ass3 v1
Document4 pages
Ass3 v1
Reeya Prakash
No ratings yet
DP 100
Document13 pages
DP 100
manan511shah
No ratings yet
hw1 Problem Set
Document8 pages
hw1 Problem Set
Billy bob
No ratings yet
Object Oriented Concepts Using C++
Document11 pages
Object Oriented Concepts Using C++
Shoba Karthikeyan
No ratings yet
Oomp Assignments For UoP
Document5 pages
Oomp Assignments For UoP
Vivek Agarwal
0% (1)
ClassTest1 PG2116D S222
Document4 pages
ClassTest1 PG2116D S222
sbusisosimelane8
No ratings yet
C PROGRAMMING Interview Questions
Document10 pages
C PROGRAMMING Interview Questions
Gaurav Gupta
No ratings yet
COMP90049 2021S1 A3-Spec
Document7 pages
COMP90049 2021S1 A3-Spec
Masud Zaman
No ratings yet
(3 Marks) You Have Main Function Which Is Saved On Folder Named Q1 Already. Use These
Document121 pages
(3 Marks) You Have Main Function Which Is Saved On Folder Named Q1 Already. Use These
Hoàng Ngọc Thủy K17 HL
No ratings yet
Lab 10
Document7 pages
Lab 10
johnandshelly621
No ratings yet
Lab 08
Document10 pages
Lab 08
johnandshelly621
No ratings yet
BCA 3rd Sem. Ass.2018-19
Document9 pages
BCA 3rd Sem. Ass.2018-19
Kumar Info
No ratings yet
Lab Session 4
Document14 pages
Lab Session 4
umama amjad
No ratings yet
Computer Science, Paper-I: Roll Number
Document2 pages
Computer Science, Paper-I: Roll Number
Junaid Ahmed Shaikh
No ratings yet
C9550-606.exam.60q: Number: C9550-606 Passing Score: 800 Time Limit: 120 Min
Document31 pages
C9550-606.exam.60q: Number: C9550-606 Passing Score: 800 Time Limit: 120 Min
Khemais Abdallah
No ratings yet
5.1 Lab Assignment # 5
Document2 pages
5.1 Lab Assignment # 5
Sushain Thakur
No ratings yet
Data Structure and C - Lab
Document3 pages
Data Structure and C - Lab
ayushti1924
No ratings yet
BE368 - Lab Class 3
Document2 pages
BE368 - Lab Class 3
Muhammad Afzaal
No ratings yet
Intro To Scikit Learning
Document18 pages
Intro To Scikit Learning
THIRUNEELAKANDAN
No ratings yet
Instructions:: CS593: Data Structure and Database Lab Take Home Assignment - 5 (10 Questions, 100 Points)
Document3 pages
Instructions:: CS593: Data Structure and Database Lab Take Home Assignment - 5 (10 Questions, 100 Points)
Lafoot Babu
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
CSC583 Artificial Intelligence Algorithms Group Assignment (30%)
Document3 pages
CSC583 Artificial Intelligence Algorithms Group Assignment (30%)
harith danish
No ratings yet
CGL Lab Manual 23-24
Document30 pages
CGL Lab Manual 23-24
samruddhishende04
No ratings yet
Lab 7 Tasks and Report
Document10 pages
Lab 7 Tasks and Report
Lovelove
No ratings yet
Assignment3 A20
Document3 pages
Assignment3 A20
April Ding
No ratings yet
Practical List CJP - 24
Document3 pages
Practical List CJP - 24
siphay900
No ratings yet
CCP - 1 Spring
Document38 pages
CCP - 1 Spring
iyy u
No ratings yet
檔案2
Document11 pages
檔案2
jiangmuggle
No ratings yet
Visual Basic Net PDF
Document7 pages
Visual Basic Net PDF
Yusuf Cali Garaad
No ratings yet
TCS Technical Pro Paid Paper-2
Document118 pages
TCS Technical Pro Paid Paper-2
viru kothamasu
100% (1)
TCS Programming Bits PDF
Document118 pages
TCS Programming Bits PDF
Prasa
67% (3)
OOP Using C++ Lab Manual 2014
Document10 pages
OOP Using C++ Lab Manual 2014
anjaney
No ratings yet
Due Date: Thursday 22 February 2024 at 11:59 PM On
Document7 pages
Due Date: Thursday 22 February 2024 at 11:59 PM On
ihaseebarshad10
No ratings yet
Multiplechoice Jan 20177787
Document9 pages
Multiplechoice Jan 20177787
nsy42896
No ratings yet
Theoryassignment PDF
Document11 pages
Theoryassignment PDF
Karthik Reddy
No ratings yet
AIDS - DM Using Python - Lab Programs
Document19 pages
AIDS - DM Using Python - Lab Programs
yelubandirenukavidyadhari
No ratings yet
DP-Designing and Implementing
Document10 pages
DP-Designing and Implementing
Steven Doh
No ratings yet
Final Test: Student Name: - Time/Date Issued: Time/Date Due: Instructions
Document3 pages
Final Test: Student Name: - Time/Date Issued: Time/Date Due: Instructions
Jonatan Ramirez
No ratings yet
Flipkart Training: Exploratory Data Analysis
Document9 pages
Flipkart Training: Exploratory Data Analysis
HPot PotTech
No ratings yet
ComplexArithmetic - Jupyter Notebook
Document14 pages
ComplexArithmetic - Jupyter Notebook
HPot PotTech
No ratings yet
Assignment I (DF)
Document10 pages
Assignment I (DF)
HPot PotTech
No ratings yet
Assignment I (Dataframe) : Analysis of Stocks Data
Document9 pages
Assignment I (Dataframe) : Analysis of Stocks Data
HPot PotTech
No ratings yet
Learning Tensorflow
Document9 pages
Learning Tensorflow
HPot PotTech
No ratings yet
Elastic Stack 7
Document280 pages
Elastic Stack 7
HPot PotTech
No ratings yet
ABC Guide On Citizen Engagement
Document11 pages
ABC Guide On Citizen Engagement
HPot PotTech
No ratings yet
Democracy Administration 1
Document34 pages
Democracy Administration 1
HPot PotTech
No ratings yet
Hello, World: Artificial Intelligence and Its Use in The Public Sector
Document185 pages
Hello, World: Artificial Intelligence and Its Use in The Public Sector
HPot PotTech
No ratings yet
Developing Cloud Native Applications With Microservices Architecture - Google Slides
Document1 page
Developing Cloud Native Applications With Microservices Architecture - Google Slides
HPot PotTech
No ratings yet
Practical TOGAF 9 Sample Soln 2014Q2
Document35 pages
Practical TOGAF 9 Sample Soln 2014Q2
HPot PotTech
No ratings yet
ISO 7064 Mod97-10
Document2 pages
ISO 7064 Mod97-10
Eduardo Poletti
No ratings yet
CV Ashish Dangi
Document2 pages
CV Ashish Dangi
shamsehr
No ratings yet
Blue Horizon Solution
Document6 pages
Blue Horizon Solution
jeeash57
No ratings yet
Advanced Java (Module 4)
Document16 pages
Advanced Java (Module 4)
Sushma Sumant
No ratings yet
StressHead - Design Concept - 170223
Document15 pages
StressHead - Design Concept - 170223
Juan Manuel
No ratings yet
CAL ScratchJr Reader Full
Document153 pages
CAL ScratchJr Reader Full
kze88844
100% (2)
SBRdecanter Brochure HD00811
Document1 page
SBRdecanter Brochure HD00811
Karim El Shamashergy
No ratings yet
Brain Gate
Document11 pages
Brain Gate
Sreelekha
No ratings yet
Flatpack Mcu
Document2 pages
Flatpack Mcu
AVkucher1983
No ratings yet
E5cs Omron Temperature Controller
Document10 pages
E5cs Omron Temperature Controller
Akram Kareem
No ratings yet
Multimedia Audio Video Lesson Idea
Document2 pages
Multimedia Audio Video Lesson Idea
api-618736625
No ratings yet
AL13P Laminator Manual
Document5 pages
AL13P Laminator Manual
mikael
No ratings yet
Updated Quotation - Nov 2012
Document12 pages
Updated Quotation - Nov 2012
Daniel Roberto Ascon Ruiz
No ratings yet
Ansys Fluent 12.0 Text Command List: April 2009
Document79 pages
Ansys Fluent 12.0 Text Command List: April 2009
baghali
No ratings yet
Coursera Q6JJG52FHPLZ
Document1 page
Coursera Q6JJG52FHPLZ
Umair Ejaz Butt
No ratings yet
Wireless LAN Technology
Document20 pages
Wireless LAN Technology
Pedy
No ratings yet
Integrity Poster - Google Search
Document1 page
Integrity Poster - Google Search
DreamyNight UwU
No ratings yet
Basic Computer Structure and Knowledge
Document22 pages
Basic Computer Structure and Knowledge
Cyril Jay G. Ortega
No ratings yet
Carol of The Bells (3, 4 or 5 Octaves) - Arr. Cathy Moklebust
Document2 pages
Carol of The Bells (3, 4 or 5 Octaves) - Arr. Cathy Moklebust
Kelvinn Music Academy
No ratings yet
Lampiran IIb Perbup Nomor 4 Tahun 2019 Tentang Kebijakan Akuntansi Pemerintah Kabupaten Banjar
Document154 pages
Lampiran IIb Perbup Nomor 4 Tahun 2019 Tentang Kebijakan Akuntansi Pemerintah Kabupaten Banjar
Isro M
No ratings yet
Operating System Part 3-Testbook
Document9 pages
Operating System Part 3-Testbook
sadafmirza
No ratings yet
List of Useful SAP Fiori TCodes - SAP Blogs
Document7 pages
List of Useful SAP Fiori TCodes - SAP Blogs
Mariano Acevedo
No ratings yet
Saep 1151
Document13 pages
Saep 1151
fero
67% (3)
Operating Manual For Ammonia Unit: 1. Safety and Health 1. Purpose and Application
Document55 pages
Operating Manual For Ammonia Unit: 1. Safety and Health 1. Purpose and Application
Teknik Kimia PLS1 Kelas B 2018
No ratings yet
Office Administratior Assitand Offer Letter
Document2 pages
Office Administratior Assitand Offer Letter
Mirifical Careers Pvt Ltd
No ratings yet
RC AKIC NEW FORENSIC PHOTO May Sagot
Document69 pages
RC AKIC NEW FORENSIC PHOTO May Sagot
Hizam Corobong
No ratings yet
Final Report 10574 Muqeema
Document14 pages
Final Report 10574 Muqeema
Mona Alisa
No ratings yet
Survey On LLM
Document9 pages
Survey On LLM
Ranger Los
No ratings yet