Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Assignment 2

Hive:
Loaded the data into hive. Dropped the ‘ID’ column and saved the data for further processing in spark.
Dropped ‘ID’ column as it was not an important feature for classification.

a) Create and load table

b) Drop ID column and dump table to local for further processing


PySpark:

Steps:

1) Load Data
a) Load the dataset and infer schema

b) Change the column names for better interpretation

2) Preprocessing and Understanding Data


a) None of the columns have Null value. I dropped the ‘ID’ column using hive as it was not useful for
classification.

b) Get the number of examples which belong to each class.

We can see that there are more examples belonging to class 0 are more, so the dataset is not balanced.

c) As the dataset is not balanced, add a column "weight" which tells how important an example is during
training. We give more weight of 2.7 to examples that belong to class 1 and give weight 1 to examples
that belong to class 0.

d) Look at the number of Defaulters/Non-Defaulters for each sex.


e) Distribution of values for Pay_2 column for the dataset. Value of 0 and -2 is not defined for this
dataset, still a lot of examples have these values.

f) Distribution of values for the Age column for this dataset. We can see most of the data is of people of
age group 24-40.
g) Distribution of values for the Marriage column for this dataset. Value of 0 is not defined for this
dataset, but some examples have this value.

h) Distribution of values for the Balance_limit column for this dataset.


i) Randomly split the dataset into training and testing in 60:40 ratio.

j) One-hot encode the categorical variables Marriage and Education.

k) Assemble all the features into one column using Vector assembler.
3) Define Model and Pipeline
a) We use Logistic Regression model and we provide input and output columns to model for training. We
also provide weight that the model should give each example during training.

b) Define Pipeline to chain multiple transformations to specify machine learning workflow.

4) Get Model Predictions


a) Fit the model and get predictions on the test data.

b) Get the number of examples where prediction is the same as true label.
c) Print number of examples which have true label 0/1. Print number of examples which has predicted
label 0/1. Print output for some test examples.

5) Evaluate Model
a) Evaluate the model using AUC metric which is in the BinaryClassificationEvaluator package. Give
probabilities and labels as input to the evaluator.
b) Using F1 score for evaluating the performance of the classifier. Metrics used for evaluation depend on
what is the end goal we want to achieve, but in most scenarios F1 score is a good metric to evaluate
performance of the classifier if the dataset is unbalanced.

Spark WebUI

Most time is taken by model.fit() paragraph. This task started many jobs from Job id 1564 to 1636 on my
laptop. They all took around 3 seconds combined. This is the total duration. For calculating the Executor
Computing Time, we have to go inside each job and sum the Executor Computing Time of each stage of
the job. We can see from the following figure that job id 1564 to job id 1636 have the same job group.
Most time taken by a single job was 0.3s. There were few jobs which had a duration of 0.3s. One such
job is job id 1543 in figure below.

You might also like