Professional Documents
Culture Documents
Assignment 2: Hive
Assignment 2: Hive
Hive:
Loaded the data into hive. Dropped the ‘ID’ column and saved the data for further processing in spark.
Dropped ‘ID’ column as it was not an important feature for classification.
Steps:
1) Load Data
a) Load the dataset and infer schema
We can see that there are more examples belonging to class 0 are more, so the dataset is not balanced.
c) As the dataset is not balanced, add a column "weight" which tells how important an example is during
training. We give more weight of 2.7 to examples that belong to class 1 and give weight 1 to examples
that belong to class 0.
f) Distribution of values for the Age column for this dataset. We can see most of the data is of people of
age group 24-40.
g) Distribution of values for the Marriage column for this dataset. Value of 0 is not defined for this
dataset, but some examples have this value.
k) Assemble all the features into one column using Vector assembler.
3) Define Model and Pipeline
a) We use Logistic Regression model and we provide input and output columns to model for training. We
also provide weight that the model should give each example during training.
b) Get the number of examples where prediction is the same as true label.
c) Print number of examples which have true label 0/1. Print number of examples which has predicted
label 0/1. Print output for some test examples.
5) Evaluate Model
a) Evaluate the model using AUC metric which is in the BinaryClassificationEvaluator package. Give
probabilities and labels as input to the evaluator.
b) Using F1 score for evaluating the performance of the classifier. Metrics used for evaluation depend on
what is the end goal we want to achieve, but in most scenarios F1 score is a good metric to evaluate
performance of the classifier if the dataset is unbalanced.
Spark WebUI
Most time is taken by model.fit() paragraph. This task started many jobs from Job id 1564 to 1636 on my
laptop. They all took around 3 seconds combined. This is the total duration. For calculating the Executor
Computing Time, we have to go inside each job and sum the Executor Computing Time of each stage of
the job. We can see from the following figure that job id 1564 to job id 1636 have the same job group.
Most time taken by a single job was 0.3s. There were few jobs which had a duration of 0.3s. One such
job is job id 1543 in figure below.