Iadml

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

Topic: Income Groups

Muhammad Hammad Chaudhary


Contents
• Problem Setting​
• Data Exploration & Preprocessing
• Modelling
• Evaluation
• ​Summary​
Problem Setting
- Dataset : Income Prediction
- The dataset contains 14 features and 1 label i.e., Income and 29999 (~30k)
instances
- There are 6 numerical features and 8 categorical features:
• Age, Employment type, Weighting factor, Level of education, Schooling
period, Marital status, Employment area, Partnership, Ethnicity, Gender, Gains
on financial assets, Losses on financial assets, Weekly working time, Country
of birth
- Label has only 2 unique values (<=50K, >50K)which are discrete, making this
problem a classification problem or specifically binary classification
Data Exploration & Preprocessing
- Summary statistics, we see that the minimum and maximum value of all the
numeric features are sensible.
Data Exploration & Preprocessing
- Data types are checked of all the - Detected and dropped 43
features and are already correct duplicate instances

- Numerical and categorical


columns separation
Data Exploration & Preprocessing
- Missing Values
We see that 3 features have missing values which are Employment type, Employment area, and Country of birth.
Data Exploration & Preprocessing
- Techniques to fill missing values are tested on the 5000 instances of the dataset
which has the income value with Decision Tree
- Those 5 techniques are:
- Technique 1: Removing the instances
- Technique 2: Estimate missing values by replacing with most frequent
- Technique 3: Extend range of values to special value “missing”
- Technique 4: Introduce new binary attributes “Attribute_XY_missing”
- Technique 5: Not Handling Missing Values
- The results in all were similar, so “missing” was replaced with “?” being the
simplest approach
Data Exploration & Preprocessing
- Techniques to fill missing values are tested on the 5000 instances of the dataset
which has the income value with Decision Tree
- Those 5 techniques are:
- Technique 1: Removing the instances
- Technique 2: Estimate missing values by replacing with most frequent
- Technique 3: Extend range of values to special value “missing”
- Technique 4: Introduce new binary attributes “Attribute_XY_missing”
- Technique 5: Not Handling Missing Values
- The results in all were similar, so Global constant “missing” was assigned to all
“?” being the simplest approach
Data Exploration & Preprocessing
- Correlation check for numerical columns

-> Turns out numerical


features are not correlated
and independent of each
other
Data Exploration & Preprocessing
- Structural errors check for categorical columns, there were no structural errors
based on the unique values
- Box and whisker plots, and histograms as binning are used to check the outliers in
the categorical columns but turns out that there are no such clear outliers, the
values are in fine range but just in minority

Schooling period
Instances

Age
Data Exploration & Preprocessing
- Categorical columns are then one hot encoded, and hence the
features range goes from 14 to 108
- Numerical columns are normalized with min max scaler
- Dimensionality reduction is done with the help of models and feature
importance
Modelling
- One important part is that in this problem, since we no test labels, therefore the
train data is further split into 70-30 ratio for tests before filling out the missing
income values.
- Since decision trees work better with mix numerical and categorical columns, also
they are faster for inference, they make more sense to use here and random
forest is even better, the accuracy achieved is ~84.6%
- Feature importance is used to determine the most important features for the
model and turns out that 30 features are more than 96% significant for the model
Modelling
- Feature Importance Score Visualization
Modelling & Evaluation
- The impact feature importance makes here is that the model is trained on now
on the 30 most important features and the results were similar or even better
ranging from 84.5% to 86% accuracy
- For the overfitting check, K fold is also utilized as 5-fold, 4-fold, and 10-fold, but
the results of the model varies from 80% to 87%
Modelling & Evaluation
- Another model is tested on the approach in search of better performance and
logistic regression is therefore used and it gives around 0.5% a bit similar but
better results
- Confusion matrix
Evaluation
- In this problem, since we have imbalance class distribution, precision and recall
are both important for us, hence the focus on evaluating the model is on f1-score
Summary
- Each step was equally important in this problem be it Data preprocessing,
modeling and evaluation
- Different hyperparameter techniques were used to fine-tune the model such as
n_estimators and max_depth in random forests while penalty, and max_iter is
used in logistic regression
- The train data was not big enough to use neural networks
- SVM would have performed similar to logistic regression or even better, but since
its fit is more in big applications such as recognition and detection, also with
unsupervised or semi-supervised data, which was not the case here
Any Questions?

You might also like