Iadml

Topic: Income Groups
Muhammad Hammad Chaudhary

Contents
• Problem Setting
• Data Exploration & Preprocessing
• Modelling
• Evaluation
• Summary
Problem Setting
- Dataset : Income Prediction
- The dataset contains 14 features and 1 label i.e., Income and 29999 (~30k)
instances
- There are 6 numerical features and 8 categorical features:
• Age, Employment type, Weighting factor, Level of education, Schooling
period, Marital status, Employment area, Partnership, Ethnicity, Gender, Gains
on financial assets, Losses on financial assets, Weekly working time, Country
of birth
- Label has only 2 unique values (<=50K, >50K)which are discrete, making this
problem a classification problem or specifically binary classification
Data Exploration & Preprocessing
- Summary statistics, we see that the minimum and maximum value of all the
numeric features are sensible.
- Data types are checked of all the - Detected and dropped 43
features and are already correct duplicate instances
- Numerical and categorical

columns separation
- Missing Values
We see that 3 features have missing values which are Employment type, Employment area, and Country of birth.
- Techniques to fill missing values are tested on the 5000 instances of the dataset
which has the income value with Decision Tree
- Those 5 techniques are:
- Technique 1: Removing the instances
- Technique 2: Estimate missing values by replacing with most frequent
- Technique 3: Extend range of values to special value “missing”
- Technique 4: Introduce new binary attributes “Attribute_XY_missing”
- Technique 5: Not Handling Missing Values
- The results in all were similar, so “missing” was replaced with “?” being the
simplest approach
- Techniques to fill missing values are tested on the 5000 instances of the dataset
which has the income value with Decision Tree
- Those 5 techniques are:
- Technique 1: Removing the instances
- Technique 2: Estimate missing values by replacing with most frequent
- Technique 3: Extend range of values to special value “missing”
- Technique 4: Introduce new binary attributes “Attribute_XY_missing”
- Technique 5: Not Handling Missing Values
- The results in all were similar, so Global constant “missing” was assigned to all
“?” being the simplest approach
- Correlation check for numerical columns
-> Turns out numerical

features are not correlated
and independent of each
other
- Structural errors check for categorical columns, there were no structural errors
based on the unique values
- Box and whisker plots, and histograms as binning are used to check the outliers in
the categorical columns but turns out that there are no such clear outliers, the
values are in fine range but just in minority
Schooling period
Instances
Age
- Categorical columns are then one hot encoded, and hence the
features range goes from 14 to 108
- Numerical columns are normalized with min max scaler
- Dimensionality reduction is done with the help of models and feature
importance
Modelling
- One important part is that in this problem, since we no test labels, therefore the
train data is further split into 70-30 ratio for tests before filling out the missing
income values.
- Since decision trees work better with mix numerical and categorical columns, also
they are faster for inference, they make more sense to use here and random
forest is even better, the accuracy achieved is ~84.6%
- Feature importance is used to determine the most important features for the
model and turns out that 30 features are more than 96% significant for the model
Modelling
- Feature Importance Score Visualization
Modelling & Evaluation
- The impact feature importance makes here is that the model is trained on now
on the 30 most important features and the results were similar or even better
ranging from 84.5% to 86% accuracy
- For the overfitting check, K fold is also utilized as 5-fold, 4-fold, and 10-fold, but
the results of the model varies from 80% to 87%
Modelling & Evaluation
- Another model is tested on the approach in search of better performance and
logistic regression is therefore used and it gives around 0.5% a bit similar but
better results
- Confusion matrix
Evaluation
- In this problem, since we have imbalance class distribution, precision and recall
are both important for us, hence the focus on evaluating the model is on f1-score
Summary
- Each step was equally important in this problem be it Data preprocessing,
modeling and evaluation
- Different hyperparameter techniques were used to fine-tune the model such as
n_estimators and max_depth in random forests while penalty, and max_iter is
used in logistic regression
- The train data was not big enough to use neural networks
- SVM would have performed similar to logistic regression or even better, but since
its fit is more in big applications such as recognition and detection, also with
unsupervised or semi-supervised data, which was not the case here
Any Questions?

Iadml

Uploaded by

Copyright:

Available Formats

You might also like

Iadml

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Iadml

Uploaded by

Copyright:

Available Formats

Topic: Income Groups

Muhammad Hammad Chaudhary

- Numerical and categorical

-> Turns out numerical

You might also like