Professional Documents
Culture Documents
Iadml
Iadml
Iadml
Schooling period
Instances
Age
Data Exploration & Preprocessing
- Categorical columns are then one hot encoded, and hence the
features range goes from 14 to 108
- Numerical columns are normalized with min max scaler
- Dimensionality reduction is done with the help of models and feature
importance
Modelling
- One important part is that in this problem, since we no test labels, therefore the
train data is further split into 70-30 ratio for tests before filling out the missing
income values.
- Since decision trees work better with mix numerical and categorical columns, also
they are faster for inference, they make more sense to use here and random
forest is even better, the accuracy achieved is ~84.6%
- Feature importance is used to determine the most important features for the
model and turns out that 30 features are more than 96% significant for the model
Modelling
- Feature Importance Score Visualization
Modelling & Evaluation
- The impact feature importance makes here is that the model is trained on now
on the 30 most important features and the results were similar or even better
ranging from 84.5% to 86% accuracy
- For the overfitting check, K fold is also utilized as 5-fold, 4-fold, and 10-fold, but
the results of the model varies from 80% to 87%
Modelling & Evaluation
- Another model is tested on the approach in search of better performance and
logistic regression is therefore used and it gives around 0.5% a bit similar but
better results
- Confusion matrix
Evaluation
- In this problem, since we have imbalance class distribution, precision and recall
are both important for us, hence the focus on evaluating the model is on f1-score
Summary
- Each step was equally important in this problem be it Data preprocessing,
modeling and evaluation
- Different hyperparameter techniques were used to fine-tune the model such as
n_estimators and max_depth in random forests while penalty, and max_iter is
used in logistic regression
- The train data was not big enough to use neural networks
- SVM would have performed similar to logistic regression or even better, but since
its fit is more in big applications such as recognition and detection, also with
unsupervised or semi-supervised data, which was not the case here
Any Questions?