Professional Documents
Culture Documents
Midterm Data Mining
Midterm Data Mining
BUSINESS ANALYTICS
COURSE: INS2061
MIDTERM PROJECT
Lecturer: Assoc. Prof. Tran Thi Oanh
PREPARED BY
NGUYEN THAO VI - 20070011 PREPARED DATE
DANG THUY NGAN - 20070762 Dec 11, 2022
Table of Contents
1. Feature Discovery 2
2. Feature Selection 3
3. Input Preparation 4
5. Hyperparameter Tuning 14
6. Conclusion 16
pg.1
Link to our Google Collab Notebook:
https://colab.research.google.com/drive/1TkUQdGP-
YyC0cUe6vhjqbjBo69ugTR_x?usp=sharing
The libraries we use to perform data mining are primarily numpy (for working with
arrays), pandas (for loading and manipulating the extracted data) and seaborn (for
visualizing the data). We also call the %matplotlib inline function for printing the
visualizations.
1. Feature Discovery
We notice that the sex_upon_outcome column denotes the cat’s gender and whether it
was intact or spayed/neutered. This variable is splitted into 2 feature as sex and
Spay/Neuter.
Date features present the month and year the cat was born and time of outcome. We
observe that they are extracted and stored in dob_year, dob_month, dob_monthyear,
outcome_month, outcome_year, outcome_weekday and outcome_hour columns. This
makes the data easier for analysis and workable with pandas.
The cat’s age_upon_outcome is given under a mess data format with a date interval as
week, month, year categories. This data is transformed into day and year values in
numeric type as outcome_age_(days) and outcome_age_(years) columns.
pg.2
We also notice that age_group is bins of outcome_age_(years).
The cat’s breeds are classified by its origin and hair length. However, some of them
may be indistinguishable since there are 2 variables appearing in several breed values.
The breed1 and breed2 columns are extracted from breed feature. This can help to clarify
whether the cat is cfa_breed or domestic_breed.
Another feature we are concerned about is the cat’s color. We see that the cats are
various in colors and one cat may not only come in with multiple colors but also with
different coat patterns. The color is splitted into coat_pattern, color1 and color2 columns
that can describe all the cat’s colors.
The coat feature may be created based on extracted coat colors and patterns, by taking
the value of color1 or of coat_pattern, in case the color1 feature is null.
2. Feature Selection
Due to the above similarities between the attributes, we decided to select only a few as
inputs for our models to reduce the training time as well as avoid the multicollinearity
between the leading attributes, which affects the final result.
pg.3
11. outcome_subtype
3. Input Preparation
pg.4
The remaining columns (age_group, coat_pattern, color2, coat, outcome_subtype)
contain more than 2 unique values each thus we opted to perform one hot encoding
(get_dummies) on these features.
We then concatenated everything to create our input table with now 23534 records and
90 columns.
We divided the data into training and testing sets with a ratio of 7:3.
0 Transfer
1 Adoption
2 Return to Owner
3 Died
4 Euthanasia
5 Missing
6 Disposal
7 Rto-Adopt
The two models that we chose were Random Forest and K-nearest neighbors. These
are the models that we have been taught and claimed to natively support missing values
(Our data includes numerous NaN values).
pg.5
Our procedure:
Training the models in their default state with the split training and testing datasets
(no parameter setup) ⇒ Display the results ⇒ run Repeated Stratified KFold
cross-validation on the whole dataset ⇒ compare the results of the two models ⇒
Choose one and perform hyperparameter tuning on the split datasets and the whole
dataset.
The reason why we chose Repeated Stratified KFold is that it is very effective when used
on classification problems with a severe class imbalance like in our case. It splits the
dataset in such a way that preserves approximately the same class distribution.
Repeated Stratified KFold improves the estimated performance of a machine learning
model by simply repeating the cross-validation procedure multiple times and reporting
the mean result across all folds from all runs. Thus this method would give us a better
assessment of the model performance on the whole dataset.
Accuracy, Precision, Recall, F1-score, Confusion matrix, and ROC curve are our metrics
for evaluation.
pg.6
Random Forest Report
0.94 is a pretty high accuracy score, however, the fundamental downside of accuracy is
that it obscures the problem of class imbalance thus we also have to examine other
metrics.
No label is more important than the other so there is no need to check each individual
score. We can look at the macro and weighted average of the three metrics, in which the
macro average is the average of all labels while the weighted average is the sum of the
number of samples multiplied by the metric of individual labels divided by the total
number of samples.
Weighted average considers how many instances of each class there were in its
calculation, so fewer of one class means that its precision/recall/F1 score has less of an
impact on the weighted average for each of those things.
pg.7
For example, in this case, the weighted average precision was high because classes 5,6,7
(where the precision scores were 0) contain far fewer samples.
The confusion matrix shows that the models classified transferred, adopted, and
euthanized cats very well which is understandable since these two types own the most
samples. On the contrary, missing, Disposal, and RTO-adopted cats are harder to detect
based on the lack of records.
pg.8
Random Forest Confusion Matrix
pg.9
Random Forest ROC Curve
pg.10
The ROC Curve shows us a clearer picture of Random Forest’s performance for each
class. The same conclusion as above can be interpreted. We also looked at the Feature
importance of Random Forest. According to the figure below, the outcome_subtype,
name, and Cat/ Kitten (outcome) are the three features that have the most impact on
the outcome.
pg.11
4.2. K-nearest neighbors
KNN Report
pg.12
KNN Confusion Matrix
pg.13
KNN ROC Curve
As the results suggested, Random Forest showed better performance for the dataset
with chosen features in predicting the outcome type of shelter cats than K-nearest
neighbors. Thus we will choose to optimize Random Forest.
pg.14
5. Hyperparameter Tuning
We used Gridsearch CV for this purpose due to limited computational power. However,
Gridsearch CV only tries the combination of a presented list of values of the hyper-
parameters and chooses the best combination based on the cross-validation score.
Thus, we cannot guarantee the best result.
pg.15
GridsearchCV Result
Overall the result after GridsearchCV is not as good since we only defined the
parameters within a scope. Therefore, we will temporarily presume that Random Forest
at its default setting is our ultimate choice for now.
pg.16
6. Conclusion
After examining the dataset of shelter cats we have picked out 10 features that we
believed are not interdependent, convenient for encoding, and significant in
determining the type of outcome. We also trained the data on 2 learned models which
are Random Forest and K-nearest neighbor and compared the final outcomes. For now,
the default Random Forest Classifier performed best among all. In the future, we will
try out other models and hyperparameter tuning methods for better results.
pg.17