Midterm Data Mining

DATA MINING FOR
BUSINESS ANALYTICS
COURSE: INS2061
MIDTERM PROJECT
Lecturer: Assoc. Prof. Tran Thi Oanh
PREPARED BY
NGUYEN THAO VI - 20070011 PREPARED DATE
DANG THUY NGAN - 20070762 Dec 11, 2022
Table of Contents
1. Feature Discovery 2
2. Feature Selection 3
3. Input Preparation 4
4. Model training and comparison 5
4.1. Random Forest 6
4.2. K-nearest neighbors 11
5. Hyperparameter Tuning 14
6. Conclusion 16
pg.1
Link to our Google Collab Notebook:
https://colab.research.google.com/drive/1TkUQdGP-
YyC0cUe6vhjqbjBo69ugTR_x?usp=sharing
The libraries we use to perform data mining are primarily numpy (for working with
arrays), pandas (for loading and manipulating the extracted data) and seaborn (for
visualizing the data). We also call the %matplotlib inline function for printing the
visualizations.
1. Feature Discovery
We notice that the sex_upon_outcome column denotes the cat’s gender and whether it
was intact or spayed/neutered. This variable is splitted into 2 feature as sex and
Spay/Neuter.
Date features present the month and year the cat was born and time of outcome. We
observe that they are extracted and stored in dob_year, dob_month, dob_monthyear,
outcome_month, outcome_year, outcome_weekday and outcome_hour columns. This
makes the data easier for analysis and workable with pandas.
The cat’s age_upon_outcome is given under a mess data format with a date interval as
week, month, year categories. This data is transformed into day and year values in
numeric type as outcome_age_(days) and outcome_age_(years) columns.
Based on the outcome_age_(years) feature, the cat can be categorized as a Kitten/Cat.

This makes sense since the kittens seem to be adopted more than adult cats. The kittens
can also get spayed or neutered if they are intact. It may be the reason why the column
sex_age_outcome exists.
pg.2
We also notice that age_group is bins of outcome_age_(years).
The cat’s breeds are classified by its origin and hair length. However, some of them
may be indistinguishable since there are 2 variables appearing in several breed values.
The breed1 and breed2 columns are extracted from breed feature. This can help to clarify
whether the cat is cfa_breed or domestic_breed.
Another feature we are concerned about is the cat’s color. We see that the cats are
various in colors and one cat may not only come in with multiple colors but also with
different coat patterns. The color is splitted into coat_pattern, color1 and color2 columns
that can describe all the cat’s colors.
The coat feature may be created based on extracted coat colors and patterns, by taking
the value of color1 or of coat_pattern, in case the color1 feature is null.
2. Feature Selection
Due to the above similarities between the attributes, we decided to select only a few as
inputs for our models to reduce the training time as well as avoid the multicollinearity
between the leading attributes, which affects the final result.
The features we chose include:

1. name
2. sex
3. Spay/Neuter
4. Cat/Kitten (outcome)
5. age_group
6. cfa_breed
7. domestic_breed
8. coat_pattern
9. color2
10. coat
pg.3
11. outcome_subtype
Our objective is to predict the outcome_type attribute.
3. Input Preparation
To prepare for the input of our

model, first, we replace values in
binary columns like sex,
Spay/Neuter, Cat/Kitten,
cfa_breed, and domestic_breed
with 1 & 0 (E.g. True = 1, False = 0).
We did this manually to ensure
that the numbers accurately
reflect the information wished to
convey.
The name column on the other

hand is slightly different. There
are 6440 different names, so
assigning numbers will be quite
inconvenient. Furthermore, there
are also quite a few unnamed cats,
so we decided to assign the
named cases number 1 and Correlation matrix for the binary attributes
not yet named the number 0.
Looking at the Correlation matrix, there is no strong relationship between those

features therefore there is no multicollinearity for now.
pg.4
The remaining columns (age_group, coat_pattern, color2, coat, outcome_subtype)
contain more than 2 unique values each thus we opted to perform one hot encoding
(get_dummies) on these features.
We then concatenated everything to create our input table with now 23534 records and
90 columns.
We divided the data into training and testing sets with a ratio of 7:3.
The outcome_type column is denoted as below:
Numerical representation Value
0 Transfer
1 Adoption
2 Return to Owner
3 Died
4 Euthanasia
5 Missing
6 Disposal
7 Rto-Adopt
4. Model training and comparison
The two models that we chose were Random Forest and K-nearest neighbors. These
are the models that we have been taught and claimed to natively support missing values
(Our data includes numerous NaN values).
pg.5
Our procedure:
Training the models in their default state with the split training and testing datasets
(no parameter setup) ⇒ Display the results ⇒ run Repeated Stratified KFold
cross-validation on the whole dataset ⇒ compare the results of the two models ⇒
Choose one and perform hyperparameter tuning on the split datasets and the whole
dataset.
The reason why we chose Repeated Stratified KFold is that it is very effective when used
on classification problems with a severe class imbalance like in our case. It splits the
dataset in such a way that preserves approximately the same class distribution.
Repeated Stratified KFold improves the estimated performance of a machine learning
model by simply repeating the cross-validation procedure multiple times and reporting
the mean result across all folds from all runs. Thus this method would give us a better
assessment of the model performance on the whole dataset.
Accuracy, Precision, Recall, F1-score, Confusion matrix, and ROC curve are our metrics
for evaluation.
4.1. Random Forest
pg.6
Random Forest Report
0.94 is a pretty high accuracy score, however, the fundamental downside of accuracy is
that it obscures the problem of class imbalance thus we also have to examine other
metrics.
No label is more important than the other so there is no need to check each individual
score. We can look at the macro and weighted average of the three metrics, in which the
macro average is the average of all labels while the weighted average is the sum of the
number of samples multiplied by the metric of individual labels divided by the total
number of samples.
Weighted average considers how many instances of each class there were in its
calculation, so fewer of one class means that its precision/recall/F1 score has less of an
impact on the weighted average for each of those things.
pg.7
For example, in this case, the weighted average precision was high because classes 5,6,7
(where the precision scores were 0) contain far fewer samples.
The confusion matrix shows that the models classified transferred, adopted, and
euthanized cats very well which is understandable since these two types own the most
samples. On the contrary, missing, Disposal, and RTO-adopted cats are harder to detect
based on the lack of records.
pg.8
Random Forest Confusion Matrix
pg.9
Random Forest ROC Curve
pg.10
The ROC Curve shows us a clearer picture of Random Forest’s performance for each
class. The same conclusion as above can be interpreted. We also looked at the Feature
importance of Random Forest. According to the figure below, the outcome_subtype,
name, and Cat/ Kitten (outcome) are the three features that have the most impact on
the outcome.
Random Forest Feature Importance
Running Repeated Stratified KFold on the whole chosen-feature dataset with 10

folds, repeated 5 times helped us conclude that Random Forest performs
exquisitely with an accuracy of 0.944 and a low standard deviation of 0.003.
pg.11
4.2. K-nearest neighbors
KNN Report
pg.12
KNN Confusion Matrix
pg.13
KNN ROC Curve
As the results suggested, Random Forest showed better performance for the dataset
with chosen features in predicting the outcome type of shelter cats than K-nearest
neighbors. Thus we will choose to optimize Random Forest.
pg.14
5. Hyperparameter Tuning
We used Gridsearch CV for this purpose due to limited computational power. However,
Gridsearch CV only tries the combination of a presented list of values of the hyper-
parameters and chooses the best combination based on the cross-validation score.
Thus, we cannot guarantee the best result.
pg.15
GridsearchCV Result
Overall the result after GridsearchCV is not as good since we only defined the
parameters within a scope. Therefore, we will temporarily presume that Random Forest
at its default setting is our ultimate choice for now.
pg.16
6. Conclusion
After examining the dataset of shelter cats we have picked out 10 features that we
believed are not interdependent, convenient for encoding, and significant in
determining the type of outcome. We also trained the data on 2 learned models which
are Random Forest and K-nearest neighbor and compared the final outcomes. For now,
the default Random Forest Classifier performed best among all. In the future, we will
try out other models and hyperparameter tuning methods for better results.
pg.17

Midterm Data Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Midterm Data Mining

Uploaded by

Copyright:

Available Formats

DATA MINING FOR

4. Model training and comparison 5

4.1. Random Forest 6

4.2. K-nearest neighbors 11

Based on the outcome_age_(years) feature, the cat can be categorized as a Kitten/Cat.

The features we chose include:

Our objective is to predict the outcome_type attribute.

To prepare for the input of our

The name column on the other

Looking at the Correlation matrix, there is no strong relationship between those

The outcome_type column is denoted as below:

Numerical representation Value

4. Model training and comparison

4.1. Random Forest

Random Forest Feature Importance

Running Repeated Stratified KFold on the whole chosen-feature dataset with 10

You might also like