Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

1

INSE 6220 -- Week 11


Advanced Statistical Approaches to Quality

• Machine Learning
• Classification Algorithms
2

Machine Learning
• Supervised Learning: Data and corresponding labels are given.
• Unsupervised Learning: Only data is given, no labels provided.
• Semi-supervised Learning: Some (if not all) labels are present.
• Reinforcement Learning: An agent interacting with the world makes observations, takes
actions, and is rewarded or punished; it should learn to choose actions in such a way as to
obtain a lot of rewards.
3

Classification vs. Clustering


• Classification and clustering are two methods of pattern identification used in
machine learning.
• Although both techniques have certain similarities, the difference lies in the fact that
classification uses predefined classes in which objects are assigned, while
clustering identifies similarities between objects, which it groups according to
those characteristics in common and which differentiate them from other groups of
objects. These groups are known as "clusters".
4

What is Classification in Machine Learning


• Classification is a process of categorizing a given set of data into classes, It
can be performed on both structured or unstructured data. The process starts
with predicting the class of given data points. The classes are often referred to
as target, label or categories.

• The classification predictive modeling is the task of approximating the mapping


function from input variables to discrete output variables. The main goal is to
identify which class/category the new data will fall into.
5

Classification Terminology in Machine Learning


• Classifier – It is an algorithm that is used to map the input data to a specific category.
• Classification Model – The model predicts or draws a conclusion to the input data given
for training, it will predict the class or category for the data.
• Feature – A feature is an individual measurable property of the phenomenon being
observed.
• Binary Classification – It is a type of classification with two outcomes, e.g. either true or
false.
• Multi-Class Classification – The classification with more than two classes, in multi-class
classification each sample is assigned to one and only one label or target.
• Multi-label Classification – This is a type of classification where each sample is
assigned to a set of labels or targets.
• Initialize – It is to assign the classifier to be used.
• Train the Classifier – Each classifier in scikit-learn uses the fit(X, y) method to fit the
model for training the train X and train label y.
• Predict the Target – For an unlabeled observation X, the predict(X) method returns
predicted label y.
• Evaluate – This basically means the evaluation of the model, i.e. classification report,
accuracy score, etc.
6

Machine Learning Classification Framework


• Machine Learning is one of the most trending technologies in the field
of artificial intelligence. It involves the use of algorithms that allow
machines to learn by imitating the way humans learn.
• Apply a prediction function to a feature representation of the image, for
example, to get the desired output:

f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
7

Machine Learning Classification Framework

y = f(x)
output prediction Input
function feature

• Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)},


estimate the prediction function f by minimizing the prediction error on
the training set
• Testing: apply f to an unseen test example x and output the predicted
value y = f(x) to assess the model accuracy
Number of correct classifications
Accuracy = ,
Total number of test cases
8

Example: Image Classification

Training Training
Labels
Training
Images
Image Learned
Training
Features model

Testing

Image Learned
Prediction
Features model
Test Image
9

Popular Classification Algorithms


There are many classification algorithms, but it is not possible to conclude which
one performs better than the other. It depends on the application and nature of the
available data set. Popular classification algorithms include:

•Decision Tree Classifier


•Naive Bayes
•K-Nearest Neighbors
•Random Forest Classifier
•Logistic Regression
•Support Vector Machines
•Linear Discriminant Analysis
•Ridge Regression
•Quadratic Discriminant Analysis
•Extra Trees Classifier
•Light Gradient Boosting Machine
10

Decision Tree Classifier


A decision tree is a non-parametric supervised learning method, which builds classification or
regression models in the form of a tree structure. The goal is to create a model that predicts
the value of a target variable by learning simple decision rules inferred from the data features.
It utilizes an if-then rule set which is mutually exclusive and exhaustive for classification. The
rules are learned sequentially using the training data one at a time. Each time a rule is
learned, the tuples covered by the rules are removed. This process is continued on the
training set until meeting a termination condition.
11

Decision Tree Classifier


The basic idea behind any decision tree algorithm is as follows:
1. Select the best attribute using Attribute Selection Measures (heuristic) to split the records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Starts tree building by repeating this process recursively for each child until one of the
condition will match:
1. All the tuples belong to the same attribute value.
2. There are no more remaining attributes.
3. There are no more instances.
12

Random Forest Classifier


• As the name implies, a random forest consists of a large number of individual
decision trees that operate as an ensemble of decision trees, usually trained with
the “bagging” method. Each individual tree in the random forest gives a class
prediction and the class with the most votes becomes our model’s prediction.
• Put simply: random forest builds multiple decision trees and merges them
together to get a more accurate and stable prediction.
13

Random Forest Classifier


Random Forest is one of the most popular and widely used non-parametric machine
learning algorithms for classification problems. It can also be used for regression
problems, but it mainly performs well on classification tasks.
➢ Step-1: Pick K random records from the dataset having a total of N records.
➢ Step-2: Build and train a decision tree model on these K records.
➢ Step-3: Choose the number of trees you want in your algorithm and repeat steps 1 and 2
14

Random Forest Classifier: Questions


1) Why would we use a random forest instead of a decision tree?

a) For lower training error


b) To reduce the variance of the model.
c) To better approximate posterior probabilities.
d) For a model that is easier for a human to interpret.
Solution: b) and c).

2) In random forest, you can generate hundreds of trees (say T1, T2 …..Tn) and then aggregate the
results of these trees. Which of the following is true about an individual (Tk) tree in Random Forest?
1. Individual tree is built on a subset of the features
2. Individual tree is built on all the features
3. Individual tree is built on a subset of observations
4. Individual tree is built on full set of observations
a) 1 and 3
b) 1 and 4
c) 2 and 3
d) 2 and 4

Solution: a).
15

Naïve Bayes Classifier

• Naive Bayes classifier is generally a parametric model, which assumes that the
presence of a particular feature in a class is unrelated to the presence of any other
feature.
• For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter. Even if these features depend on each other or upon the existence
of the other features, all of these properties independently contribute to the probability
that this fruit is an apple and that is why it is known as ‘naïve’.
16

K-Nearest Neighbors Classifier


K-Nearest Neighbors is a non-parametric lazy learning algorithm which stores all instances
correspond to training data points in n-dimensional space. When an unknown discrete data
is received, it analyzes the closest k number of instances saved (nearest neighbors) and
returns the most common class as the prediction and for real-valued data it returns the
mean of K-nearest neighbors.
17

K-Nearest Neighbors Classifier: Questions


1) K-NN algorithm does more computation on test time rather than training time.

a) TRUE
b) FALSE

Solution: a). The training phase of the algorithm consists only of storing the feature vectors and class
labels of the training samples. In the testing phase, a test point is classified by assigning the label which
are most frequent among the K training samples nearest to that query point – hence higher
computation.

2) Which of the following is true about K-NN algorithm?

a) It can be used for classification


b) It can be used for regression
c) It can be used in both classification and regression

Solution: c). We can also use K-NN for regression problems.


18

Logistic Regression Classifier


• Logistic Regression is a popular statistical parametric learning model used for binary
classification, i.e. for predictions of the type this or that, yes or no, A or B, etc.
Logistic regression can, however, be used for multiclass classification.
• The logistic regression hypothesis generalizes from the linear regression in that it
uses the logistic function (curve):
19

Support Vector Machine Classifier


• Support Vector Machine (SVM) is a parametric supervised machine learning algorithm
that it is mostly used in classification problems. In the SVM algorithm, we plot each
data item as a point in n-dimensional space (where n is the number of features) with
the value of each feature being the value of a particular coordinate.
• We perform classification by finding the hyper-plane that separates the classes very
well, thus helps in classifying new data points.
20

K-fold Cross-Validation
• To evaluate the performance of a model on a dataset, we need to measure how well the
predictions made by the model match the observed data. K-fold cross-validation is one
of the most commonly-used model evaluation methods, which is mainly used for
hyperparameter tuning. Randomly divide a dataset into k groups, or “folds”, of roughly
equal size.

• Let’s say that we have 100 rows of data. We randomly divide them into ten groups of
folds. Each fold will consist of around 10 rows of data. The first fold is going to be used
as the validation set, and the rest is for the training set. Then we train our model using
this dataset and calculate the accuracy or loss. We then repeat this process but using a
different fold for the validation set.
21

Classification with PyCaret


• PyCaret is an open source, low-code machine learning library in Python that allows you to
go from preparing your data to deploying your model within minutes.
• PyCaret’s Classification Module is a supervised machine learning module which is used
for classifying elements into groups. The goal is to predict the class labels which are
discrete and unordered. Some common use cases include predicting customer default (Yes
or No), and predicting customer churn (customer will leave or stay).

• This module can be used for binary or multiclass problems. It provides several pre-
processing features that prepare the data for modeling through setup function. It has over
18 ready-to-use algorithms and several plots to analyze the performance of trained models.
22

Import Python Libraries


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
import pandas as pd
plt.rcParams['figure.figsize'] = (12,8)
23

Load Dataset
df = pd.read_csv('https://raw.githubusercontent.com/myconcordia/INSE6220/main/seeds.csv')
df.head(10)

To construct the data, seven geometric parameters of wheat kernels were measured:
area A; perimeter P; compactness C = 4piA/P^2; length of kernel; width of kernel;
asymmetry coefficient; length of kernel groove.
24

Exploratory Data Analysis


corr_data = df.corr()
sns.clustermap(corr_data, annot=True, fmt='.2f')
25

Exploratory Data Analysis


sns.pairplot(df, hue='class')
plt.show()
26

Install PyCaret
# install slim version (default)
!pip install pycaret

• PyCaret is an open-source, machine learning library in Python that helps you from
data preparation to model deployment. It is easy to use, and you can do almost
every data science project task with just one line of code.

• PyCaret, being a low-code library, makes you more productive. You can spend
less time on coding, and you can do more experiments.

• It is an easy-to-use machine learning library that will help you perform end-to-end
machine learning experiments, whether that’s imputing missing values, encoding
categorical data, feature engineering, hyperparameter tuning, or building
ensemble models.
27

Setting up the Environment in PyCaret


from pycaret.classification import *
clf = setup(data=data, target='class', train_size=0.7, session_id=123)

• The setup() function initializes the environment in PyCaret and creates the transformation
pipeline to prepare the data for modeling and deployment. setup() must be called before
executing any other function in PyCaret. It takes two mandatory parameters: a pandas
dataframe and the name of the target column. All other parameters are optional and are used to
customize the pre-processing pipeline (we will see them in later tutorials).
• When setup() is executed, PyCaret's inference algorithm will automatically infer the data types
for all features based on certain properties. The data type should be inferred correctly but this is
not always the case. To account for this, PyCaret displays a table containing the features and
their inferred data types after setup() is executed. If all of the data types are correctly identified
enter can be pressed to continue or quit can be typed to end the experiment.

• Ensuring that the data types are correct is of fundamental importance in PyCaret as it
automatically performs a few pre-processing tasks which are imperative to any machine learning
experiment. These tasks are performed differently for each data type which means it is very
important for them to be correctly configured.
28

Comparing all Models


#show the best model and their statistics
best_model = compare_models()
29

Performance Evaluation Metrics

where TP, FP, TN and FN denote number of true positives, false positives, true
negatives and false negatives, respectively.
30

Question
From the ROC curve below, which of the following algorithm would you take into the
consideration in your final model building on the basis of performance?

A) Random Forest
B) Logistic Regression
C) Both of the above
D) None of these
Solution: A). Random forest has largest AUC.
31

Create a Decision Tree Classifier


dt = create_model('dt')

create_model is the most granular function in PyCaret and is often the foundation behind most of
the PyCaret functionalities. As the name suggests, this function trains and evaluates a model using
cross-validation that can be set with fold parameter. The output prints a score grid that shows
Accuracy, Recall, Precision, F1, Kappa and MCC by fold.
32

Tune a Decision Tree Classifier


tuned_dt = tune_model(dt)

The tune_model() function is a random grid search of hyperparameters over a pre-defined search
space. By default, it is set to optimize Accuracy, but this can be changed using optimize parameter.
This function automatically tunes the hyperparameters of a model on a pre-defined search space
and scores it using stratified cross validation.
33

Evaluate a Decision Tree Classifier


evaluate_model(tuned_dt)

How to analyze model performance using various plots.


34

Create a K-Nearest Neighbors Classifier


knn = create_model('knn')
35

Tune a K-Nearest Neighbors Classifier


tuned_knn = tune_model(knn, custom_grid = {'n_neighbors' : np.arange(0,50,1)})
36

Evaluate a K-Nearest Neighbors Classifier


evaluate_model(tuned_knn)
37

Create a Logistic Regression Classifier


lr = create_model('lr')
38

Tune a Logistic Regression Classifier


tuned_lr = tune_model(lr)
39

Evaluate a Logistic Regression Classifier


evaluate_model(tuned_lr)

From the plot above, we can see that LG is the most important feature when making predictions.
40

Classification + Principal Component Analysis


clf_pca = setup(data=data, target='class', train_size=0.7, session_
id=123, normalize = True, pca = True, pca_components = 3)

best_model_pca = compare_models()
41

Tune the Best Classifier


# Tune hyperparameters with scikit-learn (default)
tuned_best_model_pca = tune_model(best_model_pca)
42

Evaluate the Best Classifier


One way to analyze the performance of models is to use the evaluate_model() function
which displays a user interface for all of the available plots for a given model. It internally
uses the plot_model() function.
evaluate_model(tuned_best_model_pca)
43

Explainable AI with Shapley Values


Shapley values are a widely used approach from cooperative game theory that come with
desirable properties.

!pip install shap


rf_pca = create_model('rf')
tuned_rf_pca = tune_model(rf_pca)
interpret_model(tuned_rf_pca, plot='summary')

Rather than using a typical feature importance bar chart, we use a density scatter plot of
SHAP values for each feature to identify how much impact each feature has on the model
output for individuals in the validation dataset. Features are sorted by the sum of the
SHAP value magnitudes across all samples.

You might also like