Professional Documents
Culture Documents
11 W11NSE6220 - Fall 2023 - Zeng
11 W11NSE6220 - Fall 2023 - Zeng
• Machine Learning
• Classification Algorithms
2
Machine Learning
• Supervised Learning: Data and corresponding labels are given.
• Unsupervised Learning: Only data is given, no labels provided.
• Semi-supervised Learning: Some (if not all) labels are present.
• Reinforcement Learning: An agent interacting with the world makes observations, takes
actions, and is rewarded or punished; it should learn to choose actions in such a way as to
obtain a lot of rewards.
3
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
7
y = f(x)
output prediction Input
function feature
Training Training
Labels
Training
Images
Image Learned
Training
Features model
Testing
Image Learned
Prediction
Features model
Test Image
9
2) In random forest, you can generate hundreds of trees (say T1, T2 …..Tn) and then aggregate the
results of these trees. Which of the following is true about an individual (Tk) tree in Random Forest?
1. Individual tree is built on a subset of the features
2. Individual tree is built on all the features
3. Individual tree is built on a subset of observations
4. Individual tree is built on full set of observations
a) 1 and 3
b) 1 and 4
c) 2 and 3
d) 2 and 4
Solution: a).
15
• Naive Bayes classifier is generally a parametric model, which assumes that the
presence of a particular feature in a class is unrelated to the presence of any other
feature.
• For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter. Even if these features depend on each other or upon the existence
of the other features, all of these properties independently contribute to the probability
that this fruit is an apple and that is why it is known as ‘naïve’.
16
a) TRUE
b) FALSE
Solution: a). The training phase of the algorithm consists only of storing the feature vectors and class
labels of the training samples. In the testing phase, a test point is classified by assigning the label which
are most frequent among the K training samples nearest to that query point – hence higher
computation.
K-fold Cross-Validation
• To evaluate the performance of a model on a dataset, we need to measure how well the
predictions made by the model match the observed data. K-fold cross-validation is one
of the most commonly-used model evaluation methods, which is mainly used for
hyperparameter tuning. Randomly divide a dataset into k groups, or “folds”, of roughly
equal size.
• Let’s say that we have 100 rows of data. We randomly divide them into ten groups of
folds. Each fold will consist of around 10 rows of data. The first fold is going to be used
as the validation set, and the rest is for the training set. Then we train our model using
this dataset and calculate the accuracy or loss. We then repeat this process but using a
different fold for the validation set.
21
• This module can be used for binary or multiclass problems. It provides several pre-
processing features that prepare the data for modeling through setup function. It has over
18 ready-to-use algorithms and several plots to analyze the performance of trained models.
22
Load Dataset
df = pd.read_csv('https://raw.githubusercontent.com/myconcordia/INSE6220/main/seeds.csv')
df.head(10)
To construct the data, seven geometric parameters of wheat kernels were measured:
area A; perimeter P; compactness C = 4piA/P^2; length of kernel; width of kernel;
asymmetry coefficient; length of kernel groove.
24
Install PyCaret
# install slim version (default)
!pip install pycaret
• PyCaret is an open-source, machine learning library in Python that helps you from
data preparation to model deployment. It is easy to use, and you can do almost
every data science project task with just one line of code.
• PyCaret, being a low-code library, makes you more productive. You can spend
less time on coding, and you can do more experiments.
• It is an easy-to-use machine learning library that will help you perform end-to-end
machine learning experiments, whether that’s imputing missing values, encoding
categorical data, feature engineering, hyperparameter tuning, or building
ensemble models.
27
• The setup() function initializes the environment in PyCaret and creates the transformation
pipeline to prepare the data for modeling and deployment. setup() must be called before
executing any other function in PyCaret. It takes two mandatory parameters: a pandas
dataframe and the name of the target column. All other parameters are optional and are used to
customize the pre-processing pipeline (we will see them in later tutorials).
• When setup() is executed, PyCaret's inference algorithm will automatically infer the data types
for all features based on certain properties. The data type should be inferred correctly but this is
not always the case. To account for this, PyCaret displays a table containing the features and
their inferred data types after setup() is executed. If all of the data types are correctly identified
enter can be pressed to continue or quit can be typed to end the experiment.
• Ensuring that the data types are correct is of fundamental importance in PyCaret as it
automatically performs a few pre-processing tasks which are imperative to any machine learning
experiment. These tasks are performed differently for each data type which means it is very
important for them to be correctly configured.
28
where TP, FP, TN and FN denote number of true positives, false positives, true
negatives and false negatives, respectively.
30
Question
From the ROC curve below, which of the following algorithm would you take into the
consideration in your final model building on the basis of performance?
A) Random Forest
B) Logistic Regression
C) Both of the above
D) None of these
Solution: A). Random forest has largest AUC.
31
create_model is the most granular function in PyCaret and is often the foundation behind most of
the PyCaret functionalities. As the name suggests, this function trains and evaluates a model using
cross-validation that can be set with fold parameter. The output prints a score grid that shows
Accuracy, Recall, Precision, F1, Kappa and MCC by fold.
32
The tune_model() function is a random grid search of hyperparameters over a pre-defined search
space. By default, it is set to optimize Accuracy, but this can be changed using optimize parameter.
This function automatically tunes the hyperparameters of a model on a pre-defined search space
and scores it using stratified cross validation.
33
From the plot above, we can see that LG is the most important feature when making predictions.
40
best_model_pca = compare_models()
41
Rather than using a typical feature importance bar chart, we use a density scatter plot of
SHAP values for each feature to identify how much impact each feature has on the model
output for individuals in the validation dataset. Features are sorted by the sum of the
SHAP value magnitudes across all samples.