Machine Learning Thesis --1

Machine Learning: Modern Techniques
and Mathematical Approach to Neural Networks
BS Thesis
By
Sardar Abdul Wahab CH

CIIT/FA20-BSM-012/LHR
Abdul Raheem
CIIT/SP20-BSM-003/LHR
COMSATS University Islamabad Pakistan

Spring 2024
A thesis Submitted to
COMSATS University Islamabad, Lahore Campus
In partial fulfillment
of the requirement for the degree of
Bachelor of Science in Mathematics
By
Abdul Raheem
Department of Mathematics
Faculty of Science
COMSATS University Islamabad Pakistan

Spring 2024
ii
This thesis is submitted to the department of Mathematics in partial fulfillment of the requirement for
the award of degree of Master of Science in Mathematics
Name Registration Number
Sardar Abdul Wahab Chaudary CIIT/FA20-BSM-012/LHR
Abdul Raheem CIIT/SP20-BSM-003/LHR
Supervisor
Dr. Adeel Farooq

Associate Professor,
COMSATS University Islamabad, Lahore Campus
June, 2024
iii
Certificate of Approval
This report titled

By
Sardar Abdul Wahab Chaudary

And
Abdul Raheem
has been approved
for the Degree of Bachelor of Science in Mathematics
at COMSATS University Islamabad, Lahore Campus
External Examiner:
Prof. Dr Muhammad Akram

University of the Punjab
Supervisior:
Dr. Adeel Farooq

Department of Mathematics CUI,
Lahore Campus
Head of Department:
Prof. Dr. M. Hussain

Department of Mathematics CUI,
Lahore Campus
iv
Author’s Declaration
We, Sardar Abdul Wahab Chaudary CIIT/FA20-BSM-012/LHR and Abdul Raheem
CIIT/SP20 BSM-003/LHR hereby declare that we have produced the work presented in this thesis,
during the scheduled period of study. We also declare that we have not taken any material from any
source except referred to wherever due to that amount of plagiarism is within an acceptable range. If
a violation of HEC rules on research has occurred in this thesis, we shall be liable to punishable
action under the plagiarism rules of HEC.
Date:
Abdul-Raheem
v
Certificate
It is certified that Sardar Abdul Wahab Chaudaru CIIT/FA20-BSM-012/LHR and Abdul Raheem
CIIT/SP20 BSM-003/LHR has carried out all the work related to this thesis under my supervision at
the Department of Mathematics, COMSATS University Islamabad, Lahore Campus and the work
fulfills the requirement for award of BS degree.
Date: Supervisor
Dr. Adeel Farooq

Associate Professor, Mathematics
CUI Lahore,
vi
Dedication
We, dedicate our thesis to our parents, honourable teachers.
vii
Acknowledgements
All praise is due to ALLAH, the Cherisher and Lord of the Worlds,
Most Gracious and Most Merciful.
First and foremost, I want to express my deepest gratitude to ALLAH Almighty (the Most
Beneficent and Most Merciful) for providing me with the strength, knowledge, and opportunity to
undertake and complete this research. Without His countless blessings, this achievement would not
have been possible. May peace and blessings be upon His messenger, Hazrat Muhammad (PBUH),
his family, companions, and all who follow him. My sincere thanks to Hazrat Muhammad (PBUH),
who continues to be a beacon of guidance and knowledge for all humanity. On this journey towards
my degree, I have found in him a teacher, an inspiration, a role model, and a constant source of
support.

Abdul Raheem
viii
Abstract
by

Abdul Raheem
This thesis explores the mathematical foundations of machine learning

algorithms, providing a comprehensive exploration of various key techniques and
models. It covers fundamental concepts and methodologies in regression,
classification, support vector machines (SVM), decision trees, neural networks, and
perceptrons. Each of these algorithms is analyzed in terms of their mathematical
underpinnings, operational mechanisms, and practical applications. The study aims to
elucidate the core principles that drive these algorithms, offering insights into their
theoretical and practical aspects. By understanding the mathematics behind these
models, this research contributes to a deeper appreciation and effective utilization of
machine learning techniques in solving complex real-world problems.
ix
Table of Contents
Chapter 01 - Machine Learning: A Gentle Introduction ................................................................. 1
1.1 Machine Learning ................................................................................................................ 1
1.2 Process of Machine Learning............................................................................................... 2
1.3 Applicatons .......................................................................................................................... 4
1.3.1 Predictive Analytics ............................................................................................. 4
1.3.2 Recommendation Systems ................................................................................... 5
1.4 Flow of Machine Learning...................................................................................................5
Chapter 02 - Machine Learning: Types and Applications .............................................................. 8
2.1 Supervised Learning ........................................................................................................... 8
2.1.1 For Regression ................................................................................................... 11
2.1.2 For Classification ............................................................................................... 12
2.2 Unsupervised Machine Learning ...................................................................................... 13
2.2.1 Key Characteristics ............................................................................................ 13
2.2.2 Common Techniques ......................................................................................... 13
2.2.3 Applications ....................................................................................................... 13
2.2.4 K-Mean Clustering.............................................................................................14
2.3 Reinforcement Learning ................................................................................................... 18
2.3.1 Working Model .................................................................................................. 19
Chapter 03 - Supervised Learning: Regression and Classification .............................................. 20
3.1 Overview of Machine Learning ........................................................................................ 20
3.1.1 Type of Supervised Machine Learning .............................................................. 21
3.2 Linear Regression .............................................................................................................. 21
3.3 Classification.....................................................................................................................23
3.4 Logistic Regression: Classification Algorithm ................................................................ 24
Chapter 04 - Supervised Learning: SVM and Decision Tree ...................................................... 28
4.1 Support Vector Machine .................................................................................................. 28
4.1.1 Anatomy and Mathematics ............................................................................... 29
4.2 Decision Tree: Anatomy and its Mathematics ................................................................. 33
x
Chapter 05 - Neural Networks – Perceptron: Concepts and Applications...................................38
5.1 The Perceptron .................................................................................................................. 40
5.1.1 Perceptron Training ...........................................................................................43
5.2 Forward Propagation ........................................................................................................ 45
5.3 Backward Propagation ...................................................................................................... 46
xi
Chapter 01
Machine Learning: A gentle introduction
1.1 Machine Learning:

Machine learning is essentially a way for computers to learn from experience, much like humans
do, but at a much larger scale. It's a type of artificial intelligence that allows software
applications to become more accurate at predicting outcomes without being directly programmed
to do so.
Imagine you're trying to teach a computer to distinguish between photos of cats and dogs. In
traditional programming, you'd write specific rules for the computer to follow, such as looking
for whiskers or tail length. However, with machine learning, you feed the computer thousands of
labeled images of cats and dogs. The computer looks at these images and tries to find patterns
and differences on its own. Over time, as you feed it more images, it gets better at figuring out
which animals are in the pictures.
In simple terms, machine learning is like teaching a child through examples. Instead of telling the
child all the rules about what makes an animal a cat or a dog, you show them many pictures, and
over time, they start to understand the difference based on the examples they've seen.
Img-no-1.1
Machine learning is used in many everyday applications. For example, when Netflix
recommends movies you might like, that's machine learning at work. The system has learned
your preferences based on the movies you've watched and rated in the past. Similarly, when your
email program filters out spam, it's using machine learning to identify what unwanted emails
look like based on patterns it has learned from millions of examples.
The beauty of machine learning is that it enables computers to handle new situations without
human intervention, making processes more efficient and uncovering insights that might take
1
humans much longer to realize. It's a tool that, when used wisely, can significantly enhance our
decision-making and automate routine tasks.
Example:
Consider a set of data points. Each data point has some features (input) and an associated
outcome (output). In machine learning, we try to find a function ƒ that maps inputs to outputs as
accurately as possible, based on the data we have. This function ƒ is determined by a set of
parameters (which could be weights in a neural network, coefficients in a linear regression, etc.).
The learning process consists of:
1. Defining a Loss Function: This is a mathematical function that measures the difference
between the actual outcome and the predicted outcome by the model. For example, in regression
tasks, a common loss function is the Mean Squared Error (MSE), which calculates the average of
the squares of the errors between the actual and predicted values.
2. Optimization: This involves finding the set of parameters that minimizes the loss function.
This process usually requires an optimization algorithm like gradient descent. The algorithm
starts with random values for the parameters and iteratively updates them in a direction that
reduces the loss.
3. Iteration: The optimization is usually done iteratively over many cycles (or "epochs"), where
the model learns from a subset of the data (a "batch") at a time, gradually reducing the loss and
improving the model's predictions.
Mathematically, if our data set consists of pairs (𝑥i, 𝑦i) where 𝑥i are the inputs and 𝑦i are the
actual outcomes, and our model makes predictions ƒ(𝑥i), then the goal of machine learning is to
find the parameters of ƒ that minimize the total loss over all data points:
𝑚i𝑛i𝑚iz𝑒 𝐿 = ∑(𝑓(𝗑i), 𝑦i)

i
Here, the exact form of the loss function and the function \( f \) depends on the specific type of
machine learning task (e.g., classification, regression). But fundamentally, all machine learning
follows this basic mathematical principle of minimizing some measure of error between the
model's predictions and the actual data.
1.2 Process of Machine Learning

The process of machine learning encompasses several systematic steps to ensure that the models
developed are accurate, efficient, and effective. Below, I'll describe each step in the process in
greater detail:
1.2.1 Data Collection:

The foundation of any machine learning project is data. The quality and quantity of the data
collected will directly impact the performance of the machine learning model. Data can come
2
from various sources such as files, databases, internet, sensors, or experiments and can be in
different formats like text, images, videos, or tables.
 Challenges: Ensuring data relevance, dealing with privacy issues, and collecting a large
and diverse dataset.
 Best Practices: Collect as much relevant and diversified data as possible while respecting
privacy and ethical guidelines.
1.2.2 Data Preparation:

Once data is collected, it needs to be prepared and cleaned. This involves handling missing
values, removing outliers, normalizing data, and encoding non-numeric features into a format
that can be understood by machine learning algorithms.
 Handling Missing Values: Depending on the context, you might fill in missing values
with the mean, median, mode, or even predict them with another machine learning model.
 Data Normalization: Transforming data to a common scale without distorting
differences in the ranges of values.
 Feature Encoding: Converting categorical data into numerical data so that it can be
processed by the algorithm.
1.2.3 Model Choice:

Selecting the appropriate machine learning model is crucial. The choice depends on the type of
problem (e.g., classification, regression, clustering), the nature and amount of data, and the
specific requirements of the application.
 Considerations: The interpretability of the model, computational resources, and the

trade-off between bias and variance.
 Approaches: Starting with simple models and moving to more complex ones if
necessary, or using ensemble methods that combine multiple models to improve
performance.
3
Img-no-1.2
1.2.4 Training:
Training involves using a dataset to adjust the parameters of the machine learning model. The data used
for training must be representative of the real-world scenario for the model to learn effectively.
Process: The data is usually split into training and validation sets, where the training set is used to train the
model and the validation set is used to tune the hyperparameters.
Techniques: Batch learning, online learning, or reinforcement learning depending on the problem and data
size.
1.2.4 Evaluation:
After training, the model is evaluated using a different set of data called the test set. The evaluation
metrics depend on the type of machine learning task.
 Classification Metrics: Accuracy, precision, recall, F1 score, and ROC curves.

 Regression Metrics: Mean squared error, mean absolute error, and R-squared.
1.2.5 Parameter Tuning:

Also known as hyperparameter optimization, this step involves adjusting the model's hyperparameters to
improve its performance. This can be done manually or with automated methods like grid search, random
search, or Bayesian optimization.
 Objective: To find the set of hyperparameters that results in the best performance of the model on
the validation set.
 Challenges: Balancing the trade-off between model complexity and overfitting, and dealing with
the computational cost of testing many different hyperparameter combinations.
1.2.7 Prediction or Inference:

The final step is using the trained model to make predictions or decisions based on new data. This is
where the model is applied to solve practical problems.
 Implementation: Integrating the model into an existing production environment or using it to

make decisions in real-time.
4
 Considerations: Monitoring the model's performance over time, updating it with new data, and
ensuring it remains relevant and accurate.
Each step in the machine learning process is crucial and requires careful consideration and expertise. By
meticulously following these steps, one can develop models that are not only accurate and efficient but
also robust and scalable.
1.3 Applications
Machine learning applications span across various sectors, impacting our daily lives and the way
industries operate. Below are detailed discussions on three specific applications: Predictive Analytics,
Medical Diagnosis, and Recommendation Systems.
1.3.1 Predictive Analytics (e.g., forecasting stock market trends):
Predictive analytics involves using historical data, statistical algorithms, and machine learning techniques
to predict future outcomes. In the context of the stock market, predictive analytics can be used to forecast
market trends, stock prices, and economic shifts based on a plethora of factors, including past market data,
financial news, company performance indicators, and global economic trends.
How it Works: Machine learning models such as time series forecasting, regression analysis, and neural
networks are trained on historical stock market data. They learn patterns and relationships between
various factors influencing the markets.
Applications: Traders and investment firms use predictive models to make informed decisions, hedge
risks, and identify investment opportunities. Algorithmic trading systems use these predictions to execute
trades at optimal times, maximizing profits and minimizing losses.
Challenges: The stock market is influenced by unpredictable factors like political events, natural
disasters, and changes in government policies, making it inherently volatile and difficult to predict with
high accuracy.
1.3.2 Recommendation Systems (like those used by Amazon and Netflix):

Recommendation systems are a widespread application of machine learning, designed to suggest products,
movies, books, or services to users based on their preferences and behaviors. These systems help in
personalizing the user experience, increasing user engagement, and boosting sales.
How it Works: Machine learning models analyze user behavior, past purchases, and item features to
identify patterns and predict what the user may be interested in next. Techniques such as collaborative
filtering, content-based filtering, and hybrid methods are commonly used.
Applications: E-commerce sites like Amazon use recommendation systems to suggest products to
customers. Streaming services like Netflix and Spotify recommend movies, TV shows, or music based on
the user's past interactions and preferences.
Challenges: Balancing personalization with user privacy, dealing with the cold start problem (making
recommendations for new users or items), and ensuring the diversity and novelty of recommendations are
key challenges in this area. These applications showcase the versatility and transformative potential of
machine learning across different domains, driving innovations and improving efficiencies.
5
Img-no-1.3
1.4 Flow of Machine Learning
The process of machine learning is a structured approach to developing, training, and deploying
algorithms that learn from data. Below, I will detail each step in the machine learning process,
highlighting its importance and the common practices involved:
1.4.1 Data Collection:

The first step in the machine learning process is gathering the necessary data. This data forms the basis of
learning and the foundation upon which the algorithm will improve. The quality, quantity, and relevance
of the data directly impact the performance of the final model.
 Practices: Data can be collected from various sources, including public datasets, company
databases, online repositories, or through sensors and real-time data feeds.
 Challenges: Ensuring data diversity to avoid bias, respecting privacy and ethical standards, and
dealing with large volumes of data.
1.4.2 Data Preparation:

Once data is collected, it needs to be cleaned and preprocessed. This step is crucial as machine learning
algorithms require data in a specific format to function correctly.
1.4.3 Cleaning:
Involves removing or correcting inconsistent, incomplete, or erroneous data.
Preprocessing: Includes normalization (scaling all numeric attributes in the dataset to a common scale
without distorting differences in the ranges of values), handling missing values, and encoding categorical
variables into a format that algorithms can understand.
 Feature Selection and Engineering: Identifying the most relevant features to the prediction task
and creating new features from the existing ones to improve model performance.
1.4.4 Model Choice:

Selecting the appropriate model is based on the type of problem (classification, regression, clustering,
etc.), the size and type of data, and the specific requirements of the application. Different algorithms have
different strengths and weaknesses and are suited to different types of data and tasks.
6
 Considerations: The complexity of the model, the interpretability of the results, and
computational efficiency.
 Approaches: It's common to start with simpler models for initial benchmarks and gradually move
towards more complex models as needed.
1.4.5 Training:
In this step, the chosen machine learning model is applied to the prepared dataset. The model learns to
make predictions or decisions based on the data. This involves adjusting the model's parameters so that it
can accurately map the input data to the correct output.
 Process: The data is usually split into a training set and a test set. The training set is used to train
the model, while the test set is used to evaluate its performance.
 Techniques: Depending on the algorithm, different techniques like gradient descent might be
used to optimize the model's parameters.
1.4.6 Evaluation:
After training, the model's performance is assessed using the test set. This step is crucial to determine how
well the model has learned from the training data and how well it generalizes to new, unseen data.
 Metrics: Vary depending on the type of machine learning task (accuracy, precision, recall for
classification tasks; mean squared error for regression tasks).
 Validation: Involves using various techniques like cross-validation to ensure that the model
performs well on different subsets of the data.
1.4.7 Parameter Tuning:

Also known as hyperparameter optimization, this step involves fine-tuning the model's settings to improve
its performance. Unlike model parameters, hyperparameters are not learned from the data but are set prior
to the training process.
 Techniques: Grid search, random search, and Bayesian optimization are common methods for
exploring different hyperparameter combinations to find the most effective ones.
1.4.8 Prediction or Inference:

Finally, the trained and tuned model is used to make predictions on new data. This is the step where the
model provides insights, makes decisions, or takes actions based on what it has learned during training.
 Deployment: Involves integrating the model into an existing production environment where it
can provide ongoing predictions or analyses.
 Monitoring: Continuous monitoring is essential to ensure the model remains effective over time,
adjusting and retraining as necessary to account for new data or changing conditions.
Each of these steps is critical to the success of a machine learning project. They ensure that the final
model is not only accurate and efficient but also robust and capable of adapting to new data and evolving
requirements.
7
Chapter 02
Machine Learning: Types and Applications
Types of Machine Learning:

Machine learning can be broadly categorized into three types based on the nature of the
learning signal or feedback available to the learning system: Supervised Learning, Unsupervised
Learning, and Reinforcement Learning. Each type has different processes and is used for
different kinds of problems.
Img-no-2.1 Supervised vs unsupervised learning
2.1 Supervised Learning:

Definition: Supervised learning is a type of machine learning where the algorithm is trained
on a labeled dataset. This means that each training example is paired with an output label. The
model learns from this data to make predictions or decisions based on new, unseen data. The
"supervised" part of the name implies that the process of an algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process.
2.1.1 Process:
In supervised learning, the algorithm makes predictions or decisions based on input data.
After each prediction, it receives feedback on its accuracy: correct or incorrect. This feedback
helps the algorithm to adjust and improve its future predictions during the training process. The
main goal is to map input data to the correct output labels, minimizing the errors, and improving
the model's accuracy.
8
Types:
 Classification: The output variable is a category, such as "spam" or "not spam" in an

email filtering tool.
 Regression: The output variable is a real value, such as "dollar amount" or "weight".
Example with Data:
Let’s consider a simple example of supervised learning: email spam detection.
Data Collection: Gather a dataset consisting of emails, each labeled as 'Spam' or 'Not Spam'.
Features: Extract features from each email, which could include the frequency of certain words
or phrases, the sender's details, the time the email was sent, etc.
Training Data: The training data might look something like this:
Email Text Label

"Win a free phone now" spam
"Meeting at 10 am tomorrow" not spam
"Congratulations, you've won!" spam
"Could we discuss the report?" not spam
2.1.2 Model Training:

This data is then used to train a model, typically a classifier, that learns to associate the
features of the emails (like certain keywords) with the labels ('Spam' or 'Not Spam').
Prediction:
Once trained, the model can take a new email without a label and predict whether it is Spam
or Not Spam based on the learned associations.
In this supervised learning scenario, the algorithm learns from the training data, which acts as a
teacher providing answers (labels) for the emails (input data). Over time, by minimizing the
difference between its predictions and the actual labels, the algorithm can learn to accurately
classify new, unseen emails.
9
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
# Sample data
emails = ["Win a free phone now",
"Meeting at 10 am tomorrow",
"Congratulations, you've won!",
"Could we discuss the report?"]
labels = ['Spam', 'Not Spam', 'Spam', 'Not Spam']
# Convert the text data into numerical data

vectorizer = CountVectorizer()
email_counts = vectorizer.fit_transform(emails)
# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(email_counts, labels,
test_size=0.25, random_state=42)
# Initialize the classifier and train it

classifier = MultinomialNB()
classifier.fit(X_train, y_train)
# Make predictions on the test data

predictions = classifier.predict(X_test)
# Evaluate the model

print("Accuracy:", accuracy_score(y_test, predictions))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
In machine learning, various metrics are used to evaluate the performance of models. These
metrics help in understanding how well a model is performing and are crucial for comparing
different models. The choice of metric depends on the type of machine learning task (e.g.,
regression, classification, clustering). Here are some common metrics used in machine learning:
10
Img-no-2.2 Regression Analysis
2.1.3 For Regression:

Mean Absolute Error (MAE): This is the average of the absolute differences between the
predicted and actual values. It gives an idea of how big the errors are, on average.
Mean Squared Error (MSE): This is the average of the squared differences between the
predicted and actual values. It penalizes larger errors more than MAE.
Root Mean Squared Error (RMSE): This is the square root of the MSE. It is more sensitive to
outliers than MAE and is in the same units as the target variable.
R-squared (R²): This is the coefficient of determination, which measures the proportion of the
variance in the dependent variable that is predictable from the independent variable(s). It
provides a measure of how well observed outcomes are replicated by the model.
Adjusted R-squared: This adjusts the R² for the number of predictors in the model. It's used in
multiple regression and provides a better measure of goodness of fit as it adjusts for the number
of terms in the model.
2.1.4 For Classification:

Accuracy: This is the proportion of true results (both true positives and true negatives) among
the total number of cases examined.
Precision (Positive Predictive Value): This is the ratio of true positives to the sum of true and
false positives. It indicates the quality of the positive class predictions.
11
Recall (Sensitivity or True Positive Rate): This is the ratio of true positives to the sum of true
positives and false negatives. It indicates how well the model is identifying the positive class.
F1 Score: This is the harmonic mean of precision and recall and provides a balance between the
two. It is useful when you need to balance precision and recall.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): This metric is used
to evaluate the performance of a binary classification system and is a plot of the true positive rate
against the false positive rate at various threshold settings.
Log Loss (Logarithmic Loss): This measures the performance of a classification model where
the prediction input is a probability value between 0 and 1. It penalizes false classifications.
Img-no-2.3 Regression vs Classification
2.1.5 For Clustering:

Silhouette Score: Measures how similar an object is to its own cluster compared to other
clusters. The value ranges from -1 to 1, where a high value indicates that the object is well
matched to its own cluster and poorly matched to neighboring clusters.
Davies-Bouldin Index: The lower the score, the better the separation between the clusters.
Calinski-Harabasz Index: Also known as the Variance Ratio Criterion, a higher score indicates
better defined clusters.
Each of these metrics provides different insights into the performance of a machine learning
model, and the best metric to use will depend on the specific requirements of your project and the
nature of your data.
12
2.2 Unsupervised Machine Learning
Unsupervised machine learning is a type of machine learning where models are trained using
data that has not been labeled, categorized, or classified. In other words, the learning algorithm is
given data without explicit instructions on what to do with it. Instead, the algorithm must find
patterns and relationships within the data on its own. Unsupervised learning is primarily used to
discover underlying patterns, groupings, or structures in data.
Here are some key points and techniques associated with unsupervised machine learning:
2.2.1 Key Characteristics:
No Labels: The training data is not labeled, meaning the outcome or category of the data is not
provided. The algorithm tries to learn the patterns without any reference to known or labeled
outcomes.
Pattern Discovery: The main goal is often to identify patterns or inherent structures within the
data.
Self-Organization: The model organizes or describes the data using a set of rules or features
discovered through the learning process.
2.2.2 Common Techniques:

Clustering: This is one of the most common unsupervised learning techniques. It involves
grouping data points together based on their similarity. Examples include K-Means clustering,
hierarchical clustering, and DBSCAN.
Dimensionality Reduction: This technique is used to reduce the number of variables under
consideration and is often used for data visualization. Principal Component Analysis (PCA) and
t-Distributed Stochastic Neighbor Embedding (t-SNE) are popular dimensionality reduction
techniques.
Association Rule Learning: This method is used to discover interesting relationships between
variables in large databases. A common example is market basket analysis, where the goal is to
find sets of products that frequently co-occur in transactions.
Anomaly Detection: This involves identifying rare items, events, or observations which raise
suspicions by differing significantly from the majority of the data. It is widely used in fraud
detection, network security, and fault detection.
Neural Networks and Deep Learning: Certain neural network architectures, like autoencoders,
can be used for unsupervised learning tasks such as feature learning and representation learning.
2.2.3 Applications:
Unsupervised learning is used in various domains, including:
o Market segmentation: Identifying distinct groups of customers.
13
o Anomaly detection: Detecting fraudulent transactions or defective items in
manufacturing.
o Image and speech recognition: Learning features or patterns without labeled data.
o Natural language processing: Topic modeling and sentiment analysis.
o Recommendation systems: Recommending products or content based on user
behavior patterns.
Unsupervised machine learning is valuable for exploring data when you are not sure what to look
for, or when you want to reduce the complexity of data for further analysis. While unsupervised
learning can reveal hidden patterns and structures, the interpretations of these findings are
typically more subjective and require more domain expertise compared to supervised learning
outcomes.
Example
Img-no-2.4 k-means clustering
2.2.4 K-Means Clustering:

K-Means is an unsupervised learning algorithm used primarily for clustering. The aim is to
partition `n` observations into `k` clusters in which each observation belongs to the cluster with
the nearest mean (centroid), serving as a prototype of the cluster.
How K-Means works:
Initialization: The algorithm starts by initializing `k` centroids. These can be chosen randomly
or by other methods. Each centroid represents the initial center of a cluster.
14
Assignment step: Each data point is assigned to the closest centroid, and thus partitions the data
into clusters based on the current centroids. The "closeness" is typically measured using
Euclidean distance, though other distance measures can be used.
Update step: Once all points have been assigned to clusters, the positions of the centroids are
recalculated as the mean of all points in the cluster.
Repeat: Steps 2 and 3 are repeated until the centroids no longer change significantly, meaning
the algorithm has converged. This usually means that the within-cluster sum of squares (WCSS)
cannot be minimized any further.
Challenges with K-Means:
o The need to specify `k` (the number of clusters) in advance.

o Sensitivity to the initial choice of centroids.
o The algorithm might converge to a local minimum, depending on the initial centroid
placement.
o It assumes that clusters are spherical and evenly sized, which might not always be the
case.
The Iris Dataset:
The Iris dataset is a collection of data from three different types of iris flowers: Setosa,
Versicolour, and Virginica. Each sample has four features: the lengths and the widths of the
sepals and petals. This dataset is often used for testing out machine learning algorithms because it
is small and has well-defined clusters.
Applying K-Means to the Iris Dataset:
In our example, we apply K-Means clustering to the Iris dataset with the number of clusters set to
three, corresponding to the three species of Iris flowers.
Data Preparation: Although K-Means is not highly sensitive to feature scaling, the Iris dataset
features are all of similar scales, so we didn't perform scaling. In different scenarios, feature
scaling might be necessary.
Model Training: We use `scikit-learn`'s KMeans class to fit the model. By default, `scikit-learn`
uses a refined starting condition (the K-Means++ algorithm) to choose initial centroids, which
helps in achieving better clustering.
Cluster Assignment: Once the algorithm has been run and the centroids have been determined,
each data point in the dataset is assigned to its nearest centroid, resulting in a partitioning of the
data into three clusters.
Evaluation: While we don't have an explicit accuracy metric like in supervised learning, we can
use intrinsic metrics like silhouette scores to judge the clustering quality. However, in this
15
educational example, we skipped this part. But in a real-world scenario, evaluating the clustering
result is crucial.
Visualization: Finally, we visualize the clusters and their centroids. This isn't just for aesthetics;
visual inspection can provide immediate insights into the effectiveness of the clustering, though
it's more feasible with lower-dimensional data.
Through this process, we aim to see whether the K-Means algorithm can effectively group the
iris flowers into clusters that correspond to their actual species. This is a simplified example of
how unsupervised learning can reveal hidden patterns in the data without needing any labels.
The Iris dataset is a classic dataset in machine learning and statistics. It

consists of 50 samples from each of three species of Iris flowers (Setosa,
Versicolour, and Virginica). Each sample has four features: the lengths
and the widths of the sepals and petals.
16
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
# Load the Iris dataset

iris = load_iris()
X = iris.data
# Applying K-Means clustering

kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X)
# Plotting the clusters

plt.figure(figsize=(12, 6))
colors = np.array(['red', 'green', 'blue'])
plt.scatter(X[:, 0], X[:, 1], c=colors[y_kmeans], s=50, cmap='viridis')
# Plotting the centroids of the clusters

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Clustering of Iris Dataset with K-Means')
plt.show()
Img-no-2.5 K-means Clustering on Iris Dataset
17
2.3 Reinforcement Learning (RL)
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make
decisions by taking actions in an environment to achieve some goal. The learner or agent is not
told which actions to take, but instead must discover which actions yield the most reward by
trying them out. This method contrasts with supervised learning, where the agent is given correct
actions to take in various situations.
Key Concepts in Reinforcement Learning:

Agent: The learner or decision-maker that interacts with the environment.
Environment: Everything the agent interacts with and which provides the agent with states and
rewards.
State: A representation of the current situation returned by the environment.
Action: All possible moves the agent can make.
Reward: A feedback from the environment to assess the last action taken by the agent. It can be
positive (reinforcing the action) or negative (discouraging the action).
Policy: A strategy used by the agent, mapping states of the environment to actions to be taken
when in those states.
Value Function: It predicts the expected return (sum of rewards) for an agent starting at a given
state and acting according to a particular policy.
Q-value or Action-Value Function: It predicts the expected return of taking a given action in a
given state and following a specified policy thereafter.
How Reinforcement Learning Works:

In reinforcement learning, the agent makes sequential decisions, observes the outcomes (rewards
and next state), and adjusts its actions based on whether the outcome was beneficial (positive
reward) or not (negative reward). This process leads to a trial-and-error learning methodology
where the goal is to develop a policy that maximizes the cumulative reward or the long-term
return.
Types of Reinforcement Learning:

Model-Based RL: In this approach, the agent builds a model of the environment based on its
interactions and uses that model to make decisions.
Model-Free RL: Here, the agent learns to make decisions based solely on the observed rewards
and states, without constructing a model of the environment. This approach includes methods
like Q-learning and SARSA.
18
Applications of Reinforcement Learning:
Reinforcement learning has been successfully applied in various domains, such as:
- Gaming: From board games like Chess and Go to video games, RL agents have been
developed that can learn to play and win at complex games.
- Robotics: RL can be used for teaching robots to perform tasks through trial and error.
- Autonomous Vehicles: Reinforcement learning helps in developing policies for self-driving
cars.
- Personalized Recommendations: RL can be used to personalize content recommendations in
apps and websites based on user interactions.
- Finance: For portfolio management and algorithmic trading by learning optimal trading
strategies.
Reinforcement learning presents a different paradigm of learning than other machine learning
methods, emphasizing learning by interaction and adapting to changing conditions to achieve the
best possible results.
19
Chapter 03
Supervised Learning: Regression and Classification
3.1 Overview of Supervised Learning

Supervised learning is a paradigm in machine learning where algorithms are trained on labeled data to
make predictions or decisions. The process involves presenting the model with a dataset consisting of input-
output pairs, where the desired output (label) is provided for each input. The primary objective is for the
algorithm to learn a mapping between the inputs and their corresponding outputs.
Key Components:
3.1.1 Labeled Training Data:
- Supervised learning relies on datasets where each example is paired with a known outcome. This labeled
data serves as the foundation for training the model.
3.1.2 Learning Process:

- During training, the algorithm adjusts its internal parameters based on the input-output pairs in the labeled
data. This adjustment is guided by a predefined objective, often minimizing the difference between predicted
and actual outcomes.
3.1.3 Prediction or Decision Making:

- Once trained, the model can make predictions or decisions on new, unseen data. It generalizes patterns
learned from the training set to make informed responses to novel inputs.
3.2 Types of Supervised Learning:

3.2.1 Classification:
- In classification tasks, the algorithm predicts discrete categories or labels. Examples include spam
detection, image recognition, and sentiment analysis.
3.2.2 Regression:
- Regression involves predicting continuous values. Applications range from predicting stock prices and
housing values to estimating sales figures based on input features.
20
3.2 Linear regression
Linear regression is a fundamental statistical method and a popular machine learning algorithm used
for modeling the relationship between a dependent variable and one or more independent variables.
The goal of linear regression is to find the best-fitting linear relationship that minimizes the
difference between the observed and predicted values of the dependent variable.
In the context of a simple linear regression, which involves one independent variable, the
relationship between the dependent variable 𝑦 and the independent variable 𝗑 is represented by the
equation of a straight line:
𝑦 = 𝑏0 + 𝑏1 + c
Where:
- 𝑦 is the dependent variable (the variable we want to predict).
- 𝗑 is the independent variable (the variable used to make predictions).
- 𝑏0 is the y-intercept, the point where the line crosses the y-axis when 𝗑 = 0
- 𝑏1 is the slope of the line, representing the change in 𝑦 for a unit change in 𝑥.
- c is the error term, representing the difference between the observed and predicted values.
The objective of linear regression is to estimate the values of 𝑏0 and 𝑏1 that minimize the sum of
squared differences between the observed 𝑦i and predicted 𝑦7 values. This is typically done using the
method of least squares.
In the case of multiple linear regression, where there are multiple independent variables, the equation
is extended to:
𝑦 = 𝑏0 + 𝑏1𝗑1 + 𝑏2𝗑2 + ⋯ + 𝑏𝑛𝗑𝑛 + c
Here, 𝑏0 is still the y-intercept, and 𝑏1, 𝑏2, … , 𝑏𝑛 are the slopes associated with each independent
variable 𝗑1, 𝗑2, … , 𝗑𝑛.
Linear regression is widely used for predicting numeric outcomes and understanding the relationship
between variables. It is important to note that linear regression makes certain assumptions about the
data, such as linearity, independence of errors, homoscedasticity (constant variance of errors), and
normality of errors. When these assumptions are met, linear regression can provide interpretable and
valuable insights into the data.
21
Code Example
1
Medical Cost Personal Datasets - Insurance Forecast by using Linear Regression
Data Description: have 7 attributes and almost 1300 rows
Img-no-3.1 Linear Regression on Medical Dataset
1
https://www.kaggle.com/datasets/mirichoi0218/insurance
22
Model Interpretation
We’ve calculated the Mean Squared Error (MSE) and R-squared 𝑅2 values for your insurance
dataset. Let's interpret these results:
Mean Squared Error (MSE):

- The MSE value you provided (33635210.431178406) represents the average squared difference
between the actual and predicted values of the insurance charges.
- A higher MSE indicates a larger average squared difference, suggesting that the model's
predictions are, on average, further from the actual values.
- The absolute value of MSE may not be intuitively interpretable, but it is useful for comparing
different models. Smaller MSE values are desirable, indicating better model performance.
R-squared 𝑅2:
- The 𝑅2 value you obtained (0.7833463107364539) is between 0 and 1.
- 𝑅2 measures the proportion of the variance in the dependent variable (insurance charges) that is
explained by the independent variables in your model.
- An 𝑅2 value of 0.78 indicates that approximately 78% of the variability in insurance charges can
be explained by the features included in your model.
- Higher 𝑅2 values suggest that the model provides a good fit to the data, explaining a larger
proportion of the variability.
Interpretation:
- The relatively high 𝑅2 value suggests that your linear regression model explains a substantial
portion of the variance in insurance charges based on the provided features.
- The MSE, while not easily interpretable in absolute terms, complements the 𝑅2 value by indicating
the average squared difference between predicted and actual values. In your case, it's beneficial to
consider whether the magnitude of MSE aligns with the specific scale and context of your insurance
charges.
3.3 Classification
Supervised learning classification is a type of machine learning algorithm used to categorize input
data into distinct classes or categories. The goal is to learn a mapping from input variables to output labels
based on a set of labeled training data.
Here's how it generally works:
Input Data: You start with a dataset consisting of input features and corresponding labels. Each data point
has a set of features (or attributes) and is assigned a class label.
23
Training Phase: In this phase, the algorithm learns from the labeled data. It tries to find patterns or
relationships between the input features and the labels. Various classification algorithms use different
techniques to accomplish this, such as decision trees, support vector machines, k-nearest neighbors, logistic
regression, or neural networks.
Model Creation: After training, the algorithm creates a model that represents these patterns. This model is
essentially a mathematical function that takes input features and predicts the corresponding class label.
Evaluation: Once the model is created, it needs to be evaluated to assess its performance. This is typically
done using a separate dataset called the validation set or test set. The model's predictions are compared with
the actual labels in the test set to measure its accuracy, precision, recall, F1 score, or other evaluation metrics.
Prediction: After evaluation, the trained model can be used to make predictions on new, unseen data. The
model takes the input features of a new data point and predicts its class label based on the patterns learned
during the training phase.
Iterative Improvement: Depending on the evaluation results, the model may undergo further refinement or
optimization. This could involve tuning hyperparameters, selecting different algorithms, or gathering more
labeled data to improve performance.
Classification is widely used in various applications such as spam detection, sentiment analysis, medical
diagnosis, image recognition, and more. It's an essential tool in the field of machine learning and data mining
for automating decision-making processes based on historical data.
3.4 Logistic Regression: Classification Algorithm

Binary classification is a specific type of supervised learning task where the goal is to classify input
data into one of two possible classes or categories. The mathematics behind binary classification involves
various algorithms and techniques aimed at separating the data points belonging to different classes in a
binary decision boundary.
Here's a simplified overview of the mathematics involved in binary classification:
Representation of Data:
- Let's denote the input features by X = *𝑥1 , 𝑥2 , 𝑥3 , … … , 𝑥𝑛 +, where each 𝑥i represents a feature of the
input data.
- The corresponding labels are denoted by 𝑦 = *0,1+, where 0 and 1 represent the two classes.
Model Representation:
- In binary classification, the model learns a decision boundary that separates the data points belonging to
different classes.
- One common way to represent this decision boundary is using a hypothesis function denoted as 𝑕𝜃(𝑥),
where 𝜃 represents the parameters of the model.
- For logistic regression, a popular algorithm for binary classification, the hypothesis function is represented
as:
24
𝑕𝜃 (𝑥 ) = 1
𝑇
1+𝑒−𝜃 𝑥
- Here 𝑒 is the base of the natural logarithm and 𝑥 represents the input features.
Training the Model:
- During the training phase, the algorithm aims to learn the optimal parameters 𝜃 that best fit the training
data.
- This is typically done by minimizing a cost function that measures the error between the predicted outputs
and the actual labels.
- For logistic regression, the cost function 𝐽(𝜃) is often defined as the logistic loss function or cross-
entropy loss:
𝑚
1
𝐽(𝜃) = − ∑,𝑦 (i) log .𝑕 (𝑥i)/ + (1 − 𝑦(i))log(1 − 𝑕𝜃 (𝑥i))-
𝑚
i=1
- 𝑚 is the number of training examples, 𝑥(i) and 𝑦(i) are the features and labels of the \(i\)-th training
example, respectively.
Optimization:
- The goal is to find the parameters 𝜃 that minimize the cost function 𝐽(𝜃).
- This is typically done using optimization algorithms like gradient descent or more advanced variants such
as stochastic gradient descent or mini-batch gradient descent.
Prediction:
- Once the model is trained and the optimal parameters are obtained, it can be used to make predictions on
new, unseen data.
- Given an input 𝑥, the model predicts the class label by evaluating the hypothesis function 𝑕𝜃(𝑥),
typically thresholding the output probability at 0.5.
Binary classification is fundamental in various applications such as spam detection, fraud detection, medical
diagnosis, and more, where the task involves making a decision between two distinct classes.
25
Logistic Regression Model Trained On Dataset:
2
Student Mental health
A STATISTICAL RESEARCH ON THE EFFECTS OF MENTAL HEALTH ON STUDENTS’ CGPA dataset
Evaluation Metrics
Confusion Matrix
Img-no-3.2 Logistic Regression and Model Evaluation Metric
2
https://www.kaggle.com/datasets/shariful07/student-mental-health
26
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
df = pd.read_csv('./Student_Mental_health.csv')
df.rename(columns={'What is your CGPA?': 'CGPA'}, inplace=True)
df.head()
df['Age'].fillna(df['Age'].median(), inplace=True)
df['CGPA'] = df['CGPA'].apply(lambda x: float(x.split(' - ')[0]) if '-' in x else
float(x))
df.isnull().sum()
# Encode categorical variables

label_encoder = LabelEncoder()
df['gender'] = label_encoder.fit_transform(df['Choose your gender'])
df['course'] = label_encoder.fit_transform(df['What is your course?'])
df['marital_status'] = label_encoder.fit_transform(df['Marital status'])
# Define features and target variable

features = ['gender', 'Age', 'course', 'CGPA', 'marital_status']
target = 'Do you have Depression?'
X = df[features]
y = df[target]
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
# Initialize the Logistic Regression model

Img-no-3.4 Logistic Regression using Scikit-learn
27
Chapter 04
Supervised Learning: SVM and Decision Trees
4.1 Support Vector Machine:

SVM is a powerful classification algorithm that works by finding the hyperplane that best
separates the data into different classes. It can also be used for regression tasks. The "supervised"
aspect comes from the fact that the algorithm is trained on a labeled dataset, meaning each training
example is paired with an output label. The SVM algorithm then learns to classify new data points
based on this training. SVM is versatile supervised learning algorithm used for classification,
regression, and outlier detection. Its primary use is in classification problems, and it is especially
well-suited for binary classification. The fundamental idea behind SVM is to find the optimal
hyperplane that separates data points of different classes with the maximum margin.
Img-no-4.1 Support Vector Machine
28
4.1.1 Concepts:
Margin or Decision Boundary:

 The margin is the distance between the hyperplane and the nearest data point of any
class. SVM aims to maximize this margin.
 A larger margin indicates a better generalization of the model to unseen data.
Hyperplane:
 In a two-dimensional space, a hyperplane is a line dividing the plane into two parts
where each part represents a different class.
 In higher dimensions, a hyperplane is a subspace whose dimension is one less than
that of its ambient space.
Support Vectors:
 Support vectors are the data points that are closest to the hyperplane. These points are
critical in defining the position and orientation of the hyperplane.
 Only these points are used to determine the hyperplane, making the algorithm
computationally efficient.
Kernel’s Trick:
 When data is not linearly separable in its original feature space, SVM uses kernel
functions to transform the data into a higher-dimensional space where a linear
separation is possible.
 Common kernel functions include linear, polynomial, radial basis function (RBF),
and sigmoid.
4.2 Anatomy of SVM:

Training:
Given a set of training examples (𝗑1, 𝑦1), (𝗑2, 𝑦2), . . . , (𝗑𝑛, 𝑦𝑛) where 𝗑i represents
the feature vectors and 𝑦i represents the class labels, the SVM algorithm constructs the hyperplane
or set of hyperplanes in a high-dimensional space.
The goal is to find a hyperplane that maximizes the margin. This can be formulated as
a convex optimization problem:
1
𝑚i𝑛 ‖𝑚‖2
2
Subject to:
yi(𝜔 · xi + b) ≥ 1 ✯i
Where 𝑤 is the weight vector and is the bias term.

Soft Margin (Handling Non-linearly Separable Data):
29
For data that is not perfectly separable, SVM introduces slack variables 𝛾i to
allow some misclassifications. The optimization problem becomes
1 𝑛
𝑚i𝑛 ‖𝑚‖2 + 𝐶 ∑ 𝛾
i
2
i=1
Where 𝐶 is a regularization parameter that controls the trade-off between maximizing

the margin and minimizing the classification error.
Kernel Functions:
To handle non-linearly separable data, SVM maps the original feature space to a
higher-dimensional space using a kernel function 𝐾(𝗑i, 𝑦i). The commonly used kernels are:
Linear kernel: 𝐾(𝗑i, 𝑦j) = 𝗑i . 𝗑j
Polynomial Kernel: 𝐾(𝗑i, 𝑦j) = (𝗑i . 𝗑j + 1)𝑑

2
RBF Kernel (Gaussian Kernel): 𝐾(𝗑i, 𝑦j) = 𝑒𝗑𝑝(−𝛾‖𝗑i − 𝗑j‖ )
Sigmoid Kernel: 𝐾(𝗑i, 𝑦j) = 𝑡𝑎𝑛𝑕(𝑎𝗑i. 𝗑j + 𝑐)
SVM for Classification:

In classification tasks, the decision function for a new input is given by:
𝑓(𝗑) = 𝑠i𝑔𝑛(w · 𝗑 + 𝑏)
The sign of the function determines the class label.
SVM for Regression (Support Vector Regression, SVR):
In regression tasks, SVM aims to find a function that deviates from the actual target values by
a value no greater than 𝜖 for all training data, and at the same time, is as flat as possible. The
optimization problem is formulated as:
1 𝑛
𝑚i𝑛 ‖𝑚‖2 + 𝐶 ∑(𝛾 + 𝛾*)
i i
2
i=1
Subject to:
30
yi − (𝜔 · xi + b) ≤ c + 𝛾i
{ (𝜔 · xi + b) − 𝑦i ≤ c + 𝛾i*
𝛾i, 𝛾*i ≥ 0
Where 𝛾i 𝑎𝑛𝑑 𝛾*i 𝑎𝑟𝑒 𝑠𝑙𝑎𝑐𝑘 𝑣𝑎𝑟i𝑎𝑏𝑙𝑒𝑠.
Advantages of SVM:
o Effective in High-Dimensional Spaces: SVM is particularly effective when the
number of dimensions exceeds the number of samples.
o Memory Efficiency: Only the support vectors are used to define the hyperplane.
o Versatility: Different kernel functions can be specified for the decision function. It is
possible to use custom kernels.
Disadvantages of SVM:
 Computationally Intensive: Training can be time-consuming, especially for
large datasets.
 Choice of Kernel: The performance of SVM depends significantly on the
choice of the kernel and the kernel parameters.
 Hard to Interpret: SVM models are often seen as black boxes since the
decision boundary in a high-dimensional space is hard to interpret.
Applications of SVM:
Text and Hypertext Categorization: SVMs are used for classifying texts into different
categories.
Image Classification: SVMs can classify images into different categories based on the
features extracted.
Bioinformatics: Used for protein classification, cancer classification based on gene
expression data, etc.
Handwriting Recognition: SVMs can be used to recognize handwritten characters.
31
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv"
wine_data = pd.read_csv(url, delimiter=';')
# Separate the features and the target

X = wine_data.drop('quality', axis=1)
y = wine_data['quality']
# Binarize the output variable (quality)

# For simplicity, let's classify the wine as 'good' (quality 6 and above) and
'not good' (quality below 6)
y = y.apply(lambda x: 1 if x >= 6 else 0)
# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Standardize the features

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize the SVM classifier

svm_classifier = SVC(kernel='linear')
# Train the model

svm_classifier.fit(X_train, y_train)
# Predict the labels of the test set

y_pred = svm_clas
Img-no-4.1 SVM model trained on wine quality dataset
32
4.2 Decision Tree: Anatomy and its Mathematics
A Decision Tree is a flowchart-like structure where an internal node represents a feature (or
attribute), a branch represents a decision rule, and each leaf node represents an outcome (or class
label). The paths from the root to the leaf represent classification rules.
4.2.1 Anatomy and Its working method:

 Root Node: This is the topmost node in a decision tree. It represents the entire dataset and the
first feature that is split.
 Internal Nodes: These nodes represent the features of the dataset that are used to make
decisions. Each internal node splits into branches based on the value of a feature.
 Branches: These are the links between nodes. Each branch represents an outcome of a
decision, and they lead to other nodes or leaf nodes.
 Leaf Nodes: These nodes represent the final decision or classification. They contain the
output variable.
The basic idea of building a decision tree is to divide the dataset into subsets that contain instances
with similar values (homogeneous). This is done recursively, starting from the root node and moving
down the tree.
Steps to Build a Decision Tree
 Select the Best Feature to Split: The feature that best separates the data into homogeneous
subsets is chosen. This is based on criteria like Information Gain, Gini Index, or others.
 Split the Data: The dataset is split into subsets based on the selected feature.
 Create Sub-nodes: A decision node is created for each split, and the process is repeated
recursively for each subset.
 Stopping Criteria: The recursion is stopped when a stopping criterion is met, such as a
maximum tree depth, minimum number of samples per node, or if all instances in a node
belong to the same class.
4.2.2 Mathematical Foundation:

1. Entropy and Information Gain
Entropy measures the impurity or randomness in the dataset. For a binary classification, it is
defined as:
𝐻(𝑆) = −𝑝+ log2(𝑝+) − 𝑝− log2(𝑝−)
Where 𝑝+ is the proportion of positive examples in the set 𝑆, and 𝑝− is the proportion of negative
examples.
Information Gain is used to select the feature that best splits the data. It is defined as the
difference in entropy before and after a split:
33
|𝑆𝑣|
𝐼𝐺(𝑆, 𝐴) = 𝐻(𝑆) − ∑ 𝐻(𝑆 )
|𝑆| 𝑣
𝑣 c 𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
Where 𝐼𝐺(𝑆, 𝐴) is the information gain from splitting the set 𝑆 on feature 𝐴, and 𝑆𝑣 are subsets of 𝑆
for each value 𝑣 𝑜𝑓 𝐴.
2. Gini Index:
The Gini Index measures the impurity of a dataset. It is defined as:
𝐶
𝐺i𝑛i(𝑆) = 1 − ∑ 𝑝2i
i=1
where 𝑝i is the probability of an element being classified into class i, and 𝐶is the number of classes.
The feature with the smallest Gini Index is chosen for the split.
3. Splitting Criteria:
For each feature, the dataset is split and the resulting subsets' impurity is calculated. The
feature that results in the lowest impurity (or highest information gain) is chosen.
4. Recursive Partitioning:
The process is repeated for each subset recursively until a stopping criterion is met (e.g.,
maximum depth, minimum samples per node, etc.).
Example: Building a Decision Tree:
Outlook Temperature Humidity Windy Play Tennis
Sunny Hot High F No

Sunny Hot High T No
Overcast Mild High F Yes
Rain Cool Normal T Yes
Rain Cool Normal T Yes
Overcast Cool Normal T No
Overcast Mild Normal F No
Sunny Cool High F Yes
Sunny Mild Normal T No
Rain Mild Normal F No
Sunny Mild Normal F Yes
Overcast Hot High T Yes
Rain Mild Normal T Yes
Sunny Hot Normal F Yes
34
Step-by-Step Process
1. Calculate the Entropy of the Whole Dataset
9 9 5 5
𝐻(𝑆) = − ( log2 + log2 )
14 14 14 14
2. Calculate Information Gain for Each Feature:
2 2 3 3
𝐻(𝑆𝑢𝑛𝑛𝑦) = − ( log2 + log2 ) ≈ 0. 971
5 5 5 5
4 4
𝐻(𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = − ( log2 ) ≈ 0
4 4
3 3 2 2
𝐻(𝑅𝑎i𝑛) = − ( log2 + log2 ) ≈ 0. 971
5 5 5 5
5 4 5
𝐼𝐺(𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = 0. 940 − ( * 0. 971 + *0+ * 0. 971) ≈ 0. 247
14 14 14
Temperature, Humidity, Windy: Similar calculations are performed for other features to determine
their information gain.
3. Choose the Best Feature to Split: Suppose Outlook has the highest information gain.
4. Split the Dataset Based on Outlook:
 Create branches for Sunny, Overcast, and Rain.

 Repeat the process for each subset until stopping criteria are met.
Advantages and Disadvantages of Decision Trees

Advantages
 Easy to Understand: They are easy to visualize and interpret.

 Non-linear Relationships: They can model complex relationships and interactions.
 No Feature Scaling Required: They do not require normalization or standardization of
features.
 Handle Both Numerical and Categorical Data: They can work with different types of
data.
Disadvantages
35
 Overfitting: Decision trees can easily overfit the training data, especially if they are not
pruned.
 Instability: Small changes in the data can result in significantly different trees.
 Bias: They can be biased towards features with more levels (i.e., those that have more
possible values).
Pruning:
Pruning is a technique used to reduce the size of the tree and prevent overfitting. It involves
removing branches that have little importance.
There are two types of pruning:
 Pre-pruning (Early Stopping): The tree is stopped from growing beyond a certain depth, or
splits are stopped if the number of samples in a node is less than a specified number.
 Post-pruning: After the tree is fully grown, branches are removed if they do not provide
significant information gain.
Conclusion
Decision Trees are a powerful and intuitive method for both classification and regression
tasks. Their ease of interpretation and ability to handle various types of data make them a popular
choice. However, they require careful tuning and pruning to avoid overfitting and ensure that the
model generalizes well to new data.
36
Part 2
Deep Learning
37
Chapter 05
Neural Networks - Perceptron: Concepts and Applications
Deep Learning is about neural networks and their structures and designs; and training those
networks. Even the most complex Neural network is based on vectors, and matrices and uses the
concept of cost function and algorithms like gradient descent to find a reduced cost, and then
propagate the cost back to all constituents of the network proportionally via a method called back-
propagation.
Have you held an integrated circuit or chip in hand or seen it? It looks overwhelmingly complex. But
its base is the humble transistor and Boolean logic. To understand something complex we need to
understand the simpler constituents.
The earliest neural network — Rosenblatt’s Perceptron was the first to introduce the concept of
using vectors and the property of dot product, to split hyperplanes of input feature vectors.
Background:
As a student of mathematics we all are familiar with the concept of Vectors and applications.
Matrices are many things to many aspects of mathematics. But in the case of Neural networks they
are just a way to represent Vectors (and Tensors). Vectors are represented as matrices. A Vector is
essentially a one-dimensional matrix. A matrix is defined to be a rectangular array of numbers. An
example here is a Euclidean Vector in three-dimensional Euclidean space (or R³ with some
magnitude and direction (from (0,0,0) origin in this case).
Multi-dimensional matrices can be thought of as one-dimensional vectors stacked on top of each
other. The Neural Network Weights or weight vectors are stacked together as matrices. This intuition
is especially helpful when we use dot products on neural network weight matrices.
Dot Product and Splitting the hyper-plane - the crux of the Perceptron
Let’s take a simple example. Let's assume that a vector like ,𝗑, 𝑦- is some feature of a leaf /a
feature vector of a leaf. Healthy leaves will have some values of this feature vector. Unhealthy leaves
38
will have some other value ranges of this feature vector. Let's collect features for 10 leaves and
assume that 2 are unhealthy ones. Imagine we have a problem classifying if a leaf is healthy or not
based on the features of the leaf. For each leaf, we have some feature vector set. Now, if we have a
weight vector, whose dot product with the feature vector of the set of input vectors of a certain class
(say leaf is healthy) is positive, and with the other set is negative, then that weight vector splits the
feature vector hyper-plane into two areas.
In the above diagram, the weight vector 𝑤 (shown as a dotted line) splits the 2D feature space of
positive and negative samples.
For any new leaf, if we only extract the same features into a feature vector; we can dot it with the
trained weight vector and find out if it falls in the healthy or deceased class.
39
Img-no-5.1 leaves-sysnthetic-data using numpy
Note that here we have hand-coded the weight matrix. We will see how this code can be adapted to
learn the weight matrix
5.1 The Perceptron

The initial neural network — Frank Rosenblatt’s perceptron was designed to split a linearly
separable feature set into distinct sets. Here is how the Rosenblatt’s perceptron is modelled.
40
Img-no-5.3 perceptron anatomy
Inputs are 𝑥1 𝑡𝑜 𝑥𝑛Weights are some values that are learned 𝑤1𝑡𝑜 𝑤𝑛. There is also a bias 𝑏 which
in the above is θ. The bias can be modelled as a weight 𝑤0 connected to a dummy input 𝑥0 set to 1.
If we ignore the bias term, the output y can be written as the sum of all inputs times the weights;
thresholded by zero.
The Perceptron will fire if the sum of its inputs is greater than zero;
otherwise, it will not. It will be activated if the sum of its inputs is greater
than zero; otherwise, it will not.
The Activation Function of the Neural Network
The big blue circle is the primitive brain of the primitive neural network — the perceptron
brain. This is what is called an Activation Function in Neural Networks. In Perceptron, the activation
function is a simple step function, the output is non-continuous (and hence non-differentiable) and is
either 1 or 0. If the inputs are arranged as a column matrix and weights are also arranged likewise
then both the input and weights can be treated as vectors and the activation function
∑ 𝑤i𝑥i
i
is the same as the dot product!!
𝑤. 𝑥
Hence the activation function can also be written as
Note that the dot product of two matrices (representing vectors), can be written as the
transpose of one multiplied by another.
𝑤. 𝑥 = 𝑤 𝑇 . 𝑥
41
All three equations (Eq 1,2 &3) are the same, just that in different references it will be written
in one of these forms.
𝑤. 𝑥 > 𝑏
Above eq defines all the points on one side of the hyperplane, and
𝑤. 𝑥 ≥ 𝑏
all the points on the other side of the hyperplane and on the hyperplane itself.
This happens to be the very definition of “linear separability”
Thus, the Perceptron allows us to separate our feature space in two convex half-spaces. If we
can get the weight matrix that has this property, then this weight vector splits the input feature
vectors into two regions by a hyperplane.
This is the essence of the Perceptron, the initial artificial neuron.
In simple terms, it means that an unknown feature vector of an input set belonging to say
Dogs and Cats, when a dot product is applied with a trained weight vector, will fall into either the
Dog space of the hyperplane or the Cat space of the hyperplane. This is how neural networks do
classifications.
How are the Perceptron weights learned?

You may have heard about Gradient descent. Perceptron learning, is much simpler. What is
done is to start with a randomly initialized weight vector, compute a resultant classification (0 or 1)
by taking the dot product with the input feature vector, and then adjust the weight vector by a tiny bit
to the right ’direction’ so that the output is closer to the expected value. Do this iteratively until the
output is close enough. The question is how to nudge to the correct “direction”?
We want to move the weight vector in the direction of the input vector so that the hyperplane
is closer to the correct classification. The error of a perceptron with weight vector w is the number of
incorrectly classified points. The learning algorithm must minimize this error function.
𝐸(𝑤)
42
5.1.1 Perceptron Training
1. Taking input from the training data, and doing a dot product with the initial weight vector;
will give you either a value greater than 0 or less than .
2. Note that this means which quadrant the feature vector lies; either in the positive quadrant (P)
or on the negative side (N).
3. If this is as expected, then do nothing.
4. If the dot product comes wrong, that is if the input feature vector — say x was x ∈ P but dot
product w.x < 0 we need to drag/rotate the weight vector towards x.
𝑤𝑛 = 𝑤 + 𝑥
Which is vector addition, that is w is moved towards 𝑥
5. Alternately say that x ∈ N but dot product𝑤. 𝑥 > 0 , then we need to do the reverse
𝑤𝑛 = 𝑤 − 𝑥
This is also called the delta rule. Note that some articles refer to this as gradient descent simplified.
But gradient descent depends on the activation function being differentiable. The step function which
is the activation function of the perceptron is non-continuous and hence non-differentiable.
43
44
5.2 Forward Propagation:
Forward propagation is the process of passing input data through the layers of a neural network to
generate an output. Each layer consists of neurons that apply weights to the inputs, pass them through an
activation function, and produce outputs for the next layer.
45
5.3 Backward Propagation
Backward propagation, or backpropagation, is the process of calculating the gradient of the

loss function with respect to each weight by the chain rule, and then using this gradient to update the
weights in order to minimize the loss. It is an essential part of training neural networks.
Results for the below code
46
47
Refrences
https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
https://medium.com/data-science-engineering/explaining-neural-network-as-simple-as-
possible-gradient-descent-00b213cba5a9
https://www.freecodecamp.org/learn/machine-learning-with-python/
https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-
guide-for-beginners/
https://www.youtube.com/watch?v=Ixl3nykKG9M
https://medium.com/analytics-vidhya/an-introduction-to-mathematics-behind-neural-
networks-135df0b85fa1
48

Machine Learning Thesis --1

Uploaded by

Copyright:

Available Formats

You might also like

Machine Learning Thesis --1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Thesis --1

Uploaded by

Copyright:

Available Formats

Machine Learning: Modern Techniques

and Mathematical Approach to Neural Networks

Sardar Abdul Wahab CH

COMSATS University Islamabad Pakistan

COMSATS University Islamabad Pakistan

Name Registration Number

Sardar Abdul Wahab Chaudary CIIT/FA20-BSM-012/LHR

Abdul Raheem CIIT/SP20-BSM-003/LHR

Dr. Adeel Farooq

Machine Learning: Modern Techniques

Sardar Abdul Wahab Chaudary

Prof. Dr Muhammad Akram

Dr. Adeel Farooq

Prof. Dr. M. Hussain

Dr. Adeel Farooq

Sardar Abdul Wahab Chaudary

Sardar Abdul Wahab Chaudary

This thesis explores the mathematical foundations of machine learning

Machine Learning: A gentle introduction

1.1 Machine Learning:

The learning process consists of:

𝑚i𝑛i𝑚iz𝑒 𝐿 = ∑(𝑓(𝗑i), 𝑦i)

1.2 Process of Machine Learning

1.2.1 Data Collection:

1.2.2 Data Preparation:

1.2.3 Model Choice:

 Considerations: The interpretability of the model, computational resources, and the

 Classification Metrics: Accuracy, precision, recall, F1 score, and ROC curves.

1.2.5 Parameter Tuning:

1.2.7 Prediction or Inference:

 Implementation: Integrating the model into an existing production environment or using it to

1.3.2 Recommendation Systems (like those used by Amazon and Netflix):

1.4 Flow of Machine Learning

1.4.1 Data Collection:

1.4.2 Data Preparation:

1.4.4 Model Choice:

1.4.7 Parameter Tuning:

1.4.8 Prediction or Inference:

Machine Learning: Types and Applications

Types of Machine Learning:

Img-no-2.1 Supervised vs unsupervised learning

2.1 Supervised Learning:

 Classification: The output variable is a category, such as "spam" or "not spam" in an

Example with Data:

Let’s consider a simple example of supervised learning: email spam detection.

Email Text Label

2.1.2 Model Training:

# Convert the text data into numerical data

# Split data into training and test sets

# Initialize the classifier and train it

# Make predictions on the test data

# Evaluate the model

2.1.3 For Regression:

2.1.4 For Classification:

Img-no-2.3 Regression vs Classification

2.1.5 For Clustering:

2.2.1 Key Characteristics:

2.2.2 Common Techniques:

o Market segmentation: Identifying distinct groups of customers.