Professional Documents
Culture Documents
Machine Learning Thesis --1
Machine Learning Thesis --1
Machine Learning Thesis --1
BS Thesis
By
Abdul Raheem
CIIT/SP20-BSM-003/LHR
A thesis Submitted to
COMSATS University Islamabad, Lahore Campus
In partial fulfillment
of the requirement for the degree of
Bachelor of Science in Mathematics
By
Sardar Abdul Wahab CH
CIIT/FA20-BSM-012/LHR
Abdul Raheem
CIIT/SP20-BSM-003/LHR
Department of Mathematics
Faculty of Science
ii
Machine Learning: Modern Techniques
and Mathematical Approach to Neural Networks
This thesis is submitted to the department of Mathematics in partial fulfillment of the requirement for
the award of degree of Master of Science in Mathematics
Supervisor
iii
Certificate of Approval
This report titled
By
Abdul Raheem
CIIT/SP20-BSM-003/LHR
has been approved
for the Degree of Bachelor of Science in Mathematics
at COMSATS University Islamabad, Lahore Campus
External Examiner:
Supervisior:
Head of Department:
iv
Author’s Declaration
We, Sardar Abdul Wahab Chaudary CIIT/FA20-BSM-012/LHR and Abdul Raheem
CIIT/SP20 BSM-003/LHR hereby declare that we have produced the work presented in this thesis,
during the scheduled period of study. We also declare that we have not taken any material from any
source except referred to wherever due to that amount of plagiarism is within an acceptable range. If
a violation of HEC rules on research has occurred in this thesis, we shall be liable to punishable
action under the plagiarism rules of HEC.
Date:
Sardar Abdul Wahab CH
CIIT/FA20-BSM-012/LHR
Abdul-Raheem
CIIT/SP20-BSM-003/LHR
v
Certificate
It is certified that Sardar Abdul Wahab Chaudaru CIIT/FA20-BSM-012/LHR and Abdul Raheem
CIIT/SP20 BSM-003/LHR has carried out all the work related to this thesis under my supervision at
the Department of Mathematics, COMSATS University Islamabad, Lahore Campus and the work
fulfills the requirement for award of BS degree.
Date: Supervisor
vi
Dedication
We, dedicate our thesis to our parents, honourable teachers.
vii
Acknowledgements
All praise is due to ALLAH, the Cherisher and Lord of the Worlds,
Most Gracious and Most Merciful.
First and foremost, I want to express my deepest gratitude to ALLAH Almighty (the Most
Beneficent and Most Merciful) for providing me with the strength, knowledge, and opportunity to
undertake and complete this research. Without His countless blessings, this achievement would not
have been possible. May peace and blessings be upon His messenger, Hazrat Muhammad (PBUH),
his family, companions, and all who follow him. My sincere thanks to Hazrat Muhammad (PBUH),
who continues to be a beacon of guidance and knowledge for all humanity. On this journey towards
my degree, I have found in him a teacher, an inspiration, a role model, and a constant source of
support.
viii
Abstract
Machine Learning: Modern Techniques
and Mathematical Approach to Neural Networks
by
ix
Table of Contents
Chapter 01 - Machine Learning: A Gentle Introduction ................................................................. 1
1.1 Machine Learning ................................................................................................................ 1
1.2 Process of Machine Learning............................................................................................... 2
1.3 Applicatons .......................................................................................................................... 4
1.3.1 Predictive Analytics ............................................................................................. 4
1.3.2 Recommendation Systems ................................................................................... 5
1.4 Flow of Machine Learning...................................................................................................5
Chapter 02 - Machine Learning: Types and Applications .............................................................. 8
2.1 Supervised Learning ........................................................................................................... 8
2.1.1 For Regression ................................................................................................... 11
2.1.2 For Classification ............................................................................................... 12
2.2 Unsupervised Machine Learning ...................................................................................... 13
2.2.1 Key Characteristics ............................................................................................ 13
2.2.2 Common Techniques ......................................................................................... 13
2.2.3 Applications ....................................................................................................... 13
2.2.4 K-Mean Clustering.............................................................................................14
2.3 Reinforcement Learning ................................................................................................... 18
2.3.1 Working Model .................................................................................................. 19
Chapter 03 - Supervised Learning: Regression and Classification .............................................. 20
3.1 Overview of Machine Learning ........................................................................................ 20
3.1.1 Type of Supervised Machine Learning .............................................................. 21
3.2 Linear Regression .............................................................................................................. 21
3.3 Classification.....................................................................................................................23
3.4 Logistic Regression: Classification Algorithm ................................................................ 24
Chapter 04 - Supervised Learning: SVM and Decision Tree ...................................................... 28
4.1 Support Vector Machine .................................................................................................. 28
4.1.1 Anatomy and Mathematics ............................................................................... 29
4.2 Decision Tree: Anatomy and its Mathematics ................................................................. 33
x
Chapter 05 - Neural Networks – Perceptron: Concepts and Applications...................................38
5.1 The Perceptron .................................................................................................................. 40
5.1.1 Perceptron Training ...........................................................................................43
5.2 Forward Propagation ........................................................................................................ 45
5.3 Backward Propagation ...................................................................................................... 46
xi
Chapter 01
In simple terms, machine learning is like teaching a child through examples. Instead of telling the
child all the rules about what makes an animal a cat or a dog, you show them many pictures, and
over time, they start to understand the difference based on the examples they've seen.
Img-no-1.1
Machine learning is used in many everyday applications. For example, when Netflix
recommends movies you might like, that's machine learning at work. The system has learned
your preferences based on the movies you've watched and rated in the past. Similarly, when your
email program filters out spam, it's using machine learning to identify what unwanted emails
look like based on patterns it has learned from millions of examples.
The beauty of machine learning is that it enables computers to handle new situations without
human intervention, making processes more efficient and uncovering insights that might take
1
humans much longer to realize. It's a tool that, when used wisely, can significantly enhance our
decision-making and automate routine tasks.
Example:
Consider a set of data points. Each data point has some features (input) and an associated
outcome (output). In machine learning, we try to find a function ƒ that maps inputs to outputs as
accurately as possible, based on the data we have. This function ƒ is determined by a set of
parameters (which could be weights in a neural network, coefficients in a linear regression, etc.).
1. Defining a Loss Function: This is a mathematical function that measures the difference
between the actual outcome and the predicted outcome by the model. For example, in regression
tasks, a common loss function is the Mean Squared Error (MSE), which calculates the average of
the squares of the errors between the actual and predicted values.
2. Optimization: This involves finding the set of parameters that minimizes the loss function.
This process usually requires an optimization algorithm like gradient descent. The algorithm
starts with random values for the parameters and iteratively updates them in a direction that
reduces the loss.
3. Iteration: The optimization is usually done iteratively over many cycles (or "epochs"), where
the model learns from a subset of the data (a "batch") at a time, gradually reducing the loss and
improving the model's predictions.
Mathematically, if our data set consists of pairs (𝑥i, 𝑦i) where 𝑥i are the inputs and 𝑦i are the
actual outcomes, and our model makes predictions ƒ(𝑥i), then the goal of machine learning is to
find the parameters of ƒ that minimize the total loss over all data points:
Here, the exact form of the loss function and the function \( f \) depends on the specific type of
machine learning task (e.g., classification, regression). But fundamentally, all machine learning
follows this basic mathematical principle of minimizing some measure of error between the
model's predictions and the actual data.
2
from various sources such as files, databases, internet, sensors, or experiments and can be in
different formats like text, images, videos, or tables.
Challenges: Ensuring data relevance, dealing with privacy issues, and collecting a large
and diverse dataset.
Best Practices: Collect as much relevant and diversified data as possible while respecting
privacy and ethical guidelines.
Handling Missing Values: Depending on the context, you might fill in missing values
with the mean, median, mode, or even predict them with another machine learning model.
Data Normalization: Transforming data to a common scale without distorting
differences in the ranges of values.
Feature Encoding: Converting categorical data into numerical data so that it can be
processed by the algorithm.
3
Img-no-1.2
1.2.4 Training:
Training involves using a dataset to adjust the parameters of the machine learning model. The data used
for training must be representative of the real-world scenario for the model to learn effectively.
Process: The data is usually split into training and validation sets, where the training set is used to train the
model and the validation set is used to tune the hyperparameters.
Techniques: Batch learning, online learning, or reinforcement learning depending on the problem and data
size.
1.2.4 Evaluation:
After training, the model is evaluated using a different set of data called the test set. The evaluation
metrics depend on the type of machine learning task.
Objective: To find the set of hyperparameters that results in the best performance of the model on
the validation set.
Challenges: Balancing the trade-off between model complexity and overfitting, and dealing with
the computational cost of testing many different hyperparameter combinations.
4
Considerations: Monitoring the model's performance over time, updating it with new data, and
ensuring it remains relevant and accurate.
Each step in the machine learning process is crucial and requires careful consideration and expertise. By
meticulously following these steps, one can develop models that are not only accurate and efficient but
also robust and scalable.
1.3 Applications
Machine learning applications span across various sectors, impacting our daily lives and the way
industries operate. Below are detailed discussions on three specific applications: Predictive Analytics,
Medical Diagnosis, and Recommendation Systems.
1.3.1 Predictive Analytics (e.g., forecasting stock market trends):
Predictive analytics involves using historical data, statistical algorithms, and machine learning techniques
to predict future outcomes. In the context of the stock market, predictive analytics can be used to forecast
market trends, stock prices, and economic shifts based on a plethora of factors, including past market data,
financial news, company performance indicators, and global economic trends.
How it Works: Machine learning models such as time series forecasting, regression analysis, and neural
networks are trained on historical stock market data. They learn patterns and relationships between
various factors influencing the markets.
Applications: Traders and investment firms use predictive models to make informed decisions, hedge
risks, and identify investment opportunities. Algorithmic trading systems use these predictions to execute
trades at optimal times, maximizing profits and minimizing losses.
Challenges: The stock market is influenced by unpredictable factors like political events, natural
disasters, and changes in government policies, making it inherently volatile and difficult to predict with
high accuracy.
5
Img-no-1.3
The process of machine learning is a structured approach to developing, training, and deploying
algorithms that learn from data. Below, I will detail each step in the machine learning process,
highlighting its importance and the common practices involved:
Practices: Data can be collected from various sources, including public datasets, company
databases, online repositories, or through sensors and real-time data feeds.
Challenges: Ensuring data diversity to avoid bias, respecting privacy and ethical standards, and
dealing with large volumes of data.
1.4.3 Cleaning:
Involves removing or correcting inconsistent, incomplete, or erroneous data.
Preprocessing: Includes normalization (scaling all numeric attributes in the dataset to a common scale
without distorting differences in the ranges of values), handling missing values, and encoding categorical
variables into a format that algorithms can understand.
Feature Selection and Engineering: Identifying the most relevant features to the prediction task
and creating new features from the existing ones to improve model performance.
6
Considerations: The complexity of the model, the interpretability of the results, and
computational efficiency.
Approaches: It's common to start with simpler models for initial benchmarks and gradually move
towards more complex models as needed.
1.4.5 Training:
In this step, the chosen machine learning model is applied to the prepared dataset. The model learns to
make predictions or decisions based on the data. This involves adjusting the model's parameters so that it
can accurately map the input data to the correct output.
Process: The data is usually split into a training set and a test set. The training set is used to train
the model, while the test set is used to evaluate its performance.
Techniques: Depending on the algorithm, different techniques like gradient descent might be
used to optimize the model's parameters.
1.4.6 Evaluation:
After training, the model's performance is assessed using the test set. This step is crucial to determine how
well the model has learned from the training data and how well it generalizes to new, unseen data.
Metrics: Vary depending on the type of machine learning task (accuracy, precision, recall for
classification tasks; mean squared error for regression tasks).
Validation: Involves using various techniques like cross-validation to ensure that the model
performs well on different subsets of the data.
Techniques: Grid search, random search, and Bayesian optimization are common methods for
exploring different hyperparameter combinations to find the most effective ones.
Deployment: Involves integrating the model into an existing production environment where it
can provide ongoing predictions or analyses.
Monitoring: Continuous monitoring is essential to ensure the model remains effective over time,
adjusting and retraining as necessary to account for new data or changing conditions.
Each of these steps is critical to the success of a machine learning project. They ensure that the final
model is not only accurate and efficient but also robust and capable of adapting to new data and evolving
requirements.
7
Chapter 02
2.1.1 Process:
In supervised learning, the algorithm makes predictions or decisions based on input data.
After each prediction, it receives feedback on its accuracy: correct or incorrect. This feedback
helps the algorithm to adjust and improve its future predictions during the training process. The
main goal is to map input data to the correct output labels, minimizing the errors, and improving
the model's accuracy.
8
Types:
Data Collection: Gather a dataset consisting of emails, each labeled as 'Spam' or 'Not Spam'.
Features: Extract features from each email, which could include the frequency of certain words
or phrases, the sender's details, the time the email was sent, etc.
Training Data: The training data might look something like this:
Prediction:
Once trained, the model can take a new email without a label and predict whether it is Spam
or Not Spam based on the learned associations.
In this supervised learning scenario, the algorithm learns from the training data, which acts as a
teacher providing answers (labels) for the emails (input data). Over time, by minimizing the
difference between its predictions and the actual labels, the algorithm can learn to accurately
classify new, unseen emails.
9
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
# Sample data
emails = ["Win a free phone now",
"Meeting at 10 am tomorrow",
"Congratulations, you've won!",
"Could we discuss the report?"]
labels = ['Spam', 'Not Spam', 'Spam', 'Not Spam']
In machine learning, various metrics are used to evaluate the performance of models. These
metrics help in understanding how well a model is performing and are crucial for comparing
different models. The choice of metric depends on the type of machine learning task (e.g.,
regression, classification, clustering). Here are some common metrics used in machine learning:
10
Img-no-2.2 Regression Analysis
Mean Squared Error (MSE): This is the average of the squared differences between the
predicted and actual values. It penalizes larger errors more than MAE.
Root Mean Squared Error (RMSE): This is the square root of the MSE. It is more sensitive to
outliers than MAE and is in the same units as the target variable.
R-squared (R²): This is the coefficient of determination, which measures the proportion of the
variance in the dependent variable that is predictable from the independent variable(s). It
provides a measure of how well observed outcomes are replicated by the model.
Adjusted R-squared: This adjusts the R² for the number of predictors in the model. It's used in
multiple regression and provides a better measure of goodness of fit as it adjusts for the number
of terms in the model.
Precision (Positive Predictive Value): This is the ratio of true positives to the sum of true and
false positives. It indicates the quality of the positive class predictions.
11
Recall (Sensitivity or True Positive Rate): This is the ratio of true positives to the sum of true
positives and false negatives. It indicates how well the model is identifying the positive class.
F1 Score: This is the harmonic mean of precision and recall and provides a balance between the
two. It is useful when you need to balance precision and recall.
Area Under the Receiver Operating Characteristic Curve (AUC-ROC): This metric is used
to evaluate the performance of a binary classification system and is a plot of the true positive rate
against the false positive rate at various threshold settings.
Log Loss (Logarithmic Loss): This measures the performance of a classification model where
the prediction input is a probability value between 0 and 1. It penalizes false classifications.
Davies-Bouldin Index: The lower the score, the better the separation between the clusters.
Calinski-Harabasz Index: Also known as the Variance Ratio Criterion, a higher score indicates
better defined clusters.
Each of these metrics provides different insights into the performance of a machine learning
model, and the best metric to use will depend on the specific requirements of your project and the
nature of your data.
12
2.2 Unsupervised Machine Learning
Unsupervised machine learning is a type of machine learning where models are trained using
data that has not been labeled, categorized, or classified. In other words, the learning algorithm is
given data without explicit instructions on what to do with it. Instead, the algorithm must find
patterns and relationships within the data on its own. Unsupervised learning is primarily used to
discover underlying patterns, groupings, or structures in data.
Here are some key points and techniques associated with unsupervised machine learning:
No Labels: The training data is not labeled, meaning the outcome or category of the data is not
provided. The algorithm tries to learn the patterns without any reference to known or labeled
outcomes.
Pattern Discovery: The main goal is often to identify patterns or inherent structures within the
data.
Self-Organization: The model organizes or describes the data using a set of rules or features
discovered through the learning process.
Dimensionality Reduction: This technique is used to reduce the number of variables under
consideration and is often used for data visualization. Principal Component Analysis (PCA) and
t-Distributed Stochastic Neighbor Embedding (t-SNE) are popular dimensionality reduction
techniques.
Association Rule Learning: This method is used to discover interesting relationships between
variables in large databases. A common example is market basket analysis, where the goal is to
find sets of products that frequently co-occur in transactions.
Anomaly Detection: This involves identifying rare items, events, or observations which raise
suspicions by differing significantly from the majority of the data. It is widely used in fraud
detection, network security, and fault detection.
Neural Networks and Deep Learning: Certain neural network architectures, like autoencoders,
can be used for unsupervised learning tasks such as feature learning and representation learning.
2.2.3 Applications:
Unsupervised learning is used in various domains, including:
13
o Anomaly detection: Detecting fraudulent transactions or defective items in
manufacturing.
o Image and speech recognition: Learning features or patterns without labeled data.
o Natural language processing: Topic modeling and sentiment analysis.
o Recommendation systems: Recommending products or content based on user
behavior patterns.
Unsupervised machine learning is valuable for exploring data when you are not sure what to look
for, or when you want to reduce the complexity of data for further analysis. While unsupervised
learning can reveal hidden patterns and structures, the interpretations of these findings are
typically more subjective and require more domain expertise compared to supervised learning
outcomes.
Example
Initialization: The algorithm starts by initializing `k` centroids. These can be chosen randomly
or by other methods. Each centroid represents the initial center of a cluster.
14
Assignment step: Each data point is assigned to the closest centroid, and thus partitions the data
into clusters based on the current centroids. The "closeness" is typically measured using
Euclidean distance, though other distance measures can be used.
Update step: Once all points have been assigned to clusters, the positions of the centroids are
recalculated as the mean of all points in the cluster.
Repeat: Steps 2 and 3 are repeated until the centroids no longer change significantly, meaning
the algorithm has converged. This usually means that the within-cluster sum of squares (WCSS)
cannot be minimized any further.
The Iris dataset is a collection of data from three different types of iris flowers: Setosa,
Versicolour, and Virginica. Each sample has four features: the lengths and the widths of the
sepals and petals. This dataset is often used for testing out machine learning algorithms because it
is small and has well-defined clusters.
In our example, we apply K-Means clustering to the Iris dataset with the number of clusters set to
three, corresponding to the three species of Iris flowers.
Data Preparation: Although K-Means is not highly sensitive to feature scaling, the Iris dataset
features are all of similar scales, so we didn't perform scaling. In different scenarios, feature
scaling might be necessary.
Model Training: We use `scikit-learn`'s KMeans class to fit the model. By default, `scikit-learn`
uses a refined starting condition (the K-Means++ algorithm) to choose initial centroids, which
helps in achieving better clustering.
Cluster Assignment: Once the algorithm has been run and the centroids have been determined,
each data point in the dataset is assigned to its nearest centroid, resulting in a partitioning of the
data into three clusters.
Evaluation: While we don't have an explicit accuracy metric like in supervised learning, we can
use intrinsic metrics like silhouette scores to judge the clustering quality. However, in this
15
educational example, we skipped this part. But in a real-world scenario, evaluating the clustering
result is crucial.
Visualization: Finally, we visualize the clusters and their centroids. This isn't just for aesthetics;
visual inspection can provide immediate insights into the effectiveness of the clustering, though
it's more feasible with lower-dimensional data.
Through this process, we aim to see whether the K-Means algorithm can effectively group the
iris flowers into clusters that correspond to their actual species. This is a simplified example of
how unsupervised learning can reveal hidden patterns in the data without needing any labels.
16
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
17
2.3 Reinforcement Learning (RL)
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make
decisions by taking actions in an environment to achieve some goal. The learner or agent is not
told which actions to take, but instead must discover which actions yield the most reward by
trying them out. This method contrasts with supervised learning, where the agent is given correct
actions to take in various situations.
Environment: Everything the agent interacts with and which provides the agent with states and
rewards.
Reward: A feedback from the environment to assess the last action taken by the agent. It can be
positive (reinforcing the action) or negative (discouraging the action).
Policy: A strategy used by the agent, mapping states of the environment to actions to be taken
when in those states.
Value Function: It predicts the expected return (sum of rewards) for an agent starting at a given
state and acting according to a particular policy.
Q-value or Action-Value Function: It predicts the expected return of taking a given action in a
given state and following a specified policy thereafter.
Model-Free RL: Here, the agent learns to make decisions based solely on the observed rewards
and states, without constructing a model of the environment. This approach includes methods
like Q-learning and SARSA.
18
Applications of Reinforcement Learning:
Reinforcement learning has been successfully applied in various domains, such as:
- Gaming: From board games like Chess and Go to video games, RL agents have been
developed that can learn to play and win at complex games.
- Robotics: RL can be used for teaching robots to perform tasks through trial and error.
- Autonomous Vehicles: Reinforcement learning helps in developing policies for self-driving
cars.
- Personalized Recommendations: RL can be used to personalize content recommendations in
apps and websites based on user interactions.
- Finance: For portfolio management and algorithmic trading by learning optimal trading
strategies.
Reinforcement learning presents a different paradigm of learning than other machine learning
methods, emphasizing learning by interaction and adapting to changing conditions to achieve the
best possible results.
19
Chapter 03
Key Components:
3.1.1 Labeled Training Data:
- Supervised learning relies on datasets where each example is paired with a known outcome. This labeled
data serves as the foundation for training the model.
3.2.2 Regression:
- Regression involves predicting continuous values. Applications range from predicting stock prices and
housing values to estimating sales figures based on input features.
20
3.2 Linear regression
Linear regression is a fundamental statistical method and a popular machine learning algorithm used
for modeling the relationship between a dependent variable and one or more independent variables.
The goal of linear regression is to find the best-fitting linear relationship that minimizes the
difference between the observed and predicted values of the dependent variable.
In the context of a simple linear regression, which involves one independent variable, the
relationship between the dependent variable 𝑦 and the independent variable 𝗑 is represented by the
equation of a straight line:
𝑦 = 𝑏0 + 𝑏1 + c
Where:
- 𝑦 is the dependent variable (the variable we want to predict).
- 𝗑 is the independent variable (the variable used to make predictions).
- 𝑏0 is the y-intercept, the point where the line crosses the y-axis when 𝗑 = 0
- 𝑏1 is the slope of the line, representing the change in 𝑦 for a unit change in 𝑥.
- c is the error term, representing the difference between the observed and predicted values.
The objective of linear regression is to estimate the values of 𝑏0 and 𝑏1 that minimize the sum of
squared differences between the observed 𝑦i and predicted 𝑦7 values. This is typically done using the
method of least squares.
In the case of multiple linear regression, where there are multiple independent variables, the equation
is extended to:
Here, 𝑏0 is still the y-intercept, and 𝑏1, 𝑏2, … , 𝑏𝑛 are the slopes associated with each independent
variable 𝗑1, 𝗑2, … , 𝗑𝑛.
Linear regression is widely used for predicting numeric outcomes and understanding the relationship
between variables. It is important to note that linear regression makes certain assumptions about the
data, such as linearity, independence of errors, homoscedasticity (constant variance of errors), and
normality of errors. When these assumptions are met, linear regression can provide interpretable and
valuable insights into the data.
21
Code Example
1
Medical Cost Personal Datasets - Insurance Forecast by using Linear Regression
1
https://www.kaggle.com/datasets/mirichoi0218/insurance
22
Model Interpretation
We’ve calculated the Mean Squared Error (MSE) and R-squared 𝑅2 values for your insurance
dataset. Let's interpret these results:
R-squared 𝑅2:
- The 𝑅2 value you obtained (0.7833463107364539) is between 0 and 1.
- 𝑅2 measures the proportion of the variance in the dependent variable (insurance charges) that is
explained by the independent variables in your model.
- An 𝑅2 value of 0.78 indicates that approximately 78% of the variability in insurance charges can
be explained by the features included in your model.
- Higher 𝑅2 values suggest that the model provides a good fit to the data, explaining a larger
proportion of the variability.
Interpretation:
- The relatively high 𝑅2 value suggests that your linear regression model explains a substantial
portion of the variance in insurance charges based on the provided features.
- The MSE, while not easily interpretable in absolute terms, complements the 𝑅2 value by indicating
the average squared difference between predicted and actual values. In your case, it's beneficial to
consider whether the magnitude of MSE aligns with the specific scale and context of your insurance
charges.
3.3 Classification
Supervised learning classification is a type of machine learning algorithm used to categorize input
data into distinct classes or categories. The goal is to learn a mapping from input variables to output labels
based on a set of labeled training data.
Here's how it generally works:
Input Data: You start with a dataset consisting of input features and corresponding labels. Each data point
has a set of features (or attributes) and is assigned a class label.
23
Training Phase: In this phase, the algorithm learns from the labeled data. It tries to find patterns or
relationships between the input features and the labels. Various classification algorithms use different
techniques to accomplish this, such as decision trees, support vector machines, k-nearest neighbors, logistic
regression, or neural networks.
Model Creation: After training, the algorithm creates a model that represents these patterns. This model is
essentially a mathematical function that takes input features and predicts the corresponding class label.
Evaluation: Once the model is created, it needs to be evaluated to assess its performance. This is typically
done using a separate dataset called the validation set or test set. The model's predictions are compared with
the actual labels in the test set to measure its accuracy, precision, recall, F1 score, or other evaluation metrics.
Prediction: After evaluation, the trained model can be used to make predictions on new, unseen data. The
model takes the input features of a new data point and predicts its class label based on the patterns learned
during the training phase.
Iterative Improvement: Depending on the evaluation results, the model may undergo further refinement or
optimization. This could involve tuning hyperparameters, selecting different algorithms, or gathering more
labeled data to improve performance.
Classification is widely used in various applications such as spam detection, sentiment analysis, medical
diagnosis, image recognition, and more. It's an essential tool in the field of machine learning and data mining
for automating decision-making processes based on historical data.
Representation of Data:
- Let's denote the input features by X = *𝑥1 , 𝑥2 , 𝑥3 , … … , 𝑥𝑛 +, where each 𝑥i represents a feature of the
input data.
- The corresponding labels are denoted by 𝑦 = *0,1+, where 0 and 1 represent the two classes.
Model Representation:
- In binary classification, the model learns a decision boundary that separates the data points belonging to
different classes.
- One common way to represent this decision boundary is using a hypothesis function denoted as 𝜃(𝑥),
where 𝜃 represents the parameters of the model.
- For logistic regression, a popular algorithm for binary classification, the hypothesis function is represented
as:
24
𝜃 (𝑥 ) = 1
𝑇
1+𝑒−𝜃 𝑥
- Here 𝑒 is the base of the natural logarithm and 𝑥 represents the input features.
- During the training phase, the algorithm aims to learn the optimal parameters 𝜃 that best fit the training
data.
- This is typically done by minimizing a cost function that measures the error between the predicted outputs
and the actual labels.
- For logistic regression, the cost function 𝐽(𝜃) is often defined as the logistic loss function or cross-
entropy loss:
𝑚
1
𝐽(𝜃) = − ∑,𝑦 (i) log . (𝑥i)/ + (1 − 𝑦(i))log(1 − 𝜃 (𝑥i))-
𝑚
i=1
- 𝑚 is the number of training examples, 𝑥(i) and 𝑦(i) are the features and labels of the \(i\)-th training
example, respectively.
Optimization:
- The goal is to find the parameters 𝜃 that minimize the cost function 𝐽(𝜃).
- This is typically done using optimization algorithms like gradient descent or more advanced variants such
as stochastic gradient descent or mini-batch gradient descent.
Prediction:
- Once the model is trained and the optimal parameters are obtained, it can be used to make predictions on
new, unseen data.
- Given an input 𝑥, the model predicts the class label by evaluating the hypothesis function 𝜃(𝑥),
typically thresholding the output probability at 0.5.
Binary classification is fundamental in various applications such as spam detection, fraud detection, medical
diagnosis, and more, where the task involves making a decision between two distinct classes.
25
Logistic Regression Model Trained On Dataset:
2
Student Mental health
A STATISTICAL RESEARCH ON THE EFFECTS OF MENTAL HEALTH ON STUDENTS’ CGPA dataset
Evaluation Metrics
Confusion Matrix
2
https://www.kaggle.com/datasets/shariful07/student-mental-health
26
import pandas as pd
df = pd.read_csv('./Student_Mental_health.csv')
df.rename(columns={'What is your CGPA?': 'CGPA'}, inplace=True)
df.head()
df['Age'].fillna(df['Age'].median(), inplace=True)
df['CGPA'] = df['CGPA'].apply(lambda x: float(x.split(' - ')[0]) if '-' in x else
float(x))
df.isnull().sum()
X = df[features]
y = df[target]
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
27
Chapter 04
28
4.1.1 Concepts:
Support Vectors:
Support vectors are the data points that are closest to the hyperplane. These points are
critical in defining the position and orientation of the hyperplane.
Only these points are used to determine the hyperplane, making the algorithm
computationally efficient.
Kernel’s Trick:
When data is not linearly separable in its original feature space, SVM uses kernel
functions to transform the data into a higher-dimensional space where a linear
separation is possible.
Common kernel functions include linear, polynomial, radial basis function (RBF),
and sigmoid.
29
For data that is not perfectly separable, SVM introduces slack variables 𝛾i to
allow some misclassifications. The optimization problem becomes
1 𝑛
𝑚i𝑛 ‖𝑚‖2 + 𝐶 ∑ 𝛾
i
2
i=1
1 𝑛
𝑚i𝑛 ‖𝑚‖2 + 𝐶 ∑(𝛾 + 𝛾*)
i i
2
i=1
Subject to:
30
yi − (𝜔 · xi + b) ≤ c + 𝛾i
{ (𝜔 · xi + b) − 𝑦i ≤ c + 𝛾i*
𝛾i, 𝛾*i ≥ 0
Advantages of SVM:
o Effective in High-Dimensional Spaces: SVM is particularly effective when the
number of dimensions exceeds the number of samples.
o Memory Efficiency: Only the support vectors are used to define the hyperplane.
o Versatility: Different kernel functions can be specified for the decision function. It is
possible to use custom kernels.
Disadvantages of SVM:
Computationally Intensive: Training can be time-consuming, especially for
large datasets.
Choice of Kernel: The performance of SVM depends significantly on the
choice of the kernel and the kernel parameters.
Hard to Interpret: SVM models are often seen as black boxes since the
decision boundary in a high-dimensional space is hard to interpret.
Applications of SVM:
Text and Hypertext Categorization: SVMs are used for classifying texts into different
categories.
Image Classification: SVMs can classify images into different categories based on the
features extracted.
Bioinformatics: Used for protein classification, cancer classification based on gene
expression data, etc.
Handwriting Recognition: SVMs can be used to recognize handwritten characters.
31
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv"
wine_data = pd.read_csv(url, delimiter=';')
32
4.2 Decision Tree: Anatomy and its Mathematics
A Decision Tree is a flowchart-like structure where an internal node represents a feature (or
attribute), a branch represents a decision rule, and each leaf node represents an outcome (or class
label). The paths from the root to the leaf represent classification rules.
The basic idea of building a decision tree is to divide the dataset into subsets that contain instances
with similar values (homogeneous). This is done recursively, starting from the root node and moving
down the tree.
Steps to Build a Decision Tree
Select the Best Feature to Split: The feature that best separates the data into homogeneous
subsets is chosen. This is based on criteria like Information Gain, Gini Index, or others.
Split the Data: The dataset is split into subsets based on the selected feature.
Create Sub-nodes: A decision node is created for each split, and the process is repeated
recursively for each subset.
Stopping Criteria: The recursion is stopped when a stopping criterion is met, such as a
maximum tree depth, minimum number of samples per node, or if all instances in a node
belong to the same class.
Where 𝑝+ is the proportion of positive examples in the set 𝑆, and 𝑝− is the proportion of negative
examples.
Information Gain is used to select the feature that best splits the data. It is defined as the
difference in entropy before and after a split:
33
|𝑆𝑣|
𝐼𝐺(𝑆, 𝐴) = 𝐻(𝑆) − ∑ 𝐻(𝑆 )
|𝑆| 𝑣
𝑣 c 𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
Where 𝐼𝐺(𝑆, 𝐴) is the information gain from splitting the set 𝑆 on feature 𝐴, and 𝑆𝑣 are subsets of 𝑆
for each value 𝑣 𝑜𝑓 𝐴.
2. Gini Index:
The Gini Index measures the impurity of a dataset. It is defined as:
𝐶
𝐺i𝑛i(𝑆) = 1 − ∑ 𝑝2i
i=1
where 𝑝i is the probability of an element being classified into class i, and 𝐶is the number of classes.
The feature with the smallest Gini Index is chosen for the split.
3. Splitting Criteria:
For each feature, the dataset is split and the resulting subsets' impurity is calculated. The
feature that results in the lowest impurity (or highest information gain) is chosen.
4. Recursive Partitioning:
The process is repeated for each subset recursively until a stopping criterion is met (e.g.,
maximum depth, minimum samples per node, etc.).
34
Step-by-Step Process
9 9 5 5
𝐻(𝑆) = − ( log2 + log2 )
14 14 14 14
2 2 3 3
𝐻(𝑆𝑢𝑛𝑛𝑦) = − ( log2 + log2 ) ≈ 0. 971
5 5 5 5
4 4
𝐻(𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡) = − ( log2 ) ≈ 0
4 4
3 3 2 2
𝐻(𝑅𝑎i𝑛) = − ( log2 + log2 ) ≈ 0. 971
5 5 5 5
5 4 5
𝐼𝐺(𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = 0. 940 − ( * 0. 971 + *0+ * 0. 971) ≈ 0. 247
14 14 14
Temperature, Humidity, Windy: Similar calculations are performed for other features to determine
their information gain.
3. Choose the Best Feature to Split: Suppose Outlook has the highest information gain.
4. Split the Dataset Based on Outlook:
Disadvantages
35
Overfitting: Decision trees can easily overfit the training data, especially if they are not
pruned.
Instability: Small changes in the data can result in significantly different trees.
Bias: They can be biased towards features with more levels (i.e., those that have more
possible values).
Pruning:
Pruning is a technique used to reduce the size of the tree and prevent overfitting. It involves
removing branches that have little importance.
There are two types of pruning:
Pre-pruning (Early Stopping): The tree is stopped from growing beyond a certain depth, or
splits are stopped if the number of samples in a node is less than a specified number.
Post-pruning: After the tree is fully grown, branches are removed if they do not provide
significant information gain.
Conclusion
Decision Trees are a powerful and intuitive method for both classification and regression
tasks. Their ease of interpretation and ability to handle various types of data make them a popular
choice. However, they require careful tuning and pruning to avoid overfitting and ensure that the
model generalizes well to new data.
36
Part 2
Deep Learning
37
Chapter 05
Deep Learning is about neural networks and their structures and designs; and training those
networks. Even the most complex Neural network is based on vectors, and matrices and uses the
concept of cost function and algorithms like gradient descent to find a reduced cost, and then
propagate the cost back to all constituents of the network proportionally via a method called back-
propagation.
Have you held an integrated circuit or chip in hand or seen it? It looks overwhelmingly complex. But
its base is the humble transistor and Boolean logic. To understand something complex we need to
understand the simpler constituents.
The earliest neural network — Rosenblatt’s Perceptron was the first to introduce the concept of
using vectors and the property of dot product, to split hyperplanes of input feature vectors.
Background:
As a student of mathematics we all are familiar with the concept of Vectors and applications.
Matrices are many things to many aspects of mathematics. But in the case of Neural networks they
are just a way to represent Vectors (and Tensors). Vectors are represented as matrices. A Vector is
essentially a one-dimensional matrix. A matrix is defined to be a rectangular array of numbers. An
example here is a Euclidean Vector in three-dimensional Euclidean space (or R³ with some
magnitude and direction (from (0,0,0) origin in this case).
Multi-dimensional matrices can be thought of as one-dimensional vectors stacked on top of each
other. The Neural Network Weights or weight vectors are stacked together as matrices. This intuition
is especially helpful when we use dot products on neural network weight matrices.
Dot Product and Splitting the hyper-plane - the crux of the Perceptron
Let’s take a simple example. Let's assume that a vector like ,𝗑, 𝑦- is some feature of a leaf /a
feature vector of a leaf. Healthy leaves will have some values of this feature vector. Unhealthy leaves
38
will have some other value ranges of this feature vector. Let's collect features for 10 leaves and
assume that 2 are unhealthy ones. Imagine we have a problem classifying if a leaf is healthy or not
based on the features of the leaf. For each leaf, we have some feature vector set. Now, if we have a
weight vector, whose dot product with the feature vector of the set of input vectors of a certain class
(say leaf is healthy) is positive, and with the other set is negative, then that weight vector splits the
feature vector hyper-plane into two areas.
In the above diagram, the weight vector 𝑤 (shown as a dotted line) splits the 2D feature space of
positive and negative samples.
For any new leaf, if we only extract the same features into a feature vector; we can dot it with the
trained weight vector and find out if it falls in the healthy or deceased class.
39
Img-no-5.1 leaves-sysnthetic-data using numpy
Note that here we have hand-coded the weight matrix. We will see how this code can be adapted to
learn the weight matrix
40
Img-no-5.3 perceptron anatomy
Inputs are 𝑥1 𝑡𝑜 𝑥𝑛Weights are some values that are learned 𝑤1𝑡𝑜 𝑤𝑛. There is also a bias 𝑏 which
in the above is θ. The bias can be modelled as a weight 𝑤0 connected to a dummy input 𝑥0 set to 1.
If we ignore the bias term, the output y can be written as the sum of all inputs times the weights;
thresholded by zero.
The Perceptron will fire if the sum of its inputs is greater than zero;
otherwise, it will not. It will be activated if the sum of its inputs is greater
than zero; otherwise, it will not.
The Activation Function of the Neural Network
The big blue circle is the primitive brain of the primitive neural network — the perceptron
brain. This is what is called an Activation Function in Neural Networks. In Perceptron, the activation
function is a simple step function, the output is non-continuous (and hence non-differentiable) and is
either 1 or 0. If the inputs are arranged as a column matrix and weights are also arranged likewise
then both the input and weights can be treated as vectors and the activation function
∑ 𝑤i𝑥i
i
is the same as the dot product!!
𝑤. 𝑥
Hence the activation function can also be written as
Note that the dot product of two matrices (representing vectors), can be written as the
transpose of one multiplied by another.
𝑤. 𝑥 = 𝑤 𝑇 . 𝑥
41
All three equations (Eq 1,2 &3) are the same, just that in different references it will be written
in one of these forms.
𝑤. 𝑥 > 𝑏
Above eq defines all the points on one side of the hyperplane, and
𝑤. 𝑥 ≥ 𝑏
all the points on the other side of the hyperplane and on the hyperplane itself.
This happens to be the very definition of “linear separability”
Thus, the Perceptron allows us to separate our feature space in two convex half-spaces. If we
can get the weight matrix that has this property, then this weight vector splits the input feature
vectors into two regions by a hyperplane.
This is the essence of the Perceptron, the initial artificial neuron.
In simple terms, it means that an unknown feature vector of an input set belonging to say
Dogs and Cats, when a dot product is applied with a trained weight vector, will fall into either the
Dog space of the hyperplane or the Cat space of the hyperplane. This is how neural networks do
classifications.
42
5.1.1 Perceptron Training
1. Taking input from the training data, and doing a dot product with the initial weight vector;
will give you either a value greater than 0 or less than .
2. Note that this means which quadrant the feature vector lies; either in the positive quadrant (P)
or on the negative side (N).
3. If this is as expected, then do nothing.
4. If the dot product comes wrong, that is if the input feature vector — say x was x ∈ P but dot
product w.x < 0 we need to drag/rotate the weight vector towards x.
𝑤𝑛 = 𝑤 + 𝑥
Which is vector addition, that is w is moved towards 𝑥
5. Alternately say that x ∈ N but dot product𝑤. 𝑥 > 0 , then we need to do the reverse
𝑤𝑛 = 𝑤 − 𝑥
This is also called the delta rule. Note that some articles refer to this as gradient descent simplified.
But gradient descent depends on the activation function being differentiable. The step function which
is the activation function of the perceptron is non-continuous and hence non-differentiable.
43
44
5.2 Forward Propagation:
Forward propagation is the process of passing input data through the layers of a neural network to
generate an output. Each layer consists of neurons that apply weights to the inputs, pass them through an
activation function, and produce outputs for the next layer.
45
5.3 Backward Propagation
46
47
Refrences
https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
https://medium.com/data-science-engineering/explaining-neural-network-as-simple-as-
possible-gradient-descent-00b213cba5a9
https://www.freecodecamp.org/learn/machine-learning-with-python/
https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-
guide-for-beginners/
https://www.youtube.com/watch?v=Ixl3nykKG9M
https://medium.com/analytics-vidhya/an-introduction-to-mathematics-behind-neural-
networks-135df0b85fa1
48