Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

UNIT-2

Introduction to Artificial Neural Networks


Artificial Neural Networks (ANNs) are computational models inspired by the structure and
functionality of biological neural networks in the human brain. They are a fundamental
component of machine learning and artificial intelligence, used for tasks such as pattern
recognition, classification, regression, and clustering.

The basic building blocks of artificial neural networks are artificial neurons, or nodes, which are
interconnected in layers.

Representation of artificial neural network

Artificial Neural Network has an input layer, an output layer as well as hidden layers. The input layer
receives data from the outside world which the neural network needs to analyze or learn about. Then
this data passes through one or multiple hidden layers that transform the input into data that is valuable
for the output layer. Finally, the output layer provides an output in the form of a response of the
Artificial Neural Networks to input data provided.

In the majority of neural networks, units are interconnected from one layer to another. Each of these
connections has weights that determine the influence of one unit on another unit. As the data transfers
from one unit to another, the neural network learns more and more about the data which eventually
results in an output from the output layer .

1. Input Layer: This layer receives the raw input data. Each node in the input layer
represents a feature or attribute of the input data.
2. Hidden Layers: These are intermediate layers between the input and output
layers. Each node in a hidden layer performs a weighted sum of the inputs from
the previous layer, applies an activation function to the result, and passes the
output to the next layer. Multiple hidden layers allow neural networks to learn
complex relationships in the data.
3. Output Layer: This layer produces the final output of the neural network. The
number of nodes in the output layer depends on the type of problem being
solved. For example, in a binary classification problem, there might be one node
representing the probability of belonging to one class and another node
representing the probability of belonging to the other class.
Appropriate Problems for Learning Neural Networks

*instances have many attribute-value pairs: The target function to be learned is defined
over instances that can be described by a vector of predefined features.

*Target function output may be discrete-valued, real-valued, or a vector of several real-


or discrete-valued attributes

*Training examples may contain errors: ANN learning methods are quite robust to noise
in the training data.

*Long training times are acceptable: Network training algorithms typically require longer
training times than, say, decision tree learning algorithms. Training times can range from
a few seconds to many hours, depending on factors such as the number of weights in
the network, the number of training examples considered, and the settings of various
learning algorithm parameters.

*Fast evaluation of the learned target function may be required. Although ANN learning
times are relatively long, evaluating the learned network, in order to apply it to a
subsequent instance, is typically very fast.
*The ability for humans to understand the learned target function is not important. The
weights learned by neural networks are often difficult for humans to interpret. Learned
neural networks are less easily communicated to humans than learned rules

Perceptrons

"perceptrons," which are a type of artificial neuron used in artificial neural networks

A perceptron is the simplest form of an artificial neural network, conceptualized by Frank


Rosenblatt in 1957. It's a binary classifier that makes decisions based on a linear combination of
input features, similar to the way a biological neuron processes signals from its dendrites .

Basic Components of Perceptron


A perceptron, the basic unit of a neural network, comprises essential components that
collaborate in information processing.
 Input Features: The perceptron takes multiple input features, each input
feature represents a characteristic or attribute of the input data.
 Weights: Each input feature is associated with a weight, determining the
significance of each input feature in influencing the perceptron’s output.
During training, these weights are adjusted to learn the optimal values.
 Summation Function: The perceptron calculates the weighted sum of its
inputs using the summation function. The summation function combines the
inputs with their respective weights to produce a weighted sum.
 Activation Function: The weighted sum is then passed through an activation
function. Perceptron uses Heaviside step function functions. which take the
summed values as input and compare with the threshold and provide the
output as 0 or 1.
 Output: The final output of the perceptron, is determined by the activation
function’s result. For example, in binary classification problems, the output
might represent a predicted class (0 or 1).
 Bias: A bias term is often included in the perceptron model. The bias allows
the model to make adjustments that are independent of the input. It is an
additional parameter that is learned during training.
 Learning Algorithm (Weight Update Rule): During training, the perceptron
learns by adjusting its weights and bias based on a learning algorithm. A
common approach is the perceptron learning algorithm, which updates
weights based on the difference between the predicted output and the true
output.
These components work together to enable a perceptron to learn and make predictions.
While a single perceptron can perform binary classification, more complex tasks require
the use of multiple perceptrons organized into layers, forming a neural network.
Types of Perceptron
 Single-Layer Perceptron: This type of perceptron is limited to learning
linearly separable patterns. effective for tasks where the data can be divided
into distinct categories through a straight line.
 Multilayer Perceptron: Multilayer perceptrons possess enhanced processing
capabilities as they consist of two or more layers, adept at handling more
complex patterns and relationships within the data.

How does Perceptron work?


A weight is assigned to each input node of a perceptron, indicating the significance of
that input to the output. The perceptron’s output is a weighted sum of the inputs that
have been run through an activation function to decide whether or not the perceptron
will fire. it computes the weighted sum of its inputs as:
z = w1x1 + w1x2 + ... + wnxn = XTW
The step function compares this weighted sum to the threshold, which outputs 1 if the
input is larger than a threshold value and 0 otherwise, is the activation function that
perceptrons utilize the most frequently. The most common step function used in
perceptron is the Heaviside step function:

*Limitations of Perceptron
However, its capabilities are limited:
The perceptron model has some limitations that can make it unsuitable for certain types
of problems:
 Limited to linearly separable problems.
 Convergence issues with non-separable data
 Requires labeled data
 Sensitivity to input scaling
 Lack of hidden layers
Multi-Layer Neural Network
To be accurate a fully connected Multi-Layered Neural Network is known as Multi-
Layer Perceptron. A Multi-Layered Neural Network consists of multiple layers of
artificial neurons or nodes. Unlike Single-Layer Neural networks, in recent times most
networks have Multi-Layered Neural Network. The following diagram is a visualization
of a multi-layer neural network.
(MLPs), are a type of artificial neural network with multiple layers of neurons.
1. Input Layer: The input layer consists of neurons that receive input signals from
the external environment or other systems. Each neuron in the input layer
represents a feature or attribute of the input data.
2. Hidden Layers: Hidden layers are intermediate layers between the input and
output layers. Each neuron in a hidden layer receives input from neurons in the
previous layer, performs a weighted sum of the inputs, applies an activation
function, and then passes the output to neurons in the next layer. The number of
hidden layers and the number of neurons in each hidden layer are configurable
parameters of the network architecture.
3. Output Layer: The output layer produces the final output of the neural network.
The number of neurons in the output layer depends on the nature of the task
being solved. For example, in binary classification, there may be one neuron
representing the probability of belonging to one class and another neuron
representing the probability of belonging to the other class. In multi-class
classification, there will be one neuron for each class.
Backpropagation algorithm

Backpropagation is an algorithm that backpropagates the errors from the output nodes
to the input nodes. Therefore, it is simply referred to as the backward propagation of
errors.

Backpropagation, short for "backward propagation of errors," is an algorithm used to


train artificial neural networks, specifically multi-layer perceptrons (MLPs), by efficiently
computing the gradients of the loss function with respect to the network's parameters.
It allows for the optimization of these parameters using gradient-based optimization
algorithms such as gradient descent.

Here's how the backpropagation algorithm works:


1. Forward Pass: Start by performing a forward pass through the network. Input
data is fed into the network, and the activations of each neuron are computed
sequentially layer by layer until the output is obtained. This involves applying the
weights and biases of each neuron and passing the result through an activation
function to obtain the output of the neuron.
2. Compute Loss: Compare the output of the network with the ground truth (i.e.,
the true labels or targets) to compute the loss function, which quantifies the
difference between the predicted output and the true output. Common loss
functions include mean squared error (MSE) for regression tasks and cross-
entropy loss for classification tasks.
3. Backward Pass (Backpropagation): This is the key step of the algorithm. It
involves computing the gradients of the loss function with respect to the
parameters of the network (weights and biases) using the chain rule of calculus.
The gradients are computed layer by layer, starting from the output layer and
moving backward towards the input layer.
a. Output Layer: Compute the gradient of the loss function with respect to the
activations of the output layer neurons.
b. Hidden Layers: Propagate the gradients backward through the network,
computing the gradients of the loss function with respect to the activations of
neurons in each hidden layer using the gradients from the subsequent layer.
c. Weights and Biases: Use the gradients of the loss function with respect to the
activations of neurons to compute the gradients of the loss function with respect
to the weights and biases of each neuron in the network.
4. Parameter Update: Finally, use the computed gradients to update the
parameters (weights and biases) of the network using an optimization algorithm
such as gradient descent or one of its variants. The parameters are updated in the
direction that minimizes the loss function.
5. Repeat: Iterate through steps 1-4 for multiple epochs (passes through the entire
dataset) until the network converges, i.e., until the loss function stops decreasing
or reaches a satisfactory level.

Backpropagation allows neural networks to efficiently learn from training data by


adjusting their parameters to minimize the error between the predicted output and the
true output. It is a fundamental algorithm in the training of neural networks and has
enabled the widespread success of deep learning in various domains.
Remarks On Back Propagation Algorithm
The backpropagation algorithm is a cornerstone of modern deep learning and has significantly
contributed to the success and widespread adoption of neural networks. Here are some key remarks
on the backpropagation algorithm:
1. Efficient Gradient Computation: Backpropagation provides an efficient way to compute the
gradients of the loss function with respect to the parameters of the neural network. By
leveraging the chain rule of calculus, gradients can be propagated backward through the
network in an organized and systematic manner, enabling efficient parameter updates during
training.
2. Training Deep Neural Networks: Backpropagation is particularly well-suited for training
deep neural networks with multiple layers of neurons. Its ability to compute gradients layer
by layer allows for effective learning of hierarchical representations in data, leading to
superior performance in tasks such as image classification, natural language processing, and
reinforcement learning.
3. Overcoming Vanishing and Exploding Gradients: In deep neural networks, gradients can
either diminish exponentially (vanishing gradients) or explode exponentially (exploding
gradients) as they propagate backward through many layers. Techniques such as careful
weight initialization, batch normalization, and gradient clipping help mitigate these issues
and enable stable training of deep networks with backpropagation.
4. Non-Convex Optimization: The optimization problem posed by neural network training is
highly non-convex, meaning it contains multiple local minima and saddle points.
Backpropagation, combined with stochastic gradient descent and its variants, effectively
navigates this non-convex landscape and converges to good solutions, although not
necessarily globally optimal ones.
5. Implementation Challenges: While backpropagation is conceptually straightforward,
implementing it efficiently and correctly can be challenging. Issues such as numerical
stability, memory consumption, and computational complexity must be carefully addressed
to train deep neural networks effectively. Fortunately, modern deep learning frameworks
provide high-level abstractions and optimized implementations of backpropagation, making
it accessible to practitioners.
6. Generalization and Regularization: Backpropagation allows neural networks to learn
complex patterns from data, but there's a risk of overfitting—where the model performs well
on training data but poorly on unseen data. Techniques such as dropout, weight decay, and
early stopping help regularize the training process and improve the generalization ability of
neural networks trained with backpropagation.
7. Continued Research and Advancements: Despite its effectiveness, backpropagation is not
without limitations. Ongoing research aims to address challenges such as catastrophic
forgetting, efficient training of very deep networks, and incorporating structured priors into
learning. Alternatives to backpropagation, such as evolutionary algorithms and
neuroevolution, are also being explored to complement traditional gradient-based
optimization methods.

In summary, the backpropagation algorithm has revolutionized the field of deep learning and
remains a fundamental tool for training neural networks. Its ability to efficiently compute gradients
enables the training of deep, complex models on large-scale datasets, leading to state-of-the-art
performance in various machine learning tasks.
face recognition

Advanced Topics in Neural Networks


Advanced topics in neural networks encompass various techniques, architectures, and
methodologies that push the boundaries of what neural networks can achieve. Here are
some advanced topics in neural networks:


1. Deep Learning Architectures: Deep learning encompasses neural network
architectures with many layers, allowing them to learn hierarchical
representations of data. Advanced architectures include Convolutional Neural
Networks (CNNs) for image processing, Recurrent Neural Networks (RNNs) for
sequential data, Long Short-Term Memory networks (LSTMs) and Gated
Recurrent Units (GRUs) for handling long-term dependencies, and Transformers
for natural language processing tasks.
2. Transfer Learning and Fine-Tuning: Transfer learning involves leveraging pre-
trained neural network models trained on large datasets for a specific task and
fine-tuning them on a smaller, task-specific dataset. This approach can
significantly reduce the amount of labeled data required to train a model and
improve generalization performance.
3. Generative Adversarial Networks (GANs): GANs are a class of neural networks
that consist of two networks—a generator and a discriminator—that are trained
simultaneously through a min-max game. GANs can generate realistic synthetic
data, such as images, text, and audio, and have applications in image generation,
data augmentation, and domain adaptation.
4. Reinforcement Learning (RL): RL is a branch of machine learning where an
agent learns to make decisions by interacting with an environment to maximize
cumulative rewards. Deep RL combines deep neural networks with RL algorithms,
enabling the learning of complex policies for tasks such as game playing,
robotics, and autonomous driving.
5. Meta-Learning: Meta-learning, or learning to learn, involves training models that
can learn new tasks or adapt to new environments with minimal data or training
samples. Meta-learning algorithms aim to discover common patterns across
different tasks and leverage this knowledge to facilitate rapid learning of new
tasks.
6. Neuroevolution: Neuroevolution combines neural networks with evolutionary
algorithms to optimize network architectures or parameters through genetic
algorithms, evolutionary strategies, or other evolutionary computation
techniques. It is particularly useful for training neural networks in environments
with limited data or when manual design is challenging.
7. Adversarial Robustness: Adversarial attacks involve intentionally perturbing
input data to mislead neural network models into making incorrect predictions.
Adversarial training methods aim to improve the robustness of neural networks
against such attacks by augmenting training data with adversarial examples or
incorporating adversarial perturbations directly into the training process.
8. Capsule Networks: Capsule networks are a novel type of neural network
architecture designed to better capture hierarchical relationships and spatial
hierarchies in data. They use groups of neurons called capsules to represent
entities or parts of objects and have shown promise in tasks such as object
recognition and pose estimation.
9. Explainable AI (XAI): XAI techniques aim to provide insights into the decision-
making process of neural networks, making them more interpretable and
transparent to users. Methods include attention mechanisms, saliency maps, and
model distillation, which help understand which features or parts of input data
are relevant for making predictions.

 Evaluating hypotheses:
Whenever you form a hypothesis for a given training data set, for example,
you came up with a hypothesis for the EnjoySport example where the
attributes of the instances decide if a person will be able to enjoy their
favorite sport or not.
Now to test or evaluate how accurate the considered hypothesis is we use
different statistical measures. Evaluating hypotheses is an important step in
training the model.
*To evaluate the hypotheses precisely focus on these points:
When statistical methods are applied to estimate hypotheses,
 First, how well does this estimate the accuracy of a hypothesis across additional
examples, given the observed accuracy of a hypothesis over a limited sample of
data?

 Second, how likely is it that if one theory outperforms another across a set of data,
it is more accurate in general?

 Third, what is the best strategy to use limited data to both learn and measure the
accuracy of a hypothesis?
Motivation:
There are instances where the accuracy of the entire model plays a huge
role in the model is adopted or not. For example, consider using a training
model for Medical treatment. We need to have a high accuracy so as to
depend on the information the model provides.

When we need to learn a hypothesis and estimate its future accuracy based
on a small collection of data, we face two major challenges:

Bias in the estimation


There is a bias in the estimation. Initially, the observed accuracy of the
learned hypothesis over training instances is a poor predictor of its
accuracy over future cases.

Because the learned hypothesis was generated from previous instances,


future examples will likely yield a skewed estimate of hypothesis
correctness.
Estimation variability.
Second, depending on the nature of the particular set of test examples, even
if the hypothesis accuracy is tested over an unbiased set of test instances
independent of the training examples, the measurement accuracy can still
differ from the true accuracy.

The anticipated variance increases as the number of test examples


decreases.

When evaluating a taught hypothesis, we want to know how accurate it


will be at classifying future instances.

Also, to be aware of the likely mistake in the accuracy estimate. There is


an X-dimensional space of conceivable scenarios. We presume that
different instances of X will be met at different times.

Assume there is some unknown probability distribution D that describes


the likelihood of encountering each instance in X. This is a convenient
method to model this.

A trainer draws each instance separately, according to the distribution D,


and then passes the instance x together with its correct target value f (x) to
the learner as training examples of the target function f.

The following two questions are of particular relevance to us in this


context,

1. What is the best estimate of the accuracy of h over future instances taken from the
same distribution, given a hypothesis h and a data sample containing n examples
picked at random according to the distribution D?
2. What is the margin of error in this estimate of accuracy?

True Error and Sample Error:


We must distinguish between two concepts of accuracy or, to put it another
way, error. One is the hypothesis’s error rate based on the available data
sample.

The hypothesis’ error rate over the complete unknown distribution D of


examples is the other. These will be referred to as the sampling error and
real error, respectively.

The fraction of S that a hypothesis misclassifies is the sampling error of a


hypothesis with respect to some sample S of examples selected from X.

Sample Error:
It is denoted by errors(h) of hypothesis h with respect to target function f
and data sample S is

Where n is the number of examples in S, and the quantity is 1


if f(x) != h(x), and 0 otherwise.

True Error:
It is denoted by errorD(h) of hypothesis h with respect to target function f
and distribution D, which is the probability that h will misclassify an
instance drawn at random according to D.
 Basics of Sampling Theory

For estimating hypothesis accuracy, statistical methods are applied. In this blog, we’ll
have a look at evaluating hypotheses and the basics of sampling theory.
Let’s have a look at the terminologies involved and what they mean,

1)*Random Variable:
A random variable may be thought of as the name of a probabilistic experiment. Its value
is the outcome of the experiment. When we don’t know the outcome of the experiment for
certain, it comes under random variables.

The outcome of a coin flip is a good illustration of a random variable. Consider a


probability distribution where the outcomes of a random event aren’t all equally likely to
occur.

2)*Probability Distribution:
A probability distribution is a statistical function that specifies all possible values and
probabilities for a random variable in a given range.

This range will be bounded by the minimum and greatest possible values, but where the
possible value will be plotted on the probability distribution will be determined by a
variety of factors.

The mean (average), standard deviation, skewness, and kurtosis of the distribution are
among these parameters.

3)*Expected Value:
The expected value (EV) is the value that investment is predicted to have at some time in
the future.
In statistics and probability analysis, the expected value is computed by multiplying each
conceivable event by the likelihood that it will occur and then summing all of those
values.
Investors might choose the scenario that is most likely to provide the desired result by
assessing anticipated values.
4)*The variance of a Random Variable:
In statistics, variance refers to the deviation of a data collection from its mean value. The
probability-weighted average of squared deviations from the predicted value is used to
calculate it.
As a result, the greater the variance, the greater the difference between the set’s numbers
and the mean. A smaller variance, on the other hand, indicates that the numbers in the
collection are closer to the mean.
The Y-random variable variance is defined as,

5)*Standard Deviation:
The standard deviation is a statistic that calculates the square root of the variance and
measures the dispersion of a dataset relative to its mean.
The standard deviation is determined as the square root of variance by computing each
data point’s difference from the mean.
When data points are further from the mean, there is more variation within the data set;
as a result, the larger the standard deviation, the more spread out the data is.The standard

deviation of Y is . The standard deviation of Y is usually represented using the


symbol .

6)*The Binomial Distribution:


Under a given set of factors or assumptions, the binomial distribution expresses the
likelihood that a variable will take one of two independent values.
The binomial distribution is based on the premise that each trial has just one result, has
the same chance of success, and is mutually exclusive, or independent of the others.
It gives the probability of observing r heads in a series of n independent coin tosses if the
probabilities of heads in a single toss is p.

7)*Normal Distribution:
The standard distribution, also known as the Gaussian distribution, is the probability of a
measure of distribution based on the definition, indicating that the data about the
definition occurs more often than the data at a distance. The normal distribution will
appear as a metal grid on the graph.
It is also referred to as a bell-shaped probability distribution that covers many natural
phenomena.

8)*Central Limit Theorem:


Central Limit Theorem is a statistical premise that given a big enough sample size from a
population with a finite level of variation, the mean of all sampled variables from the
same population will be about equal to the mean of the entire population.
Furthermore, according to the law of large numbers, these samples resemble a normal
distribution, with their variances being roughly equal to the variance of the population as
the sample size grows.

9)Estimator:
It is a random variable Y used to estimate some parameter p of an underlying population.

The estimand is the quantity that is being estimated (i.e. the one you wish to know). For
example, suppose you needed to discover the average height of pupils at a 1000-student
school.
You measure a group of 30 children and discover that the average height is 56 inches.
This is the estimator for your sample mean. You estimate the population means (your
estimand) to be around 56 inches using the sample mean.

10)The Estimation Bias:


The estimation bias of Y as an estimator for p is the quantity (E[Y] – p). An unbiased
estimator is one for which the bias is zero.

11)N % confidence interval:


An N% confidence interval estimate for parameter p is an interval that includes p with
probability N%.

-Comparing Learning Algorithms in


Machine Learning
T here are literally countless number of machine learning (ML)

algorithms human has invented. Of course, most of the time only a


small subset is used in research and in industries. Yet, it is still a bit
overwhelming for a human to understand and remember all the nitty-
gritty details of all these ML models. Some people might also have a
wrong impression that all these algorithms are totally unrelated. More
importantly, how might one chooses to use algorithm A over B when
both seem to be effective algorithms?

This article aims to provide the readers with different angles to view
the ML algorithms. With these perspectives, algorithms can be
compared on common grounds and they can be analysed easily. The
article is written with two major ML tasks in mind — regression and
classification.

Time complexity

Under the RAM model [1], the “time” an algorithm takes is measured
by the elementary operations of the algorithm. While users and
developers may concern more about the wall clock time an algorithm
takes to train the models, it would be fairer to use the standard worst
case computational time complexity to compare the time the models
take to train. Using computational complexity has the benefits of
ignoring the differences like the computer power and architecture used
at runtime and the underlying programming language, allowing users
to focus on the fundamental differences of the elementary operations
of the algorithms.

Note that the time complexity can be very different during training and
testing. For example, parametric models like linear regression could
have long training time but they are efficient during test time.

Space complexity

Space complexity measures how much memory an algorithm needed to


run in terms of the input size. A ML program could not be successfully
run if a ML algorithm loads too much data into the working memory of
a machine.

Sample complexity

Sample complexity measures the number of training examples needed


to train the network in order to guarantee a valid generalisation. For
example, deep neural network has high sample complexity since lots of
training data are needed to train it.

Bias-variance tradeoff

Different ML algorithms would have different bias-variance tradeoff.


The bias errors come from the fact that a model is biased towards a
specific solution or assumption. For example, if linear decision
boundary is fit on nonlinear data, the bias will be high. Variance, on
the other hand, measures the errors coming from the variance of the
model. It is the average square difference of the prediction of a model
and the prediction of expected model[2].

Bias-variance tradeoff, extracted from [2].

Different models make different bias-variance tradeoff. For example,


naive Bayes is considered a high bias, low variance model because of
the over-simplistic assumptions it made.
Online and Offline

Online and offline learning refers to the way a machine learning


software learns to update the model. Online learning means training
data can be presented one at a time so that parameters can be updated
immediately when new data are available. Offline learning, however,
requires the training to start over again (re-train the whole model)
when new data presented in order to update the parameters. If an
algorithm is an online one, it would be efficient since the parameters
used in production can be updated in real-time to reflect the effect of
new data.

ID3 Decision tree algorithm is an example of offline learning. The way


ID3 works is to look at the data globally and do a greedy search to
maximize information gain. When new data points come, the whole
model needs to be re-trained. In contrast, stochastic gradient descent
(SGD) is an online algorithm since you can always use it to update the
parameters of the trained models when new data arrive.

Parallelizability

A parallel algorithm means that an algorithm can complete multiple


operations at a given time. This can be done by distributing the
workloads across different workers, like processors in a single machine
or multiple machines. Sequential algorithms like gradient boosting
decision tree (GBDT) is tricky to parallelize since the next decision tree
is built based on the errors the previous decision tree made.
The nature of k-nearest neighbors (k-NN) model allows it to be easily
run on multiple machine at the same time. It is a classic example of
using MapReduce in machine learning.

Parametricity

The concept of parametricity is widely used in the fields of statistical


learning. Simply speaking, a parametric model means the number of
parameters of a model is fixed while the number of parameters of a
non-parametric model grows when more data are available [3].
Another way of defining a parametric model is based on its underlying
assumptions about the shapes of the probability distribution of the
data. If there is no assumption presented, then it is a non-parametric
model [4].

Parametric model is very common in machine learning. Examples are


linear regression, neural networks and many other ML models. k-NN
and SVM (support vector machine), on the other hand, are
nonparametric models [5].

Methodology, Assumptions and Objectives

In essence, all machine learning problems are optimization problems.


There is always a methodology behind a machine learning model, or an
underlying objective function to be optimized. The comparison of the
main ideas behind the algorithms can enhance reasonings about them.
For instance, the objective of a linear regression model is to minimize
the square loss of predictions and the actual value (Mean Square Error,
MSE), while Lasso regression aims to minimize the MSE while
restricting the learned parameters by adding an extra regularization
term to prevent overfitting.

Some taxonomies of machine learning models include a) generative vs


discriminative, b) probabilistic vs non-probabilistic, c)tree-based vs
non-tree based, etc.

You might also like