Professional Documents
Culture Documents
Practice Q Machine Learning Ans
Practice Q Machine Learning Ans
Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain what the
input is and what the corresponding output is in the training data) where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision nodes and
leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is
split.
An example of a decision tree can be explained using above binary tree. Let’s say you want to predict
whether a person is fit given their information like age, eating habit, and physical activity, etc. The
decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of
pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary
classification problem (a yes no type problem). There are two main types of Decision Trees:
What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or
‘unfit’. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now that we
know what a Decision Tree is, we’ll see how it works internally. There are many algorithms out there
which construct Decision Trees, but one of the best is called as ID3 Algorithm. ID3 Stands for Iterative
Dichotomiser 3. Before discussing the ID3 algorithm, we’ll go through few definitions.
Entropy:
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of the
Information Gain:
nformation gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is the
effective change in entropy after deciding on a particular attribute A. It measures the relative change in
entropy with respect to the independent variables.
Let’s understand this with the help of an example. Consider a piece of data collected over the course of 14
days where the features are Outlook, Temperature, Humidity, Wind and the outcome variable is whether
Golf was played on the day. Now, our job is to build a predictive model which takes in above 4
parameters and predicts whether Golf will be played on the day. We’ll build a decision tree to do that
using ID3 algorithm.
Yes No Total
9 5 14
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them belong
to one class and other half belong to other class that is perfect randomness. Here it’s 0.94 which means
the distribution is fairly random. Now, the next step is to choose the attribute that gives us highest
possible Information Gain which we’ll choose as the root node. Let’s start with
‘Wind’
where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible values in the
sample data, hence x = {Weak, Strong} We’ll have to
Wind =
Wind = Weak Total
Strong
8 6 14
Now, out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’ for ‘Play
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no coincidence by
any chance, the simple tree resulted because of the highest information gain is given by the attribute
Outlook. Now how do we proceed from this point? We can simply apply recursion, you might want to
look at the algorithm steps described earlier. Now that we’ve used Outlook, we’ve got three of them
remaining Humidity, Temperature, and Wind. And, we had three possible values of Outlook: Sunny,
Overcast, Rain. Where the Overcast node already ended up having leaf node ‘Yes’, so we’re left with two
subtrees to compute: Sunny and Rain. Table where
the value of Outlook is Sunny looks like:
import pydotplus
from sklearn.datasets import load_iris
from sklearn import tree
from IPython.display import Image, display
__author__ = "Mayur Kulkarni <mayur.kulkarni@xoriant.com>"
def load_data_set():
"""
Loads the iris data set
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))
if __name__ == '__main__':
iris_data = load_iris()
decision_tree_classifier = train_model(iris_data)
display_image(clf=decision_tree_classifier, iris=iris_data)
Conclusion: Below is the summary of what we’ve studied in this blog:
1. Entropy to measure discriminatory power of an attribute for classification task. It defines the
amount of randomness in attribute for classification task. Entropy is minimal means the attribute
appears close to one class and has a good discriminatory power for classification
2. Information Gain to rank attribute for filtering at given node in the tree. The ranking is based on
high information gain entropy in decreasing order.
3. The recursive ID3 algorithm that creates a decision tree.
Inductive Inference
Inductive inference is the process of reaching a general conclusion from specific examples.
Inductive Learning Hypothesis: any hypothesis found to approximate the target function well over a
sufficiently large set of training examples will also approximate the target function well over other
unobserved examples.
Example:
x y z
2 3 5
4 6 10
5 2 7
Model 1:
x+y=z
Prediction: x = 0, z = 0 y=0
Model 2:
if x = 2 and z = 5, then y = 3.
if x = 4 and z = 10, then y = 6.
if x = 5 and z = 7, then y = 2.
otherwise y = 1.
Good:
Bad:
completely consistent with data.
no justification in the data for the prediction that y = 1 in all other cases.
not in the class of algebraic functions (but nothing was said about class of descriptions).
Inductive bias: explicit or implicit assumption(s) about what kind of model is wanted.
Example:
The decision tree ID3 algorithm searches the complete hypothesis space, and there is no
restriction on the number of hypthotheses that could eventually be enumerated. However, this
algorithm searches incompletely through the set of possibly hypotheses and preferentially selects
those hypotheses that lead to a smaller decision tree. This type of bias is called
a preference (or search) bias.
In contrast, the version space candidate-elimination algorithm searches through only a subset of
the possible hypotheses (an incomplete hypothesis space), yet searches this space completely.
This type of bias is called a restriction (or language) bias, because the number of possible
hyptheses considered is restricted.
A preference bias is generally more desirable than a restriction bias, because an algorithm with
this bias is allowed to search through the complete hypothesis space, which is guaranteed to
contain the target function.
o Restricting the hypothesis space being searched (a restriction bias) is less desirable
because the target function may not be within the set of hypotheses considered.
Positive Examples
x y z
2 3 5
2 5 7
4 6 10
general x, y, z I
more specific x, y, z I+
more specific than the first 1 < x, y, z < 11 ;
two x, y, z I
even more specific model x+y=z
Negative Examples
x y z Decision
2 3 5 Y
2 5 7 Y
4 6 10 Y
2 2 5 N
Forever.
Until out of memory.
Until a final answer is reached.
Success Criterion
Look for a description l L such that l is consistent with all observed examples.
Example:
L = {x op y = z}, op = {+, -, *, /}
Given a precise specification of language and data, write a program to test descriptions one by one against
the examples.
It is very difficult to specify a small finite language that contains a description of the examples.
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised
learning algorithms, the decision tree algorithm can be used for solving regression and classification
problems
Introduction of Terminology:
I'm not going into all detail's of terminology assuming you know at this point instead of sharing image for
references below
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation
of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity
of the node increases with respect to the target variable. The decision tree splits the nodes on all available
variables and then selects the split which results in most homogeneous sub-nodes.
The algorithm selection is also based on the type of target variables. Let us look at some algorithms used
in Decision Trees:
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing
classification trees)
1. Select Root node(S) based on low Entropy and Highest Information Gain
2 . On each iteration of an algorithms it calculate the Entropy and Information gain, considering that
every node is unused
5. An algorithms continuous to recur on each subset and make sure that attributes are fresh and Creates
the decision Tree
Before going deep in the algorithms understand some Statistical Terms Involves in it:
1. Entropy :
A decision tree is built top-down from a root node and involves partitioning the data into subsets that
contain instances with similar values (homogeneous). ID3 algorithm uses entropy to calculate the
homogeneity of a sample. if Entropy E(s) = 0 means it is Completely homogenic or we can say that it is
Leaf node of tree so it can't be divided further and ID3 uses Lowest Entropy to splitting the algorithm.
Formula:
Here, p = Probability
I'm mentioning some useful relation between probability and Entropy (Consider Binary Classification):
2.Probability(Either or Both Class) = 0 & Entropy = 0 it is called Leaf Node & stop Split
Above relation we seeing it the graph that if p = 0.5 on x axis then Entropy becomes 1 So, we
determine that as probability high i.e. > 0.5 its entropy started decreasing and also we got leaf
node by entropy = 0
2 . Information Gain:
The information gain is based on the decrease in entropy after a data-set is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the
most homogeneous branches)
Example:
Now we Understood the all concept used in ID3 algorithm now its time to implement using an Examples
Let's see one Example.
From above example you can See there are 4 features and based on derive tree is Mammal or Reptile
1. find all features Information Gain and select the root node which is Largest Among all
How You Calculated for I.G all features in our example for toothed
Based on data id3 find information gain for all excluding hair = Not hair below
from above Got Information Gain it select legs as child node of Parent Hair it splits tree accordingly we
see in image below
4. Now consider " Legs = Legs " find information gain and splits here its not consider hair again
Entropy of current node Legs :
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian classifiers
are the statistical classifiers with the Bayesian probability understandings. The theory expresses how a
level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to
provide an algorithm that uses evidence to calculate limits on an unknown parameter.
Bayes's theorem is expressed mathematically by the following equation that is given below.
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is known as
the marginal probability.
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects the
degree of belief in a hypothesis before and after accounting for evidence. For example, Lets us consider
an example of the coin. If we toss a coin, then we get either heads or tails, and the percent of occurrence
of either heads and tails is 50%. If the coin is flipped numbers of times, and the outcomes are observed,
the degree of belief may rise, fall, or remain the same depending on the outcomes.
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) procedure
that is utilized to compute uncertainties by utilizing the probability concept. Generally known as Belief
Networks, Bayesian Networks are used to show uncertainties using Directed Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical graph, a
DAG consists of a set of nodes and links, where the links signify the connection between the nodes.
The nodes here represent random variables, and the edges define the relationship between these variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability Distribution
(CDP) of each random variable. A Conditional Probability Table (CPT) is used to represent the CPD of
each variable in a network.
What is a kernel?
The kernel is the essential foundation of a computer's operating system (OS). It is the core that provides
basic services for all other parts of the OS. It is the main layer between the OS and underlying computer
hardware, and it helps with tasks such as process and memory management, file systems, device control
and networking.
During normal system startup, a computer's basic input/output system, or BIOS, completes a hardware
bootstrap or initialization. It then runs a bootloader which loads the kernel from a storage device -- such
as a hard drive -- into a protected memory space. Once the kernel is loaded into computer memory, the
BIOS transfers control to the kernel. It then loads other OS components to complete the system startup
and make control available to users through a desktop or other user interface.
If the kernel is damaged or cannot load successfully, the computer will be unable to start completely -- if
at all. This will require service to correct hardware damage or restore the operating system kernel to a
working version.
1. It provides the interfaces needed for users and applications to interact with the computer.
In more granular terms, accomplishing these three kernel functions involves a range of computer tasks,
including the following:
organizing and managing threads and the various processes spawned by running applications;
scheduling which applications can access and use the kernel, and supervising that use when the
scheduled time occurs;
deciding which nonprotected user memory space each application process uses;
managing and optimizing hardware resources and dependencies, such as central processing unit (CPU)
and cache use, file system operation and network transport mechanisms;
managing and accessing input/output devices such as keyboards, mice, disk drives, USB ports, network
adapters and displays; and
handling device and application system calls using various mechanisms such as hardware interrupts or
device drivers.
Scheduling and management are central to the kernel's operation. Computer hardware can only do one
thing at a time. However, a computer's OS components and applications can spawn dozens and even
hundreds of processes that the computer must host. It's impossible for all of those processes to use the
computer's hardware -- such as a memory address or CPU instruction pipeline -- at the same time. The
kernel is the central manager of these processes. It knows which hardware resources are available and
which processes need them. It then allocates time for each process to use those resources.
The kernel is critical to a computer's operation, and it requires careful protection within the system's
memory. The kernel space it loads into is a protected area of memory. That protected memory space
ensures other applications and data don't overwrite or impair the kernel, causing performance problems,
instability or other negative consequences. Instead, applications are loaded and executed in a generally
available user memory space.
A kernel is often contrasted with a shell, which is the outermost part of an OS that interacts with user
commands. Kernel and shell are terms used more frequently in Unix OSes than in IBM mainframe and
Microsoft Windows systems.
A kernel is not to be confused with a BIOS, which is an independent program stored on a chip within a
computer's circuit board.
Device drivers
A key part of kernel operation is communication with hardware devices inside and outside of the physical
computer. However, it is impractical to write an OS capable of interacting with every possible device in
existence. Instead, kernels rely on the ability of device drivers to add kernel support for specialized
devices, such as printers and graphics adapters.
When an OS is installed on a computer, the installation adds device drivers for any specific devices
detected within the computer. This helps tailor the OS installation to the specific system with just enough
components to support the devices present. When a new or better device replaces an existing device,
the device driver is updated or replaced.
There are several types of device drivers. Each addresses a different data transfer type. The following are
some of the main driver types:
Character device drivers implement, open, close, read and write data, as well as grant data stream
access for the user space.
Block device drivers provide device access for hardware that transfers randomly accessible data in
fixed blocks.
Network device drivers transmit data packets for hardware interfaces that connect to external systems.
Device drivers are classified as kernel or user. A kernel mode device driver is a generic driver that is
loaded along with the OS. These drivers are often suited to small categories of major hardware devices,
such as CPU and motherboard device drivers.
User mode device drivers encompass an array of ad hoc drivers used for aftermarket, user-added devices,
such as printers, graphics adapters, mice, advanced sound systems and other plug-and-play devices.
The OS needs the code that makes up the kernel. Consequently, the kernel code is usually loaded into an
area in the computer storage that is protected so that it will not be overlaid with less frequently used parts
of the OS.
Computer designers have long understood the importance of security and the need to protect critical
aspects of the computer's behavior. Long before the internet, or even the emergence of networks,
designers carefully managed how software components accessed system hardware and resources.
Processors were developed to support two operating modes: kernel mode and user mode.
Kernel mode
Kernel mode refers to the processor mode that enables software to have full and unrestricted access to the
system and its resources. The OS kernel and kernel drivers, such as the file system driver, are loaded into
protected memory space and operate in this highly privileged kernel mode.
User mode
User mode refers to the processor mode that enables user-based applications, such as a word processor or
video game, to load and execute. The kernel prepares the memory space and resources for that
application's use and launches the application within that user memory space.
User mode applications are less privileged and cannot access system resources directly. Instead, an
application running in user mode must make system calls to the kernel to access system resources. The
kernel then acts as a manager, scheduler and gatekeeper for those resources and works to prevent
conflicting resource requests.
The processor switches to kernel mode as the kernel processes its system calls and then switches back to
user mode to continue operating the application(s).
It's worth noting that kernel and user modes are processor states and have nothing to do with actual solid-
state memory. There is nothing intrinsically safe or protected about the memory used for kernel mode.
Kernel driver crashes and memory failures within the kernel memory space can still crash the OS and the
computer.
Types of kernels
Kernels fall into three architectures: monolithic, microkernel and hybrid. The main difference between
these types is the number of address spaces they support.
A microkernel delegates user processes and services and kernel services in different address spaces.
A hybrid kernel, such as the Microsoft Windows NT and Apple XNU kernels, attempts to combine the
behaviors and benefits of microkernel and monolithic kernel architectures.
Overall, these kernel implementations present a tradeoff -- admins get the flexibility of more source code
with microkernels or they get increased security without customization options with the monolithic
kernel.
Some specific differences among the three kernel types include the following:
Microkernels
Microkernels have all of their services in the kernel address space. For their communication protocol,
microkernels use message passing, which sends data packets, signals and functions to the correct
processes. Microkernels also provide greater flexibility than monolithic kernels; to add a new service,
admins modify the user address space for a microkernel.
Because of their isolated nature, microkernels are more secure than monolithic kernels. They remain
unaffected if one service within the address space fails.
Monolithic kernels
Monolithic kernels are larger than microkernels, because they house both kernel and user services in the
same address space. Monolithic kernels use a faster system call communication protocol than
microkernels to execute processes between the hardware and software. They are less flexible than
microkernels and require more work; admins must reconstruct the entire kernel to support a new service.
Monolithic kernels pose a greater security risk to systems than microkernels because, if a service fails,
then the entire system shuts down. Monolithic kernels also don't require as much source code as a
microkernel, which means they are less susceptible to bugs and need less debugging.
The Linux kernel is a monolithic kernel that is constantly growing; it had 20 million lines of code in
2018. From a foundational level, it is layered into a variety of subsystems. These main groups include a
system call interface, process management, network stack, memory management, virtual file system and
device drivers.
Administrators can port the Linux kernel into their OSes and run live updates. These features, along with
the fact that Linux is open source, make it more suitable for server systems and environments that require
real-time maintenance.
Hybrid kernels
Apple developed the XNU OS kernel in 1996 as a hybrid of the Mach and Berkeley Software Distribution
(BSD) kernels and paired it with an Objective-C application programming interface or API. Because it is
a combination of the monolithic kernel and microkernel, it has increased modularity, and parts of the OS
gain memory protection.
See how user and kernel address space for physical
memory is configured in Windows 10.
Before the kernel, developers coded actions directly to the processor, instead of relying on an OS to
complete interactions between hardware and software.
The first attempt to create an OS that used a kernel to pass messages was in 1969 with the RC 4000
Multiprogramming System. Programmer Per Brinch Hansen discovered it was easier to create a nucleus
and then build up an OS, instead of converting existing OSes to be compatible with new hardware. This
nucleus -- or kernel -- contained all source code to facilitate communications and support systems,
eliminating the need to directly program on the CPU.
After RC 4000, Bell Labs researchers started work on Unix, which radically changed OS development
and kernel development and integration. The goal of Unix was to create smaller utilities that do specific
tasks well instead of having system utilities try to multitask. From a user standpoint, this simplifies
creating shell scripts that combine simple tools.
As Unix adoption increased, the market started to see a variety of Unix-like computer OSes, including
BSD, NeXTSTEP and Linux. Unix's structure perpetuated the idea that it was easier to build a kernel on
top of an OS that reused software and had consistent hardware, instead of relying on a time-shared system
that didn't require an OS.
Unix brought OSes to more individual systems, but researchers at Carnegie Mellon expanded kernel
technology. From 1985 to 1994, they expanded work on the Mach kernel. Unlike BSD, the Mach kernel
is OS-agnostic and supports multiple processor architectures. Researchers made it binary-compatible with
existing BSD software, enabling it to be available for immediate use and continued experimentation.
The Mach kernel's original goal was to be a cleaner version of Unix and a more portable version of
Carnegie Mellon's Accent interprocessor communications (IPC) kernel. Over time, the kernel brought
new features, such as ports and IPC-based programs, and ultimately evolved into a microkernel.
Shortly after the Mach kernel, in 1986, Vrije Universiteit Amsterdam developer Andrew Tanenbaum
released MINIX (mini-Unix) for educational and research uses. This distribution contained a microkernel-
based structure, multitasking, protected mode, extended memory support and an American National
Standards Institute C compiler.
The next major advancement in kernel technology came in 1992, with the release of the Linux kernel.
Founder Linus Torvalds developed it as a hobby, but he still licensed the kernel under general public
license, making it open source. It was first released with 176,250 lines of code.
The majority of OSes -- and their kernels -- can be traced back to Unix, but there is one outlier: Windows.
With the popularity of DOS- and IBM-compatible PCs, Microsoft developed the NT kernel and based its
OS on DOS. That is why writing commands for Windows differs from Unix-based systems.
Bias is considered a systematic error that occurs in the machine learning model itself due to incorrect
assumptions in the ML process.
Technically, we can define bias as the error between average model prediction and the ground truth.
Moreover, it describes how well the model matches the training data set:
A model with a higher bias would not match the data set closely.
A low bias model will closely match the training data set.
All these contribute to the flexibility of the model. For instance, a model that does not match a data set
with a high bias will create an inflexible model with a low variance that results in a suboptimal machine
learning model.
Getting started with AIOps is easy. Learn how you can manage escalating IT complexity with ease! ›
Underfitting occurs when the model is unable to match the input data to the target data. This happens
when the model is not complex enough to match all the available data and performs poorly with the
training dataset.
Overfitting relates to instances where the model tries to match non-existent data. This occurs when
dealing with highly complex models where the model will match almost all the given data points and
perform well in training datasets. However, the model would not be able to generalize the data point in
the test data set to predict the outcome accurately.
When a data engineer modifies the ML algorithm to better fit a given data set, it will lead to low bias—
but it will increase variance. This way, the model will fit with the data set while increasing the chances of
inaccurate predictions.
The same applies when creating a low variance model with a higher bias. While it will reduce the risk of
inaccurate predictions, the model will not properly match the data set.
It’s a delicate balance between these bias and variance. Importantly, however, having a higher variance
does not indicate a bad ML algorithm. Machine learning algorithms should be able to handle some
variance.
Increasing the complexity of the model to count for bias and variance, thus decreasing the overall bias
while increasing the variance to an acceptable level. This aligns the model with the training dataset
without incurring significant variance errors.
Increasing the training data set can also help to balance this trade-off, to some extent. This is the
preferred method when dealing with overfitting models. Furthermore, this allows users to increase the
complexity without variance errors that pollute the model as with a large data set.
A large data set offers more data points for the algorithm to generalize data easily. However, the major
issue with increasing the trading data set is that underfitting or low bias models are not that sensitive to
the training data set. Therefore, increasing data is the preferred solution when it comes to dealing with
high variance and high bias models.
This table lists common algorithms and their expected behavior regarding bias and variance:
Linear High
Regression Bias Less Variance
Low
Decision Tree Bias High Variance
Ready to discover how BMC Helix for ServiceOps can transform your business?
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction techniques
in machine learning to solve more than two-class classification problems. It is also known as Normal
Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).
This can be used to project the features of higher dimensional space into lower-dimensional space in
order to reduce resources and dimensional costs. In this topic, "Linear Discriminant Analysis (LDA) in
machine learning”, we will discuss the LDA algorithm for classification predictive modeling problems,
limitation of logistic regression, representation of linear Discriminant analysis model, how to make a
prediction using LDA, how to prepare data for LDA, extensions to LDA and much more. So, let's start
with a quick introduction to Linear Discriminant Analysis (LDA) in machine learning.
Note: Before starting this topic, it is recommended to learn the basics of Logistic Regression algorithms
and a basic understanding of classification problems in machine learning as a prerequisite
Although the logistic regression algorithm is limited to only two-class, linear Discriminant analysis is
applicable for more than two classes of classification problems.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques used for
supervised classification problems in machine learning. It is also considered a pre-processing step for
modeling differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple features efficiently, the
Linear Discriminant Analysis model is considered the most common technique to solve such
classification problems. For e.g., if we have two classes with multiple features and need to separate them
efficiently. When we classify them using a single feature, then it may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of features
regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-dimensional
plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane into the 1-D
plane. Using this technique, we can also maximize the separability between multiple classes.
Linear Discriminant analysis is used as a dimensionality reduction technique in machine learning, using
which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we need to
classify them efficiently. As we have already seen in the above example that LDA enables us to draw a
straight line that can completely separate the two classes of the data points. Here, LDA uses an X-Y axis
to create a new axis by separating them using a straight line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can maximize the
distance between the means of the two classes and minimizes the variation within each class.
In other words, we can say that the new axis will increase the separation between the data points of the
two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform well for
binary classification but falls short in the case of multiple classification problems with well-
separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as PCA,
which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract useful data
from different faces. Coupled with eigenfaces, it produces effective results.
Although, LDA is specifically used to solve supervised classification problems for two or more classes
which are not possible using logistic regression in machine learning. But LDA also fails in some cases
where the Mean of the distributions is shared. In this case, LDA fails to create a new axis that makes both
the classes linearly separable.
To overcome such problems, we use non-linear Discriminant analysis in machine learning.
Linear Discriminant analysis is one of the most simple and effective methods to solve classification
problems in machine learning. It has so many extensions and variations as follows:
1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class deploys its
own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of inputs are
used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the variance
(actually covariance) and hence moderates the influence of different variables on LDA.
Some of the common real-world applications of Linear discriminant Analysis are given below:
o Face Recognition
Face recognition is the popular application of computer vision, where each face is represented as
the combination of a number of pixel values. In this case, LDA is used to minimize the number of
features to a manageable number before going through the classification process. It generates a
new template in which each dimension consists of a linear combination of pixel values. If a linear
combination is generated using Fisher's linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on the basis of
various parameters of patient health and the medical treatment which is going on. On such
parameters, it classifies disease as mild, moderate, or severe. This classification helps the doctors
in either increasing or decreasing the pace of the treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of LDA; we
can easily identify and select the features that can specify the group of customers who are likely
to purchase a specific product in a shopping mall. This can be helpful when we want to identify a
group of customers who mostly purchase a product in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For example, "will you
buy this product” will give a predicted result of either one or two possible classes as a buying or
not.
o In Learning
Nowadays, robots are being trained for learning and talking to simulate human work, and it can
also be considered a classification problem. In this case, LDA builds similar groups on the basis
of different parameters, including pitches, frequencies, sound, tunes, etc.
Below are some suggestions that one should always consider while preparing the data to build the LDA
model:
o Classification Problems: LDA is mainly applied for classification problems to classify the
categorical output variable. It is suitable for both binary and multi-class classification problems.
o Gaussian Distribution: The standard LDA model applies the Gaussian Distribution of the input
variables. One should review the univariate distribution of each attribute and transform them into
more Gaussian-looking distributions. For e.g., use log and root for exponential distributions and
Box-Cox for skewed distributions.
o Remove Outliers: It is good to firstly remove the outliers from your data because these outliers
can skew the basic statistics used to separate classes in LDA, such as the mean and the standard
deviation.
o Same Variance: As LDA always assumes that all the input variables have the same variance,
hence it is always a better way to firstly standardize the data before implementing an LDA model.
By this, the Mean will be 0, and it will have a standard deviation of 1.
Introduction
Pruning is a technique in machine learning that involves diminishing the size of a prepared model by
eliminating some of its parameters. The objective of pruning is to make a smaller, faster, and more
effective model while maintaining its accuracy. Pruning can be especially useful for huge and complex
models, where lessening their size can prompt significant improvements in their speed and proficiency.
There are two principal types of pruning techniques: unstructured and structured pruning. Unstructured
pruning involves eliminating individual parameters or connections from the model, resulting in a smaller
and sparser model. Structured pruning involves eliminating groups of parameters, such as whole filters,
channels, or neurons.
Structured Pruning:
Structured pruning involves eliminating whole structures or groups of parameters from the model, such as
whole neurons, channels, or filters. This sort of pruning preserves the hidden structure of the model,
implying that the pruned model will have the same overall architecture as the first model, but with fewer
parameters.
Structured pruning is suitable for models with a structured architecture, such as convolutional neural
networks (CNNs), where the parameters are coordinated into filters, channels, and layers. It is also easier
to carry out than unstructured pruning since it preserves the structure of the model.
Unstructured Pruning:
Unstructured pruning involves eliminating individual parameters from the model without respect for their
location in the model. This sort of pruning does not preserve the hidden structure of the model, implying
that the pruned model will have an unexpected architecture in comparison to the first model. Unstructured
pruning is suitable for models without a structured architecture, such as completely connected brain
networks, where the parameters are coordinated into a single grid. It tends to be more effective than
structured pruning since it allows for more fine-grained pruning; however, it can also be more difficult to
execute.
The decision of which pruning technique to use depends on several factors, such as the type of model, the
accessibility of registration resources, and the degree of accuracy desired. For instance, structured pruning
is more suitable for convolutional brain networks, while unstructured pruning is more pertinent for
completely connected networks. The decision to prune should also consider the compromise between
model size and accuracy. Other factors to consider include the complexity of the model, the size of the
training information, and the performance metrics of the model.
Neural networks are a kind of machine learning model that can benefit extraordinarily from pruning. The
objective of pruning in neural networks is to lessen the quantity of parameters in the network, thereby
making a smaller and faster model without sacrificing accuracy.
There are several types of pruning techniques that can be applied to neural networks, including weight
pruning, neuron pruning, channel pruning, and filter pruning.
1. Weight Pruning
Weight pruning is the most common pruning technique used in brain networks. It involves setting some
of the weights in the network to zero or eliminating them. This results in a sparser network that is faster
and more effective than the first network. Weight pruning can be done in more than one way, including
magnitude-based pruning, which removes the smallest magnitude weights, and iterative pruning, which
removes weights during training.
2. Neuron Pruning
Neuronal pruning involves eliminating whole neurons from the network. This can be useful for
diminishing the size of the network and working on its speed and effectiveness. Neuron pruning can be
done in more ways than one, including threshold-based pruning, which removes neurons with small
activation values, and sensitivity-based pruning, which removes neurons that only slightly affect the
result.
3. Channel Pruning
Channel pruning is a technique used in convolutional brain networks (CNNs) that involves eliminating
whole channels from the network. A channel in a CNN corresponds to a gathering of filters that figure out
how to distinguish a specific element. Eliminating unnecessary channels can decrease the size of the
network and work on its speed and effectiveness without sacrificing accuracy.
4. Filter Pruning
Filter pruning involves eliminating whole filters from the network. A filter in a CNN corresponds to a set
of weights that figure out how to identify a specific element. Eliminating unnecessary filters can decrease
the size of the network and improve its speed and effectiveness without sacrificing accuracy.
Pruning can also be applied to decision trees, which are a kind of machine learning model that learns a
series of binary decisions based on the information features. Decision trees can turn out to be
exceptionally huge and perplexing, prompting overfitting and decreased generalisation capacity. Pruning
can be used to eliminate unnecessary branches and nodes from the decision tree, resulting in a smaller and
simpler model that is less liable to overfit.
Pruning can also be applied to support vector machines (SVMs), which are a sort of machine learning
model that separates useful pieces of information into various classes using a hyperplane. SVMs can turn
out to be extremely large and complicated, resulting in slow and wasteful predictions. Pruning can be
used to eliminate unnecessary support vectors from the model, resulting in a smaller and faster model that
is still accurate.
Advantages
o Decreased model size and complexity. Pruning can significantly diminish the quantity of
parameters in a machine learning model, prompting a smaller and simpler model that is easier to
prepare and convey.
o Faster inference. Pruning can decrease the computational cost of making predictions, prompting
faster and more effective predictions.
o Further developed generalization. Pruning can forestall overfitting and further develop the
generalization capacity of the model by diminishing the complexity of the model.
o Increased interpretability. Pruning can result in a simpler and more interpretable model, making it
easier to understand and make sense of the model's decisions.
Disadvantages
o Possible loss of accuracy. Pruning can sometimes result in a loss of accuracy, especially in the
event that such a large number of parameters are pruned or on the other hand in the event that
pruning is not done cautiously.
o Increased training time. Pruning can increase the training season of the model, especially
assuming it is done iteratively during training.
o Trouble in choosing the right pruning technique. Choosing the right pruning technique can be
testing and may require area expertise and experimentation.
o Risk of over-pruning. Over-pruning can prompt an overly simplified model that is not accurate
enough for the task.
The decision of pruning technique depends on the specific characteristics of the model and the task within
reach. Structured pruning is suitable for models with a structured architecture, while unstructured pruning
is suitable for models without a structured architecture.
The pruning rate determines the proportion of parameters to be pruned. It should be chosen cautiously to
adjust the reduction in model size with the loss of accuracy.
The effect of pruning on the model's accuracy should be evaluated using fitting metrics, such as validation
accuracy or test accuracy.
Iterative pruning involves pruning the model on various occasions during training, which can prompt
improved results than a single pruning toward the finish of training.
Pruning can be joined with other regularization techniques, such as L1 and L2 regularization or dropout,
to further develop the model's performance further.
o Beware of over-pruning:
Over-pruning can prompt an overly simplified model that is not accurate enough for the task. Cautious
attention should be given to choosing the right pruning rate and assessing the effect on the model's
accuracy.
Conclusion:
Pruning is a useful technique in machine learning for decreasing the size and complexity of prepared
models. There are various types of pruning techniques, and selecting the right one depends on various
factors. Pruning should be done cautiously to achieve the desired harmony between model size and
accuracy, and it should be evaluated using suitable metrics. Overall, pruning can be an effective method
for making smaller, faster, and more proficient models without sacrificing accuracy.