ML Unit 1

MACHINE LEARNING
MACHINE LEARNING
 “Machine Learning is a subset of artificial intelligence. It focuses
mainly on the designing of systems, thereby allowing them to learn and
make predictions based on some set of matrices in machines”.
 The machine learning algorithm is trained using a labelled or unlabelled

training data set to produce a model. New input data is introduced to the ML
algorithm and makes a prediction based on the model, the prediction is then
evaluated for accuracy and if the accuracy is acceptable the machine learning
algorithm is deployed.
MACHINE LEARNING
MACHINE LEARNING
MACHINE LEARNING
1. Supervised Machine Learning

In supervised learning, you train your model on a
labelled dataset that means we have both raw input
data as well as its results. We split our data into a
training dataset and test dataset where the training
dataset is used to train our network whereas the test
dataset acts as new data for predicting results or to
see the accuracy of our model.
MACHINE LEARNING
MACHINE LEARNING
Some algorithms for supervised learning

1.Linear Regression
2.Random Forest
3.Support Vector Machines (SVM)
MACHINE LEARNING
Applications of Supervised Learning

•Sentiment Analysis: It is a natural language processing technique in
which we analyze and categorize some meaning out of the given text
data.
•Recommendations: Every e-Commerce site or media, all of them use the

recommendation system to recommend their products and new releases
to their customers or users on the basis of their activities.
•Spam Filtration: Detecting spam emails is indeed a very helpful tool, this

filtration techniques can easily detect any sort of virus, malware or even
harmful URLs,
MACHINE LEARNING
1. Unsupervised Machine Learning

Unsupervised learning studies how systems can
infer a function to describe a hidden structure from
unlabeled data. The system doesn’t predict the right
output, but instead, it explores the data and can
draw inferences from datasets to describe hidden
structures from unlabeled data.
Some examples of models that belong to this family
are the following: PCA, K-means, DBSCAN, mixture
models etc.
MACHINE LEARNING
MACHINE LEARNING
Applications:
Association rules allow you to establish

associations amongst data objects inside large
datasets by identifying relationships between
variables in a given dataset, as in market basket
analysis and recommendation engines.
Anomaly detection is an unsupervised algorithm

to identify anomalous data in a dataset. It is used
for fault diagnosis, network security intrusion,
MACHINE LEARNING
and fraud detection.
Reinforcement learning
In Reinforcement learning an agent interacts with

the environment by sensing its state and learns to
take actions in order to maximize long-term
reward. As the agent takes actions it needs to
maintain a balance between exploration and
exploitation by performing a variety of actions
using trial and error to favor the actions that yield
the maximum reward in the future.
Deep learning
Deep Learning, on the other hand, is just a type
of Machine Learning, inspired by the structure
of a human brain. Deep learning algorithms
attempt to draw similar conclusions as humans
would by continually analyzing data with a
given logical structure. To achieve this, deep
learning uses a multi-layered structure of
algorithms called neural networks.
Feature selection
Feature selection means selecting and retaining only

the most important features in the model.
MACHINE LEARNING
Wrapper Method
•The main idea behind a wrapper method is to select
which set of features works best for a machine learning
model.
•It follows a greedy search approach by evaluating all the
possible combinations of features against the evaluation
criterion.
MACHINE LEARNING
1. Forward Selection
Forward selection is an iterative method in each iteration, we keep adding the
feature which best improves our model till an addition of a new variable does
not improve the performance of the model.
2. Backward Selection
In backward elimination, we start with all the features and removes the least
significant feature at each iteration which improves the performance of the
model. We repeat this until no improvement is observed on removal of features.
3. Exhaustive Feature Selection
This is the most robust feature selection method covered so far. This is a brute-
force evaluation of each feature subset. This means that it tries every possible
combination of the variables and returns the best performing subset.
MACHINE LEARNING
Filter Method
Filter Method Types

1.Basic Statistical Filter Methods
2.Correlation & Ranking based statistical Filter
Methods
3.Statistical Test based Methods
1.Basic Statistical Filter Methods
Variance Threshold
Removing Numerical features with low variance
•We simply compute the variance of each features, and we
select the subset of features based on a user-specified
threshold.
•We assume that features with a higher variance may contain
more useful information.
•This feature selection algorithm looks only at the
features (X), not the desired outputs (y), and can thus be
used for unsupervised learning.
•As we are not taking the relationship between features
variables or feature and target variables into account, which is
one of the drawbacks of Variance Threshold filter method.
•It is applicable only on Numerical features.
2. Correlation & Ranking based statistical Filter Methods
•Pearson’s correlation coefficient(linear data)
•Spearman’s rank coefficient(linear and nonlinear)
Covariance
•Covariance measures the directional relationship between the two or more variables
•The sign of the covariance can be interpreted as whether the two variables change in the same direction
(positive) or change in different directions (negative).
•The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both
variables are completely independent.
•Example : When two stocks tend to move together, they are seen as having a positive covariance; when
they move inversely, the covariance is negative.
Correlation
1.Correlation states how the features are related to each other or to the target variable.
2.Correlation is defined as a measure of the linear relationship between two quantitative variables, like
height and weight. You could also define correlation is a measure of how strongly one variable depends on
another.
3.Correlation can be positive (the values of one variable increase as the values of another increase) or
negative (the values of one variable decrease as the values of another increase.)
4.A high correlation is often a useful property : if two variables are highly correlated, we can predict one
from the other. Therefore, we generally look for features that are the highly correlated with the target,
especially for linear machine learning models.
5.However, if two variables are highly correlated among themselves, they provide redundant information in
regards to the target. Essentially, we can make an accurate prediction on the target with just one of the
redundant variables.In these cases, the second variable doesn’t add additional information, so removing it
3. Statistical Test based Methods
Annova or F-Test
A univariate test,
Linear model for
testing the individual
effect of each of
features with target.
•ANOVA assumes a
linear relationship
between the features
and the target, and also
that the variables are
normally distributed. MACHINE LEARNING
MACHINE LEARNING
Feature Selection using Mutual Information(MI)
•Mutual information is a measure of dependence or “mutual dependence” between

two random variables(x and y).
•It measures the amount of information obtained about one variable through
observing the other variable. In other words, it determines how much we can know
about one variable by understanding another—it’s a little bit like correlation, but
mutual information is more general.
•In machine learning, mutual information measures how much information the
presence/absence of a feature contributes to making the correct prediction on Y.
•The mutual information between two random variables X and Y can be stated
formally as follows:
•I(X ; Y) = H(X) – H(X | Y)
• Where I(X ; Y) is the mutual information for X and Y,
• H(X) is the entropy for X and H(X | Y) is the conditional entropy for X
given Y.
•Mutual information is a measure of dependence or “mutual dependence” between
two random variables. As such, the measure is symmetrical, meaning that I(X ; Y) =
I(Y ; X).
Categorical Feature Selection via Chi-Square Test of
Independence
• In our everyday data science work, we often encounter categorical features.

Some people would be confused about how to handle these features, especially
when we want to create a prediction model where those models basically an
equation that accepting number; not a category.
• One way is to encode all the category variable using the OneHotEncoding
method (encode all the categorical class into numerical values 0 and 1, where 0
mean absent and 1 is present).
• This method is preferable by many as the information is still present and it is
easy to understand the concept. The downfall is when we possess many
categorical features with high cardinality, the number of the features after
OneHotEncoding process would be massive.
• While adding features could decrease the error in our prediction model, it
would only decrease until a certain number of features; after that, the error will
increase again. This is the concept of the curse of dimensionality.
• Many ways to alleviate this problem, here we will use the feature selection via
the Chi-Square test of independence.
Embedded Method
Wrapper methods provide better results in terms of

performance, but they’ll also cost us a lot of computation
time/resources.so if we could include the feature
selection process in ML model training itself? That
could lead us to even better features for that model, in a
shorter amount of time. This is where embedded
methods come into play.
MACHINE LEARNING
MACHINE LEARNING
MACHINE LEARNING
MACHINE LEARNING
Normalization
• min-max normalization
• z-score normalization
• normalization by decimal scaling.

• Min-max normalization performs a linear transformation on the original data.
Suppose that minA and maxA are the minimum and maximum values of an attribute A.
Min-max normalization maps a value v of A to v’ in the range [new min A; new maxA] by
computing.
For example,
Suppose that the maximum and minimum values for the attribute income are $98,000
and $12,000, respectively. We would like to map income to the range [0; 1].
By min-max normalization, a value of $73,600 for income is transformed to

z-score normalization
the values for an attribute A are normalized based on the mean and standard deviation
of A. A value v of A is normalized to v’ by computing
For example,
Suppose that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively. With z-score normalization, a value
of $73,600 for income is transformed to
Normalization by decimal scaling
normalizes by moving the decimal point of values of attribute A. The number of

decimal points moved depends on the maximum absolute value of A. A value v of A
is normalized to v’ by computing
Suppose that the recorded values of A range from -986 to 917. The maximum absolute
value of A is 986. To normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., j = 3) so that -986 normalizes to -0.986.
Dimensionality reduction
Dimensionality reduction can be applied to obtain a

reduced representation of the data set that is much
smaller in volume, yet closely maintains the
integrity of the original data. That is, mining on the
reduced data set should be more efficient yet
produce the same (or almost the same) analytical
results.
Principal components analysis
The basic procedure is as follows.
1.The input data are normalized, so that each attribute falls within the same range. This step helps
ensure that attributes with large domains will not dominate attributes with smaller domains.
2. PCA computes N orthonormal vectors which provide a basis for the normalized input data. These
are unit vectors that each point in a direction perpendicular to the others. These vectors are
referred to as the principal components. The input data are a linear combination of the principal
components.
3. The principal components are sorted in order of decreasing strength. The principal components
essentially serve as a new set of axes for the data, providing important information about variance.
That is, the sorted axes are such that the first axis shows the most variance among the data, the
second axis shows the next highest variance, and so on. This information helps identify groups or
patterns within the data.
4. Since the components are sorted according to decreasing order of signficance, the size of the
data can be reduced by eliminating the weaker components, i.e., those with low variance. Using the
strongest principal components, it should be possible to reconstruct a good approximation of the
original data.
Linear Discriminant
Linear Analysis Analysis or Normal
Discriminant Discriminant
Analysis or Discriminant Function Analysis is a dimensionality
reduction technique that is commonly used for supervised
classification problems. It is used for modelling diffe rences in
groups i.e. separating two or more classes. It is used to project
the features in higher dimension space into a lower dimension
space.

For example, we have two classes and we need to separate them
efficiently. Classes can have multiple features. Using only a
single feature to classify them may result in some overlapping as
shown in the below figure. So, we will keep on increasing the
number of features for proper classification.

ML Unit 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Unit 1

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING

 The machine learning algorithm is trained using a labelled or unlabelled

•Recommendations: Every e-Commerce site or media, all of them use the

•Spam Filtration: Detecting spam emails is indeed a very helpful tool, this

Association rules allow you to establish

Anomaly detection is an unsupervised algorithm

In Reinforcement learning an agent interacts with

Feature selection means selecting and retaining only

Filter Method Types

•Mutual information is a measure of dependence or “mutual dependence” between

• In our everyday data science work, we often encounter categorical features.

Wrapper methods provide better results in terms of

• normalization by decimal scaling.

By min-max normalization, a value of $73,600 for income is transformed to

normalizes by moving the decimal point of values of attribute A. The number of

Dimensionality reduction can be applied to obtain a

You might also like