Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

UNIT - 1

WHAT IS MACHINE LEARNING?


Machine learning is a branch of artificial intelligence (AI) and computer science
which focuses on the use of data and algorithms to make computers imitate the way
that humans learn, gradually improving its accuracy.

Why is Machine Learning Important?


● Machine learning is growing in importance due to increasingly enormous
volumes and variety of data, the access and affordability of computational
power, and the availability of high speed Internet.
● There are a multitude of use cases that machine learning can be applied to in
order to cut costs, mitigate risks, and improve overall quality of life including
recommending products/services, detecting cybersecurity breaches, and
enabling self-driving cars.
● With greater access to data and computation power, machine learning is
becoming more ubiquitous every day and will soon be integrated into many
facets of human life.

BIG DATA IN CONTEXT OF MACHINE LEARNING


Big data refers to extremely large sets of structured and unstructured data that
cannot be handled with traditional methods. Big data analytics can make sense of
the data by uncovering trends and patterns. Machine learning can accelerate this
process with the help of decision-making algorithms. It can categorize the incoming
data, recognize patterns and translate the data into insights helpful for business
operations.
Types of Machine Learning
Machine learning is sub-categorized to three types:

● Supervised Learning – Train Me!


● Unsupervised Learning – I am self sufficient in learning
● Reinforcement Learning – My life My rules! (Hit & Trial)

What is Supervised Learning?

● Supervised Machine Learning is an algorithm that learns from labeled


training data to help you predict outcomes for unforeseen data. (Supervised
learning is when you train a machine learning model using labelled data.) It
means some data is already tagged with correct answers.
● It can be compared to learning in the presence of a supervisor or a teacher.
We have a dataset which acts as a teacher and its role is to train the model or
the machine.
● Once the model gets trained it can start making a prediction or decision when
new data is given to it.

How it works –

● The Machine Learning algorithm here is provided with a small training


dataset to work with, which is a smaller part of the bigger dataset.
● It serves to give the algorithm an idea of the problem, solution, and various
data points to be dealt with.
● The training dataset here is also very similar to the final dataset in its
characteristics and offers the algorithm with the labeled parameters required
for the problem.
● The Machine Learning algorithm then finds relationships between the given
parameters, establishing a cause and effect relationship between the
variables in the dataset.

Example: House prices

One practical example of supervised learning problems is predicting house prices.


How is this achieved?

First, we need data about the houses: square footage, number of rooms, features,
whether a house has a garden or not, and so on. We then need to know the prices of
these houses, i.e. the corresponding labels. By leveraging data coming from
thousands of houses, their features and prices, we can now train a supervised
machine learning model to predict a new house’s price based on the examples
observed by the model.

Example: Is it a cat or a dog?

Image classification is a popular problem in the computer vision field. Here, the goal
is to predict what class an image belongs to. In this set of problems, we are
interested in finding the class label of an image. More precisely: is the image of a car
or a plane? A cat or a dog?

Example: How’s the weather today?

One particularly interesting problem which requires considering a lot of different


parameters is predicting weather conditions in a particular location. To make correct
predictions for the weather, we need to take into account various parameters,
including historical temperature data, precipitation, wind, humidity, and so on.
What is Unsupervised Learning?

● The model learns through observation and finds structures in the data.
Unsupervised learning, as the name suggests, has no data labels. The
machine looks for patterns randomly.
● Once the model is given a dataset, it automatically finds patterns and
relationships in the dataset by creating clusters in it. What it cannot do is add
labels to the cluster, like it cannot say this is a group of apples or mangoes,
but it will separate all the apples from mangoes.
● Suppose we present images of apples, bananas and mangoes to the model,
so what it does, based on some patterns and relationships it creates clusters
and divides the dataset into those clusters. Now if a new data is fed to the
model, it adds it to one of the created clusters.
● EXAMPLES:
○ Customer segmentation, or understanding different customer
groups around which to build marketing or other business
strategies.
○ Genetics, for example clustering DNA patterns to analyze
evolutionary biology.
○ Recommender systems, which involve grouping together users
with similar viewing patterns in order to recommend similar
content.
○ Anomaly detection, including fraud detection or detecting
defective mechanical parts (i.e., predictive maintenance).

What is Reinforcement Learning?

● It is the ability of an agent to interact with the environment and find out what is
the best outcome.
● It follows the concept of hit and trial method.
● The agent is rewarded or penalized with a point for a correct or a wrong
answer, and on the basis of the positive reward points gained the model trains
itself.
● And again once trained it gets ready to predict the new data presented to it.
● EXAMPLES

○ Robotics for industrial automation.


○ Business strategy planning
○ It helps you to create training systems that provide custom
instruction and materials according to the requirements of
students.
○ Aircraft control and robot motion control
BOOLEAN ALGEBRA: A system of algebra in which there are only two
possible values for a variable (often expressed as true and false or as 1 and 0) and
in which the basic operations are the logical operations AND and OR.
UNIT 2

Classification Algorithm in Machine Learning


As we know, the Supervised Machine Learning algorithm can be broadly classified
into Regression and Classification Algorithms. In Regression algorithms, we have
predicted the output for continuous values, but to predict the categorical values, we
need Classification algorithms.

What is the Classification Algorithm?

● The Classification algorithm is a Supervised Learning technique that is used


to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and
then classifies new observations into a number of classes or groups. Such as,
Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called
as targets/labels or categories.
● Unlike regression, the output variable of Classification is a category, not a
value, such as "Green or Blue", "fruit or animal", etc. Since the Classification
algorithm is a Supervised learning technique, hence it takes labeled input
data, which means it contains input with the corresponding output.
● The best example of an ML classification algorithm is Email Spam Detector.
● The main goal of the Classification algorithm is to identify the category of a
given dataset, and these algorithms are mainly used to predict the output for
the categorical data.
● The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:
● Binary Classifier: If the classification problem has only two possible
outcomes, then it is called a Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or
DOG, etc.

● Multi-class Classifier: If a classification problem has more than two


outcomes, then it is called a Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of
music.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

● Linear Models

○ Logistic Regression

○ Support Vector Machines

● Non-linear Models

○ K-Nearest Neighbours

○ Kernel SVM

○ Naïve Bayes

○ Decision Tree Classification

○ Random Forest Classification

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular

use cases of Classification Algorithms:

● Email Spam Detection

● Speech Recognition

● Identifications of Cancer tumor cells.

● Drugs Classification

● Biometric Identification, etc.

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning

algorithms. It is a statistical method that is used for predictive analysis. Linear

regression makes predictions for continuous/real or numeric variables such as

sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and

one or more independent (y) variables, hence called linear regression. Since linear

regression shows the linear relationship, which means it finds how the value of the

dependent variable is changing according to the value of the independent variable.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithm:

● Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple
Linear Regression.

● Multiple Linear regression:


If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that

means the error between predicted values and actual values should be minimized.

The best fit line will have the least error.

Model Performance:
The "Goodness of fit" term is taken from the statistics, and the goal of the machine

learning models to achieve the goodness of fit. In statistics modeling, it defines how

closely the result or predicted values match the true values of the dataset.

The Goodness of fit determines how the line of regression fits the set of

observations. The process of finding the best model out of various models is called

optimization. It can be achieved by below method:

1. R-squared method:

● R-squared is a statistical method that determines the goodness of fit.

● It measures the strength of the relationship between the dependent and


independent variables on a scale of 0-100%.

● The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.

● It is also called a coefficient of determination, or coefficient of multiple


determination for multiple regression.
Linear Regression Logistic Regression

Linear regression is used to predict the Logistic Regression is used to predict

continuous dependent variable using a the categorical dependent variable

given set of independent variables. using a given set of independent

variables.

Linear Regression is used for solving Logistic regression is used for solving

Regression problems. Classification problems.

In Linear regression, we predict the In logistic Regression, we predict the

value of continuous variables. values of categorical variables.

In linear regression, we find the best fit In Logistic Regression, we find the

line, by which we can easily predict the S-curve by which we can classify the

output. samples.

Least square estimation method is used Maximum likelihood estimation method

for estimation of accuracy. is used for estimation of accuracy.

The output for Linear Regression must The output of Logistic Regression must

be a continuous value, such as price, be a Categorical value such as 0 or 1,

age, etc. Yes or No, etc.


In Linear regression, it is required that In Logistic regression, it is not required

the relationship between dependent to have the linear relationship between

variable and independent variable must the dependent and independent

be linear. variable.

Underfitting vs. Overfitting in ML

UNDERFITTING:

● If the model is performing poorly over the test and the train set, then we call
that an underfitting model.

● It usually happens when we have fewer data to build an accurate model


● An example of this situation would be building a linear regression model over
non-linear data.

● Underfitting destroys the accuracy of our machine learning model.

Techniques to reduce underfitting:

What is Feature Engineering?


Feature Engineering is the process of extracting and organizing the important features from raw data in

such a way that it fits the purpose of the machine learning model. It can be thought of as the art of selecting

the important features and transforming them into refined and meaningful features that suit the needs of the

model.
OVERFITTING:

● This situation where any given model is performing too well on the training
data but the performance drops significantly over the test set is called an
overfitting model.

● When a model gets trained with so much data, it starts learning from the noise
and inaccurate data entries in our data set.

● For example, non-parametric models like decision trees, KNN, and other
tree-based algorithms are very prone to overfitting
● A solution to avoid overfitting is using a linear algorithm if we have linear data.

Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how
closely the result or predicted values match the true values of the dataset.

The model with a good fit is between the underfitting and overfitted model, and
ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.

K-Nearest Neighbor(KNN) Algorithm for


Machine Learning
● K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.

● The K-NN algorithm assumes the similarity between the new case/data and
available cases and puts the new case into the category that is most similar
to the available categories.

● K in KNN is a parameter that refers to the number of nearest neighbors in the


majority voting process
● KNN algorithm at the training phase just stores the dataset and when it gets
new data, then it classifies that data into a category that is much similar to the
new data.

● K-NN is a non-parametric algorithm, which means it does not make any


assumption on underlying data.

● It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.

● example: Suppose, we have an image of a creature that looks similar to cat


and dog, but we want to know whether it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it
in either cat or dog category.

KNN Algorithm Example


To make you understand how KNN algorithm works, let’s consider the following
scenario:
● In the above image, we have two classes of data, namely class A (squares)
and Class B (triangles)

● The problem statement is to assign the new input data point to one of the two
classes by using the KNN algorithm

● The first step in the KNN algorithm is to define the value of ‘K’. But what does
the ‘K’ in the KNN algorithm stand for?

● ‘K’ stands for the number of Nearest Neighbors and hence the name K
Nearest Neighbors (KNN).

● In the above image, I’ve defined the value of ‘K’ as 3. This means that the
algorithm will consider the three neighbors that are the closest to the new data
point in order to decide the class of this new data point.

● The closeness between the data points is calculated by using measures such
as Euclidean and Manhattan distance, which I’ll be explaining below.

● At ‘K’ = 3, the neighbors include two squares and 1 triangle. So, if I were to
classify the new data point based on ‘K’ = 3, then it would be assigned to
Class A (squares).
● But what if the ‘K’ value is set to 7? Here, I’m basically telling my algorithm to
look for the seven nearest neighbors and classify the new data point into the
class it is most similar to.

● At ‘K’ = 7, the neighbors include three squares and four triangles. So, if I were
to classify the new data point based on ‘K’ = 7, then it would be assigned to
Class B (triangles) since the majority of its neighbors were of class B.

In practice, there’s a lot more to consider while implementing the KNN algorithm.
This will be discussed in the demo section of the blog.
Earlier I mentioned that KNN uses Euclidean distance as a measure to check the
distance between a new data point and its neighbors, let’s see how.

● Consider the above image, here we’re going to measure the distance
between P1 and P2 by using the Euclidean Distance measure.

● The coordinates for P1 and P2 are (1,4) and (5,1) respectively.

● The Euclidean Distance can be calculated like so:

It is as simple as that! KNN makes use of simple measures in order to solve complex
problems, this is one of the reasons why KNN is such a commonly used algorithm.
What is Naive Bayes algorithm?
● It is a classification technique based on Bayes’ Theorem with an assumption

of independence among predictors. In simple terms, a Naive Bayes classifier

assumes that the presence of a particular feature in a class is unrelated to the

presence of any other feature.

● For example, a fruit may be considered to be an apple if it is red, round, and

about 3 inches in diameter. Even if these features depend on each other or

upon the existence of the other features, all of these properties independently

contribute to the probability that this fruit is an apple and that is why it is

known as ‘Naive’.

● Bayes theorem provides a way of calculating posterior probability P(c|x) from

P(c), P(x) and P(x|c). Look at the equation below:

How does the Naive Bayes algorithm work?


Let’s understand it using an example. Below I have a training data set of weather
and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now,
we need to classify whether players will play or not based on weather conditions.
Let’s follow the below steps to perform it.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.
Step 3: Now, use the Naive Bayesian equation to calculate the posterior probability
for each class. The class with the highest posterior probability is the outcome of
prediction.

Problem: Players will play if the weather is sunny. Is this statement correct?

We can solve it using the above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 =
0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Decision Tree Classification Algorithm


● Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and
each leaf node represents the outcome
● It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure.
● In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
● A decision tree simply asks a question, and based on the answer (Yes/No), it
further splits the tree into subtrees.
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

● Information Gain

○ Information Gain: Information gain is used to decide which feature to


split on at each step in building the tree. Simplicity is best, so we want
to keep our tree small. To do so, at each step we should choose the
split that results in the purest daughter nodes. A commonly used
measure of purity is called information.
○ The split with the highest information gain will be taken as the first split
and the process will continue until all children nodes are pure, or until
the information gain is 0.

○ The algorithm calculates the information gain for each split and the split
which is giving the highest value of information gain is selected.
○ We can say that in Information gain we are going to compute the
average of all the entropy-based on the specific split.

● Gini Index

○ Gini index is a measure of impurity or purity used while creating a


decision tree in the CART(Classification and Regression Tree)
algorithm.
○ used to build Decision Trees to determine how the features of a data
set should split nodes to form the tree.
○ An attribute with the low Gini index should be preferred as compared to
the high Gini index.
○ More precisely, the Gini Impurity of a data set is a number between
0-0.5
○ Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

● ENTROPY :

○ Entropy: Entropy is the measure of impurity, disorder, or uncertainty in


a bunch of examples.

○ Purpose of Entropy:

○ Entropy controls how a Decision Tree decides to split the data. It

affects how a Decision Tree draws its boundaries.

○ “Entropy values range from 0 to 1”, Less the value of entropy, the more

trustworthy it is.

Entropy vs Gini Impurity


The maximum value for entropy is 1 whereas the maximum value for Gini impurity is
0.5.

As the Gini Impurity does not contain any logarithmic function to calculate it takes
less computational time as compared to entropy.
What is a Confusion Matrix?
● A Confusion matrix is an N x N matrix used for evaluating the performance of
a classification model, where N is the number of target classes.
● Both precision and recall can be interpreted from the confusion matrix, so
we start there. The confusion matrix is used to display how well a model made
its predictions.
● The purpose of the confusion matrix is to show how…well, how confused the
model is.
● For a binary classification problem, we would have a 2 x 2 matrix as shown
below with 4 values:

Let’s decipher the matrix:

● The target variable has two values: Positive or Negative


● The columns represent the actual values of the target variable
● The rows represent the predicted values of the target variable
But wait – what’s TP, FP, FN and TN here? That’s the crucial part of a confusion
matrix. Let’s understand each term below:

True Positive (TP)

● The predicted value matches the actual value


● The actual value was positive and the model predicted a positive value
True Negative (TN)

● The predicted value matches the actual value


● The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error

● The predicted value was falsely predicted


● The actual value was negative but the model predicted a positive value
● Also known as the Type 1 error
False Negative (FN) – Type 2 error

● The predicted value was falsely predicted


● The actual value was positive but the model predicted a negative value
● Also known as the Type 2 error
Let me give you an example to better understand this. Suppose we had a
classification dataset with 1000 data points. We fit a classifier on it and get the below
confusion matrix:

The different values of the Confusion matrix would be as follows:

● True Positive (TP) = 560; meaning 560 positive class data points were
correctly classified by the model
● True Negative (TN) = 330; meaning 330 negative class data points were
correctly classified by the model
● False Positive (FP) = 60; meaning 60 negative class data points were
incorrectly classified as belonging to the positive class by the model
● False Negative (FN) = 50; meaning 50 positive class data points were
incorrectly classified as belonging to the negative class by the model

Remember the Type 1 and Type 2 errors. Interviewers love to ask the difference
between these two!
Precision vs. Recall

Precision tells us how many of the correctly predicted cases actually turned out to be
positive.

Here’s how to calculate Precision:

Recall tells us how many of the actual positive cases we were able to predict
correctly with our model.

And here’s how we can calculate Recall:

F1-Score
In practice, when we try to increase the precision of our model, the recall goes down,
and vice-versa. The F1-score captures both the trends in a single value:

F1-score is a harmonic mean of Precision and Recall, and so it gives a combined


idea about these two metrics. It is maximum when Precision is equal to Recall.

ACCURACY:
Support Vector Machine Algorithm
● Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in Machine
Learning.
● The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.

How an SVM works


● A simple linear SVM classifier works by making a straight line between two
classes. That means all of the data points on one side of the line will
represent a category and the data points on the other side of the line will be
put into a different category. This means there can be an infinite number of
lines to choose from.
● What makes the linear SVM algorithm better than some of the other
algorithms, like k-nearest neighbors, is that it chooses the best line to classify
your data points. It chooses the line that separates the data and is the furthest
away from the closest data points as possible.

A 2-D example helps to make sense of all the machine learning jargon. Basically you
have some data points on a grid. You're trying to separate these data points by the
category they should fit in, but you don't want to have any data in the wrong
category. That means you're trying to find the line between the two closest points
that keeps the other data points separated.

So the two closest data points give you the support vectors you'll use to find that line.
That line is called the decision boundary.
linear SVM

The decision boundary doesn't have to be a line. It's also referred to as a hyperplane
because you can find the decision boundary with any number of features, not just
two.

Non- Linear SVM

Types of SVM
SVM can be of two types:

● Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
● Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
Hyperplane

● Hyperplane: There can be multiple lines/decision boundaries to segregate the


classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known
as the hyperplane of SVM.
● The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then the
hyperplane will be a straight line. And if there are 3 features, then the
hyperplane will be a 2-dimension plane.
● The hyperplane with maximum margin is called the optimal hyperplane.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vectors. These vectors support the
hyperplane, hence called a Support vector, hence algorithm is termed as Support
Vector Machine

Applications of SVM in Real World

As we have seen, SVMs depend on supervised learning algorithms. The aim of


using SVM is to correctly classify unseen data. SVMs have a number of applications
in several fields.Some common applications of SVM are-

● Face detection
● Text and hypertext categorization
● Classification of images
● Protein fold and remote homology detection
● Handwriting recognition
UNIT 3

You might also like