Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Unit :2

Regressions analysis is a fundamental concept in the field of machine learning.


It falls under supervised learning wherein the algorithm is trained with both input
features and output labels. It helps in establishing a relationship among the variables
by estimating how one variable affects the other.

Imagine you're car shopping and have decided that gas mileage is a deciding
factor in your decision to buy. If you wanted to predict the miles per gallon of some
promising rides, how would you do it? Well, since you know the different features of
the car (weight, horsepower, displacement, etc.) one possible method is regression.
By plotting the average MPG of each car given its features you can then use
regression techniques to find the relationship of the MPG and the input features. The
regression function here could be represented as $Y = f(X)$, where Y would be the
MPG and X would be the input features like the weight, displacement, horsepower,
etc. The target function is $f$ and this curve helps us predict whether it’s beneficial to
buy or not buy. This mechanism is called regression.

Regression In Machine Learning


Regression in machine learning consists of mathematical methods that allow data scientists to
predict a continuous outcome (y) based on the value of one or more predictor variables (x). Linear
regression is probably the most popular form of regression analysis because of its ease-of-use in
predicting and forecasting.

Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It performs
a regression task. Regression models a target prediction value based on independent
variables. It is mostly used for finding out the relationship between variables and
forecasting.
Different regression models differ based on – the kind of relationship between
dependent and independent variables, they are considering and the number of independent
variables being used.

Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x). So, this regression technique finds out a linear relationship
between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.
Hypothesis function for Linear Regression :

While training the model we are given :


x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)
When training the model – it fits the best line to predict the value of y for a given value of x.
The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using
our model for prediction, it will predict the value of y for the input value of x.
How to update θ1 and θ2 values to get the best fit line ?
Cost Function (J):
By achieving the best-fit regression line, the model aims to predict y value such that the
error difference between predicted value and true value is minimum. So, it is very important
to update the θ1 and θ2 values, to reach the best value that minimize the error between
predicted y value (pred) and true y value (y).
Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between
predicted y value (pred) and true y value (y).
Gradient Descent:
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value) and
achieving the best fit line the model uses Gradient Descent. The idea is to start with random
θ1 and θ2 values and then iteratively updating the values, reaching minimum cost.

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a
tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree algorithm Work?


In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called
as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute
for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision
tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

What is Instance-Based and Model-Based Learning?

Most Machine Learning tasks are about making predictions. This means that given a number of
training examples, the system needs to be able to make good “predictions for” / “generalize
to” examples it has never seen before.

Having a good performance measure on the training data is good, but insufficient; the true goal is to
perform well on new instances.

There are two main approaches to generalization: instance-based learning and model-based
learning

1. Instance-based learning:

(sometimes called memory-based learning) is a family of learning algorithms that, instead of


performing explicit generalization, compares new problem instances with instances seen in
training, which have been stored in memory.

Ex- k-nearest neighbor, decision tree

2. Model-based learning:

Machine learning models that are parameterized with a certain number of parameters that do not
change as the size of training data changes.

If you don’t assume any distribution with a fixed number of parameters over your data, for example,
in k-nearest neighbor, or in a decision tree, where the number of parameters grows with the size of
the training data, then you are not model-based, or nonparametric
Overfitting in Machine Learning
Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model. The problem is
that these concepts do not apply to new data and negatively impact the models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when
learning a target function. As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible
and is subject to overfitting training data. This problem can be addressed by pruning a tree after it
has learned in order to remove some of the detail it has picked up.

Underfitting in Machine Learning


Underfitting refers to a model that can neither model the training data nor generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor
performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The
remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide
a good contrast to the problem of overfitting.

Overfitting vs. Underfitting

We can understand overfitting better by looking at the opposite problem, underfitting.

Underfitting occurs when a model is too simple – informed by too few features or regularized too
much – which makes it inflexible in learning from the dataset.

Simple learners tend to have less variance in their predictions but more bias towards wrong
outcomes.

On the other hand, complex learners tend to have more variance in their predictions.

Both bias and variance are forms of prediction error in machine learning.

Typically, we can reduce error from bias but might increase error from variance as a result, or vice
versa.

This trade-off between too simple (high bias) vs. too complex (high variance) is a key concept in
statistics and machine learning, and one that affects all supervised learning algorithms.
What is Feature Reduction?
● Feature reduction, also known as dimensionality reduction, is the process of
reducing the number of features in a resource heavy computation without losing
important information.
● Reducing the number of features means the number of variables is reduced
making the computer’s work easier and faster.
Feature reduction can be divided into two processes:
feature selection and feature extraction.

● There are many techniques by which feature reduction is accomplished. Some of


the most popular are generalized discriminant analysis, autoencoders, non-negative
matrix factorization, and principal component analysis.
Why is this Useful?
● The purpose of using feature reduction is to reduce the number of features (or
variables) that the computer must process to perform its function.
● Feature reduction leads to the need for fewer resources to complete computations
or tasks. Less computation time and less storage capacity needed means the
computer can do more work.
● During machine learning, feature reduction removes multicollinearity resulting in
improvement of the machine learning model in use.
Another benefit of feature reduction is that it makes data easier to visualize for humans,
particularly when the data is reduced to two or three dimensions which can be easily
displayed graphically. An interesting problem that feature reduction can help with is
called the curse of dimensionality. This refers to a group of phenomena in which a problem
will have so many dimensions that the data becomes sparse. Feature reduction is used to
decrease the number of dimensions, making the data less sparse and more statistically
significant for machine learning applications.
Collaborative Filtering – ML
In Collaborative Filtering, we tend to find similar users and recommend what similar users
like.
In this type of recommendation system, we don’t use the features of the item to recommend
it, rather we classify the users into the clusters of similar types, and recommend each user
according to the preference of its cluster.
Measuring Similarity:
A simple example of the movie recommendation system will help us in explaining:

In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the
movie, so we can conclude that Movie 3 is also going to be averagely liked by the User 1
but Movie 4 will be a good recommendation to User 2, like this we can also see that there
are users who have different choices like User 1 and User 3 are opposite to each other.
One can see that User 3 and User 4 have a common interest in the movie, on that basis we
can say that Movie 4 is also going to be disliked by the User 4. This is Collaborative
Filtering, we recommend users the items which are liked by the users of similar interest
domain.

Cosine Distance:

We can also use the cosine distance between the users to find out the users with similar
interests, larger cosine implies that there is a smaller angle between two users, hence they
have similar interests.
We can apply the cosine distance between two users in the utility matrix, and we can also
give the zero value to all the unfilled columns to make calculation easy, if we get smaller
cosine then there will be a larger distance between the users and if the cosine is larger then
we have a small angle between the users, and we can recommend them similar things.
Rounding the Data:

In collaborative filtering we round off the data to compare it more easily like we can assign
below 3 ratings as 0 and above of it as 1, this will help us to compare data more easily, for
example:

We again took the previous example and we apply the rounding off process, as you can see
how much readable the data has become after performing this process, we can see that User
1 and User 2 are more similar and User 3 and User 4 are more alike.
Normalizing Rating:

In the process of normalizing we take the average rating of a user and subtract all the given
ratings from it, so we’ll get either positive or negative values as a rating, which can simply
classify further into similar groups. By normalizing the data we can make the clusters of the
users which gives a similar rating to similar items and then we can use these clusters to
recommend items to the users.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning
Concepts with the Machine Learning Foundation Course at a student-friendly price and
become industry ready.

You might also like