Professional Documents
Culture Documents
Unit 2
Unit 2
Imagine you're car shopping and have decided that gas mileage is a deciding
factor in your decision to buy. If you wanted to predict the miles per gallon of some
promising rides, how would you do it? Well, since you know the different features of
the car (weight, horsepower, displacement, etc.) one possible method is regression.
By plotting the average MPG of each car given its features you can then use
regression techniques to find the relationship of the MPG and the input features. The
regression function here could be represented as $Y = f(X)$, where Y would be the
MPG and X would be the input features like the weight, displacement, horsepower,
etc. The target function is $f$ and this curve helps us predict whether it’s beneficial to
buy or not buy. This mechanism is called regression.
Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It performs
a regression task. Regression models a target prediction value based on independent
variables. It is mostly used for finding out the relationship between variables and
forecasting.
Different regression models differ based on – the kind of relationship between
dependent and independent variables, they are considering and the number of independent
variables being used.
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x). So, this regression technique finds out a linear relationship
between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a
person. The regression line is the best fit line for our model.
Hypothesis function for Linear Regression :
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important
features of the dataset. Therefore, a technique that decreases the size of the learning tree without
reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Most Machine Learning tasks are about making predictions. This means that given a number of
training examples, the system needs to be able to make good “predictions for” / “generalize
to” examples it has never seen before.
Having a good performance measure on the training data is good, but insufficient; the true goal is to
perform well on new instances.
There are two main approaches to generalization: instance-based learning and model-based
learning
1. Instance-based learning:
2. Model-based learning:
Machine learning models that are parameterized with a certain number of parameters that do not
change as the size of training data changes.
If you don’t assume any distribution with a fixed number of parameters over your data, for example,
in k-nearest neighbor, or in a decision tree, where the number of parameters grows with the size of
the training data, then you are not model-based, or nonparametric
Overfitting in Machine Learning
Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model. The problem is
that these concepts do not apply to new data and negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when
learning a target function. As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible
and is subject to overfitting training data. This problem can be addressed by pruning a tree after it
has learned in order to remove some of the detail it has picked up.
An underfit machine learning model is not a suitable model and will be obvious as it will have poor
performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric. The
remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide
a good contrast to the problem of overfitting.
Underfitting occurs when a model is too simple – informed by too few features or regularized too
much – which makes it inflexible in learning from the dataset.
Simple learners tend to have less variance in their predictions but more bias towards wrong
outcomes.
On the other hand, complex learners tend to have more variance in their predictions.
Both bias and variance are forms of prediction error in machine learning.
Typically, we can reduce error from bias but might increase error from variance as a result, or vice
versa.
This trade-off between too simple (high bias) vs. too complex (high variance) is a key concept in
statistics and machine learning, and one that affects all supervised learning algorithms.
What is Feature Reduction?
● Feature reduction, also known as dimensionality reduction, is the process of
reducing the number of features in a resource heavy computation without losing
important information.
● Reducing the number of features means the number of variables is reduced
making the computer’s work easier and faster.
Feature reduction can be divided into two processes:
feature selection and feature extraction.
In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the
movie, so we can conclude that Movie 3 is also going to be averagely liked by the User 1
but Movie 4 will be a good recommendation to User 2, like this we can also see that there
are users who have different choices like User 1 and User 3 are opposite to each other.
One can see that User 3 and User 4 have a common interest in the movie, on that basis we
can say that Movie 4 is also going to be disliked by the User 4. This is Collaborative
Filtering, we recommend users the items which are liked by the users of similar interest
domain.
Cosine Distance:
We can also use the cosine distance between the users to find out the users with similar
interests, larger cosine implies that there is a smaller angle between two users, hence they
have similar interests.
We can apply the cosine distance between two users in the utility matrix, and we can also
give the zero value to all the unfilled columns to make calculation easy, if we get smaller
cosine then there will be a larger distance between the users and if the cosine is larger then
we have a small angle between the users, and we can recommend them similar things.
Rounding the Data:
In collaborative filtering we round off the data to compare it more easily like we can assign
below 3 ratings as 0 and above of it as 1, this will help us to compare data more easily, for
example:
We again took the previous example and we apply the rounding off process, as you can see
how much readable the data has become after performing this process, we can see that User
1 and User 2 are more similar and User 3 and User 4 are more alike.
Normalizing Rating:
In the process of normalizing we take the average rating of a user and subtract all the given
ratings from it, so we’ll get either positive or negative values as a rating, which can simply
classify further into similar groups. By normalizing the data we can make the clusters of the
users which gives a similar rating to similar items and then we can use these clusters to
recommend items to the users.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning
Concepts with the Machine Learning Foundation Course at a student-friendly price and
become industry ready.