ml_unit_2

UNIT -2
Feature Selection and Feature Engineering: Data sets, creating training and test sets,
managing categorical. data, missing features, data scaling and normalization, whitening.
Feature selection and filtering, PCA, Visualization of high-dimensional datasets.
Regression Algorithms: Linear regression, Regression types: Ridge, Lasso, ElasticNet,
Robust, Polynomial and Isotonic.
Feature engineering refers to a process of selecting and transforming variables/features in

your dataset when creating a predictive model using machine learning.
Torture the data, and it will confess to anything. — Ronald Coase
Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model. It involves selecting relevant
information from raw data and transforming it into a format that can be easily understood by
a model. The goal is to improve model accuracy by providing more meaningful and relevant
information.
Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features used
to train them. Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones. These techniques help to highlight
the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively.
Feature engineering has two goals:

● Preparing the proper input dataset, compatible with the machine learning algorithm
requirements.
● Improving the performance of machine learning models.
According to a survey of 80 Data Scientists conducted by CrowdFlower, Data Scientists
spend 60% of their time cleaning and organizing data.
What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an
individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.
For example, in a dataset of housing prices, features could include the number of bedrooms,
the square footage, the location, and the age of the property. In a dataset of customer
demographics, features could include age, gender, income level, and occupation.
What is a dataset ?
A Dataset is a set of data grouped into a collection with which developers can work to meet
their goals. In a dataset, the rows represent the number of data points and the columns
represent the features of the Dataset.
Numerical Dataset: They include numerical data points that can be solved with equations.
These include temperature, humidity, marks and so on.
→ continuous data : Continuous data refers to data that can be measured. This data has values
that are not fixed and have an infinite number of possible values. These measurements can
also be broken down into smaller individual parts.
→ discrete data : Discrete data also referred to as discrete values, is data that only takes
certain values. Commonly in the form of whole numbers or integers, this is data that can be
counted and has a finite number of values. These values must be able to fall within certain
classifications and are unable to be broken down into smaller parts.
Categorical Dataset: These include categories such as colour, gender, occupation, games,
sports and so on.
→ nominal data : Nominal data is a qualitative type of data used to classify and label
variables, and is known as the simplest data type on Stevens’ scale. All nominal data is
categorical (meaning this data can be sorted into groups) and has no specific order or
hierarchy. Variables like name, eye color and nationality are examples of nominal data.
→ ordinal data : Ordinal data is a qualitative data type that classifies variables in a ranked
order. This kind of data is categorical and separates variables into groups based on distinct
ascending or descending hierarchies. Ordinal data encompasses variables like educational
level, economic status or level of customer satisfaction.
Creating training and test sets
What is Training Dataset?
The training data is the biggest (in -size) subset of the original dataset, which is used to
train or fit the machine learning model. Firstly, the training data is fed to the ML
algorithms, which lets them learn how to make predictions for the given task.
The training data varies depending on whether we are using Supervised Learning or
Unsupervised Learning Algorithms.
For Unsupervised learning, the training data contains unlabeled data points, i.e., inputs are
not tagged with the corresponding outputs. Models are required to find the patterns from the
given training datasets in order to make predictions.
On the other hand, for supervised learning, the training data contains labels in order to train
the model and make predictions.
The type of training data that we provide to the model is highly responsible for the model's
accuracy and prediction ability. It means that the better the quality of the training data, the
better will be the performance of the model. Training data is approximately more than or
equal to 60% of the total data for an ML project.
What is a Test Dataset?

Once we train the model with the training dataset, it's time to test the model with the test
dataset. This dataset evaluates the performance of the model and ensures that the model can
generalise well with the new or unseen dataset. The test dataset is another subset of original
data, which is independent of the training dataset. However, it has some similar types of
features and class probability distribution and uses it as a benchmark for model evaluation
once the model training is completed. Test data is a well-organized dataset that contains data
for each type of scenario for a given problem that the model would be facing when used in
the real world. Usually, the test dataset is approximately 20-25% of the total original data for
an ML project.
At this stage, we can also check and compare the testing accuracy with the training accuracy,
which means how accurate our model is with the test dataset against the training dataset. If
the accuracy of the model on training data is greater than that on testing data, then the model
is said to have overfitting.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
What is Cross-Validation?
Cross validation is a technique used in machine learning to evaluate the performance of a
model on unseen data. It involves dividing the available data into multiple folds or subsets,
using one of these folds as a validation set, and training the model on the remaining folds.
This process is repeated multiple times, each time using a different fold as the validation set.
Finally, the results from each validation step are averaged to produce a more robust estimate
of the model’s performance.
→ The main purpose of cross validation is to prevent overfitting, which occurs when a model
is trained too well on the training data and performs poorly on new, unseen data.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest
50% is used for the testing purpose. It’s a simple and quick way to evaluate a model. The
major drawback of this method is that we perform training on the 50% of the dataset, it may
possible that the remaining 50% of the data contains some important information which we
are leaving while training our model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only one data-point of the
available dataset and then iterates for each data-point. In LOOCV, the model is trained on n-1
samples and tested on the one omitted sample, repeating this process for each data point in
the dataset. It has some advantages as well as disadvantages also. An advantage of using this
method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as
we are testing against one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it iterates over ‘the number
of data points’ times.
3. In K-Fold Cross Validation:
We split the dataset into k numbers of subsets (known as folds) then we perform
training on all the subsets but leave one(k-1) subset for the evaluation of the trained
model. In this method, we iterate k times with a different subset reserved for testing
purpose each time.
from sklearn.model_selection import cross_val_score, KFold

from sklearn.svm import SVC
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
svm_classifier = SVC(kernel='linear')
num_folds = 5
kf = KFold(n_splits=num_folds, shuffle=True, random_state=42)
cross_val_results = cross_val_score(svm_classifier, X, y, cv=kf)
Overfitting :
A model can be said as overfitted when it performs quite well with the training dataset but
does not generalize well with the new or unseen dataset. The issue of overfitting occurs when
the model tries to cover all the data points and hence starts caching noises present in the data.
Due to this, it can't generalize well to the new dataset. Because of these issues, the accuracy
and efficiency of the model degrade. Generally, the complex model has a high chance of
overfitting. There are various ways by which we can avoid overfitting in the model, such as
Using the Cross-Validation method, early stopping the training, or by regularization, etc.
Underfitting :
On the other hand, the model is said to be under-fitted when it is not able to capture the
underlying trend of the data. It means the model shows poor performance even with the
training dataset. In most cases, underfitting issues occur when the model is not perfectly
suitable for the problem that we are trying to solve.
○ The main difference between training data and testing data is that training data is the
subset of original data that is used to train the machine learning model, whereas
testing data is used to check the accuracy of the model.
○ The training dataset is generally larger in size compared to the testing dataset. The
general ratios of splitting train and test datasets are 80:20, 70:30, or 90:10.
○ Training data is well known to the model as it is used to train the model, whereas
testing data is like unseen/new data to the model.
1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in Supervised
Learning), and the model transforms the training data into text vectors or a number of
data features.
3. Test: In the last step, we test the model by feeding it with the test data/unseen dataset.
This step ensures that the model is trained efficiently and can generalize well.
Processes Involved in Feature Engineering

Feature engineering in Machine learning consists of mainly 5 processes: Feature Creation,
Feature Transformation, Feature Extraction, Feature Selection, and Feature Scaling. It is an
iterative process that requires experimentation and testing to find the best combination of
features for a given problem. The success of a machine learning model largely depends on the
quality of the features used in the model.
1. Feature Creation
Feature Creation is the process of generating new features based on domain knowledge or by
observing patterns in the data. It is a form of feature engineering that can significantly
improve the performance of a machine-learning model.
Types of Feature Creation:
Domain-Specific: Creating new features based on domain knowledge, such as creating
features based on business rules or industry standards.
Data-Driven: Creating new features by observing patterns in the data, such as calculating
aggregations or creating interaction features.
Synthetic: Generating new features by combining existing features or synthesizing new data
points.
Why Feature Creation?
→ Improves Model Performance: By providing additional and more relevant information to
the model, feature creation can increase the accuracy and precision of the model.
→ Increases Model Robustness: By adding additional features, the model can become more
robust to outliers and other anomalies.
→ Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.
→ Increases Model Flexibility: By adding new features, the model can be made more flexible
to handle different types of data.
2. Feature Transformation
Feature Transformation is the process of transforming the features into a more suitable
representation for the machine learning model. This is done to ensure that the model can
effectively learn from the data.
Types of Feature Transformation:
→ Normalisation: Rescaling the features to have a similar range, such as between 0 and 1, to
prevent some features from dominating others.
→ Scaling: Scaling is a technique used to transform numerical variables to have a similar
scale, so that they can be compared more easily. Rescaling the features to have a similar
scale, such as having a standard deviation of 1, to make sure the model considers all features
equally.
→ Encoding: Transforming categorical features into a numerical representation. Examples
are one-hot encoding and label encoding.
→ Transformation: Transforming the features using mathematical operations to change the
distribution or scale of the features. Examples are logarithmic, square root, and reciprocal
transformations.
Why Feature Transformation?
→ Improves Model Performance: By transforming the features into a more suitable
representation, the model can learn more meaningful patterns in the data.
→ Increases Model Robustness: Transforming the features can make the model more robust
to outliers and other anomalies.
→ Improves Computational Efficiency: The transformed features often require fewer
computational resources.
→ Improves Model Interpretability: By transforming the features, it can be easier to
understand the model’s predictions.
3. Feature Extraction
Feature Extraction is the process of creating new features from existing ones to provide more
relevant information to the machine learning model. This is done by transforming,
combining, or aggregating existing features.
Types of Feature Extraction:
→ Dimensionality Reduction: Reducing the number of features by transforming the data into
a lower-dimensional space while retaining important information. Examples are PCA and
t-SNE.
→ Feature Combination: Combining two or more existing features to create a new one. For
example, the interaction between two features.
→ Feature Aggregation: Aggregating features to create a new one. For example, calculating
the mean, sum, or count of a set of features.
→ Feature Transformation: Transforming existing features into a new representation. For
example, log transformation of a feature with a skewed distribution.
Why Feature Extraction?
→ Improves Model Performance: By creating new and more relevant features, the model can
learn more meaningful patterns in the data.
→ Reduces Overfitting: By reducing the dimensionality of the data, the model is less likely to
overfit the training data.
→ Improves Computational Efficiency: The transformed features often require fewer
computational resources.
→ Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.
4. Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the dataset to
be used in a machine-learning model. It is an important step in the feature engineering
process as it can have a significant impact on the model’s performance.
Types of Feature Selection:
→ Filter Method: Based on the statistical measure of the relationship between the feature and
the target variable. Features with a high correlation are selected.
→ Wrapper Method: Based on the evaluation of the feature subset using a specific machine
learning algorithm. The feature subset that results in the best performance is selected.
→ Embedded Method: Based on the feature selection as part of the training process of the
machine learning algorithm.
Why Feature Selection?
→ Reduces Overfitting: By using only the most relevant features, the model can generalize
better to new data.
→ Improves Model Performance: Selecting the right features can improve the accuracy,
precision, and recall of the model.
→ Decreases Computational Costs: A smaller number of features requires less computation
and storage resources.
→ Improves Interpretability: By reducing the number of features, it is easier to understand
and interpret the results of the model.
5. Feature Scaling
Feature Scaling is the process of transforming the features so that they have a similar scale.
This is important in machine learning because the scale of the features can affect the
performance of the model.
Types of Feature Scaling:
→ Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by
subtracting the minimum value and dividing by the range.
→ Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of
1 by subtracting the mean and dividing by the standard deviation.
Why Feature Scaling?
→ Improves Model Performance: By transforming the features to have a similar scale, the
model can learn from all features equally and avoid being dominated by a few large features.
→ Increases Model Robustness: By transforming the features to be robust to outliers, the
model can become more robust to anomalies.
→ Improves Computational Efficiency: Many machine learning algorithms, such as k-nearest
neighbours, are sensitive to the scale of the features and perform better with scaled features.
→ Improves Model Interpretability: By transforming the features to have a similar scale, it
can be easier to understand the model’s predictions.
What are the Steps in Feature Engineering?
The steps for feature engineering vary per different Ml engineers and data scientists. Some of
the common steps that are involved in most machine-learning algorithms are:
Data Cleansing
Data cleansing (also known as data cleaning or data scrubbing) involves identifying and
removing or correcting any errors or inconsistencies in the dataset. This step is important to
ensure that the data is accurate and reliable.
Data Transformation
Feature Extraction
Feature Selection
Feature selection involves selecting the most relevant features from the dataset for use in
machine learning. This can include techniques like correlation analysis, mutual information,
and stepwise regression.
Feature Iteration
Feature iteration involves refining and improving the features based on the performance of
the machine learning model. This can include techniques like adding new features, removing
redundant features and transforming features in different ways.
Overall, the goal of feature engineering is to create a set of informative and relevant features
that can be used to train a machine learning model and improve its accuracy and
performance. The specific steps involved in the process may vary depending on the type of
data and the specific machine-learning problem at hand.
How to Handle Missing Data

Handling missing data is very important as many machine learning algorithms do not support
data with missing values. If you have missing values in the dataset, it can cause errors and
poor performance with some machine learning algorithms.
Here is the list of common missing values you can find in your dataset.
N/A, null , Empty ,?, none, empty,- ,, NaN
Data can be missing for many reasons like technical issues, human errors, privacy concerns,
data processing issues, or the nature of the variable itself. Understanding the cause of missing
data helps choose appropriate handling strategies and ensure the quality of your analysis.
1. Deleting the Missing values
Generally, this approach is not recommended. The disadvantage of this method is one might
end up deleting some useful data from the dataset.
There are 2 ways one can delete the missing data values:
→ Deleting the rows :
If a row has many missing values, you can drop the entire row. If every row has some
(column) value missing, you might end up deleting the whole data. The code to drop the
entire row is as follows:
IN:
df = train_df.dropna(axis=0)
df.isnull().sum()
→ Deleting the columns :

If a certain column has many missing values, then you can choose to drop the entire column.
The code to drop the entire column is as follows:
IN:
df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum()
2. Imputing the missing values :

→ Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can replace it with some
arbitrary value using the following code. E.g., in the following code, we are replacing the
missing values of the ‘Dependents’ column with ‘0’.
IN:
#Replace the missing value with '0' using 'fiilna' method
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()
→ Replacing with the mean

This is the most common method of imputing missing values of numeric columns. If there are
outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first.
You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and
‘Credit_History’ with the mean of the respective column values.
IN:
#Replace the missing values for numerical columns with mean
train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean())
→ Replacing with the mode

Mode is the most frequently occurring value. It is used in the case of categorical features.
You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’
and ‘Self_Employed.’
IN:
#Replace the missing values for categorical columns with mode
train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0])
→ Replacing with the median

The median is the middlemost value. It’s better to use the median value for imputation in the
case of outliers. You can use the ‘fillna’ method for imputing the column
‘Loan_Amount_Term’ with the median value
train_df['Loan_Amount_Term']=
train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
How to Impute Missing Values for Categorical Features?

→ Impute the Most Frequent Value or mean or median
We will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use
mean or median, but we can use the most frequent value and constant.
IN:
import pandas as pd
import numpy as np
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]})
X
Shape
OUT:
0 square
1 square
2 oval
3 circle
4 NaN
IN:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)
OUT:
array([['square'],
['square'],
['oval'],
['circle'],
['square']], dtype=object)
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Nearest Neighbors Imputations (KNNImputer)
Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean
distance is used to find the nearest neighbors. Let’s take the above example of the titanic
dataset to see how it works.
IN:
from sklearn.impute import KNNImputer
impute_knn = KNNImputer(n_neighbors=2)
impute_knn.fit_transform(X)
OUT:
array([[ 1. , 7.25 , 22. ],
[ 1. , 71.2833, 38. ],
[ 0. , 7.925 , 26. ],
[ 1. , 53.1 , 35. ],
[ 0. , 8.05 , 35. ],
[ 0. , 8.4583, 30.5 ]])
In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows
measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing
values. In this case, the last row has a missing value. And the third row and the fifth row have
the closest values for the other two features. So the average of the ‘Age’ feature from these
two rows is taken as the imputed value
Managing categorical data :

Encoding categorical variables is a vital step in preparing data for machine learning tasks.
When dealing with categorical data, characterized by non-numeric values such as text or
categories, it becomes necessary to transform them into a numerical format for compatibility
with machine learning algorithms.
1. One-Hot Encoding:
One-hot encoding is the most widely used categorical encoding technique. It is suitable for
nominal categorical variables, where the categories have no inherent order or relationship.
The idea behind one-hot encoding is to represent each category as a binary vector. Here’s
how it works:
For each category in a categorical column, a new binary column is created

The binary column will have a value of 1 if the class is present, else it will be zero
For example, if you have a categorical feature “Color” with values “Yellow”, “Blue,” and
“Green,” one-hot encoding would convert it into three binary columns:
import pandas as pd
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)
one_hot_encoded = pd.get_dummies(df, columns=['Color'])
2. Label encoding : is suitable for categorical features with only two distinct categories. In
this technique, each category is assigned a unique integer label. Categories are assigned
integer values starting from 0.
For example, if you have an ordinal categorical feature “Size” with values “Small,”
“Medium,” and “Large,” label encoding would convert it as follows:
Size Size_encoded
0 Small 2
1 Medium 1
2 Large 0
3 Medium 1
4 Small 2
As you can see, each unique category in the ‘Size’ column has been replaced with a unique
integer label, where ‘Small’ is encoded as 2, ‘Medium’ as 1, and ‘Large’ as 0.
from sklearn.preprocessing import LabelEncoder

data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)
label_encoder = LabelEncoder()
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
Data scaling and normalisation :

Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying
magnitudes or values or units. If feature scaling is not done, then a machine learning
algorithm tends to weigh greater values, higher and consider smaller values as the lower
values, regardless of the unit of the values.
Scaling guarantees that all features are on a comparable scale and have comparable ranges.
This process is known as feature normalisation. This is significant because the magnitude of
the features has an impact on many machine learning techniques. Larger scale features may
dominate the learning process and have an excessive impact on the outcomes. You can avoid
this problem and make sure that each feature contributes equally to the learning process by
scaling the features.
1. Min-Max scaler :
First, we are supposed to find the minimum and the maximum value of the column.
Then we will subtract the minimum value from the entry and divide the result by the
difference between the maximum and the minimum value.
As we are using the maximum and the minimum value this method is also prone to
outliers but the range in which the data will range after performing the above two
steps is between 0 to 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
scaled_df.head()
2. Normalization
This method is more or less the same as the previous method but here instead of the
minimum value, we subtract each entry by the mean value of the whole data and then
divide the results by the difference between the minimum and the maximum value.
Range - [0,1]
from sklearn.preprocessing import Normalizer

scaler = Normalizer()
columns=df.columns)
print(scaled_df.head())
3. Standardization
This method of scaling is basically based on the central tendencies and variance of the
data. First, we should calculate the mean and standard deviation of the data we would
like to normalize. Then we are supposed to subtract the mean value from each entry
and then divide the result by the standard deviation. This helps us achieve a normal
distribution(if it is already normal but skewed) of the data with a mean equal to zero
and a standard deviation equal to 1. [-3,+3]
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
columns=df.columns)
print(scaled_df.head())
Whitening :
Whitening or Sphering is a data pre-processing step. It can be used to remove correlation or
dependencies between features in a dataset. This may help to better train a machine learning
model.
https://learndataa.com/2020/09/15/data-preprocessing-whitening-or-sphering-in-python/
Feature selection and filtering :

Feature selection is a process that chooses a subset of features from the original features so
that the feature space is optimally reduced according to a certain criterion.
Feature selection is a critical step in the feature construction process. In text categorization
problems, some words simply do not appear very often. Perhaps the word “groovy” appears
in exactly one training document, which is positive. Is it really worth keeping this word
around as a feature ? It’s a dangerous endeavor because it’s hard to tell with just one training
example if it is really correlated with the positive class or is it just noise. You could hope that
your learning algorithm is smart enough to figure it out. Or you could just remove it.
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Filter Methods
These methods are generally used while doing the pre-processing step. These methods select
features from the dataset irrespective of the use of any machine learning algorithm. In terms
of computation, they are very fast and inexpensive and are very good for removing
duplicated, correlated, redundant features but these methods do not remove multicollinearity.
Selection of features is evaluated individually which can sometimes help when features are in
isolation (don’t have a dependency on other features) but will lag when a combination of
features can lead to increase in the overall performance of the model.
Wrapper methods:
Wrapper methods, also referred to as greedy algorithms, train the algorithm by using a subset
of features in an iterative manner. Based on the conclusions made from training prior to the
model, addition and removal of features takes place. Stopping criteria for selecting the best
subset are usually pre-defined by the person training the model such as when the performance
of the model decreases or a specific number of features has been achieved. The main
advantage of wrapper methods over the filter methods is that they provide an optimal set of
features for training the model, thus resulting in better accuracy than the filter methods but
are computationally more expensive.
1. Forward selection – This method is an iterative approach where we initially start with
an empty set of features and keep adding a feature which best improves our model
after each iteration. The stopping criterion is till the addition of a new variable does
not improve the performance of the model.
2. Backward elimination – This method is also an iterative approach where we initially
start with all features and after each iteration, we remove the least significant feature.
The stopping criterion is till no improvement in the performance of the model is
observed after the feature is removed.
3. Bi-directional elimination – This method uses both forward selection and backward
elimination technique simultaneously to reach one unique solution.
4. Exhaustive selection – This technique is considered as the brute force approach for
the evaluation of feature subsets. It creates all possible subsets and builds a learning
algorithm for each subset and selects the subset whose model’s performance is best.
5. Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is
trained on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.
Embedded methods:
In embedded methods, the feature selection algorithm is blended as part of the learning
algorithm, thus having its own built-in feature selection methods. Embedded methods
encounter the drawbacks of filter and wrapper methods and merge their advantages. These
methods are faster like those of filter methods and more accurate than the filter methods and
take into consideration a combination of features as well.
1. Regularization – This method adds a penalty to different parameters of the machine

learning model to avoid over-fitting of the model. This approach of feature selection
uses Lasso (L1 regularization) and Elastic nets (L1 and L2 regularization). The
penalty is applied over the coefficients, thus bringing down some coefficients to zero.
The features having zero coefficient can be removed from the dataset.
2. Tree-based methods – These methods such as Random Forest, Gradient Boosting
provides us feature importance as a way to select features as well. Feature importance
tells us which features are more important in making an impact on the target feature.
What Is Principal Component Analysis?
Principal Component Analysis (PCA) is a powerful technique used in data analysis,

particularly for reducing the dimensionality of datasets while preserving crucial information.
It does this by transforming the original variables into a set of new, uncorrelated variables
called principal components. Here’s a breakdown of PCA’s key aspects:
● Dimensionality Reduction: PCA helps manage high-dimensional datasets by

extracting essential information and discarding less relevant features, simplifying
analysis.
● Data Exploration and Visualization: It plays a significant role in data exploration and
visualisation, aiding in uncovering hidden patterns and insights.
● Linear Transformation: PCA performs a linear transformation of data, seeking
directions of maximum variance. PCA employs a linear transformation that is based
on preserving the most variance in the data using the least number of dimensions.
● Feature Selection: Principal components are ranked by the variance they explain,
allowing for effective feature selection.
● Data Compression: PCA can compress data while preserving most of the original
information.
● Clustering and Classification: It finds applications in clustering and classification
tasks by reducing noise and highlighting underlying structure.
● Advantages: PCA offers linearity, computational efficiency, and scalability for large
datasets.
● Limitations: It assumes data normality and linearity and may lead to information loss.
● Matrix Requirements: PCA works with symmetric correlation or covariance matrices
and requires numeric, standardized data.
● Eigenvalues and Eigenvectors: Eigenvalues represent variance magnitude, and
eigenvectors indicate variance direction.
● Number of Components: The number of principal components chosen determines the
number of eigenvectors computed.
What Are Principal Components?

A principal component (PCA) is a normalized linear combination of the original features in a
data set. Let’s say we have a set of predictors as X¹, X²...,Xp
The principal component can be written as:
Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + .... +Φp¹Xp
where,
Z¹ is the first principal component
Φp¹ is the loading vector comprising loadings (Φ¹, Φ²..) of the first principal component. The
loadings are constrained to a sum of squares equals to 1.
X¹..Xp are normalized predictors. Normalized predictors have mean values equal to zero and
standard deviations equal to one.
First Principal Component

The first principal component is a linear combination of original predictor variables that
captures the data set’s maximum variance. It determines the direction of highest variability in
the data. Larger the variability captured in the first component, larger the information
captured by component. No other component can have variability higher than first principal
component.
The first principal component results in a line that is closest to the data, i.e., it minimizes the
sum of squared distance between a data point and the line.
Second Principal Component (Z²)

The second principal component is also a linear combination of original predictors, which
captures the remaining variance in the data set and is uncorrelated with Z¹. In other words,
the correlation between first and second components should be zero. It can be represented as:
Z² = Φ¹²X¹ + Φ²²X² + Φ³²X³ + .... + Φp2Xp
If the two components are uncorrelated, their directions should be orthogonal.
How Principal Component Analysis(PCA) works?

→ Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a
standard deviation of 1.
→ Step2: Covariance Matrix Computation

Covariance measures the strength of joint variability between two or more variables,
indicating how much they change in relation to each other. To find the covariance we can use
the formula:
The value of covariance can be positive, negative, or zeros.

Positive: As the x1 increases x2 also increases.
Negative: As the x1 increases x2 also decreases.
Zeros: No direct relation
→ Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix to Identify Principal
Components
Let A be a square nXn matrix and X be a non-zero vector for which
for some scalar values \lambda . then \lambda is known as the eigenvalue of matrix A and
X is known as the eigenvector of matrix A for the corresponding eigenvalue.
It can also be written as :
where I am the identity matrix of the same shape as matrix A. And the above conditions will
be true only if (A - \lambda I) will be non-invertible (i.e. singular matrix). That means,
From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation AX = \lambda X .
Advantages of Principal Component Analysis

● Dimensionality Reduction: Principal Component Analysis is a popular technique used
for dimensionality reduction, which is the process of reducing the number of variables
in a dataset. By reducing the number of variables, PCA simplifies data analysis,
improves performance, and makes it easier to visualize data.
● Feature Selection: Principal Component Analysis can be used for feature selection,
which is the process of selecting the most important variables in a dataset. This is
useful in machine learning, where the number of variables can be very large, and it is
difficult to identify the most important variables.
● Data Visualization: Principal Component Analysis can be used for data visualization.
By reducing the number of variables, PCA can plot high-dimensional data in two or
three dimensions, making it easier to interpret.
● Multicollinearity: Principal Component Analysis can be used to deal with
multicollinearity, which is a common problem in a regression analysis where two or
more independent variables are highly correlated. PCA can help identify the
underlying structure in the data and create new, uncorrelated variables that can be
used in the regression model.
● Noise Reduction: Principal Component Analysis can be used to reduce the noise in
data. By removing the principal components with low variance, which are assumed to
represent noise, Principal Component Analysis can improve the signal-to-noise ratio
and make it easier to identify the underlying structure in the data.
● Data Compression: Principal Component Analysis can be used for data compression.
By representing the data using a smaller number of principal components, which
capture most of the variation in the data, PCA can reduce the storage requirements
and speed up processing.
● Outlier Detection: Principal Component Analysis can be used for outlier detection.
Outliers are data points that are significantly different from the other data points in the
dataset. Principal Component Analysis can identify these outliers by looking for data
points that are far from the other points in the principal component space.
Disadvantages of Principal Component Analysis
● Interpretation of Principal Components: The principal components created by
Principal Component Analysis are linear combinations of the original variables, and it
is often difficult to interpret them in terms of the original variables. This can make it
difficult to explain the results of PCA to others.
● Data Scaling: Principal Component Analysis is sensitive to the scale of the data. If the
data is not properly scaled, then PCA may not work well. Therefore, it is important to
scale the data before applying Principal Component Analysis.
● Information Loss: Principal Component Analysis can result in information loss. While
Principal Component Analysis reduces the number of variables, it can also lead to loss
of information. The degree of information loss depends on the number of principal
components selected. Therefore, it is important to carefully select the number of
principal components to retain.
● Non-linear Relationships: Principal Component Analysis assumes that the
relationships between variables are linear. However, if there are non-linear
relationships between variables, Principal Component Analysis may not work well.
● Computational Complexity: Computing Principal Component Analysis can be
computationally expensive for large datasets. This is especially true if the number of
variables in the dataset is large.
● Overfitting: Principal Component Analysis can sometimes result in overfitting, which
is when the model fits the training data too well and performs poorly on new data.
This can happen if too many principal components are used or if the model is trained
on a small dataset.
REGRESSION : regression is a machine learning technique where the model predicts the
output as a continuous numerical value.It predicts the continuous output variables based on
the independent input variable. like the prediction of house prices based on different
parameters like house age, distance from the main road, location, area, etc.
LINEAR REGRESSION : “Linear regression predicts the relationship between two

variables by assuming a linear connection between the independent and dependent variables.
It seeks the optimal line that minimizes the sum of squared differences between predicted and
actual values. Applied in various domains like economics and finance, this method analyzes
and forecasts data trends.
→ Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between the dependent variable and one or more independent features by
fitting a linear equation to observed data.
→ When there is only one independent feature, it is known as Simple Linear Regression, and
when there are more than one feature, it is known as Multiple Linear Regression.
→ Similarly, when there is only one dependent variable, it is considered Univariate Linear
Regression, while when there are more than one dependent variables, it is known as
Multivariate Regression.
Pros:
→ Linear Regression is simple to implement.
→ Less complexity compared to other algorithms.
→ Linear Regression may lead to over-fitting but it can be avoided using some
dimensionality reduction techniques, regularization techniques, and cross-validation.
Cons:
→ Outliers affect this algorithm badly.
→ It over-simplifies real-world problems by assuming a linear relationship among the
variables, hence not recommended for practical use-cases.
Types of Linear Regression

There are two main types of linear regression:
Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent variable
and one dependent variable. The equation for simple linear regression is:
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope
Multiple linear regression :
This involves more than one independent variable and one dependent variable. The equation
for multiple linear regression is:
where:
Y is the dependent variable
X1, X2, …, Xp are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
Random Error(Residuals)
In regression, the difference between the observed value of the dependent variable(yi) and the
predicted value(predicted) is called the residuals.
εi = ypredicted – yi
where ypredicted = B0 + B1 Xi
What is the best Fit Line?

Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a minimum.
There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
ASSUMPTIONS OF SIMPLE LINEAR REGRESSION :
Linearity: The independent and dependent variables have a linear relationship with one
another. This implies that changes in the dependent variable follow those in the independent
variable(s) in a linear fashion. This means that there should be a straight line that can be
drawn through the data points. If the relationship is not linear, then linear regression will not
be an accurate model.
Independent of residuals: The observations in the dataset are independent of each other.
This means that the value of the dependent variable for one observation does not depend on
the value of the dependent variable for another observation. If the observations are not
independent, then linear regression will not be an accurate model.
Homoscedasticity: The error terms must have constant variance. This phenomenon is known
as Homoscedasticity. The presence of non-constant variance in the error terms is referred to
as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or
extreme leverage values.
Normality distribution of variables : The residuals should be normally distributed. This

means that the residuals should follow a bell-shaped curve. If the residuals are not normally
distributed, then linear regression will not be an accurate model.
VARIANCE AND BIAS :

What is Bias?
Bias is simply defined as the inability of the model because of that there is some difference or
error occurring between the model’s predicted value and the actual value. These differences
between actual or expected values and the predicted values are known as error or bias error or
error due to bias. Bias is a systematic error that occurs due to wrong assumptions in the
machine learning process.
→ Low Bias: Low bias value means fewer assumptions are taken to build the target function.
In this case, the model will closely match the training dataset.
→ High Bias: High bias value means more assumptions are taken to build the target function.
In this case, the model will not match the training dataset closely.
Ways to reduce high bias in Machine Learning:
● Use a more complex model: One of the main reasons for high bias is the very
simplified model. it will not be able to capture the complexity of the data. In such
cases, we can make our mode more complex by increasing the number of hidden
layers in the case of a deep neural network. Or we can use a more complex model like
Polynomial regression for non-linear datasets, CNN for image processing, and RNN
for sequence learning.
● Increase the number of features: By adding more features to train the dataset will
increase the complexity of the model. And improve its ability to capture the
underlying patterns in the data.
● Reduce Regularization of the model: Regularization techniques such as L1 or L2
regularization can help to prevent overfitting and improve the generalization ability of
the model. if the model has a high bias, reducing the strength of regularization or
removing it altogether can help to improve its performance.
● Increase the size of the training data: Increasing the size of the training data can help
to reduce bias by providing the model with more examples to learn from the dataset.
What is variance ?
Variance is the measure of spread in data from its mean position. In machine learning,
variance is the amount by which the performance of a predictive model changes when it is
trained on different subsets of the training data. More specifically, variance is the variability
of the model that determines how much it is sensitive to another subset of the training dataset.
i.e. how much it can adjust on the new subset of the training dataset.
→ Low variance: Low variance means that the model is less sensitive to changes in the
training data and can produce consistent estimates of the target function with different subsets
of data from the same distribution. This is the case of underfitting when the model fails to
generalise on both training and test data.
→ High variance: High variance means that the model is very sensitive to changes in the
training data and can result in significant changes in the estimate of the target function when
trained on different subsets of data from the same distribution. This is the case of overfitting
when the model performs well on the training data but poorly on new, unseen test data. It fits
the training data too closely that it fails on the new training dataset.
Ways to reduce variance -
→ Cross-validation: By splitting the data into training and testing sets multiple times,
cross-validation can help identify if a model is overfitting or underfitting and can be used to
tune hyperparameters to reduce variance.
→ Feature selection: By choosing the only relevant feature will decrease the model’s
complexity. and it can reduce the variance error.
→ Regularization: We can use L1 or L2 regularization to reduce variance in machine learning
models
R2 squared is also known as Coefficient of Determination or sometimes also known as
Goodness of fit.
Ridge Regression
Ridge regression is a type of regularized regression model. This means it is a variation of the
standard linear regression model that includes a regularized term in the cost function. The
purpose of this is to prevent Overfitting. Ridge Regression adds an L2 regularization term to
the linear equation. That’s why it is also known as L2 Regularization or L2 Norm.
How does Ridge Regression work?

Ridge Regression works by adding a penalty term to the cost function of a linear regression
model, called the regularization term. This regularization term prevents the model from
overfitting by penalizing the large coefficients. The regularization parameter determines how
much the model should be penalized. By increasing the regularization parameter, the
coefficients are shrunk toward zero.
Lasso Regression - Least Absolute Shrinkage and Selection Operator
Lasso Regression is also a type of regularization linear model. It also adds a penalty term to
the cost function but it adds L1 regularization instead of L2 regularization and hence is also
known as L1 Regularization or L1 Norm. This term is the sum of the absolute values of the
coefficients, multiplied by a constant lambda. The main aim of Lasso Regression is to reduce
the features and hence can be used for Feature Selection.
Difference b/w Ridge and Lasso Regression

The main difference between Ridge and Lasso regression is the way they shrink the
coefficients. Ridge regression can reduce all the coefficients by a small amount but Lasso can
reduce some features more than others and hence can completely eliminate those features.
Conclusion
Both Ridge and Lasso regression are quite popular algorithms. Ridge regression is used to
avoid Overfitting while Lasso Regression can be used for feature selection. Both are
regularization techniques. Both methods are useful in their own way and we can choose one
over the other according to our needs.
ELASTICNET :
Elastic net is a combination of the two most popular regularized variants of linear regression:
ridge and lasso. Ridge utilizes an L2 penalty and lasso uses an L1 penalty. With elastic net,
you don't have to choose between these two models, because elastic net uses both the L2 and
the L1 penalty
And that’s pretty much it! Instead of one regularization parameter $\alpha$ we now use
two parameters, one for each penalty. α1 controls the L1 penalty and α2 controls the L2
penalty. We can now use elastic net in the same way that we can use ridge or lasso. If α1 = 0,
then we have ridge regression. If α2 = 0, we have lasso. Alternatively, instead of using two
α-parameters, we can also use just one α and one L1-ratio-parameter, which determines the
percentage of our L1 penalty with regard to α. So if α==1 and L1-ratio = 0.4, our L1 penalty
will be multiplied with 0.4 and our L2 penalty will be multiplied with 1 - L1-ratio = 0.6.
Here’s the equation:
What is Robust Regression?

Robust regression is a variation of traditional regression analysis that is less sensitive to
outliers in the data. Outliers are data points that deviate significantly from the majority of the
data points, and they can have a substantial impact on the regression model's performance.
Traditional regression methods, such as ordinary least squares (OLS), treat all data points
equally, regardless of their distance from the central cluster. This makes them highly
influenced by outliers, resulting in biased parameter estimates and poor predictive
performance.
Robust regression techniques, on the other hand, aim to down-weight the impact of outliers
by assigning lower weights to these data points during the model fitting process. By giving
less weight to outliers, robust regression models can provide more accurate parameter
estimates and better predictions.
Polynomial regression ?
→ Polynomial Regression is a regression algorithm that models the relationship between a
dependent(y) and independent variable(x) as nth degree polynomial.
→ It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
→ It is a linear model with some modification in order to increase the accuracy.
→ If we apply a linear model on a linear dataset, then it provides us a good result as we have
seen in Simple Linear Regression, but if we apply the same model without any modification
on a non-linear dataset, then it will produce a drastic output. Due to which loss function will
increase, the error rate will be high, and accuracy will be decreased.
→ So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model.
Isotonic regression ?
Isotonic regression is a regression technique in which the predictor variable is monotonically
related to the target variable. This means that as the value of the predictor variable increases,
the value of the target variable either increases or decreases in a consistent, non-oscillating
manner.
Mathematically, isotonic regression can be formulated as an optimization problem in which
the goal is to find a monotonic function that minimizes the sum of the squared errors between
the predicted and observed values of the target variable.
Applications of Isotonic Regression
Isotonic regression has a number of applications, including:
● Calibration of predicted probabilities: Isotonic regression can be used to adjust the
predicted probabilities produced by a classifier so that they are more accurately
calibrated to the true probabilities.
● Ordinal regression: Isotonic regression can be used to model ordinal variables, which
are variables that can be ranked in order (e.g., “low,” “medium,” and “high”).
● Non-parametric regression: Because isotonic regression does not make any
assumptions about the functional form of the relationship between the predictor and
target variables, it can be used as a non-parametric regression method.
● Imputing missing values: Isotonic regression can be used to impute missing values in
a dataset by predicting the missing values based on the surrounding non-missing
values.
● Outlier detection: Isotonic regression can be used to identify outliers in a dataset by
identifying points that are significantly different from the overall trend of the data.

ml_unit_2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ml_unit_2

Uploaded by

Copyright:

Available Formats

UNIT -2

Feature engineering refers to a process of selecting and transforming variables/features in

Feature engineering has two goals:

What is a Test Dataset?

from sklearn.model_selection import cross_val_score, KFold

Processes Involved in Feature Engineering

How to Handle Missing Data

→ Deleting the columns :

2. Imputing the missing values :

→ Replacing with the mean

→ Replacing with the mode

→ Replacing with the median

How to Impute Missing Values for Categorical Features?

Managing categorical data :

For each category in a categorical column, a new binary column is created

from sklearn.preprocessing import LabelEncoder

Data scaling and normalisation :

from sklearn.preprocessing import Normalizer

from sklearn.preprocessing import StandardScaler

Feature selection and filtering :

1. Regularization – This method adds a penalty to different parameters of the machine

What Is Principal Component Analysis?

Principal Component Analysis (PCA) is a powerful technique used in data analysis,

● Dimensionality Reduction: PCA helps manage high-dimensional datasets by

What Are Principal Components?

First Principal Component

Second Principal Component (Z²)

Z² = Φ¹²X¹ + Φ²²X² + Φ³²X³ + .... + Φp2Xp

If the two components are uncorrelated, their directions should be orthogonal.

How Principal Component Analysis(PCA) works?

→ Step2: Covariance Matrix Computation

The value of covariance can be positive, negative, or zeros.

Advantages of Principal Component Analysis

LINEAR REGRESSION : “Linear regression predicts the relationship between two

Types of Linear Regression

What is the best Fit Line?

Normality distribution of variables : The residuals should be normally distributed. This

VARIANCE AND BIAS :

How does Ridge Regression work?

Difference b/w Ridge and Lasso Regression

What is Robust Regression?

You might also like