Professional Documents
Culture Documents
ml_unit_2
ml_unit_2
Feature Selection and Feature Engineering: Data sets, creating training and test sets,
managing categorical. data, missing features, data scaling and normalization, whitening.
Feature selection and filtering, PCA, Visualization of high-dimensional datasets.
Regression Algorithms: Linear regression, Regression types: Ridge, Lasso, ElasticNet,
Robust, Polynomial and Isotonic.
Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model. It involves selecting relevant
information from raw data and transforming it into a format that can be easily understood by
a model. The goal is to improve model accuracy by providing more meaningful and relevant
information.
Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features used
to train them. Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones. These techniques help to highlight
the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively.
What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an
individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.
For example, in a dataset of housing prices, features could include the number of bedrooms,
the square footage, the location, and the age of the property. In a dataset of customer
demographics, features could include age, gender, income level, and occupation.
What is a dataset ?
A Dataset is a set of data grouped into a collection with which developers can work to meet
their goals. In a dataset, the rows represent the number of data points and the columns
represent the features of the Dataset.
Numerical Dataset: They include numerical data points that can be solved with equations.
These include temperature, humidity, marks and so on.
→ continuous data : Continuous data refers to data that can be measured. This data has values
that are not fixed and have an infinite number of possible values. These measurements can
also be broken down into smaller individual parts.
→ discrete data : Discrete data also referred to as discrete values, is data that only takes
certain values. Commonly in the form of whole numbers or integers, this is data that can be
counted and has a finite number of values. These values must be able to fall within certain
classifications and are unable to be broken down into smaller parts.
Categorical Dataset: These include categories such as colour, gender, occupation, games,
sports and so on.
→ nominal data : Nominal data is a qualitative type of data used to classify and label
variables, and is known as the simplest data type on Stevens’ scale. All nominal data is
categorical (meaning this data can be sorted into groups) and has no specific order or
hierarchy. Variables like name, eye color and nationality are examples of nominal data.
→ ordinal data : Ordinal data is a qualitative data type that classifies variables in a ranked
order. This kind of data is categorical and separates variables into groups based on distinct
ascending or descending hierarchies. Ordinal data encompasses variables like educational
level, economic status or level of customer satisfaction.
Creating training and test sets
What is Training Dataset?
The training data is the biggest (in -size) subset of the original dataset, which is used to
train or fit the machine learning model. Firstly, the training data is fed to the ML
algorithms, which lets them learn how to make predictions for the given task.
The training data varies depending on whether we are using Supervised Learning or
Unsupervised Learning Algorithms.
For Unsupervised learning, the training data contains unlabeled data points, i.e., inputs are
not tagged with the corresponding outputs. Models are required to find the patterns from the
given training datasets in order to make predictions.
On the other hand, for supervised learning, the training data contains labels in order to train
the model and make predictions.
The type of training data that we provide to the model is highly responsible for the model's
accuracy and prediction ability. It means that the better the quality of the training data, the
better will be the performance of the model. Training data is approximately more than or
equal to 60% of the total data for an ML project.
What is Cross-Validation?
Cross validation is a technique used in machine learning to evaluate the performance of a
model on unseen data. It involves dividing the available data into multiple folds or subsets,
using one of these folds as a validation set, and training the model on the remaining folds.
This process is repeated multiple times, each time using a different fold as the validation set.
Finally, the results from each validation step are averaged to produce a more robust estimate
of the model’s performance.
→ The main purpose of cross validation is to prevent overfitting, which occurs when a model
is trained too well on the training data and performs poorly on new, unseen data.
1. Holdout Validation
In Holdout Validation, we perform training on the 50% of the given dataset and rest
50% is used for the testing purpose. It’s a simple and quick way to evaluate a model. The
major drawback of this method is that we perform training on the 50% of the dataset, it may
possible that the remaining 50% of the data contains some important information which we
are leaving while training our model i.e. higher bias.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only one data-point of the
available dataset and then iterates for each data-point. In LOOCV, the model is trained on n-1
samples and tested on the one omitted sample, repeating this process for each data point in
the dataset. It has some advantages as well as disadvantages also. An advantage of using this
method is that we make use of all data points and hence it is low bias.
The major drawback of this method is that it leads to higher variation in the testing model as
we are testing against one data point. If the data point is an outlier it can lead to higher
variation. Another drawback is it takes a lot of execution time as it iterates over ‘the number
of data points’ times.
3. In K-Fold Cross Validation:
We split the dataset into k numbers of subsets (known as folds) then we perform
training on all the subsets but leave one(k-1) subset for the evaluation of the trained
model. In this method, we iterate k times with a different subset reserved for testing
purpose each time.
Overfitting :
A model can be said as overfitted when it performs quite well with the training dataset but
does not generalize well with the new or unseen dataset. The issue of overfitting occurs when
the model tries to cover all the data points and hence starts caching noises present in the data.
Due to this, it can't generalize well to the new dataset. Because of these issues, the accuracy
and efficiency of the model degrade. Generally, the complex model has a high chance of
overfitting. There are various ways by which we can avoid overfitting in the model, such as
Using the Cross-Validation method, early stopping the training, or by regularization, etc.
Underfitting :
On the other hand, the model is said to be under-fitted when it is not able to capture the
underlying trend of the data. It means the model shows poor performance even with the
training dataset. In most cases, underfitting issues occur when the model is not perfectly
suitable for the problem that we are trying to solve.
○ The main difference between training data and testing data is that training data is the
subset of original data that is used to train the machine learning model, whereas
testing data is used to check the accuracy of the model.
○ The training dataset is generally larger in size compared to the testing dataset. The
general ratios of splitting train and test datasets are 80:20, 70:30, or 90:10.
○ Training data is well known to the model as it is used to train the model, whereas
testing data is like unseen/new data to the model.
1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in Supervised
Learning), and the model transforms the training data into text vectors or a number of
data features.
3. Test: In the last step, we test the model by feeding it with the test data/unseen dataset.
This step ensures that the model is trained efficiently and can generalize well.
1. Feature Creation
Feature Creation is the process of generating new features based on domain knowledge or by
observing patterns in the data. It is a form of feature engineering that can significantly
improve the performance of a machine-learning model.
Types of Feature Creation:
Domain-Specific: Creating new features based on domain knowledge, such as creating
features based on business rules or industry standards.
Data-Driven: Creating new features by observing patterns in the data, such as calculating
aggregations or creating interaction features.
Synthetic: Generating new features by combining existing features or synthesizing new data
points.
Why Feature Creation?
→ Improves Model Performance: By providing additional and more relevant information to
the model, feature creation can increase the accuracy and precision of the model.
→ Increases Model Robustness: By adding additional features, the model can become more
robust to outliers and other anomalies.
→ Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.
→ Increases Model Flexibility: By adding new features, the model can be made more flexible
to handle different types of data.
2. Feature Transformation
Feature Transformation is the process of transforming the features into a more suitable
representation for the machine learning model. This is done to ensure that the model can
effectively learn from the data.
Types of Feature Transformation:
→ Normalisation: Rescaling the features to have a similar range, such as between 0 and 1, to
prevent some features from dominating others.
→ Scaling: Scaling is a technique used to transform numerical variables to have a similar
scale, so that they can be compared more easily. Rescaling the features to have a similar
scale, such as having a standard deviation of 1, to make sure the model considers all features
equally.
→ Encoding: Transforming categorical features into a numerical representation. Examples
are one-hot encoding and label encoding.
→ Transformation: Transforming the features using mathematical operations to change the
distribution or scale of the features. Examples are logarithmic, square root, and reciprocal
transformations.
Why Feature Transformation?
→ Improves Model Performance: By transforming the features into a more suitable
representation, the model can learn more meaningful patterns in the data.
→ Increases Model Robustness: Transforming the features can make the model more robust
to outliers and other anomalies.
→ Improves Computational Efficiency: The transformed features often require fewer
computational resources.
→ Improves Model Interpretability: By transforming the features, it can be easier to
understand the model’s predictions.
3. Feature Extraction
Feature Extraction is the process of creating new features from existing ones to provide more
relevant information to the machine learning model. This is done by transforming,
combining, or aggregating existing features.
Types of Feature Extraction:
→ Dimensionality Reduction: Reducing the number of features by transforming the data into
a lower-dimensional space while retaining important information. Examples are PCA and
t-SNE.
→ Feature Combination: Combining two or more existing features to create a new one. For
example, the interaction between two features.
→ Feature Aggregation: Aggregating features to create a new one. For example, calculating
the mean, sum, or count of a set of features.
→ Feature Transformation: Transforming existing features into a new representation. For
example, log transformation of a feature with a skewed distribution.
Why Feature Extraction?
→ Improves Model Performance: By creating new and more relevant features, the model can
learn more meaningful patterns in the data.
→ Reduces Overfitting: By reducing the dimensionality of the data, the model is less likely to
overfit the training data.
→ Improves Computational Efficiency: The transformed features often require fewer
computational resources.
→ Improves Model Interpretability: By creating new features, it can be easier to understand
the model’s predictions.
4. Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the dataset to
be used in a machine-learning model. It is an important step in the feature engineering
process as it can have a significant impact on the model’s performance.
Types of Feature Selection:
→ Filter Method: Based on the statistical measure of the relationship between the feature and
the target variable. Features with a high correlation are selected.
→ Wrapper Method: Based on the evaluation of the feature subset using a specific machine
learning algorithm. The feature subset that results in the best performance is selected.
→ Embedded Method: Based on the feature selection as part of the training process of the
machine learning algorithm.
Why Feature Selection?
→ Reduces Overfitting: By using only the most relevant features, the model can generalize
better to new data.
→ Improves Model Performance: Selecting the right features can improve the accuracy,
precision, and recall of the model.
→ Decreases Computational Costs: A smaller number of features requires less computation
and storage resources.
→ Improves Interpretability: By reducing the number of features, it is easier to understand
and interpret the results of the model.
5. Feature Scaling
Feature Scaling is the process of transforming the features so that they have a similar scale.
This is important in machine learning because the scale of the features can affect the
performance of the model.
Types of Feature Scaling:
→ Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by
subtracting the minimum value and dividing by the range.
→ Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of
1 by subtracting the mean and dividing by the standard deviation.
Why Feature Scaling?
→ Improves Model Performance: By transforming the features to have a similar scale, the
model can learn from all features equally and avoid being dominated by a few large features.
→ Increases Model Robustness: By transforming the features to be robust to outliers, the
model can become more robust to anomalies.
→ Improves Computational Efficiency: Many machine learning algorithms, such as k-nearest
neighbours, are sensitive to the scale of the features and perform better with scaled features.
→ Improves Model Interpretability: By transforming the features to have a similar scale, it
can be easier to understand the model’s predictions.
What are the Steps in Feature Engineering?
The steps for feature engineering vary per different Ml engineers and data scientists. Some of
the common steps that are involved in most machine-learning algorithms are:
Data Cleansing
Data cleansing (also known as data cleaning or data scrubbing) involves identifying and
removing or correcting any errors or inconsistencies in the dataset. This step is important to
ensure that the data is accurate and reliable.
Data Transformation
Feature Extraction
Feature Selection
Feature selection involves selecting the most relevant features from the dataset for use in
machine learning. This can include techniques like correlation analysis, mutual information,
and stepwise regression.
Feature Iteration
Feature iteration involves refining and improving the features based on the performance of
the machine learning model. This can include techniques like adding new features, removing
redundant features and transforming features in different ways.
Overall, the goal of feature engineering is to create a set of informative and relevant features
that can be used to train a machine learning model and improve its accuracy and
performance. The specific steps involved in the process may vary depending on the type of
data and the specific machine-learning problem at hand.
Data can be missing for many reasons like technical issues, human errors, privacy concerns,
data processing issues, or the nature of the variable itself. Understanding the cause of missing
data helps choose appropriate handling strategies and ensure the quality of your analysis.
1. Deleting the Missing values
Generally, this approach is not recommended. The disadvantage of this method is one might
end up deleting some useful data from the dataset.
There are 2 ways one can delete the missing data values:
→ Deleting the rows :
If a row has many missing values, you can drop the entire row. If every row has some
(column) value missing, you might end up deleting the whole data. The code to drop the
entire row is as follows:
IN:
df = train_df.dropna(axis=0)
df.isnull().sum()
IN:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X)
OUT:
array([['square'],
['square'],
['oval'],
['circle'],
['square']], dtype=object)
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Nearest Neighbors Imputations (KNNImputer)
Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean
distance is used to find the nearest neighbors. Let’s take the above example of the titanic
dataset to see how it works.
IN:
from sklearn.impute import KNNImputer
impute_knn = KNNImputer(n_neighbors=2)
impute_knn.fit_transform(X)
OUT:
array([[ 1. , 7.25 , 22. ],
[ 1. , 71.2833, 38. ],
[ 0. , 7.925 , 26. ],
[ 1. , 53.1 , 35. ],
[ 0. , 8.05 , 35. ],
[ 0. , 8.4583, 30.5 ]])
In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows
measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing
values. In this case, the last row has a missing value. And the third row and the fifth row have
the closest values for the other two features. So the average of the ‘Age’ feature from these
two rows is taken as the imputed value
import pandas as pd
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)
one_hot_encoded = pd.get_dummies(df, columns=['Color'])
2. Label encoding : is suitable for categorical features with only two distinct categories. In
this technique, each category is assigned a unique integer label. Categories are assigned
integer values starting from 0.
For example, if you have an ordinal categorical feature “Size” with values “Small,”
“Medium,” and “Large,” label encoding would convert it as follows:
Size Size_encoded
0 Small 2
1 Medium 1
2 Large 0
3 Medium 1
4 Small 2
As you can see, each unique category in the ‘Size’ column has been replaced with a unique
integer label, where ‘Small’ is encoded as 2, ‘Medium’ as 1, and ‘Large’ as 0.
Scaling guarantees that all features are on a comparable scale and have comparable ranges.
This process is known as feature normalisation. This is significant because the magnitude of
the features has an impact on many machine learning techniques. Larger scale features may
dominate the learning process and have an excessive impact on the outcomes. You can avoid
this problem and make sure that each feature contributes equally to the learning process by
scaling the features.
1. Min-Max scaler :
First, we are supposed to find the minimum and the maximum value of the column.
Then we will subtract the minimum value from the entry and divide the result by the
difference between the maximum and the minimum value.
As we are using the maximum and the minimum value this method is also prone to
outliers but the range in which the data will range after performing the above two
steps is between 0 to 1.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
scaled_df.head()
2. Normalization
This method is more or less the same as the previous method but here instead of the
minimum value, we subtract each entry by the mean value of the whole data and then
divide the results by the difference between the minimum and the maximum value.
Range - [0,1]
3. Standardization
This method of scaling is basically based on the central tendencies and variance of the
data. First, we should calculate the mean and standard deviation of the data we would
like to normalize. Then we are supposed to subtract the mean value from each entry
and then divide the result by the standard deviation. This helps us achieve a normal
distribution(if it is already normal but skewed) of the data with a mean equal to zero
and a standard deviation equal to 1. [-3,+3]
https://learndataa.com/2020/09/15/data-preprocessing-whitening-or-sphering-in-python/
Feature selection is a critical step in the feature construction process. In text categorization
problems, some words simply do not appear very often. Perhaps the word “groovy” appears
in exactly one training document, which is positive. Is it really worth keeping this word
around as a feature ? It’s a dangerous endeavor because it’s hard to tell with just one training
example if it is really correlated with the positive class or is it just noise. You could hope that
your learning algorithm is smart enough to figure it out. Or you could just remove it.
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Filter Methods
These methods are generally used while doing the pre-processing step. These methods select
features from the dataset irrespective of the use of any machine learning algorithm. In terms
of computation, they are very fast and inexpensive and are very good for removing
duplicated, correlated, redundant features but these methods do not remove multicollinearity.
Selection of features is evaluated individually which can sometimes help when features are in
isolation (don’t have a dependency on other features) but will lag when a combination of
features can lead to increase in the overall performance of the model.
Wrapper methods:
Wrapper methods, also referred to as greedy algorithms, train the algorithm by using a subset
of features in an iterative manner. Based on the conclusions made from training prior to the
model, addition and removal of features takes place. Stopping criteria for selecting the best
subset are usually pre-defined by the person training the model such as when the performance
of the model decreases or a specific number of features has been achieved. The main
advantage of wrapper methods over the filter methods is that they provide an optimal set of
features for training the model, thus resulting in better accuracy than the filter methods but
are computationally more expensive.
1. Forward selection – This method is an iterative approach where we initially start with
an empty set of features and keep adding a feature which best improves our model
after each iteration. The stopping criterion is till the addition of a new variable does
not improve the performance of the model.
2. Backward elimination – This method is also an iterative approach where we initially
start with all features and after each iteration, we remove the least significant feature.
The stopping criterion is till no improvement in the performance of the model is
observed after the feature is removed.
3. Bi-directional elimination – This method uses both forward selection and backward
elimination technique simultaneously to reach one unique solution.
4. Exhaustive selection – This technique is considered as the brute force approach for
the evaluation of feature subsets. It creates all possible subsets and builds a learning
algorithm for each subset and selects the subset whose model’s performance is best.
5. Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is
trained on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed from the
current set of features till we are left with the required number of features.
Embedded methods:
In embedded methods, the feature selection algorithm is blended as part of the learning
algorithm, thus having its own built-in feature selection methods. Embedded methods
encounter the drawbacks of filter and wrapper methods and merge their advantages. These
methods are faster like those of filter methods and more accurate than the filter methods and
take into consideration a combination of features as well.
where,
Z¹ is the first principal component
Φp¹ is the loading vector comprising loadings (Φ¹, Φ²..) of the first principal component. The
loadings are constrained to a sum of squares equals to 1.
X¹..Xp are normalized predictors. Normalized predictors have mean values equal to zero and
standard deviations equal to one.
The first principal component results in a line that is closest to the data, i.e., it minimizes the
sum of squared distance between a data point and the line.
for some scalar values \lambda . then \lambda is known as the eigenvalue of matrix A and
X is known as the eigenvector of matrix A for the corresponding eigenvalue.
It can also be written as :
where I am the identity matrix of the same shape as matrix A. And the above conditions will
be true only if (A - \lambda I) will be non-invertible (i.e. singular matrix). That means,
From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation AX = \lambda X .
→ Linear regression is a type of supervised machine learning algorithm that computes the
linear relationship between the dependent variable and one or more independent features by
fitting a linear equation to observed data.
→ When there is only one independent feature, it is known as Simple Linear Regression, and
when there are more than one feature, it is known as Multiple Linear Regression.
→ Similarly, when there is only one dependent variable, it is considered Univariate Linear
Regression, while when there are more than one dependent variables, it is known as
Multivariate Regression.
Pros:
→ Linear Regression is simple to implement.
→ Less complexity compared to other algorithms.
→ Linear Regression may lead to over-fitting but it can be avoided using some
dimensionality reduction techniques, regularization techniques, and cross-validation.
Cons:
→ Outliers affect this algorithm badly.
→ It over-simplifies real-world problems by assuming a linear relationship among the
variables, hence not recommended for practical use-cases.
where:
Y is the dependent variable
X is the independent variable
β0 is the intercept
β1 is the slope
Multiple linear regression :
This involves more than one independent variable and one dependent variable. The equation
for multiple linear regression is:
where:
Y is the dependent variable
X1, X2, …, Xp are the independent variables
β0 is the intercept
β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
Random Error(Residuals)
In regression, the difference between the observed value of the dependent variable(yi) and the
predicted value(predicted) is called the residuals.
εi = ypredicted – yi
where ypredicted = B0 + B1 Xi
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
ASSUMPTIONS OF SIMPLE LINEAR REGRESSION :
Linearity: The independent and dependent variables have a linear relationship with one
another. This implies that changes in the dependent variable follow those in the independent
variable(s) in a linear fashion. This means that there should be a straight line that can be
drawn through the data points. If the relationship is not linear, then linear regression will not
be an accurate model.
Independent of residuals: The observations in the dataset are independent of each other.
This means that the value of the dependent variable for one observation does not depend on
the value of the dependent variable for another observation. If the observations are not
independent, then linear regression will not be an accurate model.
Homoscedasticity: The error terms must have constant variance. This phenomenon is known
as Homoscedasticity. The presence of non-constant variance in the error terms is referred to
as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or
extreme leverage values.
What is variance ?
Variance is the measure of spread in data from its mean position. In machine learning,
variance is the amount by which the performance of a predictive model changes when it is
trained on different subsets of the training data. More specifically, variance is the variability
of the model that determines how much it is sensitive to another subset of the training dataset.
i.e. how much it can adjust on the new subset of the training dataset.
→ Low variance: Low variance means that the model is less sensitive to changes in the
training data and can produce consistent estimates of the target function with different subsets
of data from the same distribution. This is the case of underfitting when the model fails to
generalise on both training and test data.
→ High variance: High variance means that the model is very sensitive to changes in the
training data and can result in significant changes in the estimate of the target function when
trained on different subsets of data from the same distribution. This is the case of overfitting
when the model performs well on the training data but poorly on new, unseen test data. It fits
the training data too closely that it fails on the new training dataset.
Ways to reduce variance -
→ Cross-validation: By splitting the data into training and testing sets multiple times,
cross-validation can help identify if a model is overfitting or underfitting and can be used to
tune hyperparameters to reduce variance.
→ Feature selection: By choosing the only relevant feature will decrease the model’s
complexity. and it can reduce the variance error.
→ Regularization: We can use L1 or L2 regularization to reduce variance in machine learning
models
R2 squared is also known as Coefficient of Determination or sometimes also known as
Goodness of fit.
Ridge Regression
Ridge regression is a type of regularized regression model. This means it is a variation of the
standard linear regression model that includes a regularized term in the cost function. The
purpose of this is to prevent Overfitting. Ridge Regression adds an L2 regularization term to
the linear equation. That’s why it is also known as L2 Regularization or L2 Norm.
ELASTICNET :
Elastic net is a combination of the two most popular regularized variants of linear regression:
ridge and lasso. Ridge utilizes an L2 penalty and lasso uses an L1 penalty. With elastic net,
you don't have to choose between these two models, because elastic net uses both the L2 and
the L1 penalty
And that’s pretty much it! Instead of one regularization parameter $\alpha$ we now use
two parameters, one for each penalty. α1 controls the L1 penalty and α2 controls the L2
penalty. We can now use elastic net in the same way that we can use ridge or lasso. If α1 = 0,
then we have ridge regression. If α2 = 0, we have lasso. Alternatively, instead of using two
α-parameters, we can also use just one α and one L1-ratio-parameter, which determines the
percentage of our L1 penalty with regard to α. So if α==1 and L1-ratio = 0.4, our L1 penalty
will be multiplied with 0.4 and our L2 penalty will be multiplied with 1 - L1-ratio = 0.6.
Here’s the equation:
Robust regression techniques, on the other hand, aim to down-weight the impact of outliers
by assigning lower weights to these data points during the model fitting process. By giving
less weight to outliers, robust regression models can provide more accurate parameter
estimates and better predictions.
Polynomial regression ?
→ Polynomial Regression is a regression algorithm that models the relationship between a
dependent(y) and independent variable(x) as nth degree polynomial.
→ It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
→ It is a linear model with some modification in order to increase the accuracy.
→ If we apply a linear model on a linear dataset, then it provides us a good result as we have
seen in Simple Linear Regression, but if we apply the same model without any modification
on a non-linear dataset, then it will produce a drastic output. Due to which loss function will
increase, the error rate will be high, and accuracy will be decreased.
→ So for such cases, where data points are arranged in a non-linear fashion, we need the
Polynomial Regression model.
Isotonic regression ?
Isotonic regression is a regression technique in which the predictor variable is monotonically
related to the target variable. This means that as the value of the predictor variable increases,
the value of the target variable either increases or decreases in a consistent, non-oscillating
manner.
Mathematically, isotonic regression can be formulated as an optimization problem in which
the goal is to find a monotonic function that minimizes the sum of the squared errors between
the predicted and observed values of the target variable.
Applications of Isotonic Regression
Isotonic regression has a number of applications, including:
● Calibration of predicted probabilities: Isotonic regression can be used to adjust the
predicted probabilities produced by a classifier so that they are more accurately
calibrated to the true probabilities.
● Ordinal regression: Isotonic regression can be used to model ordinal variables, which
are variables that can be ranked in order (e.g., “low,” “medium,” and “high”).
● Non-parametric regression: Because isotonic regression does not make any
assumptions about the functional form of the relationship between the predictor and
target variables, it can be used as a non-parametric regression method.
● Imputing missing values: Isotonic regression can be used to impute missing values in
a dataset by predicting the missing values based on the surrounding non-missing
values.
● Outlier detection: Isotonic regression can be used to identify outliers in a dataset by
identifying points that are significantly different from the overall trend of the data.