Professional Documents
Culture Documents
Unit - Iii Data Analysis
Unit - Iii Data Analysis
REGRESSION
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various every year and get
sales on that. The below list shows the made by the company in the last 5 years and the
corresponding sales:
Now, the company wants to do the of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.
TYPES OF REGRESSION
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the basis
of the year of experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
o It uses the concept of threshold levels, values above the threshold level are rounded
up to 1, and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those datapoints.
To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation
that means Linear regression equation Y= b 0+ b1x, is transformed into Polynomial
regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x
is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional
data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR,
it is a line which helps to predict the continuous variables and cover most of the
datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which
creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
o A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
Predictive models are proving to be quite helpful in predicting the future growth of
businesses, as it predicts outcomes using data mining and probability, where each model
consists of a number of predictors or variables. A statistical model can, therefore, be created
by collecting the data for relevant variables.
There are two categories of problems that a predictive model can solve depending on the
category of business — classification and regression problems. The classification category
describes predicting which category the sample should fall into and the latter describes
predicting quantity. These two categories are the initial points of a data science team for
choosing the right metrics and then determining a good working model.
In this article, we will understand the prediction model and performance evaluation from its
core and its importance.
Important Applications Of Predictive Modelling In Business
True-lift Modeling: This is a predictive modelling technique, also known as uplift modelling
that directly models a direct marketing action on an individual’s behaviour.
Online Marketing: This technique uses the web surfer’s past data and makes it run through
the algorithms for determining the type of products the user is most likely click on.
Fraud Detection: This model is used to detect the fraudulent by identifying outliers in a
dataset that indicates any fake activity.
Churn Prevention: This technique uses predictive analytics to predict when and why a
customer is most likely to end the relationship with the company. This study was developed
to predict churn of customer’s account information on telecom.
Sale Forecasting: This can be called the most used technique using predictive modelling.
Examining the past records, market-moving events, keeping track of sales, etc. results in a
realistic prediction sale in a company.
Performance Evaluation
Performance evaluation plays a dominant role in the technique of predictive modelling. The
performance of a predictive model is calculated and compared by choosing the right metrics.
So, it is very crucial to choose the right metrics for a particular predictive model in order to
get an accurate outcome. It is very important to evaluate proper predictive models because
various kinds of data sets are going to be used for the same predictive model.
Common Metrics That Are Used To Evaluate Predictive Models
Area Under The ROC Curve (AUC-ROC): This is one of the popular metrics that has been
used in the industry. The nature of this metric is independent of the change in the proportion
of responders and that’s the biggest advantage of this metric. A model will be represented as
a single point in the ROC plot where the class is an outcome.
Confusion Matrix: This is an NXN matrix where N is called the number of classes being
predicted. This metric is called an error matrix and it portrays a dominant role for prediction
mainly in the issues of statistical categorization. It is a special table with dimensions of two
namely the actual and predicted with an identical class.
Concordant- Discordant Ratio: This model is used to describe the relationship between
pairs of observations where the data are treated as ordinal. The method of calculating this
ratio compares the classifications for two variables on the same two items.
Cross-Validation: This is a resampling procedure and is important in any type
of data modelling. This metric is used to compare and select a model for a given predictive
modelling problem.
Gain and Lift Chart: In this metric, both the charts are used to measure the effectiveness of
a model and it deals to check the rank ordering of probabilities. This method follows like
calculating the probability for each observation and then ranking them in decreasing order.
After ranking, build deciles with each group and lastly, calculate the response rate at each
decile.
Kolmogorov Smirnov chart: The K-S chart measures the degree of separation between the
positive and negative distributions of a model. In most classification models, the K-S gives
values between 0 and 100, where the higher value is considered as the better model.
Mean Square Error: If the data contains a huge number of outliers, then this metric is
known to be a good one.
Median Absolute Error: This metric represents the average of the absolute differences
between the actual observation and the prediction.
Percent Correction Classification: This metric measures the overall accuracy where every
error has the same weight.
Root Mean Squared Error: This is one of the popular metrics that is mainly used in
regression problems. This metric assumes that the error is unbiased and follows a normal
distribution.
Classification of Data
In this article, we are going to discuss the classification of data in which we will cover
structured, unstructured data, and semi-structured data. Also, we will cover the features of the
data. Let’s discuss one by one.
Data Classification :
Process of classifying data in relevant categories so that it can be used or applied more
efficiently. The classification of data makes it easy for the user to retrieve it. Data
classification holds its importance when comes to data security and compliance and also to
meet different types of business or personal objective. It is also of major requirement, as data
must be easily retrievable within a specific period of time.
Types of Data Classification :
Pause
Unmute
×
Data can be broadly classified into 3 types.
1. Structured Data :
Structured data is created using a fixed schema and is maintained in tabular format. The
elements in structured data are addressable for effective analysis. It contains all the data
which can be stored in the SQL database in a tabular format. Today, most of the data is
developed and processed in the simplest way to manage information.
Examples –
Relational data , Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of students
for a university like the name of the student, ID of a student, address, and Email of the
student. To store the record of students used the following relational schema and table for
the same.
S_I S_Addres
S_Name S_Email
D s
2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say that
any does not follow any organized format. This kind of data is also not fit for the
relational database because in the relational database you will see a pre-defined manner or
you can say organized way of data. Unstructured data is also very important for the big
data domain and To manage and store Unstructured data there are many platforms to
handle it like No-SQL Database .
Examples –
Word, PDF, text, media logs, etc.
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that
have some organizational properties that make it easier to analyze. With some process,
you can store them in a relational database but is very hard for some kind of semi-
structured data, but semi-structured exist to ease space.
Example –
XML data.
Features of Data Classification :
The main goal of the organization of data is to arrange the data in such a form that it
becomes fairly available to the users. So it’s basic features as following.
Homogeneity – The data items in a particular group should be similar to each other.
Clarity – There must be no confusion in the positioning of any data item in a
particular group.
Stability – The data item set must be stable i.e. any investigation should not affect
the same set of classification.
Elastic – One should be able to change the basis of classification as the purpose of
classification changes.
Unlock the Power of Placement Preparation!
Feeling lost in OS, DBMS, CN, SQL, and DSA chaos? Our Complete Interview
Preparation Course is the ultimate guide to conquer placements. Trusted by over
100,000+ geeks, this course is your roadmap to interview triumph.
Mathematical formula of
Precision and Recall using the confusion matrix
For example, consider that a search query results in 30 pages, out of which 20 are relevant.
And the results fail to display 40 other relevant results. So the precision is 20/30 and recall is
20/60.
Precision helps us understand how useful the results are. Recall helps us understand how
complete the results are.
But to reduce the checking of pockets twice, the F1 score is used. F1 score is the harmonic
mean of precision and recall. It is given as,
Here ’N’ is the total number of data points in the data set, yi is the actual value of y and pi is
the probability of y belonging to the positive class.
Lower the log-loss value, better are the predictions of the model.
To calculate log-loss, scikit-learn provides a utility function.
from sklearn.metrics import log_losslog_loss(y_true, y_pred)
ROC AUC
A Receiver Operating Characteristic curve or ROC curve is created by plotting the True
Positive (TP) against the False Positive (FP) at various threshold settings. The ROC curve is
generated by plotting the cumulative distribution function of the True Positive in the y-axis
versus the cumulative distribution function of the False Positive on the x-axis.
The dashed curved line is the ROC Curve
The area under the ROC curve (ROC AUC) is the single-valued metric used for evaluating the
performance.
The higher the AUC, the better the performance of the model at distinguishing between the
classes.
In general, an AUC of 0.5 suggests no discrimination, a value between 0.5–0.7 is acceptable
and anything above 0.7 is good-to-go-model. However, medical diagnosis models, usually
AUC of 0.95 or more is considered to be good-to-go-model.
Clustering consists of grouping certain objects that are similar to each other, it can be
used to decide if two items are similar or dissimilar in their properties. In a Data
Mining sense, the similarity measure is a distance with dimensions describing object
features. That means if the distance among two data points is small then there is
a high degree of similarity among the objects and vice versa. The similarity
is subjective and depends heavily on the context and application. For example, similarity
among vegetables can be determined from their taste, size, colour etc. Most clustering
approaches use distance measures to assess the similarities or differences between a pair of
objects, the most popular distance measures used are: 1. Euclidean Distance: Euclidean
distance is considered the traditional metric for problems with geometry. It can be simply
explained as the ordinary distance between two points. It is one of the most used
algorithms in the cluster analysis. One of the algorithms that use this formula would be K-
mean. Mathematically it computes the root of squared differences between the
Figure 1: Confusion matrix for classification of 100 cats and dogs. Source: Author.
Let’s focus on the 12 observations where the model predicts a cat while in reality it is a dog. If
the model predicts 51% probability of cat and it turns out to be a dog, for sure that’s possible.
However, if the model predicts 95% probability of cat and it turns out to be a dog? This seems
highly unlikely.
Figure 2: Predicted probability of cat and the classification threshold. Source: Author.
Classifiers use a predicted probability and a threshold to classify the observations. Figure 2
visualizes the classification for a threshold of 50%. It seems intuitive to use a threshold of
50% but there is no restriction on adjusting the threshold. So, in the end the only thing that
matters is the ordering of the observations. Changing the objective to predict probabilities
instead of labels requires a different approach. For this, we enter the field of probabilistic
classification.
Evaluation metric 1: Logloss
Let us generalize from cats and dogs to class labels of 0 and 1. Class probabilities are any real
number between 0 and 1. The model objective is to match predicted probabilities with class
labels, i.e. to maximize the likelihood, given in Eq. 1, of observing class labels given the
predicted probabilities.
Equation 1: Likelihood for class labels y and predicted probabilities based on features x.
A major drawback of the likelihood is that if the number of observations grow, the product of
the individual probabilities becomes increasingly small. So, with enough data, the likelihood
will underflow the numerical precision of any computer. Next to that, a product of parameters
is difficult to differentiate. That’s the reason the logarithm of the likelihood is preferred,
commonly referred to as the loglikelihood. A logarithm is a monotonically increasing function
of its argument. Therefore, maximization of the log of a function is equivalent to maximization
of the function itself.
Equation 2: Logloss for class labels y and predicted probabilities based on features x.
Nonetheless, the loglikelihood still scales with the number of observations so an average
loglikelihood is better metric to explain the observed variation. However, in practice, most
people minimize the negative average loglikelihood instead maximizing the average
loglikelihood because optimizers normally minimize functions. Data scientists commonly
refer to this metric as Logloss, as given in Eq. 2. For a more elaborate discussion of the
Logloss and its relation to the evaluation metrics normally used in classification model
evaluation, I refer you to this article.
Evaluation metric 2: Brier Score
Next to the Logloss, the Brier Score, as given in Eq. 3, is commonly used as an evaluation
metric for predicted probabilities. In essence, it is a quadratic loss on the predicted
probabilities and the class labels. Note the similarity between the Mean Squared Error
(MSE) used in regression model evaluation.
Equation 3: Brier Score for class labels y and predicted probabilities based on features x.
However, a notable difference with the MSE is that the minimum Brier Score is not 0. The
Brier Score is the squared loss on the labels and probabilities, and therefore by definition is
not 0. Simply said, the minimum is not 0 if the underlying process is non-deterministic which
is the reason to use probabilistic classification in the first place. In order to cope with this
problem, the probabilities are commonly evaluated on a relative basis with other probabilistic
classifiers using for instance the Brier Skill Score.
Example with dummy data
In this section I will show an example of the steps to go from classification to probability
estimation using dummy data. The example will show multiple ML models, ranging from
Logistic Regression to Random Forests. Let us first create dummy data using Sklearn. The
dummy dataset contains both informative as well as redundant features and multiple clusters
per class are introduced.
The dummy data is classified using the ML model structures:
Logistic Regression (LR),
Support Vector Machine (SVM),
Decision Tree (DT),
Random Forest (RF).
The ML model’s ability to correctly classify is evaluated using the ROC-AUC score. Figure 3
shows that all ML models do a fairly good job at classifying the dummy data, i.e. ROC-AUC
> 0.65, whereas RBF SVM and RF perform best.
Figure 3: ROC-AUC score on out-of-sample data for different ML model structures. Source:
Author.
However, remind the model objective is predicting probabilities. It is nice that the ML models
accurately classify the observations, but how well do the models predict class probabilities?
There are two routes to evaluate the predicted probabilities:
Quantitatively with the Brier Score and Logloss;
Qualitatively with the calibration plot.
Quantitative evaluation of probabilities
Firstly, the ML models are quantitatively evaluated using the Brier Score and Logloss. Figure
4 shows that RBF SVM and RF perform best at probabilities estimation based on the Brier
Score (left) and the Logloss (right). Note, the Logloss of the DT is relatively high and to
understand the reason for this, I refer you to this article.
Figure 4: Brier Score (left) and Logloss (right) on out-of-sample data for different ML model
structures. Source: Author.
Qualitative evaluation of probabilities
Secondly, the ML models are qualitatively evaluated using the calibration plot. The goal of the
calibration plot is to show and evaluate whether predicted probabilities match the actual
fraction of positives. The plot buckets the predicted probabilities in uniform buckets and
compares the mean predicted to the fraction of positives. Figure 5 shows the calibration plot
for our example. You can see that the LR and RBF SVM are well calibrated, i.e., the mean
predicted probability matches the fraction of positives nicely. However, inspecting the
distribution of predicted probabilities for LR shows that the predicted probabilities are more
centered than for the RBF SVM. Next to that, you see that the DT is ill-calibrated and the
distribution of predicted probabilities seems wrong.
Figure 5: Calibration plot (upper) and distribution of probabilities (under) for different ML
models. Source: Author.
Why do predicted probabilities not match posterior probabilities?
Niculescu-Mizil & Caruano explain in their 2005 paper “Predicting Good Probabilities With
Supervised Learning” why some ML models observe distorted predicted probabilities in
comparison to the posterior probabilities. Let us start with explaining the root cause. When a
classification model is not trained to decrease the Logloss, the predicted probabilities do not
match the posterior probabilities. A solution to this, is to map predicted probabilities after
model training to posterior probabilities, which is known as post-training calibration.
Frequently used probability calibration techniques are:
Platt Scaling (Platt, 1999)
Isotonic Regression (Zadrozny, 2001)
Figure 6: Model performance after post-training probability calibration with Platt Scaling and
Isotonic regression. Source: (Niculescu et al., 2005)
Calibrating the Decision Tree and Random Forest
The ML models are calibrated using Platt scaling and Isotonic regression, which are both
easily coded in Sklearn. Note, the LR is not calibrated because this model structure is trained
to decrease the Logloss and therefore has calibrated probabilities by default.
The only tunable parameter is the number of cross-validations for probability calibration.
Niculescu (2005) show that a small calibration set size can deteriorate performance and that
the performance improvement is most positive for Boosted and Bagged Decision Trees and
Support Vector Machines. Our example contains 8,000 observations in the test dataset. For a
5-fold cross-validation, 1,600 observations are reserved for the calibration set size.
Figure 7: Brier Score for Platt Scaling and Isotonic Regression for different ML models and
different calibration set sizes. Source: (Niculescu et al., 2005)
Model evaluation after probability calibration
Let’s see whether probability calibration improves the Brier Score, Logloss and the calibration
plot. Figure 7 shows the Brier Score and Logloss after probability calibration. Isotonic
Regression has equivalent performance compared to Platt Scaling, because the dataset
contains a sufficient number of observations. Given the non-parametric nature of Isotonic
regression, I must warn for cases with small calibration set sizes. However, if you intend to do
probabilistic classification on small data sizes, I suggest you use prior information and explore
the field of Bayesian classification.
Figure 7: Brier Score (left) and Logloss (right) after post-training calibration on out-of-
sample data. Source: Author.
Figure 8 shows the calibration plots after post-training calibration. You see an improvement in
the predicted probabilities for the SVM, DT and RF. Next to that, the distribution of predicted
probabilities cover the range of [0, 1] completely and provide accurate mean predicted
probabilities.
Figure 8: Calibration plot (upper) and probabilities (under) after post-training calibration of
the ML models. Source: Author.
Does probability calibration impact the classification ability?
Calibration does not change the ordering of predicted probabilities. The calibration only
changes the predicted probabilities to better match the observed fraction of positives. Figure 9
shows that after probability calibration, the model’s classification ability, as measured by the
ROC-AUC score is either equal or better.