Unit - Iii Data Analysis

Unit-III
Regression : Predictive Performance Estimation - Finding the Parameters of the Model -

Technique and Model Selection –Classification: Binary Classification -Predictive
Performance Measures for Classification - Distance-based Learning Algorithms -
Probabilistic Classification Algorithms
REGRESSION
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various every year and get
sales on that. The below list shows the made by the company in the last 5 years and the
corresponding sales:
Now, the company wants to do the of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Some examples of regression can be as:
o Prediction of rain using temperature and other factors
o Determining Market trends
o Prediction of road accidents due to rash driving.
Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
o Independent Variable: The factors which affect the dependent variables or which
are used to predict the values of the dependent variables are called independent
variable, also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset
but not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous variable.
There are various scenarios in the real world where we need some future predictions such as
weather condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately. So for such case we need
Regression analysis which is a statistical method and used in machine learning and data
science. Below are some other reasons for using Regression analysis:
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other
factors.
TYPES OF REGRESSION
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive
analysis.
o It is one of the very simple and easy algorithms which works on regression and shows
the relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-
axis) and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple
linear regression. And if there is more than one input variable, then such linear
regression is called multiple linear regression.
o The relationship between variables in the linear regression model can be explained
using the below image. Here we are predicting the salary of an employee on the basis
of the year of experience.
o Below is the mathematical equation for Linear regression:
1. Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
Some popular applications of linear regression are:

o Analyzing trends and sales estimates
o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes
or No, True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex
cost function. This sigmoid function is used to model the data in logistic regression.
The function can be represented as:
o f(x)= Output between the 0 and 1 value.
o x= input to the function
o e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded
up to 1, and values below the threshold level are rounded up to 0.
There are three types of logistic regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear
dataset using a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the
value of x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-
linear fashion, so for such case, linear regression will not best fit to those datapoints.
To cover such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation
that means Linear regression equation Y= b 0+ b1x, is transformed into Polynomial
regression equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x
is our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables.
Below are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional
data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR,
it is a line which helps to predict the continuous variables and cover most of the
datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which
creates a margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must
contain a maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node
represents the "test" for an attribute, each branch represent the result of the test, and
each leaf node represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which
splits into left and right child nodes (subsets of dataset). These child nodes are further
divided into their children node, and themselves become the parent node of those
nodes. Consider the below image:
Above image showing the example of Decision Tee regression, here, the model is trying to
predict the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is
capable of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines
multiple decision trees and predicts the final output based on the average of each tree
output. The combined decision trees are called as base models, and it can be
represented more formally as:
g(x)= f0(x)+ f1(x)+ f2(x)+....
o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble
learning in which aggregated decision tree runs in parallel and do not interact with
each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model
by creating random subsets of the dataset.
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a
small amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We
can compute this penalty term by multiplying with the lambda to the squared weight
of each individual features.
o The equation for ridge regression will be:
o A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model.
o It is similar to the Ridge Regression except that penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
PREDICTIVE PERFORMANCE ESTIMATION
Predictive models are proving to be quite helpful in predicting the future growth of
businesses, as it predicts outcomes using data mining and probability, where each model
consists of a number of predictors or variables. A statistical model can, therefore, be created
by collecting the data for relevant variables.
There are two categories of problems that a predictive model can solve depending on the
category of business — classification and regression problems. The classification category
describes predicting which category the sample should fall into and the latter describes
predicting quantity. These two categories are the initial points of a data science team for
choosing the right metrics and then determining a good working model.
In this article, we will understand the prediction model and performance evaluation from its
core and its importance.
Important Applications Of Predictive Modelling In Business
True-lift Modeling: This is a predictive modelling technique, also known as uplift modelling
that directly models a direct marketing action on an individual’s behaviour.
Online Marketing: This technique uses the web surfer’s past data and makes it run through
the algorithms for determining the type of products the user is most likely click on.
Fraud Detection: This model is used to detect the fraudulent by identifying outliers in a
dataset that indicates any fake activity.
Churn Prevention: This technique uses predictive analytics to predict when and why a
customer is most likely to end the relationship with the company. This study was developed
to predict churn of customer’s account information on telecom.
Sale Forecasting: This can be called the most used technique using predictive modelling.
Examining the past records, market-moving events, keeping track of sales, etc. results in a
realistic prediction sale in a company.
Performance Evaluation
Performance evaluation plays a dominant role in the technique of predictive modelling. The
performance of a predictive model is calculated and compared by choosing the right metrics.
So, it is very crucial to choose the right metrics for a particular predictive model in order to
get an accurate outcome. It is very important to evaluate proper predictive models because
various kinds of data sets are going to be used for the same predictive model.
Common Metrics That Are Used To Evaluate Predictive Models
Area Under The ROC Curve (AUC-ROC): This is one of the popular metrics that has been
used in the industry. The nature of this metric is independent of the change in the proportion
of responders and that’s the biggest advantage of this metric. A model will be represented as
a single point in the ROC plot where the class is an outcome.
Confusion Matrix: This is an NXN matrix where N is called the number of classes being
predicted. This metric is called an error matrix and it portrays a dominant role for prediction
mainly in the issues of statistical categorization. It is a special table with dimensions of two
namely the actual and predicted with an identical class.
Concordant- Discordant Ratio: This model is used to describe the relationship between
pairs of observations where the data are treated as ordinal. The method of calculating this
ratio compares the classifications for two variables on the same two items.
Cross-Validation: This is a resampling procedure and is important in any type
of data modelling. This metric is used to compare and select a model for a given predictive
modelling problem.
Gain and Lift Chart: In this metric, both the charts are used to measure the effectiveness of
a model and it deals to check the rank ordering of probabilities. This method follows like
calculating the probability for each observation and then ranking them in decreasing order.
After ranking, build deciles with each group and lastly, calculate the response rate at each
decile.
Kolmogorov Smirnov chart: The K-S chart measures the degree of separation between the
positive and negative distributions of a model. In most classification models, the K-S gives
values between 0 and 100, where the higher value is considered as the better model.
Mean Square Error: If the data contains a huge number of outliers, then this metric is
known to be a good one.
Median Absolute Error: This metric represents the average of the absolute differences
between the actual observation and the prediction.
Percent Correction Classification: This metric measures the overall accuracy where every
error has the same weight.
Root Mean Squared Error: This is one of the popular metrics that is mainly used in
regression problems. This metric assumes that the error is unbiased and follows a normal
distribution.
FINDING THE RIGHT MODEL PARAMETERS

These datasets provide the opportunity for organizations to deepen business insights and
predict scenarios that will further help them to stay relevant and prominent in this competitive
world. Many of the organizations across the world are trying to assess their current big data
and analytics environment so that they identify performance bottlenecks, how optimal the
solution is and how equipped is it to cater toda new requirements. The assessment is broadly
based on two things. 1. The existing environment with current requirements 2. The existing
environment with Future requirements The existing environment with current requirement: It
spells how successful the current environment is by validating all the implementation of
previous requirements. It means to what extent the current analytics and BI satisfy the needs
of the business. The current solution needs to be validated against all the 10 parameters as
mentioned in Illustration 1.
It is highly unlikely that the current solution holds its fort against all the parameters. Some of
the typical issues around the solution are; 1. It may not satisfy all the analytical needs of the
business and some of them may require significant changes as proposed by the business
before it can be used 2. The performance of the execution of analytics is not up to the mark 3.
Many loopholes in the security of datasets as anyone within an organization can access any
part of the information 4. The business doesn’t treat the solution as a single version of truth
because of data quality issues 5. Business need to wait for a longer duration before the new
requirement can be materialized 6. The master data which is a reference for many analytics
can not be trusted because of duplicates and non-standardization of data 7. The data set which
is a feed from one of the newest data sources cannot be accommodated because of non-
flexibility of the solution 8. Business feels that they have to wait endlessly before any of their
new requirements are implemented The demands of the business manifold as it undergoes
changes over the period of time due to new acquisitions, new business models, campaigns
and new structure. However, the ill-designed big data solution may fail on many of the
above-mentioned parameters (illustration 1). It is vital for any organization to find big data
issues and bottlenecks before any major failure occurs. This warrants the assessment of big
data solution so that all the mentioned parameters are assessed to the deepest level and
identify all the issues that need a fix. The assessment can be done as per the below phases.
The
planning phase will primarily define the scope of the assessment with specific timelines so
that all the tasks can be tracked. It is recommended to initiate the entire plan using an agile
framework so that periodic checks can be done. The discovery phase will conceptualize
various layers of the solution for bottlenecks and brainstorm with various users so that more
clarity can be obtained. Based on the investigation and findings, a recommendation document
would be created and socialized with all the stakeholders. This will lead to a roadmap of
implementation which can be categorized into 3 stages; 1. Immediate fixes and tuning (within
1 to 3 months) 2. Midterm implementation (within 6 months) 3. Long term implementation
(from 6 months to 2 years/more)
TECHNIQUE AND MODEL SELECTION

Model Selection
In machine learning, the process of selecting the top model or algorithm from a list of
potential models to address a certain issue is referred to as model selection. It entails
assessing and contrasting various models according to how well they function and choosing
the one that reaches the highest level of accuracy or prediction power.
Because different models have varied levels of complexity, underlying assumptions, and
capabilities, model selection is a crucial stage in the machine-learning pipeline. Finding a
model that fits the training set of data well and generalizes well to new data is the objective.
While a model that is too complex may overfit the data and be unable to generalize, a model
that is too simple could underfit the data and do poorly in terms of prediction.
The following steps are frequently included in the model selection process:
 Problem formulation: Clearly express the issue at hand, including the kind of
predictions or task that you'd like the model to carry out (for example, classification,
regression, or clustering).
 Candidate model selection: Pick a group of models that are appropriate for the issue
at hand. These models can include straightforward methods like decision trees or
linear regression as well as more sophisticated ones like deep neural networks,
random forests, or support vector machines.
 Performance evaluation: Establish measures for measuring how well each model
performs. Common measurements include area under the receiver's operating
characteristic curve (AUC-ROC), recall, F1-score, mean squared error, and accuracy,
precision, and recall. The type of problem and the particular requirements will
determine which metrics are used.
 Training and evaluation: Each candidate model should be trained using a subset of
the available data (the training set), and its performance should be assessed using a
different subset (the validation set or via cross-validation). The established evaluation
measures are used to gauge the model's effectiveness.
 Model comparison: Evaluate the performance of various models and determine
which one performs best on the validation set. Take into account elements like data
handling capabilities, interpretability, computational difficulty, and accuracy.
 Hyperparameter tuning: Before training, many models require that certain
hyperparameters, such as the learning rate, regularisation strength, or the number of
layers that are hidden in a neural network, be configured. Use methods like grid
search, random search, and Bayesian optimization to identify these hyperparameters'
ideal values.
 Final model selection: After the models have been analyzed and fine-tuned, pick the
model that performs the best. Then, this model can be used to make predictions based
on fresh, unforeseen data.
Model Selection in machine learning:

Model selection in machine learning is the process of selecting the best algorithm and model
architecture for a specific job or dataset. It entails assessing and contrasting various models to
identify the one that best fits the data & produces the best results. Model complexity, data
handling capabilities, and generalizability to new examples are all taken into account while
choosing a model. Models are evaluated and contrasted using methods like cross-validation,
and grid search, as well as indicators like accuracy and mean squared error. Finding a model
that balances complexity and performance to produce reliable predictions and strong
generalization abilities is the aim of model selection.
There are numerous important considerations to bear in mind while selecting a model for
machine learning. These factors assist in ensuring that the chosen model is effective in
solving the issue at its core and has an opportunity for outstanding performance. Here are
some crucial things to remember:
 The complexity of the issue: Determine how complex the issue you're trying to
resolve is. Simple models might effectively solve some issues, but more complicated
models can be necessary to fully represent complex relationships in the data. Take
into account the size of the dataset, the complexity of the input features, and any
potential for non-linear connections.
 Data Availability & Quality: Consider the accessibility and caliber of the data you
already have. Using complicated models with a lot of parameters on a limited dataset
may result in overfitting. Such situations may call for simpler models with fewer
parameters. Take into account missing data, outliers, and noise as well as how various
models respond to these difficulties.
 Interpretability: Consider whether the model's interpretability is crucial in your
particular setting. Some models, like decision trees or linear regression, offer
interpretability by giving precise insights into the correlations between the input data
and the desired outcome. Complex models, such as neural networks, may perform
better but offer less interpretability.
 Model Assumptions: Recognise the presumptions that various models make. For
instance, although decision trees assume piecewise constant relationships, linear
regression assumes a linear relationship between the input characteristics and the
target variable. Make sure the model you choose is consistent with the fundamental
presumptions underpinning the data and the issue.
 Scalability and Efficiency: If you're working with massive datasets or real-time
applications, take the model's scalability and computing efficiency into consideration.
Deep neural networks and support vector machines are two examples of models that
could need a lot of time and computing power to train.
 Regularisation and Generalisation: Assess the model's capacity to apply to fresh,
untested data. By including penalty terms to the objective function of the model,
regularisation approaches like L1 or L2 regularisation can help prevent overfitting.
When the training data is sparse, regularised models may perform better in terms of
generalization.
 Domain Expertise: Consider your expertise and domain knowledge. On the basis of
previous knowledge of the data or particular features of the domain, consider if
particular models are appropriate for the task. Models that are more likely to capture
important patterns can be found by using domain expertise to direct the selection
process.
 Resource Constraints: Take into account any resource limitations you may have,
such as constrained memory space, processing speed, or time. Make that the chosen
model can be successfully implemented using the resources at hand. Some models
require significant resources during training or inference.
 Ensemble Methods: Examine the potential advantages of ensemble methods, which
integrate the results of various models in order to perform more effectively. By
utilizing the diversity of several models' predictions, ensemble approaches, such as
bagging, boosting, and stacking, frequently outperform individual models.
 Evaluation and Experimentation: experimentation and assessment of several
models should be done thoroughly. Utilize the right evaluation criteria and statistical
tests to compare their performance. To evaluate the models' performance on unknown
data and reduce the danger of overfitting, use hold-out or cross-validation.
Model Selection Techniques

Model selection in machine learning can be done using a variety of methods and tactics.
These methods assist in comparing and assessing many models to determine which is best
suited to solve a certain issue. Here are some methods for selecting models that are frequently
used:
 Train-Test Split: With this strategy, the available data is divided into two sets: a
training set & a separate test set. The models are evaluated using a predetermined
evaluation metric on the test set after being trained on the training set. This method
offers a quick and easy way to evaluate a model's performance using hypothetical
data.
 Cross-Validation: A resampling approach called cross-validation divides the data
into various groups or folds. Several folds are used as the test set & the rest folds as
the training set, and the models undergo training and evaluation on each fold
separately. Lowering the variance in the evaluation makes it easier to generate an
accurate assessment of the model's performance. Cross-validation techniques that are
frequently used include leave-one-out, stratified, and k-fold cross-validation.
 Grid Search: Hyperparameter tuning is done using the grid search technique. In
order to do this, a grid containing hyperparameter values must be defined, and all
potential hyperparameter combinations must be thoroughly searched. For each
combination, the models are trained, assessed, and their performances are contrasted.
Finding the ideal hyperparameter settings to optimize the model's performance is
made easier by grid search.
 Random Search: A set distribution for hyperparameter values is sampled at random
as part of the random search hyperparameter tuning technique. In contrast to grid
search, which considers every potential combination, random search only investigates
a portion of the hyperparameter field. When a thorough search is not possible due to
the size of the search space, this strategy can be helpful.
 Bayesian optimization: A more sophisticated method of hyperparameter tweaking,
Bayesian optimization. It models the relationship between the performance of the
model and the hyperparameters using a probabilistic model. It intelligently chooses
which set of hyperparameters to investigate next by updating the probabilistic model
and iteratively assessing the model's performance. When the search space is big and
expensive to examine, Bayesian optimization is especially effective.
 Model averaging: This technique combines forecasts from various models to get a
single prediction. For regression issues, this can be accomplished by averaging the
predictions, while for classification problems, voting or weighted voting systems can
be used. Model averaging can increase overall prediction accuracy by lowering the
bias and variation of individual models.
 Information Criteria: Information criteria offer a numerical assessment of the trade-
off between model complexity and goodness of fit. Examples include the Akaike
Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These
criteria discourage the use of too complicated models and encourage the adoption of
simpler models that adequately explain the data.
 Domain Expertise & Prior Knowledge: Prior understanding of the problem and the
data, as well as domain expertise, can have a significant impact on model choice. The
models that are more suitable given the specifics of the problem and the details of the
data may be known by subject matter experts.
 Model Performance Comparison: Using the right assessment measures, it is vital to
evaluate the performance of various models. Depending on the issue at hand, these
measurements could include F1-score, mean squared error, accuracy, precision, recall,
or the area beneath the receiver's operating characteristic curve (AUC-ROC). The
best-performing model can be found by comparing many models.
Classification of Data
In this article, we are going to discuss the classification of data in which we will cover
structured, unstructured data, and semi-structured data. Also, we will cover the features of the
data. Let’s discuss one by one.
Data Classification :
Process of classifying data in relevant categories so that it can be used or applied more
efficiently. The classification of data makes it easy for the user to retrieve it. Data
classification holds its importance when comes to data security and compliance and also to
meet different types of business or personal objective. It is also of major requirement, as data
must be easily retrievable within a specific period of time.
Types of Data Classification :
Pause
Unmute
×
Data can be broadly classified into 3 types.
1. Structured Data :
Structured data is created using a fixed schema and is maintained in tabular format. The
elements in structured data are addressable for effective analysis. It contains all the data
which can be stored in the SQL database in a tabular format. Today, most of the data is
developed and processed in the simplest way to manage information.
Examples –
Relational data , Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of students
for a university like the name of the student, ID of a student, address, and Email of the
student. To store the record of students used the following relational schema and table for
the same.
S_I S_Addres
S_Name S_Email
D s
1001 A Delhi A@gmail.com
1002 B Mumbai B@gmail.com
2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say that
any does not follow any organized format. This kind of data is also not fit for the
relational database because in the relational database you will see a pre-defined manner or
you can say organized way of data. Unstructured data is also very important for the big
data domain and To manage and store Unstructured data there are many platforms to
handle it like No-SQL Database .
Examples –
Word, PDF, text, media logs, etc.
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that
have some organizational properties that make it easier to analyze. With some process,
you can store them in a relational database but is very hard for some kind of semi-
structured data, but semi-structured exist to ease space.
Example –
XML data.
Features of Data Classification :
The main goal of the organization of data is to arrange the data in such a form that it
becomes fairly available to the users. So it’s basic features as following.
 Homogeneity – The data items in a particular group should be similar to each other.
 Clarity – There must be no confusion in the positioning of any data item in a
particular group.
 Stability – The data item set must be stable i.e. any investigation should not affect
the same set of classification.
 Elastic – One should be able to change the basis of classification as the purpose of
classification changes.
Unlock the Power of Placement Preparation!
Feeling lost in OS, DBMS, CN, SQL, and DSA chaos? Our Complete Interview
Preparation Course is the ultimate guide to conquer placements. Trusted by over
100,000+ geeks, this course is your roadmap to interview triumph.
PREDICTIVE PERFORMANCE MEASURES FOR CLASSIFICATION

Many learning algorithms have been proposed. It is often valuable to assess the efficacy of an
algorithm. In many cases, such assessment is relative, that is, evaluating which of several
alternative algorithms is best suited to a specific application.
People even end up creating metrics that suit the application. In this article, we will see some
of the most common metrics in a classification setting of a problem.
The most commonly used Performance metrics for classification problem are as follows,
 Accuracy
 Confusion Matrix
 Precision, Recall, and F1 score
 ROC AUC
 Log-loss
Accuracy
Accuracy is the simple ratio between the number of correctly classified points to the total
number of points.
To calculate accuracy, scikit-learn provides a utility function.
from sklearn.metrics import accuracy_score#predicted y values
y_pred = [0, 2, 1, 3]#actual y values
y_true = [0, 1, 2, 3]accuracy_score(y_true, y_pred)
0.5
Accuracy is simple to calculate but has its own disadvantages.
Limitations of accuracy
 If the data set is highly imbalanced, and the model classifies all the data points as the
majority class data points, the accuracy will be high. This makes accuracy not a
reliable performance metric for imbalanced data.
 From accuracy, the probability of the predictions of the model can be derived. So from
accuracy, we can not measure how good the predictions of the model are.
Confusion Matrix
Confusion Matrix is a summary of predicted results in specific table layout that allows
visualization of the performance measure of the machine learning model for a binary
classification problem (2 classes) or multi-class classification problem (more than 2 classes)
Confusion matrix of a binary classification

 TP means True Positive. It can be interpreted as the model predicted positive class
and it is True.
 FP means False Positive. It can be interpreted as the model predicted positive class but
it is False.
 FN means False Negative. It can be interpreted as the model predicted negative class
but it is False.
 TN means True Negative. It can be interpreted as the model predicted negative class
and it is True.
For a sensible model, the principal diagonal element values will be high and the off-diagonal
element values will be below i.e., TP, TN will be high.
To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to
determine whether a person has a certain disease. A false positive in this case occurs when the
person tests positive but does not actually have the disease. A false negative, on the other
hand, occurs when the person tests negative, suggesting they are healthy when they actually do
have the disease.
For a multi-class classification problem, with ‘c’ class labels, the confusion matrix will be a
(c*c) matrix.
To calculate confusion matrix, sklearn provides a utility function
from sklearn.metrics import confusion_matrix
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
Advantages of a confusion matrix:
 The confusion matrix provides detailed results of the classification.
 Derivates of the confusion matrix are widely used.
 Visual inspection of results can be enhanced by using a heat map.
Precision, Recall, and F-1 Score
Precision is the fraction of the correctly classified instances from the total classified
instances. Recall is the fraction of the correctly classified instances from the total classified
instances. Precision and recall are given as follows,
Mathematical formula of
Precision and Recall using the confusion matrix
For example, consider that a search query results in 30 pages, out of which 20 are relevant.
And the results fail to display 40 other relevant results. So the precision is 20/30 and recall is
20/60.
Precision helps us understand how useful the results are. Recall helps us understand how
complete the results are.
But to reduce the checking of pockets twice, the F1 score is used. F1 score is the harmonic
mean of precision and recall. It is given as,
When to use the F1 Score?

 The F-score is often used in the field of information retrieval for measuring search,
document classification, and query classification performance.
 The F-score has been widely used in the natural language processing literature, such as
the evaluation of named entity recognition and word segmentation.
Log Loss
Logarithmic loss (or log loss) measures the performance of a classification model where the
prediction is a probability value between 0 and 1. Log loss increases as the predicted
probability diverge from the actual label. Log loss is a widely used metric for Kaggle
competitions.
Here ’N’ is the total number of data points in the data set, yi is the actual value of y and pi is
the probability of y belonging to the positive class.
Lower the log-loss value, better are the predictions of the model.
To calculate log-loss, scikit-learn provides a utility function.
from sklearn.metrics import log_losslog_loss(y_true, y_pred)
ROC AUC
A Receiver Operating Characteristic curve or ROC curve is created by plotting the True
Positive (TP) against the False Positive (FP) at various threshold settings. The ROC curve is
generated by plotting the cumulative distribution function of the True Positive in the y-axis
versus the cumulative distribution function of the False Positive on the x-axis.
The dashed curved line is the ROC Curve
The area under the ROC curve (ROC AUC) is the single-valued metric used for evaluating the
performance.
The higher the AUC, the better the performance of the model at distinguishing between the
classes.
In general, an AUC of 0.5 suggests no discrimination, a value between 0.5–0.7 is acceptable
and anything above 0.7 is good-to-go-model. However, medical diagnosis models, usually
AUC of 0.95 or more is considered to be good-to-go-model.
DISTANCE-BASED LEARNING ALGORITHMS
Clustering consists of grouping certain objects that are similar to each other, it can be
used to decide if two items are similar or dissimilar in their properties. In a Data
Mining sense, the similarity measure is a distance with dimensions describing object
features. That means if the distance among two data points is small then there is
a high degree of similarity among the objects and vice versa. The similarity
is subjective and depends heavily on the context and application. For example, similarity
among vegetables can be determined from their taste, size, colour etc. Most clustering
approaches use distance measures to assess the similarities or differences between a pair of
objects, the most popular distance measures used are: 1. Euclidean Distance: Euclidean
distance is considered the traditional metric for problems with geometry. It can be simply
explained as the ordinary distance between two points. It is one of the most used
algorithms in the cluster analysis. One of the algorithms that use this formula would be K-
mean. Mathematically it computes the root of squared differences between the
coordinates between two objects.

Figure – Euclidean Distance
2. Manhattan Distance: This determines the absolute difference among the pair of the
coordinates. Suppose we have two points P and Q to determine the distance between these
points we simply have to calculate the perpendicular distance of the points from X-Axis
and Y-Axis. In a plane with P at coordinate (x1, y1) and Q at (x2, y2). Manhattan distance
between P and Q = |x1 – x2| + |y1 – y2| Here the

total distance of the Red line gives the Manhattan distance between both the points. 3.
Jaccard Index: The Jaccard distance measures the similarity of the two data set items as
the intersection of those items divided by the union of the data items.
Figure – Jaccard Index
4. Minkowski distance: It is the generalized form of the Euclidean and Manhattan
Distance Measure. In an N-dimensional space, a point is represented as,
(x1, x2, ..., xN)
Consider two points P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
Then, the Minkowski distance between P1 and P2 is given as:
 When p = 2, Minkowski distance is same as the Euclidean distance.

 When p = 1, Minkowski distance is same as the Manhattan distance.
5. Cosine Index: Cosine distance measure for clustering determines the cosine of the
angle between two vectors given by the following formula.
Here (theta) gives the angle between

two vectors and A, B are n-dimensional vectors.
Figure – Cosine Distance
PROBABILISTIC CLASSIFICATION ALGORITHMS
Classifying cats and dogs

Imagine creating a model with the sole purpose of classifying cats and dogs. The classification
model will not be perfect and therefore wrongly classify certain observations. Some cats will
be classified as dogs and vice versa. That’s life. In this example, the model classifies 100 cats
and dogs. The confusion matrix is a commonly used visualization tool to show prediction
accuracy and Figure 1 shows the confusion matrix for this example.
Figure 1: Confusion matrix for classification of 100 cats and dogs. Source: Author.
Let’s focus on the 12 observations where the model predicts a cat while in reality it is a dog. If
the model predicts 51% probability of cat and it turns out to be a dog, for sure that’s possible.
However, if the model predicts 95% probability of cat and it turns out to be a dog? This seems
highly unlikely.
Figure 2: Predicted probability of cat and the classification threshold. Source: Author.
Classifiers use a predicted probability and a threshold to classify the observations. Figure 2
visualizes the classification for a threshold of 50%. It seems intuitive to use a threshold of
50% but there is no restriction on adjusting the threshold. So, in the end the only thing that
matters is the ordering of the observations. Changing the objective to predict probabilities
instead of labels requires a different approach. For this, we enter the field of probabilistic
classification.
Evaluation metric 1: Logloss
Let us generalize from cats and dogs to class labels of 0 and 1. Class probabilities are any real
number between 0 and 1. The model objective is to match predicted probabilities with class
labels, i.e. to maximize the likelihood, given in Eq. 1, of observing class labels given the
predicted probabilities.
Equation 1: Likelihood for class labels y and predicted probabilities based on features x.
A major drawback of the likelihood is that if the number of observations grow, the product of
the individual probabilities becomes increasingly small. So, with enough data, the likelihood
will underflow the numerical precision of any computer. Next to that, a product of parameters
is difficult to differentiate. That’s the reason the logarithm of the likelihood is preferred,
commonly referred to as the loglikelihood. A logarithm is a monotonically increasing function
of its argument. Therefore, maximization of the log of a function is equivalent to maximization
of the function itself.
Equation 2: Logloss for class labels y and predicted probabilities based on features x.
Nonetheless, the loglikelihood still scales with the number of observations so an average
loglikelihood is better metric to explain the observed variation. However, in practice, most
people minimize the negative average loglikelihood instead maximizing the average
loglikelihood because optimizers normally minimize functions. Data scientists commonly
refer to this metric as Logloss, as given in Eq. 2. For a more elaborate discussion of the
Logloss and its relation to the evaluation metrics normally used in classification model
evaluation, I refer you to this article.
Evaluation metric 2: Brier Score
Next to the Logloss, the Brier Score, as given in Eq. 3, is commonly used as an evaluation
metric for predicted probabilities. In essence, it is a quadratic loss on the predicted
probabilities and the class labels. Note the similarity between the Mean Squared Error
(MSE) used in regression model evaluation.
Equation 3: Brier Score for class labels y and predicted probabilities based on features x.
However, a notable difference with the MSE is that the minimum Brier Score is not 0. The
Brier Score is the squared loss on the labels and probabilities, and therefore by definition is
not 0. Simply said, the minimum is not 0 if the underlying process is non-deterministic which
is the reason to use probabilistic classification in the first place. In order to cope with this
problem, the probabilities are commonly evaluated on a relative basis with other probabilistic
classifiers using for instance the Brier Skill Score.
Example with dummy data
In this section I will show an example of the steps to go from classification to probability
estimation using dummy data. The example will show multiple ML models, ranging from
Logistic Regression to Random Forests. Let us first create dummy data using Sklearn. The
dummy dataset contains both informative as well as redundant features and multiple clusters
per class are introduced.
The dummy data is classified using the ML model structures:
 Logistic Regression (LR),
 Support Vector Machine (SVM),
 Decision Tree (DT),
 Random Forest (RF).
The ML model’s ability to correctly classify is evaluated using the ROC-AUC score. Figure 3
shows that all ML models do a fairly good job at classifying the dummy data, i.e. ROC-AUC
> 0.65, whereas RBF SVM and RF perform best.
Figure 3: ROC-AUC score on out-of-sample data for different ML model structures. Source:
Author.
However, remind the model objective is predicting probabilities. It is nice that the ML models
accurately classify the observations, but how well do the models predict class probabilities?
There are two routes to evaluate the predicted probabilities:
 Quantitatively with the Brier Score and Logloss;
 Qualitatively with the calibration plot.
Quantitative evaluation of probabilities
Firstly, the ML models are quantitatively evaluated using the Brier Score and Logloss. Figure
4 shows that RBF SVM and RF perform best at probabilities estimation based on the Brier
Score (left) and the Logloss (right). Note, the Logloss of the DT is relatively high and to
understand the reason for this, I refer you to this article.
Figure 4: Brier Score (left) and Logloss (right) on out-of-sample data for different ML model
structures. Source: Author.
Qualitative evaluation of probabilities
Secondly, the ML models are qualitatively evaluated using the calibration plot. The goal of the
calibration plot is to show and evaluate whether predicted probabilities match the actual
fraction of positives. The plot buckets the predicted probabilities in uniform buckets and
compares the mean predicted to the fraction of positives. Figure 5 shows the calibration plot
for our example. You can see that the LR and RBF SVM are well calibrated, i.e., the mean
predicted probability matches the fraction of positives nicely. However, inspecting the
distribution of predicted probabilities for LR shows that the predicted probabilities are more
centered than for the RBF SVM. Next to that, you see that the DT is ill-calibrated and the
distribution of predicted probabilities seems wrong.
Figure 5: Calibration plot (upper) and distribution of probabilities (under) for different ML
models. Source: Author.
Why do predicted probabilities not match posterior probabilities?
Niculescu-Mizil & Caruano explain in their 2005 paper “Predicting Good Probabilities With
Supervised Learning” why some ML models observe distorted predicted probabilities in
comparison to the posterior probabilities. Let us start with explaining the root cause. When a
classification model is not trained to decrease the Logloss, the predicted probabilities do not
match the posterior probabilities. A solution to this, is to map predicted probabilities after
model training to posterior probabilities, which is known as post-training calibration.
Frequently used probability calibration techniques are:
 Platt Scaling (Platt, 1999)
 Isotonic Regression (Zadrozny, 2001)
Figure 6: Model performance after post-training probability calibration with Platt Scaling and
Isotonic regression. Source: (Niculescu et al., 2005)
Calibrating the Decision Tree and Random Forest
The ML models are calibrated using Platt scaling and Isotonic regression, which are both
easily coded in Sklearn. Note, the LR is not calibrated because this model structure is trained
to decrease the Logloss and therefore has calibrated probabilities by default.
The only tunable parameter is the number of cross-validations for probability calibration.
Niculescu (2005) show that a small calibration set size can deteriorate performance and that
the performance improvement is most positive for Boosted and Bagged Decision Trees and
Support Vector Machines. Our example contains 8,000 observations in the test dataset. For a
5-fold cross-validation, 1,600 observations are reserved for the calibration set size.
Figure 7: Brier Score for Platt Scaling and Isotonic Regression for different ML models and
different calibration set sizes. Source: (Niculescu et al., 2005)
Model evaluation after probability calibration
Let’s see whether probability calibration improves the Brier Score, Logloss and the calibration
plot. Figure 7 shows the Brier Score and Logloss after probability calibration. Isotonic
Regression has equivalent performance compared to Platt Scaling, because the dataset
contains a sufficient number of observations. Given the non-parametric nature of Isotonic
regression, I must warn for cases with small calibration set sizes. However, if you intend to do
probabilistic classification on small data sizes, I suggest you use prior information and explore
the field of Bayesian classification.
Figure 7: Brier Score (left) and Logloss (right) after post-training calibration on out-of-
sample data. Source: Author.
Figure 8 shows the calibration plots after post-training calibration. You see an improvement in
the predicted probabilities for the SVM, DT and RF. Next to that, the distribution of predicted
probabilities cover the range of [0, 1] completely and provide accurate mean predicted
probabilities.
Figure 8: Calibration plot (upper) and probabilities (under) after post-training calibration of
the ML models. Source: Author.
Does probability calibration impact the classification ability?
Calibration does not change the ordering of predicted probabilities. The calibration only
changes the predicted probabilities to better match the observed fraction of positives. Figure 9
shows that after probability calibration, the model’s classification ability, as measured by the
ROC-AUC score is either equal or better.

Unit - Iii Data Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit - Iii Data Analysis

Uploaded by

Copyright:

Available Formats

Unit-III

Regression : Predictive Performance Estimation - Finding the Parameters of the Model -

Terminologies Related to the Regression Analysis:

Some popular applications of linear regression are:

Decision Tree Regression:

PREDICTIVE PERFORMANCE ESTIMATION

FINDING THE RIGHT MODEL PARAMETERS

TECHNIQUE AND MODEL SELECTION

Model Selection in machine learning:

Model Selection Techniques

1001 A Delhi A@gmail.com

1002 B Mumbai B@gmail.com

PREDICTIVE PERFORMANCE MEASURES FOR CLASSIFICATION

Confusion matrix of a binary classification

When to use the F1 Score?

DISTANCE-BASED LEARNING ALGORITHMS

coordinates between two objects.

between P and Q = |x1 – x2| + |y1 – y2| Here the

 When p = 2, Minkowski distance is same as the Euclidean distance.

Here (theta) gives the angle between

Figure – Cosine Distance

PROBABILISTIC CLASSIFICATION ALGORITHMS

Classifying cats and dogs

You might also like