Professional Documents
Culture Documents
MZU-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 3
MZU-MBA-DATA ANALYTICS - Data Science and Business Analysis - Unit 3
SEMESTER-I
All rights reserved. No Part of this book may be reproduced in any form without permission
in writing from TeamLease Edtech Pvt. Ltd.
CONTENT
UNIT - 3: Classification…………………………………………………………….4
UNIT - 3: CLASSIFICATION
STRUCTURE
3.2 Introduction
3.4 SVM
3.4.2 Kernels
3.8 Clustering
3.8.2 K-means
3.9 Summary
3.2 INTRODUCTION
This unit shall focus on the classification of data. The unit shall provide an insight into
various ensemble methods. This unit shall also introduce you to the concept of probability,
including the Bayes theorem and the method of logistic regression. This unit will help us
learn about the concept of clustering as well.
The raw data obtained from surveys or macro databases should be processed and handled so
that it is ideal for review. As a result, the researchers must format data files for the purposes
of the research report. Data management requires various functions undertaken by
researchers, such as arranging, appending, combining, and collapsing data, to make data easy
to navigate and use.
The data is sorted by placing the observations in a particular order. For instance, data may be
sorted by month or year.
If researchers use more than one data file to do data analysis, all data files will be merged
into a single file. This is achieved by applying the findings of the second file to the first file.
If the researcher discovers that two files contain the same observations, but under separate
variables, the files are combined for analysis.
3.3 DATA MANAGEMENT AND ANALYSIS
The data management process includes the editing, coding, classification, and tabulation of
the data gathered. Let us read about each of these processes.
i. Editing: Editing data is a method for analysing the raw material obtained, collected in
polls to find any mistakes, omissions, or inconsistencies. Data editing is often referred to as
data washing. E.g., if the study had school girls as subjects, we might expect ages to be in a
certain range. If the data by mistake indicates the age of the subject that is not consistent with
the set, it must be verified.
Cleaning is also helpful in testing data inconsistency where the answer is not probable or
unlikely. E.g., illiterate people will not be able to claim that they read the newspaper. Missing
values or reactions that are not provided are often observed during data cleaning. Editing or
cleaning up data means that the data is as reliable as possible, compatible with all data
obtained and entered consistently to support the coding process. Such data shall be fixed
wherever feasible and handled for coding purposes. For example, if the respondent in a teen
study offers an age that does not fall into a specific age range, then the age must be checked,
and a new age should be entered.
ii. Coding: The coding of data is characterised as the method of conceptualising and
classifying research into concrete and specific categories. Coding is performed to enable the
review and evaluation of the results. It is the method of assigning numbers or other symbols
to answers to questions that can be classified into particular groups or groups. All data must
fit into a certain group. For instance, if respondents are categorised as per their gender, all
males could be coded as 1, and all females could be coded as 2.
Thus, groups 1 and 2 are mutually exclusive, which means that the data can be coded as
either 1 or 2. Data coding can be performed manually by copying data to the encoding layer.
However, a vast volume of data may be encoded in excel sheets or spreadsheets using a
computing facility. The coding process is carried out by a coder or a researcher in smaller
trials and by a group of coders in a comparatively large test sample.
iii. Classification: Classification is the next step in data management. Once the data that has
been coded is grouped into unique classes or groups based on a shared trait, data is
categorised based on attributes or class cycles.
iv. Tabulation: Where the data is grouped into unique classes, they are organised in
sequential order in tables. The systematic arrangement of data in the tables is called
tabulation. The data is seen in the compact table for readers to interpret the findings of the
analysis. Tabulation eliminates the need for a long definition and clarification. It facilitates a
simple comparison of results and additional statistical data analysis. The tabulation method
can be performed manually or with the aid of your machine. Manual tabulation can be made
for smaller comparisons. But in larger trials, which involve comparatively large amounts of
quantitative data, are only possible where machines are available for tabulation purposes.
The grouping of data by attributes is based on some universal function, either qualitative or
quantitative. The data on attributes are commonly used in the social sciences like
Psychology, Sociology, Political Science, Communication Studies, and Gender Studies. The
division of data by attributes may be simple where only one attribute is listed, and the data is
separated into two classes. The respondents in the sample, for example, are listed as either
men or women. There is also a dynamic grouping of data by attributes in which data are
grouped into various categories and classes depending on the inclusion of two or more
attributes.
Machine learning requires predicting and classifying data, and to do so, and we use different
dataset machine learning algorithms.
Support Vector Machines or SVM in-short, is one of the most common algorithms, and was
highly popular at the time they were created and perfected in the 1990s, and continues to be
popular and is one of the best options for high-performance, slightly tuned algorithms, and
offers one of the most robust prediction methods.
SVM is applied differently as opposed to other ML algorithms. The SVM training algorithm
creates a model that assigns new examples to one group or another, making it a non-
probabilistic linear classifier.
SVM is a Supervised Learning algorithm used for both Classification and Regression
questions. However, it is mostly used for problems with the classification of machine
learning. In terms of classification, SVMs can effectively perform non-linear classification
by using a kernel-named trick or parameter that indirectly maps their inputs into high-
dimensional feature spaces.
SVM is an Unsupervised Learning algorithm, too. When data is not labeled, supervised
learning is not feasible. An unsupervised learning solution is needed to find natural data
clustering within groups and then map new data to these groups.
The support vector algorithm uses support vector statistics, built in the support vector
machine algorithm, to categorize unlabeled data. It is one of the most commonly used
clustering algorithms in industrial applications.
Let's get started with a problem. Suppose there is a dataset as seen below, and it is important
to classify the red rectangles from the blue ellipses (let's assume the positive ones from the
negatives). So the job is to find the optimal line that divides this data set into two groups (say
red and blue).
As noticed, there's no special line that does the job. In addition, there are infinite lines that
can distinguish these two groups from each other. But how can SVM choose the ideal one?
Figure 3.2: Example Dataset
There are two options, a green-colored line and a yellow-colored line. Which line would
better separate the data?
If the yellow line is considered, that is the best because that's the line being looked for. In this
case, it is visually very intuitive that the yellow line is best graded. But something
meaningful is needed for this.
The green line in the picture above is very similar to the red class. While it classifies current
datasets, it is not a generalised line, and in machine learning, our goal is to get a more
generalised separator.
According to the SVM algorithm, we find the points nearest to the line in both classes. These
points are called support vectors. Now, measuring the distance between the line and the
support vectors. This gap is referred to as the margin. The goal is to optimise the margin. The
hyperplane with the highest margin is the ideal hyperplane.
Figure 3.3: Example Dataset
SVM is thus attempting to make a decision on the boundaries in such a way that the
distinction between the two groups (the street) is as wide as possible.
These results are not linearly separable. One cannot draw a straight line to define this info.
However, this data can be translated to linearly separable higher-dimensional data. Let's add
another dimension and call it the z-axis. Let the z-axis coordinates be governed by the
constraint, z = x^2+y^2.
So, essentially, the z-coordinate is the distance square of the point of origin. Let's plot the z-
axis results.
x
The data is simply linearly separable now. Let the purple line dividing the higher dimension
data be z=k, where k is a constant. Since z=x^2+y^2, we get x^2 + y^2 = k, which is a circle
equation. This transformation can be used to project this linear separator in higher
dimensions back to the original dimensions.
Thus, data can be defined by adding an extra dimension to it such that it becomes linearly
separable and then projects the decision boundary back to the original dimensions by the
mathematical transformation. But finding the right transformation for every dataset is not that
easy.
3.4.2 Kernels
The SVM algorithm uses a series of mathematical functions known as kernels. Often it is not
possible to locate a hyperplane or a linear decision boundary to any classification problems.
If the data is projected to a higher dimension from the original space, a hyperplane in the
projected dimension could be obtained to define the data.
For instance, it is difficult to locate a line that divides the two classes in the input space, but
if the same data points or input space is projected into a higher dimension, a distinction
between the two classes using a hyperplane. Refer to the example below. Initially, it's hard to
distinguish the two classes, but when projected onto a higher dimension, so using a
hyperplane, we could quickly separate the two classes for classification.
Kernel trick is a function that converts data into an acceptable shape. There are different
types of kernel functions used in the SVM algorithm, i.e., Polynomial, linear, non-linear,
radial base equation, etc. Using the kernel trick, low-dimensional input space is transformed
into a higher-dimensional space.
Ensemble methods is a machine learning technique that integrates many simple models to
create one optimised predictive model. Ensemble learning helps boost the results of machine
learning by integrating many models. This methodology allows improved predictive
performance to be generated compared to a single model. That's why the Ensemble methods
are ranked first in a variety of prestigious machine learning competitions.
● Parallel ensemble methods: where the basic learners are created in parallel (e.g.,
Random Forest).
The basic motive of parallel approaches is to leverage the individuality of basic learners, as
the error can be significantly minimized by averaging.
Rather than separating identical features at each node within, Random Forest models set a
separation degree to separate each tree based on different features. This degree of distinction
offers a wider set for aggregate over, ergo producing a more reliable indicator. Please refer to
the picture for a clearer interpretation.
Figure 3.8: Classification of Random Forests
Similar to bagging, bootstrapped subsamples are taken from a larger dataset. For each
subsample, a decision tree is created. However, the decision tree is separated into various
features.
In a highly randomised tree algorithm, randomness goes one step further: the division
thresholds are randomised. Instead of searching for the most discriminatory threshold, the
thresholds are drawn randomly for each candidate element. The better of these randomly
created thresholds is selected as the dividing law. This generally makes it easier to reduce the
formula's volatility a little more, at the cost of a somewhat higher increase in bias.
Q1 What do you understand by the term ensembling? Give an overview of the various
ensemble methods used in data science.
The theory of probability originated in the gaming houses of France in the late twelfth
century. Any gambling problems have been brought to the notice of mathematicians like
Pascal and Fermat by an upper-class gambler. They and other mathematicians in Europe
noted that numerous gambling and betting problems could be solved by means of
permutation and mixture methods. They also found that certain other problems of chance
variance required other mathematical methods, such as calculus. They also foresaw the
immense possibilities of realistic application of the laws of chance that they had derived in
the various fields of science and human life.
However, only in the 20th century, the fundamental ideas and definitions of probability
theory have been founded solely on mathematics by mathematicians such as Kolmogorov
and Markov, among others.
If we extend this definition to case A: "getting an odd number when a die is rolled" then we
get P(A)=3/6= 1/2.
This description of probability is often referred to as the classical definition. Apart from
this, there's another way we can describe the likelihood. This method is called an empiric or a
relative frequency approach.
Suppose a die is rolled 600 times, and suppose the number '2' appears 98 times. So the 1
relative frequency of the incidence of 2 is 98/600, and we can consider this to be a rational
approximation of its likelihood. Thus, the likelihood of an event A can also be defined as:
The only difference between the classical and the empiric concept of P(A) is the term
"observed". This word means that we experimented a number of times before we hit the P-
value (A).
Q1 Define the concept of probability. How is the classical definition of probability different
from the empirical definition? Support your answer with suitable examples.
Q2 Find the probability of getting i) a queen, ii) a jack of hearts, and iii) a red card, from a
pack of well-shuffled cards.
Suppose there are two balls 1 and 2. Suppose ball 1 is selected with a probability of 0.7, and
ball 2 is selected with a probability of 0.8. Let S1 denote the event of 1 being picked up, and
S2 denote the event of Ball 2 being picked up. It is clear that the information that S1 has
taken place will not change the likelihood of the incidence of S2. Similarly, whether or not
S2 happened would not affect the frequency of S1. Thus, P(S2) = 0.8 whether or not S
occurred, and P(S,) = 0.7 whether or not S2 occurred. These events are called independent
events.
Now, consider another example. The box includes five white and three red cubes. The second
equivalent box contains three white and five red cubes. A box is chosen, and a cube is drawn
from it.
Let B: select the first box. Then B': pick a second box.
In this case, you will see that if occurrence B happened, that is, if we chose the first box, the
chance of drawing a white ball is 5/8. But if event B' has happened, that is, if we have chosen
a second box, then the chance of drawing a white ball is 3/8. This implies that the likelihood
of W happening depends on our choice of box. This states that events B and W are dependent
on events. We also write P(W|B) to show the probability of W, provided that B has occurred.
Definition 2: Two events A and B are said to be independent if the probability of occurrence
of event B is in no way influenced by the occurrence or non-event of event 4.
The probability of event B, provided that event A has already happened, is called the
conditional probability of B being given by A and is denoted by P(B|A). A and B are both
independent if and only if P(B|A) = P (B).
Example: Let A and B be events on the same sample space, with P (A) = 0.6 and P (B) =
0.7. Can these two events be disjoint?
P(AꓴB) = P(A)+P(B)-P(AꓴB).
And Since probability cannot be greater than 1, these two mentioned events cannot be
disjoint.
Bayes Theorem
Theorem 1 (Bayes Theorem): Suppose that event A can occur either in conjunction with H1
or with H2,......or with Hk, where HI,......, Hk is mutually exclusive. The occurrence of A
indicates that each of these events Hi, I = 1, 2, ...., k has happened. Let P(Hi) denote the
probability of occurrence of Hi and P(A(Hi) the conditional probability of A, provided that
Hi occurred, I = 1, 2, ...., k. The conditional probability that Hi occurred when A is known to
have occurred is given by:
P(Hi) is referred to as the prior probabilities of occurrence of Hi, I = 1, 2, ...., k, and P(Hi(A)
as the posterior probabilities of occurrence. Here, there are two mutually exclusive events,
claim, M and M' instead of HI, H, ......Hk and event C is equal to event A in Bayes' theorem.
The utility of this theorem can be seen from the following example.
Example: The box includes seeds in 4 grades A, B, C, D. Each of the four grades has 80,
50, 40, and 20 percent chances of germinating. The box comprises four grades of seed in
1:2:3:4 proportions. A seed is taken randomly from the box and is planted. If it
germinates, what are the probabilities that it was grade A, B, C, or D?
Solution: Let HA, HB, Hc, and HD show events that the seed is of grade A,B, C, and D,
respectively. Let G be the event that it germinates.
= 4/19
Similarly,
P(HB|G)= 5/19
P(HC|G)= 6/19
P(HD|G)= 4/19
Q1 If 4 cards are drawn from a pack of cards, what is the probability that there is one card
from each suite?
Q2 It is known that 3 out of every 100 men and 25 women out of every 10,000 women are
color-blind. In a community about half the population is male.
(i) What is the probability that a person chosen at random from the community will be color-
blind?
(ii) What is the probability that a color-blind person chosen at random from among all
colorblind persons in the community will be a male?
Logistic Function
Logistic regression is named for the function at the core of the process, the logistic function.
The logistic function, also known as the sigmoid function, was developed by statisticians to
explain the characteristics of population growth in ecology, gradually growing and
optimizing environmental efficiency. It's an S-shaped curve that can take any real-valued
number and map it to a value between 0 and 1, but never precisely within those boundaries.
1 / (1 + e^-value)
Where e is the origin of the natural logarithms (Euler's number or EXP() function), and the
value is the actual numerical value that you want to convert. Below is a graph of numbers
that are converted into ranges 0 and 1 using the logistic function.
Input values (x) are linearly combined using weights or coefficient values (referred to as the
Greek capital letter Beta) to estimate the output value (y). The main distinction from linear
regression is that the output variable being modelled is a binary value (0 or 1) rather than a
numerical value.
If y is the expected output, b0 is the bias or the intercept term, and b1 is the coefficient for
the single input value (x). Each column in your input data has a related b coefficient
(constant real value) to be learned from your training data.
The exact expression of the formula that you can store in memory or in a register is the
coefficients in the equation (beta or b).
For example, suppose we model students as a boy or girl from their height. In that case, the
first-class could be a boy, and the logistic regression model could be written as a boy's
probability given the height of the individual, or more formally:
P(sex=boy|height)
In another way, we model the probability that the input (X) belongs to the default class
(Y=1), and we can write this formally as:
P(X) = P(Y=1|X)
Note that the probability forecast must be converted into a binary value (0 or 1) in order to
actually make a probability prediction.
Logistic regression is a linear technique, but the projections are converted by a logistic
function. The effect of this is that we can no longer interpret the forecasts as a linear
combination of inputs as we can with linear regression, for example, continuing from above,
the model can be stated as:
ln(p(X) / 1 – p(X)) = b0 + b1 * X
This is useful to see that the estimation of the output to the right is linear again (just like the
linear regression), and the input to the left is a log of the likelihood of the default class.
This ratio on the left is called the odds of the default class. Odds are determined as the
probability ratio of an event divided by the probability of not experiencing an event, e.g.
0.8/(1-0.8), and has the odds of 4. But instead, it should be:
ln(odds) = b0 + b1 * X
Since the chances are log converted, the log-odds or the probit on this left side. It is possible
to use other types of functions for transformation, but as such, it is common to refer to a
transformation that relates a linear regression equation to probabilities as a relation function.
The exponent can now be shifted back to the right and write as follows:
odds = e^(b0 + b1 * X)
All of this lets one realise that the model is still a linear combination of inputs, but that this
linear combination corresponds to the log-odds of the default class.
It might work out that this is a regression dilemma where an attempt can be made to
construct a linear regression model. Once the model has been trained, the weight for a given
unknown height value can be predicted.
Now assume there is an additional field, Obesity and whether or not a person is obese needs
to be classified based on height and weight. This is clearly a sorting dilemma where dividing
the dataset into two classes (obese and not-obese) is the best option.
So using the steps of linear regression, construct a line of regression. This time, the line will
be centered on two Height and Weight parameters, and the regression line will match
between two discrete sets of values. Although this regression line is particularly vulnerable to
outsiders, it would not do a decent job of classifying two groups.
To get a better classification, feed the output values from the regression line to the sigmoid
function. The sigmoid function returns the expectation of each output value from the
regression line. Now, based on a predefined threshold value, the division of an output into
two classes, obese or not-obese, could be done.
Figure 3.12: Logistic Regression vs Linear Regression graph
where g(z) is the logistic function. In the equation the P (y =1|x; w), viewed as a function
of x, that we can get by changing the parameters w. What would be the range of p in
such a case?
Solution: For values of x in the range of real number from −∞ to +∞ Logistic function will
give the output between (0,1).
Similarities
i. Both linear and logistic regressions are supervised machine learning algorithms.
ii. Linear regression and logistic regression, both models are parametric regression, i.e., both
models use linear equations for estimation.
Differences
i. Linear regression is used to solve regression problems, while logistic regression is used to
address classification problems.
ii. Linear regression produces continuous output, but logistic regression delivers discrete
output.
iii. Linear regression aims to find the best-fitted line, while logistic regression is one step
ahead and aligns the values of the line to the sigmoid curve.
iv. The approach used to measure the loss function in linear regression is the mean square
error, while the overall probability calculation is the logistic regression.
CHECK YOUR PROGRESS-6
Q1 For values of x in the context of the real numbers, from −∞ to +∞. What do you think is the
function that allows p between (0,1)?
Q2 Logistic function (given as l(x)) is the odds log function. What might be the set of logistic
functions in the x = [0,1] domain?
Statistical models have become particularly relevant as they have become prevalent in
contemporary culture. They help us to make different kinds of predictions in our daily lives.
For example, physicians rely on general laws drawn from formulas that tell them which
individual patient cohorts are at an elevated risk of a particular disease or occurrence. A
numerical estimate of the flight's arrival time will help us understand whether our aircraft is
expected to be delayed. In other instances, templates are successful in showing us what is
important or concrete.
In all these instances, models are generated by taking existing data and seeking a
mathematical model with appropriate data fidelity. Important figures can be determined from
such a model. In the case of airline delays, estimating the result (arrival time) is the quantity
of interest, whereas the estimation of the future selection bias may be calculated by a
particular model parameter. In the latter case, the recruiting bias calculation is normally
applied to the expected variance (i.e., noise) in the data.
A decision is taken based on how rare such a finding will be relative to noise – a definition
commonly referred to as "statistical significance." This type of approach is typically
considered to be inferential: a conclusion is drawn for the sake of explaining this.
On the other hand, the prediction of a particular value (such as arrival time) represents an
estimation issue where our aim is not simply to explain whether a pattern or truth is valid but
is based on making the most precise determination of that value. Prediction uncertainty is
another essential quantity, in particular, to gauge the reliability of the value produced by the
model.
If the model can be used for inference or prediction (or in extreme situations, both) is decided
by some essential characteristics to it. Parsimony (or simplicity) is a central element. Simple
models are usually preferred to complex models, particularly when the target is inference. It
is simpler, for example, to determine practical distributional assumptions in models with
lesser parameters. Parsimony also refers to a higher potential for understanding a model. For
example, an economist might be interested in quantifying the advantages of postgraduate
salary education. A basic model could represent a linear relationship between the years of
college and the wage of jobs. This parameterisation can conveniently facilitate statistical
inferences on the possible benefits of such schooling. But suppose the relationship is
significantly different between professions and/or not linear. A more complex model would
do a better job of capturing data patterns but would be much less interpretable.
The dilemma, though, is that precision cannot be seriously lost for the sake of convenience.
A simple model can be easy to understand but will not work if it does not preserve a
reasonable degree of data fidelity; if the model is just 50 percent correct, can it be used to
draw inferences or predictions? Complexity is typically the solution for low precision. Using
additional parameters or using an essentially non-linear model, we can increase precision, but
interpretability is likely to suffer greatly. This trade-off is a crucial factor for model design.
However, the variables used in the model and their representation are just as important to
progress. It is difficult to talk about modeling without mentioning models, but one of the
goals is to increase the focus on model predictors.
In terms of nomenclature, the quantity that is being modelled or forecast is known as either
the response, the outcome, or the dependent variable. Variables used to simulate the result
are called predictors, functions, or independent variables (depending on the context).
E.g., when modelling the selling price of a house (outcome), the features of a property (e.g.,
square feet, number of bedrooms and bathrooms) may be used as predictors. Remember,
though, artificial concept words that are composites of one or more factors, such as the
number of bedrooms per restroom. This variable can be more properly referred to as a
function (or a derived feature). In any scenario, characteristics and predictors are used to
describe the result of the model.
The belief that there are different ways to depict predictors in a model and that some of these
representations are stronger than others contributes to the idea of function engineering-the
method of generating data representations that improve the usefulness of a model.
Notice that many factors affect model performance. If the predictor has no relationship to the
result, its interpretation is meaningless. However, it is very important to note that there are
several types of models and that each has its sensitivities and needs. For instance:
● Some models are unable to accommodate predictors that calculate the same
underlying quantity (i.e., the correlation between predictors).
Feature engineering and variable selection can help to alleviate many of these problems. The
purpose is to help create better models by concentrating on predictors. "Better" depends on
the context of the problem but most definitely includes the following factors: precision,
simplicity, and robustness. To obtain these characteristics or to make good trade-offs
between them, it is important to consider the relationship between the predictors used in the
model and the form of the model. Accuracy and/or simplicity can often be enhanced by
depicting information in ways that are more appealing to the model or by reducing the
number of features used.
i. Overfitting
Overfitting is a condition where the model corresponds very well to current data but fails to
simulate new samples. It normally happens when the model depends too much on patterns
and anomalies in the current data set that do not exist otherwise. Although the model has
only access to the current data set, it has no ability to realise why certain phenomena are
anomalous. Often, very versatile models are more likely to bypass the results. It is not
difficult for these models to do exceptionally well on the data set used to construct the model.
Without any protective method, it will quickly fail to generalise to new data.
Although models can override data points, feature selection strategies may override
predictors. This happens where a variable seems to be important to the existing data
collection but does not display any real relationship to the result until new data is obtained.
The possibility of this type of over-fitting is extremely hazardous when the number of data
points, denoted as n, is limited and the number of possible predictors (p) is very high. As
with the overfitting of data points, this issue can be mitigated using a technique that shows a
signal when this happens.
Supervised data analysis involves detecting trends between predictors and an identified
outcome that needs to be modelled or forecast, whereas unsupervised approaches rely
exclusively on identifying patterns between predictors. Usually, all forms of analyses will
require a degree of experimentation. Exploratory Data Analysis (EDA) (Tukey 1977) is used
to understand the main features of the predictors and the outcome so that any relevant
problems associated with the data can be established before modelling. This could involve
investigations of association mechanisms in variables, patterns of incomplete data, and/or
deviation patterns in data that may contradict the modeler's initial expectations.
Both supervised and unsupervised analyses are vulnerable to overstatement, but supervised
analyses are more resistant to finding incorrect trends in the data for predicting the result. In
short, we will use these methods to build a self-fulfilling predictive prophecy.
The "No Free Lunch" Theorem (Wolpert 1996) is the principle that without any clear
understanding of the problem or data at hand, no predictive model can be assumed to be the
strongest. Several versions are tailored for data characteristics, such as missing values or
collinear predictors. In such cases, it would be fair to conclude that they will do better than
other models (all other things being constant). In reality, things aren't that easy. One model
designed for collinear predictors is limited to model linear data patterns and vulnerable to
data deficiency. It is very difficult to predict the best model, specifically before the data is in
hand.
Experiments have been performed to determine the models appear to perform better than
others on average, in particular Dysar (2006) and Fernandez-Delgado et al (2014). These
analyses show that certain models appear to generate the most reliable models, but the rate of
"winning" is not high enough to enforce a policy of "always using Model X."
In reality, it is prudent to test out a variety of different types of models that would work best
for your specific data set.
The method for designing an efficient model is both iterative and heuristic. It is impossible to
know the needs of any data set when operating with it, and it is normal for multiple methods
to be tested and updated before a model can be finalised. Many books and tools rely
exclusively on modelling methods, but this practise is only a small part of the overall
operation.
Figure 3.13 Analysis Process
The initial operation starts at marker (a), where the exploratory data processing is used to
examine the data. After initial exploration, marker (b) shows where initial data analysis can
take place. This may involve testing basic overview indicators or recognising predictors that
have good associations with the result. The method could run between visualisation and
interpretation until the modeller is sure that the data is well established. tA (c), the first
outline of how the predictors will be interpreted in the models, is drawn up based on the
previous analysis.
At this stage, several separate modelling approaches may be tested for an initial collection of
features. This is seen in (d), where four clusters of templates are shown as thin red spots. This
describes four different models that are being assessed, but each is evaluated several times
over a range of candidate hyperparameter values. When the four models have been tuned, the
data will be numerically analysed to clarify their performance characteristics (e). Description
measurements for each model, such as the precision of the model, are used to explain the
complexity level of the problem and to assess which models are ideally suited to the results.
Further, EDAs can be carried out based on these results based on model results (f), such as
residual analysis. As a result, a further round of feature engineering (g) may be used to
compensate for these hurdles. It should be evident from this stage that models appear to fit
well with the problem at hand and that a further, more thorough round of model tuning may
be done on fewer models (h). After further adjustment and adjustment of the predictor
description, the two candidate models were finalised.
These models can be tested on an external test set as a final "bake off" between the models
(i). The final model is then picked (j), and this fitted model can be used to predict new
samples or draw inferences.
This schematic's point is to demonstrate that there are much more events in the process than
merely fitting a single mathematical model. For most problems, it is common to have
feedback loops that assess and re-evaluate how well any model/feature combination is doing.
3.7.2 Feature Selection
Usually, new features are derived sequentially to compensate for the increased efficiency of
the model. Ses sets are generated, applied to the model, and then re-sampling is used to
determine their usefulness. The new predictors are not retroactively filtered for statistical
significance until being applied to the model. It should be a monitored operation, and
precautions must be taken to ensure that there is no over-fitting.
However, it has been seen that some of the predictors have ample context knowledge to
predict the result accurately. This set of predictors may well include non-informative
variables, and this may have an effect on results to some degree. A supervised feature
collection strategy can be used to narrow down the predictor set to a smaller subset
containing only informative predictors. In addition, there is a probability that there are a
limited number of significant predictors whose usefulness has not been discovered due to all
the non-informative variables in these sets.
In other instances, all raw predictors are identified and accessible at the start of the modelling
process. In this case, a less linear solution may be used simply by using a function filtering
routine to try to figure out the best and worst predictors.
There are a variety of common methods for supervised collection of features that can be
implemented. A comparison between search methods is made between how subsets are
derived:
i. Wrapper approaches use an external search protocol to pick various subsets of the entire
predictor set to be tested in a model.
This method distinguishes the quest process for functionality from the model fitting process.
Examples of this method will be backward or stepwise sorting as well as genetic algorithms.
ii. Embedded methods are models where the feature selection protocol happens
spontaneously during the model fitting process. An example will be a basic decision tree
where variables are chosen as the model uses them in a break. If the predictor is never used
in a break, the projection equation is functionally independent of this component and has
been chosen.
As with model fitting, the main problem during the selection of features is overfitting. This is
particularly valid where wrapper approaches are used and/or where the number of data points
in the training set is limited compared to the number of predictors.
Finally, unsupervised methods of selection may have a very positive impact on model
efficiency. With a low frequency, such predictors may have a negative impact on certain
models (such as linear regression), and it may be desirable to exclude them before creating a
model.
When looking for a subset of variables, it is necessary to remember that there may not be a
single set of predictors that will deliver the best results. There is also a countervailing effect
where, as one seemingly significant variable is excluded, the model changes using the
remaining variables. This is particularly true where there is a certain degree of association
between the explanatory variables or when low-bias models are used. For this purpose, the
collection of features cannot be used as a systematic way of evaluating the importance of a
function. More conventional inferential mathematical methods are a safer solution for
evaluating a predictor's contribution to the underlying model or data set.
3.8 CLUSTERING
Cluster analysis is a method used to segment the market. The goal is to figure out which
category of consumers is in the marketplace. Homogeneous, i.e., they share similar features
such that they can be classified into one category. Thus, the cluster/group should be big
enough to allow the company to grow profitably since the overall aim of the company is to
satisfy the consumer and make money.
The community of clients that the company hopes to represent should be large enough for the
company to make a commercially feasible plan for the company. This is also true of the
customer since the customer would not be able to spend beyond that certain cost on a certain
good.
Let's use an example to explain this. Suppose the rental store owner needs to learn the tastes
of the customers to grow their market. Is it possible for the owner to look at each customer's
specifics and formulate a specific business plan for each of them? It's certainly not. But what
should be done is to get all of the customers together and tell 10 categories depending on
their shopping patterns and have a different plan for the customers in each of these 10
groups. And that's what we call clustering.
i. Even the sampling frame is not usable. The identification and interviewing of sampling
units are expensive in terms of resources, time-consuming, and the need for a lot of work.
E.g., a list of households in the metro region, a list of state farm owners, etc.
ii. The location of the sample units described could be far apart and consume a lot of time
and resources to survey them;
iii. It might not be easy to locate well identifiable and readily identifiable elementary units.
Thus, cluster sampling yields adequate results in the sampling of elementary units to solve
the above issues. Elementary units shall be formed in groups based on location, class, or
region of cluster sampling.
3.8.2 K-means
Before we begin with k-means clustering, let’s first understand the concept of centroid
models where such type of clustering is applied.
Centroid models are adaptive clustering algorithms in which the notion of similarity is
extracted from the proximity of the cluster data point to the centroid. K-Means clustering
algorithm is a common algorithm that falls into this group. In these models, the number of
clusters needed at the end must be mentioned beforehand, which makes it necessary to have
prior knowledge of the dataset. These models run recursively to find the local optim.
The first step of this algorithm is to make, among our unmarked observations, new
observations, randomly found, called 'centroids.' The number of centroids is indicative of the
number of output groups (which, remember, we do not know). Now, the iterative process will
start, made up of two steps:
i. Initially, for each centroid, the algorithm finds the nearest points (in terms of distance that
is normally measured as the Euclidean distance) to that centroid and assigns them to its
category;
ii. Second, for each rank (represented by one centroid), the algorithm calculates the average
of all the points attributed to that class. The result of this computation would be the new
centroid for the class.
Each time the procedure is reiterated, certain findings, initially labelled along with one
centroid, may be redirected to another. In addition, after several repetitions, the shift in the
direction of the centroids should be less and less significant because the original random
centroids converge with the actual ones. This phase stops because there is no change in the
location of the centroids.
Figure 3.14: Centroid models
There are several approaches that may be used for this mission. However, in this post, I will
clarify and use the so-called 'Elbow Process.' The theory is that what we would like to see
inside our clusters is a low degree of heterogeneity determined by the intra-cluster number of
squares (WCSS):
And it's intuitive to realise that the greater the number of centroids, the lower the WCSS. In
fact, if we have as many centroids as the number of our observations, each WCSS would be
zero. However, if we consider the rule of parsimony, we know that setting the largest
possible number of centroids will be contradictory.
The concept is to choose the number of centroids after which the reduction in WCSS is
irrelevant. The relationship can be represented with the following graph:
Example: In the figure below, if you draw a horizontal line on y-axis for y = 2. What
will be the number of clusters formed?
Solution: Since the number of vertical lines intersecting the red horizontal line at y = 2 in the
dendrogram are 2, therefore, two clusters will be formed.
i. Identify the two clusters that can be nearest to each other, and
ii. Merge a limit of 2 equivalent clusters. We need to continue these steps before all the
clusters are combined.
(i) Agglomerative
Initially, consider each data point as an independent cluster and combine the closest cluster
pairs at each step. (This is a bottom-up method). At first, each data set is known to be an
independent entity or cluster. At each iteration, clusters combine with separate clusters before
one cluster is created. The agglomerative hierarchical clustering algorithm is as follows:
i. Calculate the resemblance of one cluster to all other clusters (calculate proximity matrix)
iii. Merge clusters that are very near or identical to each other
v. Repeat Step 3 and Step 4 until there is just one cluster remaining
(ii) Divisive
In HC, the number of clusters K can be set exactly as in K-means, and n is the number of
data points such as n>K. The agglomerative HC begins from the n clusters and the
accumulated data before the K clusters are collected. The divider begins from a single cluster
and divides according to similarities before the K clusters are formed. The similarity here is
the distance between points, which can be measured in several ways, and is a central aspect
of segregation. It can be calculated using various approaches:
Min: In view of the two clusters C1 and C2 such that point a belongs to C1 and b to C2. The
relation between them is proportional to the minimum width.
Max: The resemblance between points a and b is equal to the overall difference between
points a and b.
Average: All pairs of points are taken and their similarities are determined. So the average
resemblance is the similarity between C1 and C2.
Q1 What do you understand by the term clustering? What are the various parameters that are
considered before forming clusters? List the significance of clustering.
Q2 Consider the example. The marketing department has asked you to supply them with
consumer segments for the forthcoming marketing plan. What features will you use to feed
into your model and what transformations would you use to bring these segments to them?
Explain using the Elbow method.
Q3 Cluster the following 8 points (with (x, y) representing locations) into three clusters:
A(2, 10), B(2, 5), C(8, 4), D(5, 8), E(7, 5), F(6, 4), G(1, 2) and H(4, 9)
The initial cluster centres are A(2, 10), D(5, 8) and G(1, 2).
The distance function between a = (x1, y1) and b = (x2, y2) is defined as-
Use the K-Means Algorithm to identify the three cluster centres after the second iteration.
Q4 What do you understand by Hierarchical Clustering? Name the two types of Hierarchical
Clustering.
3.9 SUMMARY
Data management involves arranging, appending, combining, and collapsing data to make
data easy to navigate and use. The data is sorted by placing the observations in a particular
order. Editing of data is a method for analyzing the raw material obtained. Coding is the
method of assigning numbers or other symbols to answers to questions. All data must fit into
a certain group and be classified into specific groups or groups by the researchers.
Machine learning requires predicting and classifying data, and to do so, and we use different
machine learning algorithms. Support Vector Machines or SVM is one of the most common
algorithms and continues to be popular. SVM training algorithm creates a model that assigns
new examples to one group or another, making it a non-probabilistic linear classifier. The
data on attributes are commonly used in the social sciences like Psychology, Sociology,
Political Science, Communication Studies, and Gender Studies.
SVM is a Supervised Learning algorithm used for both Classification and Regression
questions. Ensemble methods is a machine learning technique that integrates many simple
models to create one optimised predictive model. The theory of probability originated in the
gaming houses of France in the late twelfth century. Only in the 20th century, the
fundamental ideas and definitions of probability theory have been founded solely on
mathematics by mathematicians such as Kolmogorov and Markov, among others. Logistic
regression is used to achieve an odds ratio in the case of more than one explicative
component. The method is somewhat similar to multiple linear regression, except that the
response component is binomial. The effect is the influence of each vector on the odds ratio
of the occurrence of interest observed. The biggest benefit is to prevent confusing results
when evaluating the balance of both factors. The goal of cluster analysis is to figure out
which category of consumers is in the marketplace. The cluster/group thus defined should be
big enough to allow the company to grow profitably.
4. Suppose you use a Linear SVM classifier for 2 class classification issues. You have now
obtained the following info, in which some points are circled red, representing support
vectors. Answer the following questions:
a. Will the decision boundary change if you remove the following any one red points
from the data?
b. Will the decision boundary change if you remove the non-red circled points from the
data?
1. Explain why the algorithm of bagging works best for the models which have high variance
and low bias.
2. Discuss the most appropriate strategy for data cleaning before performing clustering
analysis, given less than desirable number of data points.
5. An employee has 2 kids and one of them is a girl. What is the probability that the other
child is also a girl? You can assume that there are an equal number of males and females in
the world.
a. Discovery
b. Model Planning
c. Communication Building
d. Operationalize
a. Selection of Kernel
b. Kernel Parameters
4. Choose the correct statement about Random Forest and Gradient Boosting ensemble
methods.
b. Random Forest is used for classification, while Gradient Boosting is used for
regression tasks.
c. Random Forest is used for regression, while Gradient Boosting is used for
classification tasks.
a. AdaBoost
b. Extra Trees
c. Random Forest
d. Decision Trees
b. Clustering
c. KNN
d. Regression
a. Yes
b. No
c. Insufficient data
d. Can’t say
8. Choose the method that best fits the data in Logistic Regression.
b. Maximum Likelihood
c. Jaccard distance
d. Both a. and b.
9. What is the other name for the process decision programme chart?
a. Affinity diagram
b. Relationship diagram
c. Decision tree
d. Matrix diagram
a. SVG
b. SVM
c. Random forest
References Books
Textbook References
Mize; (2017) Data Analytics: The Ultimate Beginner's Guide to Data Analytics
Foster Provost; Tom Fawcett. Data Science for Business: What You Need to Know about
Data Mining and Data-Analytic Thinking
Websites
www.analyticsvidhya.com
www.towardsdatascience.com
www.wikipedia.org