Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

MBA – DATA ANALYTICS

SEMESTER-I

DATA SCIENCE AND BUSINESS


ANALYTICS
MBA-DA-104
All rights reserved. No Part of this book may be reproduced or transmitted, in any form or by
any means, without permission in writing from Mizoram University. Any person who does
any unauthorized act in relation to this book may be liable to criminal prosecution and civil
claims for damages. This book is meant for educational and learning purposes. The authors
of the book has/have taken all reasonable care to ensure that the contents of the book do not
violate any existing copyright or other intellectual property rights of any person in any
manner whatsoever. In the event the Authors has/ have been unable to track any source and if
any copyright has been inadvertently infringed, please notify the publisher in writing for
corrective action.

© TeamLease Edtech Pvt. Ltd.

All rights reserved. No Part of this book may be reproduced in any form without permission
in writing from TeamLease Edtech Pvt. Ltd.
CONTENT

UNIT - 3: Classification…………………………………………………………….4
UNIT - 3: CLASSIFICATION

STRUCTURE

3.1 Learning Objectives

3.2 Introduction

3.3 Data Management and Analysis

3.4 SVM

3.4.1 The Theory

3.4.2 Kernels

3.5 Ensemble methods

3.5.1 Random forests

3.6 Introduction to probability

3.6.1 Classical Definition of Probability

3.6.2 Conditional Probability

3.6.3 Bayes Theorem

3.6.4 Logistic regression

3.7 Feature engineering and selection

3.7.1 Important Concepts

3.7.2 Feature Selection

3.8 Clustering

3.8.1 Significance of cluster sampling

3.8.2 K-means

3.8.3 Hierarchical clustering

3.8.4 Comparison between k-means and Hierarchical Clustering

3.9 Summary

3.10 Self-Assessment Questions

3.11 Suggested Readings


3.1 LEARNING OBJECTIVES

After studying this unit, you will be able to:

● Learn Data Management and Analysis

● Define the concept of SVM and Kernels

● Define and categorize various ensemble methods

● Apply the random forest method

● Explain the phenomenon of Probability

● Calculate Conditional Probability for the given problem

● Learn and implement Bayes Theorem

● Derive Logistic regression function

● Conceptualize Feature engineering and selection

● Define the concept of Clustering

3.2 INTRODUCTION

This unit shall focus on the classification of data. The unit shall provide an insight into
various ensemble methods. This unit shall also introduce you to the concept of probability,
including the Bayes theorem and the method of logistic regression. This unit will help us
learn about the concept of clustering as well.

The raw data obtained from surveys or macro databases should be processed and handled so
that it is ideal for review. As a result, the researchers must format data files for the purposes
of the research report. Data management requires various functions undertaken by
researchers, such as arranging, appending, combining, and collapsing data, to make data easy
to navigate and use.

The data is sorted by placing the observations in a particular order. For instance, data may be
sorted by month or year.

If researchers use more than one data file to do data analysis, all data files will be merged
into a single file. This is achieved by applying the findings of the second file to the first file.
If the researcher discovers that two files contain the same observations, but under separate
variables, the files are combined for analysis.
3.3 DATA MANAGEMENT AND ANALYSIS

The data management process includes the editing, coding, classification, and tabulation of
the data gathered. Let us read about each of these processes.

i. Editing: Editing data is a method for analysing the raw material obtained, collected in
polls to find any mistakes, omissions, or inconsistencies. Data editing is often referred to as
data washing. E.g., if the study had school girls as subjects, we might expect ages to be in a
certain range. If the data by mistake indicates the age of the subject that is not consistent with
the set, it must be verified.

Cleaning is also helpful in testing data inconsistency where the answer is not probable or
unlikely. E.g., illiterate people will not be able to claim that they read the newspaper. Missing
values or reactions that are not provided are often observed during data cleaning. Editing or
cleaning up data means that the data is as reliable as possible, compatible with all data
obtained and entered consistently to support the coding process. Such data shall be fixed
wherever feasible and handled for coding purposes. For example, if the respondent in a teen
study offers an age that does not fall into a specific age range, then the age must be checked,
and a new age should be entered.

ii. Coding: The coding of data is characterised as the method of conceptualising and
classifying research into concrete and specific categories. Coding is performed to enable the
review and evaluation of the results. It is the method of assigning numbers or other symbols
to answers to questions that can be classified into particular groups or groups. All data must
fit into a certain group. For instance, if respondents are categorised as per their gender, all
males could be coded as 1, and all females could be coded as 2.

Thus, groups 1 and 2 are mutually exclusive, which means that the data can be coded as
either 1 or 2. Data coding can be performed manually by copying data to the encoding layer.
However, a vast volume of data may be encoded in excel sheets or spreadsheets using a
computing facility. The coding process is carried out by a coder or a researcher in smaller
trials and by a group of coders in a comparatively large test sample.

iii. Classification: Classification is the next step in data management. Once the data that has
been coded is grouped into unique classes or groups based on a shared trait, data is
categorised based on attributes or class cycles.

iv. Tabulation: Where the data is grouped into unique classes, they are organised in
sequential order in tables. The systematic arrangement of data in the tables is called
tabulation. The data is seen in the compact table for readers to interpret the findings of the
analysis. Tabulation eliminates the need for a long definition and clarification. It facilitates a
simple comparison of results and additional statistical data analysis. The tabulation method
can be performed manually or with the aid of your machine. Manual tabulation can be made
for smaller comparisons. But in larger trials, which involve comparatively large amounts of
quantitative data, are only possible where machines are available for tabulation purposes.

The grouping of data by attributes is based on some universal function, either qualitative or
quantitative. The data on attributes are commonly used in the social sciences like
Psychology, Sociology, Political Science, Communication Studies, and Gender Studies. The
division of data by attributes may be simple where only one attribute is listed, and the data is
separated into two classes. The respondents in the sample, for example, are listed as either
men or women. There is also a dynamic grouping of data by attributes in which data are
grouped into various categories and classes depending on the inclusion of two or more
attributes.

CHECK YOUR PROGRESS-1

Q1 What do you understand by editing, tabulation, and classification of data?

3.4 SUPPORT VECTOR MACHINES

Machine learning requires predicting and classifying data, and to do so, and we use different
dataset machine learning algorithms.

Support Vector Machines or SVM in-short, is one of the most common algorithms, and was
highly popular at the time they were created and perfected in the 1990s, and continues to be
popular and is one of the best options for high-performance, slightly tuned algorithms, and
offers one of the most robust prediction methods.

SVM is applied differently as opposed to other ML algorithms. The SVM training algorithm
creates a model that assigns new examples to one group or another, making it a non-
probabilistic linear classifier.

SVM is a Supervised Learning algorithm used for both Classification and Regression
questions. However, it is mostly used for problems with the classification of machine
learning. In terms of classification, SVMs can effectively perform non-linear classification
by using a kernel-named trick or parameter that indirectly maps their inputs into high-
dimensional feature spaces.
SVM is an Unsupervised Learning algorithm, too. When data is not labeled, supervised
learning is not feasible. An unsupervised learning solution is needed to find natural data
clustering within groups and then map new data to these groups.

The support vector algorithm uses support vector statistics, built in the support vector
machine algorithm, to categorize unlabeled data. It is one of the most commonly used
clustering algorithms in industrial applications.

3.4.1 The Theory


At first, approximation, what SVMs do is find a separating line (or hyperplane) between two
class details. SVM is an algorithm that uses data as input and outputs a line that divides
certain groups if possible.

Let's get started with a problem. Suppose there is a dataset as seen below, and it is important
to classify the red rectangles from the blue ellipses (let's assume the positive ones from the
negatives). So the job is to find the optimal line that divides this data set into two groups (say
red and blue).

Figure 3.1: Example Dataset

As noticed, there's no special line that does the job. In addition, there are infinite lines that
can distinguish these two groups from each other. But how can SVM choose the ideal one?
Figure 3.2: Example Dataset

There are two options, a green-colored line and a yellow-colored line. Which line would
better separate the data?

If the yellow line is considered, that is the best because that's the line being looked for. In this
case, it is visually very intuitive that the yellow line is best graded. But something
meaningful is needed for this.

The green line in the picture above is very similar to the red class. While it classifies current
datasets, it is not a generalised line, and in machine learning, our goal is to get a more
generalised separator.

According to the SVM algorithm, we find the points nearest to the line in both classes. These
points are called support vectors. Now, measuring the distance between the line and the
support vectors. This gap is referred to as the margin. The goal is to optimise the margin. The
hyperplane with the highest margin is the ideal hyperplane.
Figure 3.3: Example Dataset

SVM is thus attempting to make a decision on the boundaries in such a way that the
distinction between the two groups (the street) is as wide as possible.

Let’s consider a bit complex dataset, which is not linearly separable.

Figure 3.4: Example Dataset

These results are not linearly separable. One cannot draw a straight line to define this info.
However, this data can be translated to linearly separable higher-dimensional data. Let's add
another dimension and call it the z-axis. Let the z-axis coordinates be governed by the
constraint, z = x^2+y^2.

So, essentially, the z-coordinate is the distance square of the point of origin. Let's plot the z-
axis results.
x

Figure 3.5: Example Dataset

The data is simply linearly separable now. Let the purple line dividing the higher dimension
data be z=k, where k is a constant. Since z=x^2+y^2, we get x^2 + y^2 = k, which is a circle
equation. This transformation can be used to project this linear separator in higher
dimensions back to the original dimensions.

Figure 3.6: Example Dataset

Thus, data can be defined by adding an extra dimension to it such that it becomes linearly
separable and then projects the decision boundary back to the original dimensions by the
mathematical transformation. But finding the right transformation for every dataset is not that
easy.
3.4.2 Kernels
The SVM algorithm uses a series of mathematical functions known as kernels. Often it is not
possible to locate a hyperplane or a linear decision boundary to any classification problems.
If the data is projected to a higher dimension from the original space, a hyperplane in the
projected dimension could be obtained to define the data.

For instance, it is difficult to locate a line that divides the two classes in the input space, but
if the same data points or input space is projected into a higher dimension, a distinction
between the two classes using a hyperplane. Refer to the example below. Initially, it's hard to
distinguish the two classes, but when projected onto a higher dimension, so using a
hyperplane, we could quickly separate the two classes for classification.

Figure 3.7: Example classification

Thus, Kernel helps to locate a hyperplane in a higher-dimensional space without increasing


the cost of computing. Typically, the cost of computing would increase with an increase in
dimensions.

Kernel trick is a function that converts data into an acceptable shape. There are different
types of kernel functions used in the SVM algorithm, i.e., Polynomial, linear, non-linear,
radial base equation, etc. Using the kernel trick, low-dimensional input space is transformed
into a higher-dimensional space.

CHECK YOUR PROGRESS-2

Q1 What do you understand by the Support Vector Mechanism?

Q2 Explain the functioning of kernels.


3.5 ENSEMBLE METHODS

Ensemble methods is a machine learning technique that integrates many simple models to
create one optimised predictive model. Ensemble learning helps boost the results of machine
learning by integrating many models. This methodology allows improved predictive
performance to be generated compared to a single model. That's why the Ensemble methods
are ranked first in a variety of prestigious machine learning competitions.

Ensemble approaches are meta-algorithms that incorporate a variety of machine learning


strategies into one predictive model to minimise uncertainty (bagging), bias (boosting), or
enhance predictions (stacking). Ensemble methods can be classified into two groups:

● Sequential ensemble methods (where the simple learners are generated


sequentially): The fundamental motive of sequential approaches is to manipulate the
dependency of basic learners. The average efficiency can be improved by measuring
previously unmarked examples with a higher weight.

● Parallel ensemble methods: where the basic learners are created in parallel (e.g.,
Random Forest).

The basic motive of parallel approaches is to leverage the individuality of basic learners, as
the error can be significantly minimized by averaging.

3.5.1 Random Forests


Random Forest models, with a small tweak, can be thought of as bagging. When determining
where to divide and how to make choices, bagged decision Trees shall have the maximum set
of features to choose from. Therefore, while the bootstrapped samples are subtly different,
the data would largely split off with the same functionality in each model. In comparison,
random forest models determine where to differentiate based on a random set of features.

Rather than separating identical features at each node within, Random Forest models set a
separation degree to separate each tree based on different features. This degree of distinction
offers a wider set for aggregate over, ergo producing a more reliable indicator. Please refer to
the picture for a clearer interpretation.
Figure 3.8: Classification of Random Forests

Similar to bagging, bootstrapped subsamples are taken from a larger dataset. For each
subsample, a decision tree is created. However, the decision tree is separated into various
features.

In a highly randomised tree algorithm, randomness goes one step further: the division
thresholds are randomised. Instead of searching for the most discriminatory threshold, the
thresholds are drawn randomly for each candidate element. The better of these randomly
created thresholds is selected as the dividing law. This generally makes it easier to reduce the
formula's volatility a little more, at the cost of a somewhat higher increase in bias.

Figure 3.9: Trees in random forests

CHECK YOUR PROGRESS-3

Q1 What do you understand by the term ensembling? Give an overview of the various
ensemble methods used in data science.

Q2 Explain the concept of the random forests technique of ensembling.


3.6 INTRODUCTION TO PROBABILITY

The theory of probability originated in the gaming houses of France in the late twelfth
century. Any gambling problems have been brought to the notice of mathematicians like
Pascal and Fermat by an upper-class gambler. They and other mathematicians in Europe
noted that numerous gambling and betting problems could be solved by means of
permutation and mixture methods. They also found that certain other problems of chance
variance required other mathematical methods, such as calculus. They also foresaw the
immense possibilities of realistic application of the laws of chance that they had derived in
the various fields of science and human life.

However, only in the 20th century, the fundamental ideas and definitions of probability
theory have been founded solely on mathematics by mathematicians such as Kolmogorov
and Markov, among others.

3.6.1 Classical Definition of Probability


The probability of an occurrence in a finite sample space is described in this subsection.

Description 1: The probability of event A, P(A), is determined by,

P(A)= number of outcomes favorable to A / number of all possible outcomes

If we extend this definition to case A: "getting an odd number when a die is rolled" then we
get P(A)=3/6= 1/2.

This description of probability is often referred to as the classical definition. Apart from
this, there's another way we can describe the likelihood. This method is called an empiric or a
relative frequency approach.

Suppose a die is rolled 600 times, and suppose the number '2' appears 98 times. So the 1
relative frequency of the incidence of 2 is 98/600, and we can consider this to be a rational
approximation of its likelihood. Thus, the likelihood of an event A can also be defined as:

P(A)= number of observed favorable outcomes / total number of observed outcomes

The only difference between the classical and the empiric concept of P(A) is the term
"observed". This word means that we experimented a number of times before we hit the P-
value (A).

CHECK YOUR PROGRESS- 4

Q1 Define the concept of probability. How is the classical definition of probability different
from the empirical definition? Support your answer with suitable examples.

Q2 Find the probability of getting i) a queen, ii) a jack of hearts, and iii) a red card, from a
pack of well-shuffled cards.

3.6.2 Conditional Probability


You are now familiar with a variety of laws regulating the probability of events in a sample
space. But there are also cases in which we cannot enforce these rules. For example, suppose
two cards are selected, one by one, from a pack of cards. What is the probability that the first
card is red and the second card is black? The definition of conditional probability is
considered here to deal with certain circumstances.

Suppose there are two balls 1 and 2. Suppose ball 1 is selected with a probability of 0.7, and
ball 2 is selected with a probability of 0.8. Let S1 denote the event of 1 being picked up, and
S2 denote the event of Ball 2 being picked up. It is clear that the information that S1 has
taken place will not change the likelihood of the incidence of S2. Similarly, whether or not
S2 happened would not affect the frequency of S1. Thus, P(S2) = 0.8 whether or not S
occurred, and P(S,) = 0.7 whether or not S2 occurred. These events are called independent
events.

Now, consider another example. The box includes five white and three red cubes. The second
equivalent box contains three white and five red cubes. A box is chosen, and a cube is drawn
from it.

Let B: select the first box. Then B': pick a second box.

W: draw a white cube

A: the drawing of a red cube.

In this case, you will see that if occurrence B happened, that is, if we chose the first box, the
chance of drawing a white ball is 5/8. But if event B' has happened, that is, if we have chosen
a second box, then the chance of drawing a white ball is 3/8. This implies that the likelihood
of W happening depends on our choice of box. This states that events B and W are dependent
on events. We also write P(W|B) to show the probability of W, provided that B has occurred.

Here P(W|B) = 5/8. And it is already seen P(W(B')) = 3/8

Definition 2: Two events A and B are said to be independent if the probability of occurrence
of event B is in no way influenced by the occurrence or non-event of event 4.
The probability of event B, provided that event A has already happened, is called the
conditional probability of B being given by A and is denoted by P(B|A). A and B are both
independent if and only if P(B|A) = P (B).

Example: Let A and B be events on the same sample space, with P (A) = 0.6 and P (B) =
0.7. Can these two events be disjoint?

Solution: These two events cannot be disjoint because P(A)+P(B) >1.

P(AꓴB) = P(A)+P(B)-P(AꓴB).

An event is disjoint if P(AꓴB) = 0. If A and B are disjoint P(AꓴB) = 0.6+0.7 = 1.3

And Since probability cannot be greater than 1, these two mentioned events cannot be
disjoint.

3.6.3 Bayes Theorem


The finding that details of an event's happening can significantly alter the probability of
happening of the other is what is said by Naive Bayes.

Bayes Theorem

Theorem 1 (Bayes Theorem): Suppose that event A can occur either in conjunction with H1
or with H2,......or with Hk, where HI,......, Hk is mutually exclusive. The occurrence of A
indicates that each of these events Hi, I = 1, 2, ...., k has happened. Let P(Hi) denote the
probability of occurrence of Hi and P(A(Hi) the conditional probability of A, provided that
Hi occurred, I = 1, 2, ...., k. The conditional probability that Hi occurred when A is known to
have occurred is given by:

P(Hi|A)= P(Hi) P(A|Hi) / ΣP(Hi) P(A|Hi)

P(Hi) is referred to as the prior probabilities of occurrence of Hi, I = 1, 2, ...., k, and P(Hi(A)
as the posterior probabilities of occurrence. Here, there are two mutually exclusive events,
claim, M and M' instead of HI, H, ......Hk and event C is equal to event A in Bayes' theorem.
The utility of this theorem can be seen from the following example.

Example: The box includes seeds in 4 grades A, B, C, D. Each of the four grades has 80,
50, 40, and 20 percent chances of germinating. The box comprises four grades of seed in
1:2:3:4 proportions. A seed is taken randomly from the box and is planted. If it
germinates, what are the probabilities that it was grade A, B, C, or D?

Solution: Let HA, HB, Hc, and HD show events that the seed is of grade A,B, C, and D,
respectively. Let G be the event that it germinates.

P(HA) = 1/10, P(HB) = 1/5, P(HC) = 3/10, P(HD) = 2/5 and


P(G|HA) = 0.8, P(G|HB) = 0.5, P(G|HC) = 0.4, P(G|HD) = 0.2.

P(HA|G)= [P(HA)| P(G|HA)] / [P(HA) P(G|HA)+ P(HB) P(G|HB) + P(HC) P(G|HC) +


P(HD) P(G|HD)]

= [0.1x0.8] / [(0.1x0.8)+ (0.2x0.5)+ (0.3x0.4)+ (0.4x0.2)]

= 4/19

Similarly,

P(HB|G)= 5/19

P(HC|G)= 6/19

P(HD|G)= 4/19

CHECK YOUR PROGRESS-5

Q1 If 4 cards are drawn from a pack of cards, what is the probability that there is one card
from each suite?

Q2 It is known that 3 out of every 100 men and 25 women out of every 10,000 women are
color-blind. In a community about half the population is male.

(i) What is the probability that a person chosen at random from the community will be color-
blind?

(ii) What is the probability that a color-blind person chosen at random from among all
colorblind persons in the community will be a male?

3.6.4 Logistic Regression


Consider a case where there is a need to classify whether or not an email is spam. If linear
regression is used for this problem, there is a need to set a threshold based on which
classification can be performed. Say if the actual class is malignant, the expected continuous
value is 0.4, and the threshold value is 0.5, the data point will be labelled as not malignant,
which can lead to severe effects in real-time. From this example, it can be concluded that
linear regression is not sufficient for classification problems. Linear regression is unbounded,
and this provides a picture of logistic regression. Their value is purely from 0 to 1.

Logistic Function

Logistic regression is named for the function at the core of the process, the logistic function.
The logistic function, also known as the sigmoid function, was developed by statisticians to
explain the characteristics of population growth in ecology, gradually growing and
optimizing environmental efficiency. It's an S-shaped curve that can take any real-valued
number and map it to a value between 0 and 1, but never precisely within those boundaries.

1 / (1 + e^-value)

Where e is the origin of the natural logarithms (Euler's number or EXP() function), and the
value is the actual numerical value that you want to convert. Below is a graph of numbers
that are converted into ranges 0 and 1 using the logistic function.

Figure 3.10 Logistic regression graph

Representation of Logistic Regression

Logistic regression uses an equation as a representation, much like linear regression.

Input values (x) are linearly combined using weights or coefficient values (referred to as the
Greek capital letter Beta) to estimate the output value (y). The main distinction from linear
regression is that the output variable being modelled is a binary value (0 or 1) rather than a
numerical value.

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

If y is the expected output, b0 is the bias or the intercept term, and b1 is the coefficient for
the single input value (x). Each column in your input data has a related b coefficient
(constant real value) to be learned from your training data.

The exact expression of the formula that you can store in memory or in a register is the
coefficients in the equation (beta or b).

Calculation of Probabilities using Logistic Regression


Logistic regression models the probability of the default class (e.g., the first class).

For example, suppose we model students as a boy or girl from their height. In that case, the
first-class could be a boy, and the logistic regression model could be written as a boy's
probability given the height of the individual, or more formally:

P(sex=boy|height)

In another way, we model the probability that the input (X) belongs to the default class
(Y=1), and we can write this formally as:

P(X) = P(Y=1|X)

Note that the probability forecast must be converted into a binary value (0 or 1) in order to
actually make a probability prediction.

Logistic regression is a linear technique, but the projections are converted by a logistic
function. The effect of this is that we can no longer interpret the forecasts as a linear
combination of inputs as we can with linear regression, for example, continuing from above,
the model can be stated as:

p(X) = e^(b0 + b1*X) / (1 + e^(b0 + b1*X))

The equation above can now be written as follows:

ln(p(X) / 1 – p(X)) = b0 + b1 * X

This is useful to see that the estimation of the output to the right is linear again (just like the
linear regression), and the input to the left is a log of the likelihood of the default class.

This ratio on the left is called the odds of the default class. Odds are determined as the
probability ratio of an event divided by the probability of not experiencing an event, e.g.
0.8/(1-0.8), and has the odds of 4. But instead, it should be:

ln(odds) = b0 + b1 * X

Since the chances are log converted, the log-odds or the probit on this left side. It is possible
to use other types of functions for transformation, but as such, it is common to refer to a
transformation that relates a linear regression equation to probabilities as a relation function.

The exponent can now be shifted back to the right and write as follows:

odds = e^(b0 + b1 * X)

All of this lets one realise that the model is still a linear combination of inputs, but that this
linear combination corresponds to the log-odds of the default class.

Logistic Regression vs. Linear Regression


Let us consider the issue of providing a dataset containing Height and Weight to a group of
individuals. The job is to estimate the weight of new entries in the Height column.

It might work out that this is a regression dilemma where an attempt can be made to
construct a linear regression model. Once the model has been trained, the weight for a given
unknown height value can be predicted.

Figure 3.11: Logistic Regression vs Linear Regression graph

Now assume there is an additional field, Obesity and whether or not a person is obese needs
to be classified based on height and weight. This is clearly a sorting dilemma where dividing
the dataset into two classes (obese and not-obese) is the best option.

So using the steps of linear regression, construct a line of regression. This time, the line will
be centered on two Height and Weight parameters, and the regression line will match
between two discrete sets of values. Although this regression line is particularly vulnerable to
outsiders, it would not do a decent job of classifying two groups.

To get a better classification, feed the output values from the regression line to the sigmoid
function. The sigmoid function returns the expectation of each output value from the
regression line. Now, based on a predefined threshold value, the division of an output into
two classes, obese or not-obese, could be done.
Figure 3.12: Logistic Regression vs Linear Regression graph

Example: Consider a following model for logistic regression: P (y =1|x, w) = g(w0 +


w1x)

where g(z) is the logistic function. In the equation the P (y =1|x; w), viewed as a function
of x, that we can get by changing the parameters w. What would be the range of p in
such a case?

Solution: For values of x in the range of real number from −∞ to +∞ Logistic function will
give the output between (0,1).

Similarities

i. Both linear and logistic regressions are supervised machine learning algorithms.

ii. Linear regression and logistic regression, both models are parametric regression, i.e., both
models use linear equations for estimation.

Differences

i. Linear regression is used to solve regression problems, while logistic regression is used to
address classification problems.

ii. Linear regression produces continuous output, but logistic regression delivers discrete
output.

iii. Linear regression aims to find the best-fitted line, while logistic regression is one step
ahead and aligns the values of the line to the sigmoid curve.

iv. The approach used to measure the loss function in linear regression is the mean square
error, while the overall probability calculation is the logistic regression.
CHECK YOUR PROGRESS-6

Q1 For values of x in the context of the real numbers, from −∞ to +∞. What do you think is the
function that allows p between (0,1)?

Q2 Logistic function (given as l(x)) is the odds log function. What might be the set of logistic
functions in the x = [0,1] domain?

3.7 FEATURE ENGINEERING AND SELECTION

Statistical models have become particularly relevant as they have become prevalent in
contemporary culture. They help us to make different kinds of predictions in our daily lives.
For example, physicians rely on general laws drawn from formulas that tell them which
individual patient cohorts are at an elevated risk of a particular disease or occurrence. A
numerical estimate of the flight's arrival time will help us understand whether our aircraft is
expected to be delayed. In other instances, templates are successful in showing us what is
important or concrete.

In all these instances, models are generated by taking existing data and seeking a
mathematical model with appropriate data fidelity. Important figures can be determined from
such a model. In the case of airline delays, estimating the result (arrival time) is the quantity
of interest, whereas the estimation of the future selection bias may be calculated by a
particular model parameter. In the latter case, the recruiting bias calculation is normally
applied to the expected variance (i.e., noise) in the data.

A decision is taken based on how rare such a finding will be relative to noise – a definition
commonly referred to as "statistical significance." This type of approach is typically
considered to be inferential: a conclusion is drawn for the sake of explaining this.

On the other hand, the prediction of a particular value (such as arrival time) represents an
estimation issue where our aim is not simply to explain whether a pattern or truth is valid but
is based on making the most precise determination of that value. Prediction uncertainty is
another essential quantity, in particular, to gauge the reliability of the value produced by the
model.

If the model can be used for inference or prediction (or in extreme situations, both) is decided
by some essential characteristics to it. Parsimony (or simplicity) is a central element. Simple
models are usually preferred to complex models, particularly when the target is inference. It
is simpler, for example, to determine practical distributional assumptions in models with
lesser parameters. Parsimony also refers to a higher potential for understanding a model. For
example, an economist might be interested in quantifying the advantages of postgraduate
salary education. A basic model could represent a linear relationship between the years of
college and the wage of jobs. This parameterisation can conveniently facilitate statistical
inferences on the possible benefits of such schooling. But suppose the relationship is
significantly different between professions and/or not linear. A more complex model would
do a better job of capturing data patterns but would be much less interpretable.

The dilemma, though, is that precision cannot be seriously lost for the sake of convenience.
A simple model can be easy to understand but will not work if it does not preserve a
reasonable degree of data fidelity; if the model is just 50 percent correct, can it be used to
draw inferences or predictions? Complexity is typically the solution for low precision. Using
additional parameters or using an essentially non-linear model, we can increase precision, but
interpretability is likely to suffer greatly. This trade-off is a crucial factor for model design.

However, the variables used in the model and their representation are just as important to
progress. It is difficult to talk about modeling without mentioning models, but one of the
goals is to increase the focus on model predictors.

In terms of nomenclature, the quantity that is being modelled or forecast is known as either
the response, the outcome, or the dependent variable. Variables used to simulate the result
are called predictors, functions, or independent variables (depending on the context).

E.g., when modelling the selling price of a house (outcome), the features of a property (e.g.,
square feet, number of bedrooms and bathrooms) may be used as predictors. Remember,
though, artificial concept words that are composites of one or more factors, such as the
number of bedrooms per restroom. This variable can be more properly referred to as a
function (or a derived feature). In any scenario, characteristics and predictors are used to
describe the result of the model.

The belief that there are different ways to depict predictors in a model and that some of these
representations are stronger than others contributes to the idea of function engineering-the
method of generating data representations that improve the usefulness of a model.

Notice that many factors affect model performance. If the predictor has no relationship to the
result, its interpretation is meaningless. However, it is very important to note that there are
several types of models and that each has its sensitivities and needs. For instance:

● Some models are unable to accommodate predictors that calculate the same
underlying quantity (i.e., the correlation between predictors).

● Many models cannot use samples with missing values.


● Most models are seriously compromised where there are irrelevant predictors in the
results.

Feature engineering and variable selection can help to alleviate many of these problems. The
purpose is to help create better models by concentrating on predictors. "Better" depends on
the context of the problem but most definitely includes the following factors: precision,
simplicity, and robustness. To obtain these characteristics or to make good trade-offs
between them, it is important to consider the relationship between the predictors used in the
model and the form of the model. Accuracy and/or simplicity can often be enhanced by
depicting information in ways that are more appealing to the model or by reducing the
number of features used.

3.7.1 Important Concepts


Some key principles need to be addressed before progressing to concrete techniques and
processes. These principles include both the analytical dimensions of modeling and the
reality of modeling. There are a variety of these topics mentioned here.

i. Overfitting

Overfitting is a condition where the model corresponds very well to current data but fails to
simulate new samples. It normally happens when the model depends too much on patterns
and anomalies in the current data set that do not exist otherwise. Although the model has
only access to the current data set, it has no ability to realise why certain phenomena are
anomalous. Often, very versatile models are more likely to bypass the results. It is not
difficult for these models to do exceptionally well on the data set used to construct the model.
Without any protective method, it will quickly fail to generalise to new data.

Although models can override data points, feature selection strategies may override
predictors. This happens where a variable seems to be important to the existing data
collection but does not display any real relationship to the result until new data is obtained.
The possibility of this type of over-fitting is extremely hazardous when the number of data
points, denoted as n, is limited and the number of possible predictors (p) is very high. As
with the overfitting of data points, this issue can be mitigated using a technique that shows a
signal when this happens.

ii. Supervised and unsupervised data analysis

Supervised data analysis involves detecting trends between predictors and an identified
outcome that needs to be modelled or forecast, whereas unsupervised approaches rely
exclusively on identifying patterns between predictors. Usually, all forms of analyses will
require a degree of experimentation. Exploratory Data Analysis (EDA) (Tukey 1977) is used
to understand the main features of the predictors and the outcome so that any relevant
problems associated with the data can be established before modelling. This could involve
investigations of association mechanisms in variables, patterns of incomplete data, and/or
deviation patterns in data that may contradict the modeler's initial expectations.

Predictive models are closely controlled as there is a direct emphasis on identifying


relationships between the predictors and the outcome. Unsupervised analyses include
approaches such as cluster analysis, key component analysis, and related techniques for data
pattern exploration.

Both supervised and unsupervised analyses are vulnerable to overstatement, but supervised
analyses are more resistant to finding incorrect trends in the data for predicting the result. In
short, we will use these methods to build a self-fulfilling predictive prophecy.

iii. No free lunch

The "No Free Lunch" Theorem (Wolpert 1996) is the principle that without any clear
understanding of the problem or data at hand, no predictive model can be assumed to be the
strongest. Several versions are tailored for data characteristics, such as missing values or
collinear predictors. In such cases, it would be fair to conclude that they will do better than
other models (all other things being constant). In reality, things aren't that easy. One model
designed for collinear predictors is limited to model linear data patterns and vulnerable to
data deficiency. It is very difficult to predict the best model, specifically before the data is in
hand.

Experiments have been performed to determine the models appear to perform better than
others on average, in particular Dysar (2006) and Fernandez-Delgado et al (2014). These
analyses show that certain models appear to generate the most reliable models, but the rate of
"winning" is not high enough to enforce a policy of "always using Model X."

In reality, it is prudent to test out a variety of different types of models that would work best
for your specific data set.

iv. The model and the modelling process

The method for designing an efficient model is both iterative and heuristic. It is impossible to
know the needs of any data set when operating with it, and it is normal for multiple methods
to be tested and updated before a model can be finalised. Many books and tools rely
exclusively on modelling methods, but this practise is only a small part of the overall
operation.
Figure 3.13 Analysis Process

The initial operation starts at marker (a), where the exploratory data processing is used to
examine the data. After initial exploration, marker (b) shows where initial data analysis can
take place. This may involve testing basic overview indicators or recognising predictors that
have good associations with the result. The method could run between visualisation and
interpretation until the modeller is sure that the data is well established. tA (c), the first
outline of how the predictors will be interpreted in the models, is drawn up based on the
previous analysis.

At this stage, several separate modelling approaches may be tested for an initial collection of
features. This is seen in (d), where four clusters of templates are shown as thin red spots. This
describes four different models that are being assessed, but each is evaluated several times
over a range of candidate hyperparameter values. When the four models have been tuned, the
data will be numerically analysed to clarify their performance characteristics (e). Description
measurements for each model, such as the precision of the model, are used to explain the
complexity level of the problem and to assess which models are ideally suited to the results.
Further, EDAs can be carried out based on these results based on model results (f), such as
residual analysis. As a result, a further round of feature engineering (g) may be used to
compensate for these hurdles. It should be evident from this stage that models appear to fit
well with the problem at hand and that a further, more thorough round of model tuning may
be done on fewer models (h). After further adjustment and adjustment of the predictor
description, the two candidate models were finalised.

These models can be tested on an external test set as a final "bake off" between the models
(i). The final model is then picked (j), and this fitted model can be used to predict new
samples or draw inferences.

This schematic's point is to demonstrate that there are much more events in the process than
merely fitting a single mathematical model. For most problems, it is common to have
feedback loops that assess and re-evaluate how well any model/feature combination is doing.
3.7.2 Feature Selection
Usually, new features are derived sequentially to compensate for the increased efficiency of
the model. Ses sets are generated, applied to the model, and then re-sampling is used to
determine their usefulness. The new predictors are not retroactively filtered for statistical
significance until being applied to the model. It should be a monitored operation, and
precautions must be taken to ensure that there is no over-fitting.

However, it has been seen that some of the predictors have ample context knowledge to
predict the result accurately. This set of predictors may well include non-informative
variables, and this may have an effect on results to some degree. A supervised feature
collection strategy can be used to narrow down the predictor set to a smaller subset
containing only informative predictors. In addition, there is a probability that there are a
limited number of significant predictors whose usefulness has not been discovered due to all
the non-informative variables in these sets.

In other instances, all raw predictors are identified and accessible at the start of the modelling
process. In this case, a less linear solution may be used simply by using a function filtering
routine to try to figure out the best and worst predictors.

There are a variety of common methods for supervised collection of features that can be
implemented. A comparison between search methods is made between how subsets are
derived:

i. Wrapper approaches use an external search protocol to pick various subsets of the entire
predictor set to be tested in a model.

This method distinguishes the quest process for functionality from the model fitting process.
Examples of this method will be backward or stepwise sorting as well as genetic algorithms.

ii. Embedded methods are models where the feature selection protocol happens
spontaneously during the model fitting process. An example will be a basic decision tree
where variables are chosen as the model uses them in a break. If the predictor is never used
in a break, the projection equation is functionally independent of this component and has
been chosen.

As with model fitting, the main problem during the selection of features is overfitting. This is
particularly valid where wrapper approaches are used and/or where the number of data points
in the training set is limited compared to the number of predictors.

Finally, unsupervised methods of selection may have a very positive impact on model
efficiency. With a low frequency, such predictors may have a negative impact on certain
models (such as linear regression), and it may be desirable to exclude them before creating a
model.

When looking for a subset of variables, it is necessary to remember that there may not be a
single set of predictors that will deliver the best results. There is also a countervailing effect
where, as one seemingly significant variable is excluded, the model changes using the
remaining variables. This is particularly true where there is a certain degree of association
between the explanatory variables or when low-bias models are used. For this purpose, the
collection of features cannot be used as a systematic way of evaluating the importance of a
function. More conventional inferential mathematical methods are a safer solution for
evaluating a predictor's contribution to the underlying model or data set.

3.8 CLUSTERING

Cluster analysis is a method used to segment the market. The goal is to figure out which
category of consumers is in the marketplace. Homogeneous, i.e., they share similar features
such that they can be classified into one category. Thus, the cluster/group should be big
enough to allow the company to grow profitably since the overall aim of the company is to
satisfy the consumer and make money.

The community of clients that the company hopes to represent should be large enough for the
company to make a commercially feasible plan for the company. This is also true of the
customer since the customer would not be able to spend beyond that certain cost on a certain
good.

Let's use an example to explain this. Suppose the rental store owner needs to learn the tastes
of the customers to grow their market. Is it possible for the owner to look at each customer's
specifics and formulate a specific business plan for each of them? It's certainly not. But what
should be done is to get all of the customers together and tell 10 categories depending on
their shopping patterns and have a different plan for the customers in each of these 10
groups. And that's what we call clustering.

3.8.1 Significance of Cluster sampling


The following are the various reasons that create problems in the collection of a sample of
elementary units and cluster sampling, which allow us to resolve the following problems:

i. Even the sampling frame is not usable. The identification and interviewing of sampling
units are expensive in terms of resources, time-consuming, and the need for a lot of work.
E.g., a list of households in the metro region, a list of state farm owners, etc.
ii. The location of the sample units described could be far apart and consume a lot of time
and resources to survey them;

iii. It might not be easy to locate well identifiable and readily identifiable elementary units.

Thus, cluster sampling yields adequate results in the sampling of elementary units to solve
the above issues. Elementary units shall be formed in groups based on location, class, or
region of cluster sampling.

3.8.2 K-means
Before we begin with k-means clustering, let’s first understand the concept of centroid
models where such type of clustering is applied.

Centroid models are adaptive clustering algorithms in which the notion of similarity is
extracted from the proximity of the cluster data point to the centroid. K-Means clustering
algorithm is a common algorithm that falls into this group. In these models, the number of
clusters needed at the end must be mentioned beforehand, which makes it necessary to have
prior knowledge of the dataset. These models run recursively to find the local optim.

The first step of this algorithm is to make, among our unmarked observations, new
observations, randomly found, called 'centroids.' The number of centroids is indicative of the
number of output groups (which, remember, we do not know). Now, the iterative process will
start, made up of two steps:

i. Initially, for each centroid, the algorithm finds the nearest points (in terms of distance that
is normally measured as the Euclidean distance) to that centroid and assigns them to its
category;

ii. Second, for each rank (represented by one centroid), the algorithm calculates the average
of all the points attributed to that class. The result of this computation would be the new
centroid for the class.

Each time the procedure is reiterated, certain findings, initially labelled along with one
centroid, may be redirected to another. In addition, after several repetitions, the shift in the
direction of the centroids should be less and less significant because the original random
centroids converge with the actual ones. This phase stops because there is no change in the
location of the centroids.
Figure 3.14: Centroid models

There are several approaches that may be used for this mission. However, in this post, I will
clarify and use the so-called 'Elbow Process.' The theory is that what we would like to see
inside our clusters is a low degree of heterogeneity determined by the intra-cluster number of
squares (WCSS):

And it's intuitive to realise that the greater the number of centroids, the lower the WCSS. In
fact, if we have as many centroids as the number of our observations, each WCSS would be
zero. However, if we consider the rule of parsimony, we know that setting the largest
possible number of centroids will be contradictory.

The concept is to choose the number of centroids after which the reduction in WCSS is
irrelevant. The relationship can be represented with the following graph:

Figure 3.15: Centroid Graph


The theory is that if the points form an arm, the elbow of the arm is the optimum number of
centroids.

Example: In the figure below, if you draw a horizontal line on y-axis for y = 2. What
will be the number of clusters formed?

Solution: Since the number of vertical lines intersecting the red horizontal line at y = 2 in the
dendrogram are 2, therefore, two clusters will be formed.

3.8.3 Hierarchical Clustering


A hierarchical clustering approach operates by arranging data into a cluster tree. Hierarchical
clustering starts by considering each data point as a single cluster. It then repeatedly performs
the following steps:

i. Identify the two clusters that can be nearest to each other, and

ii. Merge a limit of 2 equivalent clusters. We need to continue these steps before all the
clusters are combined.

The goal of Hierarchical Clustering is to generate a hierarchical sequence of nested clusters.


A diagram called Dendrogram (A Dendrogram is a tree-like diagram that displays the
sequences of merges or splits) graphically depicts this hierarchy and is an inverted tree that
defines the order in which the variables combine (bottom-up view) or cluster break up (top-
down view).

This algorithm uses two techniques-Agglomerative and Divisive.

(i) Agglomerative

Initially, consider each data point as an independent cluster and combine the closest cluster
pairs at each step. (This is a bottom-up method). At first, each data set is known to be an
independent entity or cluster. At each iteration, clusters combine with separate clusters before
one cluster is created. The agglomerative hierarchical clustering algorithm is as follows:

i. Calculate the resemblance of one cluster to all other clusters (calculate proximity matrix)

ii. Consider each data point as an independent cluster

iii. Merge clusters that are very near or identical to each other

iv. Calculate the proximity matrix for each cluster

v. Repeat Step 3 and Step 4 until there is just one cluster remaining

(ii) Divisive

Divisive Hierarchical Clustering is just the reverse of the agglomerative method of


Hierarchical Clustering. In divisive Hierarchical clustering, we take into consideration all
data points as a single cluster, and, in each iteration, we distinguish data points from clusters
that are not comparable. In the end, we're left with the N clusters.

In HC, the number of clusters K can be set exactly as in K-means, and n is the number of
data points such as n>K. The agglomerative HC begins from the n clusters and the
accumulated data before the K clusters are collected. The divider begins from a single cluster
and divides according to similarities before the K clusters are formed. The similarity here is
the distance between points, which can be measured in several ways, and is a central aspect
of segregation. It can be calculated using various approaches:

Min: In view of the two clusters C1 and C2 such that point a belongs to C1 and b to C2. The
relation between them is proportional to the minimum width.

Max: The resemblance between points a and b is equal to the overall difference between
points a and b.

Average: All pairs of points are taken and their similarities are determined. So the average
resemblance is the similarity between C1 and C2.

3.8.4 Comparison between k-means and Hierarchical Clustering

Basis of Distinction K-means clustering Hierarchical Clustering

Category Centroid based, partition Hierarchical, Agglomerative


based
Method used Elbow Method (using WCSS) Dendrogram

Approach The only centroid is Top-down, Bottom-up view


considered to form clusters

Running Time Faster Slower

Table 3.1 Comparison between k-means and Hierarchical Clustering

CHECK YOUR PROGRESS- 7

Q1 What do you understand by the term clustering? What are the various parameters that are
considered before forming clusters? List the significance of clustering.

Q2 Consider the example. The marketing department has asked you to supply them with
consumer segments for the forthcoming marketing plan. What features will you use to feed
into your model and what transformations would you use to bring these segments to them?
Explain using the Elbow method.

Q3 Cluster the following 8 points (with (x, y) representing locations) into three clusters:

A(2, 10), B(2, 5), C(8, 4), D(5, 8), E(7, 5), F(6, 4), G(1, 2) and H(4, 9)

The initial cluster centres are A(2, 10), D(5, 8) and G(1, 2).

The distance function between a = (x1, y1) and b = (x2, y2) is defined as-

P(a, b) = |x2-x1|+ |y2-y1|

Use the K-Means Algorithm to identify the three cluster centres after the second iteration.

Q4 What do you understand by Hierarchical Clustering? Name the two types of Hierarchical
Clustering.

3.9 SUMMARY

Data management involves arranging, appending, combining, and collapsing data to make
data easy to navigate and use. The data is sorted by placing the observations in a particular
order. Editing of data is a method for analyzing the raw material obtained. Coding is the
method of assigning numbers or other symbols to answers to questions. All data must fit into
a certain group and be classified into specific groups or groups by the researchers.
Machine learning requires predicting and classifying data, and to do so, and we use different
machine learning algorithms. Support Vector Machines or SVM is one of the most common
algorithms and continues to be popular. SVM training algorithm creates a model that assigns
new examples to one group or another, making it a non-probabilistic linear classifier. The
data on attributes are commonly used in the social sciences like Psychology, Sociology,
Political Science, Communication Studies, and Gender Studies.

SVM is a Supervised Learning algorithm used for both Classification and Regression
questions. Ensemble methods is a machine learning technique that integrates many simple
models to create one optimised predictive model. The theory of probability originated in the
gaming houses of France in the late twelfth century. Only in the 20th century, the
fundamental ideas and definitions of probability theory have been founded solely on
mathematics by mathematicians such as Kolmogorov and Markov, among others. Logistic
regression is used to achieve an odds ratio in the case of more than one explicative
component. The method is somewhat similar to multiple linear regression, except that the
response component is binomial. The effect is the influence of each vector on the odds ratio
of the occurrence of interest observed. The biggest benefit is to prevent confusing results
when evaluating the balance of both factors. The goal of cluster analysis is to figure out
which category of consumers is in the marketplace. The cluster/group thus defined should be
big enough to allow the company to grow profitably.

3.10 SELF-ASSESSMENT QUESTIONS

A. Descriptive Type Questions

1. What is the difference between a Data Analyst and a Business Analyst?

2. What are the different tools used in Business Analytics?

3. Discuss the cost parameter in the SVM.

4. Suppose you use a Linear SVM classifier for 2 class classification issues. You have now
obtained the following info, in which some points are circled red, representing support
vectors. Answer the following questions:
a. Will the decision boundary change if you remove the following any one red points
from the data?

b. Will the decision boundary change if you remove the non-red circled points from the
data?

5. Define K-mean Algorithm.

B. Practical / Scenario Based Questions

1. Explain why the algorithm of bagging works best for the models which have high variance
and low bias.

2. Discuss the most appropriate strategy for data cleaning before performing clustering
analysis, given less than desirable number of data points.

3. List the algorithms we use for Variable Selection.

4. How do we select Support Vector Machine kernels?

5. An employee has 2 kids and one of them is a girl. What is the probability that the other
child is also a girl? You can assume that there are an equal number of males and females in
the world.

C. Multiple Choice Questions

1. Choose the incorrect statement.

a. Subsetting can be used to pick and remove variables and observations

b. Raw data can only be analysed once

c. Merging problems by integrating datasets on the same results in order to generate a


conclusion with more variables

d. None of the above mentioned options


2. Select the option that is not a part of the data science process.

a. Discovery

b. Model Planning

c. Communication Building

d. Operationalize

3. On which of the following does the effectiveness of an SVM depend upon?

a. Selection of Kernel

b. Kernel Parameters

c. Soft Margin Parameter C

d. All of the above mentioned options

4. Choose the correct statement about Random Forest and Gradient Boosting ensemble
methods.

a. Both approaches can be used for the purpose of classification

b. Random Forest is used for classification, while Gradient Boosting is used for
regression tasks.

c. Random Forest is used for regression, while Gradient Boosting is used for
classification tasks.

d. Both approaches can’t be used for regression tasks.

5. Choose the algorithm which is not an ensemble learning algorithm.

a. AdaBoost

b. Extra Trees

c. Random Forest

d. Decision Trees

6. Which of the following is an example of movie recommendation systems?


a. Classification

b. Clustering

c. KNN

d. Regression

7. Can decision trees be used for performing clustering?

a. Yes

b. No

c. Insufficient data

d. Can’t say

8. Choose the method that best fits the data in Logistic Regression.

a. Least Square Error

b. Maximum Likelihood

c. Jaccard distance

d. Both a. and b.

9. What is the other name for the process decision programme chart?

a. Affinity diagram

b. Relationship diagram

c. Decision tree

d. Matrix diagram

10. Which is not a machine learning algorithm?

a. SVG

b. SVM

c. Random forest

d. None of the above mentioned options


Answers:

1-b, 2-c, 3-d, 4-a, 5-d,6-b,7-a,8-b,9-c,10-a

3.11 SUGGESTED READINGS

References Books

 Data Science, Classification, and Related Methods. Studies in Classification, Data


Analysis, and Knowledge Organization. Springer Japan.
 Tony Hey; Stewart Tansley; Kristin Michele Tolle. The Fourth Paradigm: Data-intensive
Scientific Discovery. Microsoft Research.
 Bell, G.; Hey, T.; Szalay, A. "COMPUTER SCIENCE: Beyond the Data Deluge".
Science.

Textbook References

 Mize; (2017) Data Analytics: The Ultimate Beginner's Guide to Data Analytics
 Foster Provost; Tom Fawcett. Data Science for Business: What You Need to Know about
Data Mining and Data-Analytic Thinking

Websites

 www.analyticsvidhya.com
 www.towardsdatascience.com
 www.wikipedia.org

You might also like