20IT503 - Big Data Analytics - Unit3

Please read this disclaimer before proceeding:
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
20IT503
Big Data Analytics
Department: IT
Batch/Year: 2020-24/ III
Created by: K.Selvi AP/IT
Date: 30.07.2022
Table of Contents
S NO CONTENTS PAGE NO
1 Contents 5
2 Course Objectives 6
3 Pre Requisites (Course Names with Code) 7
4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 13
9 3 PREDICTIVE MODELING AND MACHINE 14

LEARNING
3.1 Linear Regression 14
3.2 Polynomial Regression 15
3.3 Multivariate Regression 19
3.4 Bias/Variance Trade Off 23
K Fold Cross Validation 25

3.5
3.6 Data Cleaning and Normalization 30
3.7 Cleaning Web Log Data 37
3.8 Normalizing Numerical Data 40
3.9 Detecting Outliers 41
Introduction to Supervised And Unsupervised 44

3.10
Learning
3.11
Dealing with Real World Data 47
3.12
Machine Learning Algorithms 49
3.13
Clustering 50
10 Assignments 84
11 Part A (Questions & Answers) 85
12 Part B Questions 95
13 Supportive Online Certification Courses 96
14 Real time Applications 97
15 Content Beyond the Syllabus 99
16 Assessment Schedule 102 5

17 Prescribed Text Books & Reference Books 103
Course Objectives
To Understand the Big Data Platform and its Use cases
To Provide an overview of Apache Hadoop
To Provide HDFS Concepts and Interfacing with HDFS
To Understand Map Reduce Jobs
.
Pre Requisites
CS8391 – Data Structures
CS8492 – Database Management System

Syllabus
LTPC
20IT503 BIG DATA ANALYTICS
3003
UNIT I INTRODUCTION TO BIG DATA 9
Data Science – Fundamentals and Components –Types of Digital Data –

Classification of Digital Data – Introduction to Big Data – Characteristics of
Data – Evolution of Big Data – Big Data Analytics – Classification of
Analytics – Top Challenges Facing Big Data – Importance of Big Data
Analytics.
UNIT II DESCRIPTIVE ANALYTICS USING STATISTICS 9
Mean, Median and Mode – Standard Deviation and Variance – Probability –

Probability Density Function – Percentiles and Moments – Correlation and
Covariance – Conditional Probability – Bayes’ Theorem – Introduction to
Univariate, Bivariate and Multivariate Analysis – Dimensionality Reduction
using Principal Component Analysis (PCA) and LDA.
UNIT III PREDICTIVE MODELING AND MACHINE LEARNING 9
Linear Regression – Polynomial Regression – Multivariate Regression –

Bias/Variance Trade Off – K Fold Cross Validation – Data Cleaning and
Normalization – Cleaning Web Log Data – Normalizing Numerical Data –
Detecting Outliers – Introduction to Supervised And Unsupervised Learning
– Reinforcement Learning – Dealing with Real World Data – Machine
Learning Algorithms –Clustering.
Syllabus
UNIT IV BIG DATA HADOOP FRAMEWORK 9
Introducing Hadoop –Hadoop Overview – RDBMS versus Hadoop – HDFS

(Hadoop Distributed File System): Components and Block Replication –
Processing Data with Hadoop – Introduction to MapReduce – Features of
MapReduce – Introduction to NoSQL: CAP theorem – MongoDB: RDBMS
Vs. MongoDB – Mongo DB Database Model – Data Types and Sharding –
Introduction to Hive – Hive Architecture – Hive Query Language (HQL).
UNIT V PYTHON AND R PROGRAMMING 9
Python Introduction – Data types - Arithmetic - control flow – Functions -

args - Strings – Lists – Tuples – sets – Dictionaries Case study: Using R,
Python, Hadoop, Spark and Reporting tools to understand and Analyze
the Real world Data sources in the following domain- financial,
Insurance, Healthcare in Iris, UCI datasets.
Course Outcomes
CO# COs K Level
CO1 Identify Big Data and its Business Implications. K3
CO2 List the components of Hadoop and Hadoop Eco- K4

System
CO3 Access and Process Data on Distributed File System K4
CO4 Manage Job Execution in Hadoop Environment K4
CO5 Develop Big Data Solutions using Hadoop Eco System K4
CO6 Examine the given data with R programming K4

CO-PO/PSO Mapping
PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO

CO #
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3
CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
CO6 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT – III
No
S of Proposed Actual Pertai Tax Mode of
Topics per
No date Date n on delivery
i ing CO om
ods y
lev
el
Linear Regression
1 Polynomial Regression 1 CO3 K4 Chalk&Board
Multivariate Regression,
2 Bias/Variance Trade Off 1 CO3 K4 Chalk&Board
Data Cleaning and

Normalization – Cleaning Chalk&Board
3 Web Log Data 1 CO3 K4
Normalizing Chalk&Board
4 Numerical Data – 1 CO3 K4
Detecting
Outliers
Introduction to
5 1 CO3 K4 Chalk&Board
Supervised
And
Unsupervised
Learning
Reinforcement
Learning
Dealing with
Real World
Data
Machine Learning
Algorithms.
9 Clustering 1 K4 Chalk&Board
CO3
ACTIVITY BASED LEARNING
Match the Picture with its relevance

Lecture Notes
PREDICTIVE MODELING AND MACHINE LEARNING
Introduction to Machine Learning:

Machine learning is a branch of Artificial Intelligence (AI) focused
on building applications that learn from data and improve their accuracy over
time without being programmed to do so.
Types of Machine Learning:
Supervised Machine Learning:
It is an ML technique where models are trained on labelled data i.e.
output variable is provided in these types of problems. Here, the models find
the mapping function to map input variables with the output variable or the
labels.
Regression and Classification problems are a part of Supervised
Machine Learning.
Unsupervised Machine Learning:
It is the technique where models are not provided with the labelled data and
they have to find the patterns and structure in the data to know about the data.
Clustering and Association algorithms are a part of Unsupervised
ML.
3.1 Linear Regression
In the most simple words, Linear Regression is the supervised

Machine Learning model in which the model finds the best fit linear line
between the independent and dependent variable i.e. it finds the linear
relationship between the dependent and independent variable.
Linear Regression is of two types: Simple and Multiple.
Simple Linear Regression is where only one independent variable is present
and the model has to find the linear relationship of it with the dependent
variable.
Multiple Linear Regression there are more than one independent variables for
the model to find the relationship.
Equation of Simple Linear Regression, where bo is the intercept, b1 is coefficient or
slope, x is the independent variable and y is the dependent variable.
Equation of Multiple Linear Regression, where bo is the intercept,

b1,b2,b3,b4…,bn are coefficients or slopes of the independent variables
x1,x2,x3,x4…,xn and y is the dependent variable.
3.2 Polynomial Regression

What is Machine Learning?
Machine Learning algorithms may access data (categorical, numerical, image,
video, or anything else) and use it to learn for themselves without any explicit
programming. But how does Machine Learning function exactly? simply by
inspecting the data or facts (through instructions to observe the pattern and
making decisions or predictions)
Types of Machine Learning
Algorithmic approaches for machine learning may be divided into three categories.
1. Supervised Machine Learning – Task-Oriented (Classification – Regression)
2. Unsupervised Machine Learning – Fact or Data-Oriented (Cluster – Anomaly
detection)
3. Reinforcement Machine Learning – Either learning from mistakes or learning
from them correctly
Supervised Machine Learning
In supervised learning, algorithms are trained using labelled datasets,
and the algorithm learns about each category of input. The approach is
evaluated using test data (a subset of the training set) and predicts the outcome
when the training phase is over. Supervised machine learning is classified into
two types:
1. Classification
2. Regression
Classification Vs Regression
When there is a link between the input and output variables,
regression methods are applied. It is used to forecast continuous variables such
as weather and market movements, among others.
Classification methods are employed when the output variable is
categorical, such as Yes-No, Male-Female, True-False, Normal – Abnormal, and
so on.
Why do we need Regression?
Regression analysis is frequently used for one of two purposes:
forecasting the value of the dependent variable for those who have knowledge of
the explanatory components, or assessing the influence of an explanatory
variable on the dependent variable.
What is Polynomial Regression?
Polynomial Regression is a form of linear regression in which the
relationship between the independent variable x and dependent variable y is
modelled as an nth degree polynomial. Polynomial regression fits a nonlinear
relationship between the value of x and the corresponding conditional mean of y,
denotedE(y|x).
Why Polynomial Regression:
• There are some relationships that a researcher will hypothesize is
curvilinear. Clearly, such types of cases will include a polynomial term.
• Inspection of residuals. If we try to fit a linear model to curved data, a
scatter plot of residuals (Y-axis) on the predictor (X-axis) will have patches
of many positive residuals in the middle. Hence in such a situation, it is not
appropriate.
• An assumption in usual multiple linear regression analysis is that all the
independent variables are independent. In polynomial regression model,
this assumption is not satisfied.
Uses of Polynomial Regression:
These are basically used to define or describe non-linear phenomena such as:
• The growth rate of tissues.
• Progression of disease epidemics
• Distribution of carbon isotopes in lake sediments
The basic goal of regression analysis is to model the expected
value of a dependent variable y in terms of the value of an independent
variable x.
In simple regression, we used the following equation –
y = a + bx + e
Here y is a dependent variable, a is the y-intercept, b is the slope and e is the
error rate.
In many cases, this linear model will not work out For example if
we analysing the production of chemical synthesis in terms of temperature at
which the synthesis take place in such cases we use a quadratic model.
y = a + b1x + b2^2 + e
Here y is the dependent variable on x, a is the y-intercept and e is the error
rate.
In general, we can model it for nth value.
y = a + b1x + b2x^2 +....+ bnx^n
Since regression function is linear in terms of unknown variables,
hence these models are linear from the point of estimation.
Hence through the Least Square technique, let’s compute the
response value that is y.
Advantages of using Polynomial Regression:
• A broad range of functions can be fit under it.
• Polynomial basically fits a wide range of curvatures.
• Polynomial provides the best approximation of the relationship between
dependent and independent variables.
Disadvantages of using Polynomial Regression:
• These are too sensitive to the outliers.
• The presence of one or two outliers in the data can seriously affect the
results of nonlinear analysis.
• In addition, there are unfortunately fewer model validation tools for the
detection of outliers in nonlinear regression than there are for linear
regression.
Conclusion:
Polynomial Regression is used in many organizations when they
identify a nonlinear relationship between the independent and dependent
variables. It is one of the difficult regression techniques as compared to other
regression methods, so having in-depth knowledge about the approach and
algorithm will help you to achieve better results.
3.3 Multivariate Regression :
Multivariate Regression is a type of machine learning algorithm
that involves multiple data variables for analysis. It is mostly considered as
a supervised machine learning algorithm. Steps involved for Multivariate
regression analysis are feature selection and feature engineering,
normalizing the features, selecting the loss function and hypothesis
parameters, optimize the loss function, Test the hypothesis and generate
the regression model. The major advantage of multivariate regression is to
identify the relationships among the variables associated with the data set.
It helps to find the correlation between the dependent and multiple
independent variables. Multivariate linear regression is a commonly used
machine learning algorithm.
What is Multivariate Regression?
Multivariate Regression helps use to measure the angle of more
than one independent variable and more than one dependent variable. It
finds the relation between the variables (Linearly related).
• It used to predict the behaviour of the outcome variable and the
association of predictor variables and how the predictor variables are
changing.
• It can be applied to many practical fields like politics, economics,
medical, research works and many different kinds of businesses.
• Multivariate regression is a simple extension of multiple regression.
• Multiple regression is used to predicting and exchange the values of one
variable based on the collective value of more than one value of
predictor variables.
First, we will take an example to understand the use of multivariate
regression after that we will look for the solution to that issue.
Examples of Multivariate Regression:
If E-commerce Company has collected the data of its customers such as Age,
purchased history of a customer, gender and company want to find the
relationship between these different dependents and independent variables.
• A gym trainer has collected the data of his client that are coming to his gym
and want to observe some things of client that are health, eating habits
(which kind of product client is consuming every week), the weight of the
client. This wants to find a relation between these variables.
• As you have seen in the above two examples that in both of the situations
there is more than one variable some are dependent and some are
independent, so single regression is not enough to analyse this kind of data.
Here is the multivariate regression that comes into the picture.
1. Feature selection
• The selection of features plays the most important role in multivariate
regression.
• Finding the feature that is needed for finding which variable is dependent on
this feature.
2. Normalizing Features
• For better analysis features are need to be scaled to get them into a specific
range. We can also change the value of each feature.
3. Select Loss function and Hypothesis
• The loss function calculates the loss when the hypothesis predicts the wrong
value.
• And hypothesis means predicted value from the feature variable.
4. Set Hypothesis Parameters
• Set the hypothesis parameter that can reduce the loss function and can
predict.
5. Minimize the Loss Function
Minimizing the loss by using some lose minimization algorithm and use it over the dataset which
can help to adjust the hypothesis parameters. Once the loss is minimized then it can be used for
prediction.
There are many algorithms that can be used for reducing the loss such as gradient descent.
6. Test the hypothesis function
Check the hypothesis function how correct it predicting values, test it on test data.
Steps to follow archive Multivariate Regression

1) Import the necessary common libraries such as numpy, pandas
2) Read the dataset using the pandas’ library
3) As we have discussed above that we have to normalize the data for getting
better results. Why normalization because every feature has a different range of
values.
4) Create a model that can archive regression if you are using linear regression use
equation
Y = mx + c
In which x is given input, m is a slop line, c is constant, y is the output variable.
5) Train the model using hyper parameter. Understand the hyper parameter set it
according to the model. Such as learning rate, epochs, iterations.
6) As discussed above how the hypothesis plays an important role in analysis,
checks the hypothesis and measure the loss/cost function.
7) The loss/ Cost function will help us to measure how hypothesis value is true and
accurate.
8) Minimize the loss/cost function will help the model to improve prediction.
9) The loss equation can be defined as a sum of the squared difference between
the predicted value and actual value divided by twice the size of the dataset.
10) To minimize the Lose/cost function use gradient descent, it starts with a
random value and finds the point their loss function is least.
By following the above we can implement Multivariate regression
Advantages of Multivariate Regression
• The multivariate technique allows finding a relationship between variables or
features
• It helps to find a correlation between independent and dependent variables.
Disadvantages of Multivariate Regression
• Multivariate techniques are a little complex and high-level mathematical
calculation
• The multivariate regression model’s output is not easily interpretable and
sometimes because some loss and error output are not identical.
• It cannot be applied to a small dataset because results are more
straightforward in larger datasets.
Conclusion- Multivariate Regression
• The main purpose to use multivariate regression is when you have more than
one variables are available and in that case, single linear regression will not
work.
• Mainly real world has multiple variables or features when multiple
variables/features come into play multivariate regression are used.
3.4 Bias/Variance Trade Off:
Whenever we discuss model prediction, it’s important to understand
prediction errors (bias and variance). There is a trade-off between a model’s
ability to minimize bias and variance. Gaining a proper understanding of these
errors would help us not only to build accurate models but also to avoid the
mistake of over fitting and under fitting.
What is bias?
Bias is the difference between the average prediction of our model

and the correct value which we are trying to predict. Model with high bias pays
very little attention to the training data and oversimplifies the model. It always
leads to high error on training and test data.
What is variance?
Variance is the variability of model prediction for a given data point or
a value which tells us spread of our data. Model with high variance pays a lot of
attention to training data and does not generalize on the data which it hasn’t
seen before. As a result, such models perform very well on training data but has
high error rates on test data.
In supervised learning, under fitting happens when a model unable
to capture the underlying pattern of the data. These models usually have high
bias and low variance. It happens when we have very less amount of data to
build an accurate model or when we try to build a linear model with a nonlinear
data. Also, these kind of models are very simple to capture the complex patterns
in data like Linear and logistic regression.
In supervised learning, over fitting happens when our model captures the
noise along with the underlying pattern in data. It happens when we train our model a lot over
noisy dataset. These models have low bias and high variance. These models are very complex
like Decision trees which are prone to over fitting.
Total Error
To build a good model, we need to find a good balance between bias
and variance such that it minimizes the total error.
3.5 K-Fold Cross-Validation:
Cross-validation is a resampling procedure used to evaluate machine
learning models on a limited data sample.
The procedure has a single parameter called k that refers to the
number of groups that a given data sample is to be split into. As such, the
procedure is often called k-fold cross-validation. When a specific value for k is
chosen, it may be used in place of k in the reference to the model, such as k=10
becoming 10-fold cross-validation.
Cross-validation is primarily used in applied machine learning to
estimate the skill of a machine learning model on unseen data. That is, to use a
limited sample in order to estimate how the model is expected to perform in
general when used to make predictions on data not used during the training of
the model.
It is a popular method because it is simple to understand and because
it generally results in a less biased or less optimistic estimate of the model skill
than other methods, such as a simple train/test split.
The general procedure is as follows:
• Shuffle the dataset randomly.
• Split the dataset into k groups
• For each unique group:
-Take the group as a hold out or test data set
-Take the remaining groups as a training data set
-Fit a model on the training set and evaluate it on the test set
-Retain the evaluation score and discard the model
• Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an
individual group and stays in that group for the duration of the procedure. This
means that each sample is given the opportunity to be used in the hold out set 1
time and used to train the model k-1 times.
This approach involves randomly dividing the set of observations into
k groups, or folds, of approximately equal size. The first fold is treated as a
validation set, and the method is fit on the remaining k − 1 folds.
Configuration of K:
The k value must be chosen carefully for your data sample.
A poorly chosen value for k may result in a mis-representative idea of
the skill of the model, such as a score with a high variance (that may change a
lot based on the data used to fit the model), or a high bias, (such as an
overestimate of the skill of the model).
Three common tactics for choosing a value for k are as follows:
Representative: The value for k is chosen such that each train/test group of
data samples is large enough to be statistically representative of the broader
dataset.
k=10: The value for k is fixed to 10, a value that has been found through
experimentation to generally result in a model skill estimate with low bias a
modest variance.
k=n: The value for k is fixed to n, where n is the size of the dataset to give each
test sample an opportunity to be used in the hold out dataset. This approach is
called leave-one-out cross-validation.
The choice of k is usually 5 or 10, but there is no formal rule. As k
gets larger, the difference in size between the training set and the resampling
subsets gets smaller. As this difference decreases, the bias of the technique
becomes smaller
Worked Example:
To make the cross-validation procedure concrete, let’s look at a
worked example.
Imagine we have a data sample with 6 observations:
1[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
The first step is to pick a value for k in order to determine the

number of folds used to split the data. Here, we will use a value of k=3. That
means we will shuffle the data and then split the data into 3 groups. Because we
have 6 observations, each group will have an equal number of 2 observations.
For example:
1 Fold1: [0.5, 0.2]

2 Fold2: [0.1, 0.3]
3 Fold3: [0.4, 0.6]
We can then make use of the sample, such as to evaluate the skill of
a machine learning algorithm.
Three models are trained and evaluated with each fold given a
chance to be the held out test set.
For example:
Model1: Trained on Fold1 + Fold2, Tested on Fold3
The models are then discarded after they are evaluated as they have
served their purpose.
The skill scores are collected for each model and summarized for use.
Cross-Validation API:
We do not have to implement k-fold cross-validation manually. The
scikit-learn library provides an implementation that will split a given data sample
up.
The KFold() scikit-learn class can be used. It takes as arguments the
number of splits, whether or not to shuffle the sample, and the seed for
the pseudorandom number generator used prior to the shuffle.
For example, we can create an instance that splits a dataset into 3
folds, shuffles prior to the split, and uses a value of 1 for the pseudorandom
number generator.
1 kfold = KFold(3, True, 1)
The split() function can then be called on the class where the data
sample is provided as an argument. Called repeatedly, the split will return each
group of train and test sets. Specifically, arrays are returned containing the
indexes into the original data sample of observations to use for train and test
sets on each iteration.
For example, we can enumerate the splits of the indices for a data
sample using the created KFold instance as follows
1# enumerate splits
2for train, test in kfold.split(data):
3 print('train: %s, test: %s' % (train, test))
We can tie all of this together with our small dataset used in the
worked example of the prior section.
1 # scikit-learn k-fold cross-validation
2 from numpy import array
3 from sklearn.model_selection import KFold
4 # data sample
5 data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
6 # prepare cross validation
7 kfold = KFold(3, True, 1)
8 # enumerate splits
9 for train, test in kfold.split(data):
10 print('train: %s, test: %s' % (data[train], data[test]))
Running the example prints the specific observations chosen for each train
and test set. The indices are used directly on the original data array to retrieve the
observation values.
1 train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]
2 train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]
3 train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]
Usefully, the k-fold cross validation implementation in scikit-learn is

provided as a component operation within broader methods, such as grid-
searching model hyper parameters and scoring a model on a dataset.
Nevertheless, the KFold class can be used directly in order to split up
a dataset prior to modelling such that all models will use the same data splits.
This is especially helpful if you are working with very large data samples.
Variations on Cross-Validation
There are a number of variations on the k-fold cross validation procedure.
Three commonly used variations are as follows:
Train/Test Split: Taken to one extreme, k may be set to 2 (not 1) such that a
single train/test split is created to evaluate the model.
LOOCV: Taken to another extreme, k may be set to the total number of
observations in the dataset such that each observation is given a chance to be
the held out of the dataset. This is called leave-one-out cross-validation,
or LOOCV for short.
Stratified: The splitting of data into folds may be governed by criteria such as
ensuring that each fold has the same proportion of observations with a given
categorical value, such as the class outcome value. This is called stratified
cross-validation.
Repeated: This is where the k-fold cross-validation procedure is repeated n
times, where importantly, the data sample is shuffled prior to each repetition,
which results in a different split of the sample.
Nested: This is where k-fold cross-validation is performed within each fold of
cross-validation, often to perform hyper parameter tuning during model
evaluation. This is called nested cross-validation or double cross-validation.
3.6 Data Cleaning and Normalization:
What is Data Cleaning?
Data Cleaning is a critical aspect of the domain of data
management. The data cleansing process involves reviewing all the data
present within a database to either remove or update information that is
incomplete, incorrect or duplicated and irrelevant. Data cleansing is just not
simply about erasing the old information to make space for new data, but the
process is about rather finding a way to maximize the dataset’s accuracy
without necessarily tampering with the data available. Data Cleaning is the
process of determining and correcting the wrong data. Organizations rely on
data for most things but only a few properly address the data quality.
Utilizing the effectiveness and use of data can tremendously
increase the reliability and value of the brand. Hence, Business enterprises
have started giving more importance to data quality. Data Cleaning includes
many more actions than just removing the data, the process also requires
fixing wrong spellings and syntactical errors, correcting and filling of empty
fields, and identifying duplicate records to name a few. Data cleaning is
considered a foundational step for data science basics, as it plays an
important role in an analytical process that helps uncover reliable answers.
Improving the data quality through data cleaning can eliminate problems like
expensive processing errors and incorrect invoices.
Data quality is also very important as several pieces of data like
customer information are always changing and evolving. Although there is no
one such absolute way to describe the precise steps in the data cleaning
process as the processes vary from dataset to dataset it plays an important
part in deriving reliable answers. The crucial motive of data cleaning is to
construct uniform and standardized data sets that enable the analytical tools
and business intelligence easy access and help to perceive accurate data for
each problem.
• Getting rid of unwanted outliers is another method because
outliers can cause problems with certain models. Removing outliers will
not only help with the model’s performance but also improve its accuracy.
Although one should make sure that there is a legitimate reason to remove
them.
• Small mistakes are often made when numbers are entered. If there are
any mistakes present with the numbers being entered, it needs to be
changed to actual readable data. All of the data presents will have to be
converted so that the numbers are readable by the system. Data types
should be uniform across all of the datasets. A string can’t be termed as
numeric nor can a numeric be a boolean value.
• Correcting missing values is another important method that can’t be
ignored. Knowing how to handle missing values will help keep the data
clean. At times, there might be too many missing values present in a single
column. For such occurrences, there might not be enough data to work
with, so deleting the column may stand as the best option in such cases.
There are also other different ways to impute missing data values in the
dataset. This can be done by estimating what the missing data might just
be and performing linear regression or median can help calculate this.
• Fixing the typos as a result of human error is important and one can fix
typos through multiple algorithms and techniques. One of the methods can
be to map the values and convert them into their correct spelling. Typos
are essential to fix because models treat different values differently. Strings
present in data rely a lot on their spellings and cases.
What is Data Normalization?
Normalization is the process of organizing data from a database.
This includes processes like creating tables and establishing relationships
between those tables according to the rules designed to protect the data as
well as to make the database more flexible by eliminating the redundancy and
inconsistency present. Data normalization is the method of organizing data to
appear similar across all records and fields. Performing so always results in
getting higher quality data.
This process basically includes eliminating unstructured data and
duplicates in order to ensure logical data storage.
When data normalization is performed correctly a higher value of
insights are generated. In machine learning, some feature values at times
differ from others multiple times. The features with higher values will always
dominate the learning process. However, it does not mean that those
variables are more important to predict the outcome of the model. Data
normalization transforms the multiscaled data all to the same scale. After
normalization, all variables have a similar weightage on the model, hence
improving the stability and performance of the learning algorithm.
Normalization gives equal importance to each variable so that no single
variable drives the model performance. It also prevents any issues created
from database modifications such as insertions, deletions, and updates.
For businesses to achieve further heights needs to regularly
perform data normalization. It is one of the most important things that can be
done to get rid of errors that make an analysis of data a complicated and
difficult task. With normalization, an organization can make the most of its
data as well as invest in data gathering at a greater, more efficient level.
Cleaning and Normalization of Data In Python
Using Python we can easily perform Data Cleaning and
Normalization. Here I have tried to demonstrate an example using Python and
Numpy, to demonstrate the difference between the general data and
normalized data.
import pandas as pd
import numpy as np
from scipy import stats
from mlxtend.preprocessing
import minmax_scaling
import seaborn as sns
import matplotlib.pyplot as plt
You might highly consider normalizing your data if you are going
to implement a machine learning or statistical technique that comes with the
assumption of your data being normally distributed. Some examples of these
include linear regression, T-tests and Gaussian naive Bayes. Now let’s see
what normalizing data looks like:
normalized_data = stats.boxcox(original_data)
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
Output :
How to clean data:

We can notice that the shape of our data has changed. Before
normalizing it was almost L-shaped, but after normalizing, it looks more like
a bell-shaped curve. This helps us ensure that the data is now properly
skewed and scaled!
Step 1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including
duplicate observations or irrelevant observations. Duplicate observations will
happen most often during data collection.
When you combine data sets from multiple places, scrape data, or
receive data from clients or multiple departments, there are opportunities to create
duplicate data. De-duplication is one of the largest areas to be considered in this
process. Irrelevant observations are when you notice observations that do not fit
into the specific problem you are trying to analyse. For example, if you want to
analyze data regarding millennial customers, but your dataset includes older
generations, you might remove those irrelevant observations. This can make
analysis more efficient and minimize distraction from your primary target—as well
as creating a more manageable and more performant dataset.
Step 2: Fix structural errors:
Structural errors are when you measure or transfer data and
notice strange naming conventions, typos, or incorrect capitalization. These
inconsistencies can cause mislabelled categories or classes. For example,
you may find “N/A” and “Not Applicable” both appear, but they should be
analysed as the same category.
Step 3: Filter unwanted outliers
Often, there will be one-off observations where, at a glance, they
do not appear to fit within the data you are analyzing. If you have a
legitimate reason to remove an outlier, like improper data-entry, doing so will
help the performance of the data you are working with. However, sometimes
it is the appearance of an outlier that will prove a theory you are working on.
Remember: just because an outlier exists, doesn’t mean it is incorrect. This
step is needed to determine the validity of that number. If an outlier proves
to be irrelevant for analysis or is a mistake, consider removing it.
Step 4: Handle missing data
You can’t ignore missing data because many algorithms will not
accept missing values. There are a couple of ways to deal with missing data.
Neither is optimal, but both can be considered.
• As a first option, you can drop observations that have missing values, but
doing this will drop or lose information, so be mindful of this before you
remove it.
• As a second option, you can input missing values based on other
observations; again, there is an opportunity to lose integrity of the data
because you may be operating from assumptions and not actual
observations.
• As a third option, you might alter the way the data is used to effectively
navigate null values
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to
answer these questions as a part of basic validation:
• Does the data make sense?
• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?
False conclusions because of incorrect or “dirty” data can inform
poor business strategy and decision-making. False conclusions can lead to an
embarrassing moment in a reporting meeting when you realize your data
doesn’t stand up to scrutiny. Before you get there, it is important to create a
culture of quality data in your organization. To do this, you should document
the tools you might use to create this culture and what data quality means to
you.
3.7 Cleaning web log data:
As the services provided on the web are increasing day by day. So
the number of clickstreams captured at the server are also increasing
exponentially. Every single click made by the user is recorded because it is
assumed that it can be a potential source of any useful information. If this
clickstream is analyzed, the results can help the organization to understand
the interest of each user and hence providing the more personalized
information to each user, the design of the website can be modified
accordingly and e-commerce websites can use this information to build their
cross-marketing strategies. This analysis can be done by applying the data
mining techniques on the huge data captured in the log files in unstructured
or semi-structured format maintained at the server and this is called as Web
Usage Mining. Prior to analysis these web logs need to be preprocessed
because data in these log files is generally unstructured, incomplete, noisy
and inconsistent.
Preprocessing results in a smaller volume of datasets which can
be analyzed more efficiently. Various steps included in web usage mining are:
- Data Preprocessing,
-Pattern Discovery and
- Pattern Analysis.
Data preprocessing phase includes various tasks like
-data cleaning,
-user & session identification and
-path completion.
In pattern discovery techniques from several research areas such
as -data mining,
-machine learning,
-statistics are applied to generate the meaningful patterns out of the
preprocessed data.
In pattern analysis phase uninteresting patterns are filtered out from
the set of patterns obtained in the pattern discovery phase.
DATA PREPROCESSING
The information present in the web logs is generally heterogeneous
and semi-structured in nature. These log files also contain some entries which are
not of any use during analysis. So to perform analysis in a better and fruitful way
it is important to remove these undesirable entries. This will reduce the volume of
data by keeping only the useful data for analysis. The goal of preprocessing is to
transform the raw click streams into a set of user profiles [1]. Preprocessing
presents a number of challenges which led to a variety of algorithms and a
number of heuristic techniques for each step of preprocessing. Various phases of
preprocessing are discussed below:-
1.Data Cleaning
In this phase, irrelevant entries are removed from the web logs. These
irrelevant entries may include:-
Accessorial entries like .jpg, .css, .png files which may have no
relation with the content of the page. Rather they can be a part of some
advertisement embedded inside the HTML page.
Click streams having the failed HTTP status i.e other than 2XX series.
Spider navigation records which are captured when the software used by the
search engines periodically access the website for keeping their index up to date.
Web logs can be used in many applications and steps in cleaning also varies
accordingly. The above given points are appropriate when the cleaned web log is
to be used for understanding the navigation pattern of the users for site
modification purpose or personalization. But when web log is to be used for
intrusion detection, cleaning is performed in a different way.
2.User Identification
An easiest way of identifying the user is to use the login
credentials of the users such as username and password. But this
information is not always provided by the users for some security issues.
Hence some heuristics are used to identify the users:-
• Each different IP will represent a new user.
• When IP is same but browser and operating system information is
different, this is also marked as a different user.
3.Session Identification
The clickstream of user means a delimited set of individual
sessions each user access the page divide. The method to identify the user
session include timeout mechanism and referrer based method . In timeout
based method if there is a large gap (usually more than 30 mins or 25.5
mins) between two consecutive accesses then it is assumed that the user is
starting a new session. In referrer based method the referrerURI is checked.
If the URL in referrer field is not accessed previously, then it is assumed as a
new session.
4.Path Completion
This is an important and difficult phase. Path completion is used
to access the complete user access path. The incomplete user access path is
recognized on the basis of user session identification. There are chances of
missing path because of proxy and caching problems.
Various algorithms can be used for path completion like Maximal
Forward Reference (MFR) and Reference Length (RL) algorithm. At the end
of this step we will get the user session file.
3.8 Normalizing Numerical Data :
This section describes techniques to normalize numeric values in
your datasets. Ideally, your source systems are configured to capture and
deliver data using a consistent set of units in a standardized structure and
format. In practice, data from multiple systems can illuminate differences in
the level of precision used in numeric data or differences in text entries that
reference the same thing.
Numeric precision
Mathematical computations are performed using 64-bit floating
point operations to 15 decimals of precision. However, due to rounding off,
truncation, and other technical factors, small discrepancies in outputs can
be expected. Example:
-636074.22
-2465086.34
Suppose you apply the following transformation:
Transformation Name New formula
Parameter: Formula type Single row formula
Parameter: Formula (-636074.22 + -2465086.34)
Parameter: New column name MySum
The expected output in the MySum column: -3101160.56

The actual output for in the MySum column: -3101160.5599999996
Depending on your precision requirements, you can manage
precision across your columns using a transformation like the following,
which rounds off MySum to three digits:
Transformation Name Edit column with formula
Parameter: Columns MySum
Parameter: Formula ROUND($col,3)

3.9 Detecting Outliers:
An outlier in plain English can be called as an odd man out in a
series of data. Outliers can be unusually and extremely different from most of
the data points existing in our sample. It could be a very large observation or
a very small observation. Outliers can create biased results while calculating
the stats of the data due to its extreme nature, thereby affecting further
statistical/ML models.
Detecting and dealing with outliers is one of the most important
phases of data cleansing.
For e.g. Let us consider below table with these values. Since the
table is very small, one look at it gives us an idea that 10000 is an outlier.
However, in real life, the data to be dealt with will be very large and it is not
an easy task to detect outliers at one look in real scenarios.
The mean of the above observations is 1307 which is higher than

most of the values in the table. We all know that mean is the arithmetic
average and generally represents that centre of the data. Here, 1307 is
nowhere near the centre of the entire data. And the culprit for this is the one
extreme observation 10000. Hence, 10000 can be termed as an outlier
which distorts the actual structure of the data.
Outliers can be univariate or multivariate.
Univariate outliers are generally referred to as extreme points
on a variable. For eg: 10000 in above example.
Multivariate outliers are generally combination of unusual data points for two
or more variables. Scatter plots are mostly used in multivariate settings which
indicate the relationship between the response variable and one or more
predictor variables. Sometimes an outlier may fall within the expected range
of response variable (x-axis) and the predictor variable (y-axis) but can still be
an outlier as it does not fit the model i.e. it does not fit the regression line of the
model. Contrary to univariate outliers, multivariate may not necessarily be
extreme data points.
The simplest way to detect an outlier is by graphing the features or
the data points. Visualization is one of the best and easiest ways to have an
inference about the overall data and the outliers. Scatter plots and box plots are
the most preferred visualization tools to detect outliers.
Scatter plots — Scatter plots can be used to explicitly detect when a dataset or
particular feature contains outliers.
As very clearly visible in the graph, the dependent
variable “Salesprice” is concentrated more within the range of 0–55000 approx of
the feature LotArea and the points above 150000 are very clearly outliers as
these can result to disproportionate stats about the overall structure of the data.
Ø Hence, we can graph scatter plots for all the features of the dataset which we
suspect may contain outliers.
Box Plots: Box plot is another very simple visualization tool to detect outliers
which use the concept of Interquartile range (IQR) technique.
In the below graph, few of the outliers are highlighted in red circles.
Here we have plotted the Saleprice against the LotConfigurtaion based on year.
Example outlier in below graph : the “corner LotConfig” has an outlier with
Saleprice greater than 700000 for the year 2007.
Histograms: can also be used to identify outlier. However in a histogram, existence of

outliers can be detected by isolated bars.
3.9 Introduction to Supervised And Unsupervised Learning:
Supervised learning:
Supervised learning, as the name indicates, has the presence of
a supervisor as a teacher. Basically supervised learning is when we teach or
train the machine using data that is well labelled. Which means some data is
already tagged with the correct answer. After that, the machine is provided
with a new set of examples(data) so that the supervised learning algorithm
analyses the training data(set of training examples) and produces a correct
outcome from labelled data.
Supervised learning is classified into two categories of algorithms:
Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies
that some data is already tagged with the correct answer.
Types:-
• Regression
• Logistic Regression
• Classification
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine
Advantages:-
• Supervised learning allows collecting data and produces data output from
previous experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world
computation problems.
Disadvantages:-
• Classifying big data can be challenging.
• Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
Unsupervised learning
Unsupervised learning is the training of a machine using
information that is neither classified nor labeled and allowing the algorithm to
act on that information without guidance. Here the task of the machine is to
group unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will
be given to the machine. Therefore the machine is restricted to find the hidden
structure in unlabeled data by itself.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1.Exclusive (partitioning)
2.Agglomerative
3.Overlapping
4.Probabilistic
Clustering Types:-
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Supervised machine Unsupervised

Parameters learning machine learning
Algorithms are trained using Algorithms are used against

Input Data labeled data. data that is not labeled
Computational Complexity Simpler method Computationally complex
Accuracy Highly accurate Less accurate
No. of classes No. of classes is known No. of classes is not known
Uses real-time analysis of

.
Data Analysis Uses offline analysis data
Linear and Logistics

regression, Random forest, K-Means clustering,
Hierarchical clustering,
Support Vector Machine,
Algorithms used Neural Network, etc. Apriori algorithm, etc.
3.10 Reinforcement learning:
Reinforcement learning is a machine learning training
method based on rewarding desired behaviours and/or punishing undesired
ones. In general, a reinforcement learning agent is able to perceive and
interpret its environment, take actions and learn through trial and error.
How does reinforcement learning work?
In reinforcement learning, developers devise a method of
rewarding desired behaviours and punishing negative behaviours. This
method assigns positive values to the desired actions to encourage the
agent and negative values to undesired behaviours. This programs the
agent to seek long-term and maximum overall reward to achieve an optimal
solution.
These long-term goals help prevent the agent from stalling on
lesser goals. With time, the agent learns to avoid the negative and seek the
positive. This learning method has been adopted in artificial intelligence (AI)
as a way of directing unsupervised machine learning through rewards and
penalties.
Difference between Reinforcement learning and Supervised
learning:
Reinforcement learning Supervised learning
Reinforcement learning is all about making

decisions sequentially. In simple words, we
can say that the output depends on the state In Supervised learning, the decision is made
of the current input and the next input on the initial input or the input given at the
depends on the output of the previous input start
In Reinforcement learning decision is In supervised learning the decisions are

dependent, So we give labels to sequences independent of each other so labels are given
of dependent decisions to each decision.
Example: Chess game Example: Object recognition

Types of Reinforcement: There are two types of Reinforcement:
Positive :
Positive Reinforcement is defined as when an event, occurs due to a
particular behaviour, increases the strength and the frequency of the behaviour.
In other words, it has a positive effect on behaviour.
Advantages of reinforcement learning are:
• Maximizes Performance
• Sustain Change for a long period of time
• Too much Reinforcement can lead to an overload of states which can
diminish the results
Negative :
Negative Reinforcement is defined as strengthening of behaviour
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
• Increases Behaviour
• Provide defiance to a minimum standard of performance
• It Only provides enough to meet up the minimum behavior
Various Practical applications of Reinforcement Learning
• RL can be used in robotics for industrial automation.
• RL can be used in machine learning and data processing
• RL can be used to create training systems that provide custom instruction
and materials according to the requirement of students.
Common reinforcement learning algorithms:
State-action-reward-state-action (SARSA). This reinforcement learning
algorithm starts by giving the agent what's known as a policy. The policy is
essentially a probability that tells it the odds of certain actions resulting in
rewards, or beneficial states.
Q-learning: This approach to reinforcement learning takes the opposite
approach. The agent receives no policy, meaning its exploration of its
environment is more self-directed.
Deep Q-Networks: These algorithms utilize neural networks in addition to
reinforcement learning techniques. They utilize the self-directed environment
exploration of reinforcement learning. Future actions are based on a random
sample of past beneficial actions learned by the neural network.
3.11 Dealing with Real World Data
In supervised machine learning, you use data to teach automated
systems how to make accurate decisions. ML algorithms are designed to
discover patterns and associations in historical training data; they learn from
that data and encode that learning into a model to accurately predict a data
attribute of importance for new data. Training data, therefore, is fundamental in
the pursuit of machine learning. With high-quality data, subtle nuances and
correlations can be accurately captured and high-fidelity predictive systems can
be built. But if training data is of poor quality, the efforts of even the best ML
algorithms may be rendered useless.
3.12 Machine Learning Algorithms:
3 types of Machine Learning Algorithms:

1. Supervised Learning
How it works: This algorithm consists of a target/outcome variable
(or dependent variable) which is to be predicted from a given set of predictors
(independent variables). Using this set of variables, we generate a function that
map inputs to desired outputs. The training process continues until the model
achieves a desired level of accuracy on the training data. Examples of
Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic
Regression etc.
2. Unsupervised Learning
How it works: In this algorithm, we do not have any target or
outcome variable to predict / estimate. It is used for clustering populations in
different groups, which is widely used for segmenting customers into different
groups for specific interventions. Examples of Unsupervised Learning: Apriori
algorithm, K-means.
3. Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make
specific decisions. It works this way: the machine is exposed to an environment
where it trains itself continually using trial and error. This machine learns from
past experience and tries to capture the best possible knowledge to make
accurate business decisions. Example of Reinforcement Learning: Markov
Decision Process.
List of Common Machine Learning Algorithms:
Here is the list of commonly used machine learning algorithms.

These algorithms can be applied to almost any data problem:
• Linear Regression
• Decision Tree
• SVM
• Naive Bayes
• kNN
• K-Means
• Random Forest
• Dimensionality Reduction Algorithms
• Gradient Boosting algorithms
a) GBM
b) XGBoost
c) LightGBM
d) CatBoost
Logistic Regression
Don’t get confused by its name! It is a classification, not a
regression algorithm. It is used to estimate discrete values ( Binary values
like 0/1, yes/no, true/false ) based on a given set of the independent
variable(s). In simple words, it predicts the probability of occurrence of an
event by fitting data to a logit function. Hence, it is also known as logit
regression. Since it predicts the probability, its output values lie between 0
and 1 (as expected).
Again, let us try and understand this through a simple example.
Let’s say your friend gives you a puzzle to solve. There are only 2
outcome scenarios – either you solve it or you don’t. Now imagine, that you
are being given a wide range of puzzles/quizzes in an attempt to understand
which subjects you are good at. The outcome of this study would be
something like this – if you are given a trignometry-based tenth-grade
problem, you are 70% likely to solve it. On the other hand, if it is a grade
fifth history question, the probability of getting an answer is only 30%. This
is what Logistic Regression provides you.
Decision Tree:
This is one of my favourite algorithms and I use it quite
frequently. It is a type of supervised learning algorithm that is mostly used
for classification problems. Surprisingly, it works for both categorical and
continuous dependent variables. In this algorithm, we split the population
into two or more homogeneous sets. This is done based on the most
significant attributes/ independent variables to make as distinct groups as
possible.
In the image above, you can see that population is classified into four
different groups based on multiple attributes to identify ‘if they will play or not’.
To split the population into different heterogeneous groups, it uses various
techniques like Gini, Information Gain, Chi-square, and entropy.
SVM (Support Vector Machine)
It is a classification method. In this algorithm, we plot each data item as a point
in n-dimensional space (where n is a number of features you have) with the
value of each feature being the value of a particular coordinate.
For example, if we only had two features like Height and Hair length of an
individual, we’d first plot these two variables in two-dimensional space where
each point has two coordinates (these co-ordinates are known as Support
Vectors)
Now, we will find some lines that split the data between the two
differently classified groups of data. This will be the line such that the distances
from the closest point in each of the two groups will be the farthest away.
Naive Bayes:
It is a classification technique based on Bayes’ theorem with an
assumption of independence between predictors. In simple terms, a Naive
Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the
other features, a naive Bayes classifier would consider all of these properties
to independently contribute to the probability that this fruit is an apple.
The Naive Bayesian model is easy to build and particularly useful
for very large data sets. Along with simplicity, Naive Bayes is known to
outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability
P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
Here,
• P(c|x) is the posterior probability of class (target)
given predictor (attribute).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.
Example: Let’s understand it using an example. Below I have a training
data set of weather and the corresponding target variable ‘Play’. Now, we
need to classify whether players will play or not based on weather
conditions. Let’s follow the below steps to perform it.
Step 1: Convert the data set to a frequency table.
Step 2: Create a Likelihood table by finding the probabilities like Overcast
probability = 0.29 and probability of playing is 0.64.
Step 3: Now, use the Naive Bayesian equation to calculate the posterior
probability for each class. The class with the highest posterior probability is
the outcome of the prediction.
kNN (k- Nearest Neighbors):
It can be used for both classification and regression problems.
However, it is more widely used in classification problems in the industry. K
nearest neighbors is a simple algorithm that stores all available cases and
classifies new cases by a majority vote of its k neighbors. The case assigned
to the class is most common amongst its K nearest neighbors measured by a
distance function.
These distance functions can be Euclidean, Manhattan,
Minkowski and Hamming distances. The first three functions are used for
continuous function and the fourth one (Hamming) for categorical variables.
If K = 1, then the case is simply assigned to the class of its nearest
neighbor. At times, choosing K turns out to be a challenge while performing
kNN modeling.
Things to consider before selecting kNN:

• KNN is computationally expensive
• Variables should be normalized else higher range variables can bias it
• Works on pre-processing stage more before going for kNN like an outlier,
noise removal
K-Means:
It is a type of unsupervised algorithm which solves the
clustering problem. Its procedure follows a simple and easy way to classify
a given data set through a certain number of clusters (assume k clusters).
Data points inside a cluster are homogeneous and heterogeneous to peer
groups.
Remember figuring out shapes from ink blots? k means is
somewhat similar to this activity. You look at the shape and spread to
decipher how many different clusters/populations are present!
How K-means forms cluster:
• K-means picks k number of points for each cluster known as centroids.
• Each data point forms a cluster with the closest centroids i.e. k clusters.
• Finds the centroid of each cluster based on existing cluster members. Here
we have new centroids.
• As we have new centroids, repeat steps 2 and 3. Find the closest distance
for each data point from new centroids and get associated with new k-
clusters. Repeat this process until convergence occurs i.e. centroids do not
change.
How to determine the value of K:
In K-means, we have clusters and each cluster has its own
centroid. The sum of the square of the difference between the centroid and the
data points within a cluster constitutes the sum of the square value for that
cluster. Also, when the sum of square values for all the clusters is added, it
becomes a total within the sum of the square value for the cluster solution.
We know that as the number of clusters increases, this value keeps
on decreasing but if you plot the result you may see that the sum of squared
distance decreases sharply up to some value of k, and then much more slowly
after that. Here, we can find the optimum number of clusters.
Random Forest:
Random Forest is a trademarked term for an ensemble of
decision trees. In Random Forest, we’ve got a collection of decision trees
(so-known as “Forest”). To classify a new object based on attributes, each
tree gives a classification and we say the tree “votes” for that class. The
forest chooses the classification having the most votes (over all the trees in
the forest).
Each tree is planted & grown as follows:
• If the number of cases in the training set is N, then a sample of N cases
is taken at random but with replacement. This sample will be the training
set for growing the tree.
• If there are M input variables, a number m<<M is specified such that at
each node, m variables are selected at random out of the M and the best
split on this m is used to split the node. The value of m is held constant
during the forest growth.
• Each tree is grown to the largest extent possible. There is no pruning.
3.13 Clustering:
It is basically a type of unsupervised learning method. An
unsupervised learning method is a method in which we draw references from
datasets consisting of input data without labeled responses. Generally, it is
used as a process to find meaningful structure, explanatory underlying
processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same groups are more
similar to other data points in the same group and dissimilar to the data
points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified
into one single group. We can distinguish the clusters, and we can identify that
there are 3 clusters in the below picture.
It is not necessary for clusters to be spherical. Such as :
Why Clustering?
Clustering is very much important as it determines the
intrinsic grouping among the unlabelled data present. There are no criteria for
good clustering. It depends on the user, what is the criteria they may use which
satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural
clusters” and describe their unknown properties (“natural” data types), in
finding useful and suitable groupings (“useful” data classes) or in finding
.
unusual data objects (outlier detection). This algorithm must make some
assumptions that constitute the similarity of points and each assumption make
different and equally valid clusters.
Clustering Methods :
Density-Based Methods: These methods consider the clusters as the
dense region having some similarities and differences from the lower dense
region of the space. These methods have good accuracy and the ability to
merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of
Applications with Noise), OPTICS (Ordering Points to Identify Clustering
Structure), etc.
Hierarchical Based Methods: The clusters formed in this method form a
tree-type structure based on the hierarchy. New clusters are formed using
the previously formed one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced
Iterative Reducing Clustering and using Hierarchies), etc.
Partitioning Methods: These methods partition the objects into k clusters
and each partition forms one cluster. This method is used to optimize an
objective criterion similarity function such as when the distance is a major
parameter example K-means, CLARANS (Clustering Large Applications based
upon Randomized Search), etc.
Grid-based Methods: In this method, the data space is formulated into a
finite number of cells that form a grid-like structure. All the clustering
operations done on these grids are fast and independent of the number of
data objects example STING (Statistical Information Grid), wave cluster,
CLIQUE (CLustering In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised
learning algorithm that solves clustering problem. K-means algorithm
partitions n observations into k clusters where each observation belongs to the
cluster with the nearest mean serving as a prototype of the cluster.
Applications of Clustering in different fields :

Marketing: It can be used to characterize & discover customer segments for
marketing purposes.
Biology: It can be used for classification among different species of plants
and animals.
Libraries: It is used in clustering different books on the basis of topics and
information.
Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
City Planning: It is used to make groups of houses and to study their values
based on their geographical locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we can
determine the dangerous zones.
.
Assignments
Q. Question CO K Level
No. Level
Let’s understand it using an example. Below I

have a training data set of weather and the
corresponding target variable ‘Play’. Now, we
need to classify whether players will play or not
based on weather conditions. Let’s write the
1
steps to perform it.
CO3 K3
Part-A Questions and Answers
1. What is a multivariate regression model? (CO3,K3)

Multivariable regression models are machine learning algorithms
designed to determine the statistical relationship between one dependent
variable and multiple independent variables.
2. What is the use of multivariate regression? (CO3,K3)
Multivariate regression models find ample use in research studies
for more efficient analysis of data. They are usually applied where there are
multiple independent variables or features present.
3. Which are the two most common multivariate analysis methods?
(CO3,K3)
The two main multivariate analysis methods are common factor
analysis and principal component analysis.
4. What are the disadvantages of the linear regression model?
(CO3,K3)
One of the most significant demerits of the linear model is that it
is sensitive and dependent on the outliers. It can affect the overall result.
Another notable demerit of the linear model is overfitting. Similarly,
underfitting is also a significant disadvantage of the linear model.
5. What is a Linear Regression? (CO3,K3)
In simple terms, linear regression is adopting a linear approach
to modeling the relationship between a dependent variable (scalar response)
and one or more independent variables (explanatory variables). In case you
have one explanatory variable, you call it a simple linear regression. In case
you have more than one independent variable, you refer to the process as
multiple linear regressions.
6.What are the possible ways of improving the accuracy of a linear
regression model? (CO3,K3)
There could be multiple ways of improving the accuracy of a linear
regression, most commonly used ways are as follows:
Outlier Treatment:
- Regression is sensitive to outliers, hence it becomes very important
to treat the outliers with appropriate values. Replacing the values with mean,
median, mode or percentile depending on the distribution can prove to be
useful.
7. What are the disadvantages of the linear model? (CO3,K2)
– Linear regression is sensitive to outliers which may affect the result.
– Over-fitting
– Under-fitting
8. What is a bias-variance trade-off? (CO3,K2)
Bias: Bias is an error introduced in your model due to the oversimplification of
the machine learning algorithm. It can lead to underfitting. When you train your
model at that time model makes simplified assumptions to make the target
function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High
bias machine learning algorithms — Linear Regression, Logistic Regression
Variance: Variance is error introduced in your model due to a complex machine
learning algorithm, your model learns noise also from the training data set and
performs badly on test data set. It can lead to high sensitivity and overfitting.
Normally, as you increase the complexity of your model, you will see a reduction
in error due to lower bias in the model. However, this only happens until a
particular point. As you continue to make your model more complex, you end up
over-fitting your model and hence your model will start suffering from high
variance
9. What are the differences between over-fitting and under-fitting?
(CO3,K2)
In statistics and machine learning, one of the most common
tasks is to fit a model to a set of training data, so as to be able to make
reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead
of the underlying relationship. Overfitting occurs when a model is
excessively complex, such as having too many parameters relative to the
number of observations. A model that has been overfitted, has poor
predictive performance, as it overreacts to minor fluctuations in the training
data.
Underfitting occurs when a statistical model or machine learning algorithm
cannot capture the underlying trend of the data. Underfitting would occur,
for example, when fitting a linear model to non-linear data. Such a model
too would have poor predictive performance.
10. How does data cleaning plays a vital role in the analysis?
(CO3,K2)
Data cleaning can help in analysis because:
Cleaning data from multiple sources helps to transform it into a
format that data analysts or data scientists can work with.
Data Cleaning helps to increase the accuracy of the model in machine
learning.
It is a cumbersome process because as the number of data
sources increases, the time taken to clean the data increases exponentially
due to the number of sources and the volume of data generated by these
sources.
It might take up to 80% of the time for just cleaning data
making it a critical part of the analysis task.
11. Explain cross-validation. (CO3,K2)
Cross-validation is a model validation technique for evaluating how the
outcomes of statistical analysis will generalize to an independent
dataset. Mainly used in backgrounds where the objective is forecast and one
wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model
in the training phase (i.e. validation data set) in order to limit problems like
overfitting and get an insight on how the model will generalize to an
independent data set.
12. What is data cleaning? (CO3,K2)
Data cleaning is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data within a
dataset. When combining multiple data sources, there are many opportunities
for data to be duplicated or mislabeled. If data is incorrect, outcomes and
algorithms are unreliable, even though they may look correct. There is no one
absolute way to prescribe the exact steps in the data cleaning process because
the processes will vary from dataset to dataset. But it is crucial to establish a
template for your data cleaning process so you know you are doing it the right
way every time.
13. List methods of data cleaning. (CO3,K2)
• Remove duplicates.
• Remove irrelevant data.
• Standardize capitalization.
• Convert data type.
• Clear formatting.
• Fix errors.
• Language translation.
• Handle missing values
14.What are the types of outliers? (CO3,K2)
In statistics and data science, there are three generally accepted
categories which all outliers fall into:
Type 1: Global Outliers (aka Point Anomalies)
Type 2: Contextual Outliers (aka Conditional Anomalies)
Type 3: Collective Outliers
15. What is outliers and example? (CO3,K2)
A value that "lies outside" (is much smaller or larger than) most of
the other values in a set of data. For example in the scores
25,29,3,32,85,33,27,28 both 3 and 85 are "outliers".
16. What is Machine Learning? (CO3,K2)
Machine Learning algorithms may access data (categorical,
numerical, image, video, or anything else) and use it to learn for themselves
without any explicit programming. Machine Learning function, simply by
inspecting the data or facts (through instructions to observe the pattern and
making decisions or predictions)
17. Types of Machine Learning. (CO3,K2)
Algorithmic approaches for machine learning may be divided into
three categories.
1. Supervised Machine Learning – Task-Oriented (Classification – Regression)
2. Unsupervised Machine Learning – Fact or Data-Oriented (Cluster – Anomaly
detection)
3. Reinforcement Machine Learning – Either learning from mistakes or learning
from them correctly.
18. What is Supervised Machine Learning? (CO3,K2)
In supervised learning, algorithms are trained using labelled
datasets, and the algorithm learns about each categaory of input. The
approach is evaluated using test data (a subset of the training set) and
predicts the outcome when the training phase is over. Supervised machine
learning is classified into two types:
1. Classification
2. Regression
19. Why do we need Regression? (CO3,K2)
Regression analysis is frequently used for one of two purposes:
forecasting the value of the dependent variable for those who have knowledge
of the explanatory components, or assessing the influence of an explanatory
variable on the dependent variable.
20.What is Polynomial Regression ? (CO3,K2)
Polynomial Regression is a form of linear regression in which
the relationship between the independent variable x and dependent variable y
is modelled as an nth degree polynomial. Polynomial regression fits a
nonlinear relationship between the value of x and the corresponding
conditional mean of y, denoted E(y |x)
21. What is reinforcement learning? (CO3,K2)
Reinforcement learning is a machine learning training
method based on rewarding desired behaviours and/or punishing undesired
ones. In general, a reinforcement learning agent is able to perceive and
interpret its environment, take actions and learn through trial and error.
22. What is Clustering? (CO3,K2)
Clustering is the task of dividing the population or data points into a

number of groups such that data points in the same groups are more similar
to other data points in the same group and dissimilar to the data points in
other groups. It is basically a collection of objects on the basis of similarity
and dissimilarity between them.
23.Differentiate Supervised and Unsupervised Machine Learning.
Supervised machine Unsupervised

Parameters learning machine learning
Algorithms are used

Algorithms are trained against data that is not
Input Data using labeled data. labeled
Computational
Complexity Simpler method Computationally complex
Accuracy Highly accurate Less accurate
No. of classes No. of classes is known No. of classes is not known
Uses real-time analysis of

Data Analysis Uses offline analysis data
Linear and Logistics

regression, Random forest, K-Means clustering,
Hierarchical clustering,
Support Vector Machine,
Algorithms used Neural Network, etc. Apriori algorithm, etc.
24. Types of Clustering Algorithm. (CO3,K2)
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
25. Write the general procedure k-Fold Cross-Validation (CO3,K2)
• Shuffle the dataset randomly.
• Split the dataset into k groups
• For each unique group:
a) Take the group as a hold out or test data set
b) Take the remaining groups as a training data set
c) Fit a model on the training set and evaluate it on the test set
d) Retain the evaluation score and discard the model
• Summarize the skill of the model using the sample of model evaluation
scores
26. Difference between Reinforcement learning and Supervised
learning: (CO3,K2)
Reinforcement learning Supervised learning
Reinforcement learning is all about making

decisions sequentially. In simple words, we
can say that the output depends on the state In Supervised learning, the decision is made
of the current input and the next input on the initial input or the input given at the
depends on the output of the previous input start
In Reinforcement learning decision is In supervised learning the decisions are

dependent, So we give labels to sequences independent of each other so labels are given
of dependent decisions to each decision.
Example: Chess game Example: Object recognition

27.List Various Practical applications of Reinforcement Learning.
(CO3,K2)
• RL can be used in robotics for industrial automation.
• RL can be used in machine learning and data processing
• RL can be used to create training systems that provide custom instruction
and materials according to the requirement of students.
• RL can be used in large environments in the following situations:
A model of the environment is known, but an analytic solution is not available;

Only a simulation model of the environment is given (the subject of
simulation-based optimization)
The only way to collect information about the environment is to interact with
it.
28. List of Common Machine Learning Algorithms. (CO3,K2)
• Linear Regression
• Decision Tree
• SVM
• Naive Bayes
• kNN
• K-Means
• Random Forest
• Dimensionality Reduction Algorithms
• Gradient Boosting algorithms
GBM
XGBoost
LightGBM
CatBoost
Part-B Questions
Q. Questions CO K Level
No. Level
1 Explain in detail about Linear Regression. CO3 K3
Explain in detail about polynomial Regression K3

2 CO3
and mention its advantages and disadvantages.
Explain in detail about multivariate CO3 K3

3
Regression and processes involved in
multivariate regression analysis.
Explain in detail about bias and variance trade CO3
4 K2
off and over fitting and under fitting case in
supervised learning.
Discuss in detail about K fold cross validation
CO3 K3
5 and its procedure with data sample.
Discuss in detail Data cleaning and several CO3 K3

6
methods for performing Data normalization.
Explain in detail about supervised and CO3 K2
7 Unsupervised machine learning with example.
CO3 K2
8 Explain the cleaning web log data with step by
step procedure.
What is reinforcement algorithm? Explain in CO3 K2
9
detail about common
reinforcement learning algorithms.
Discuss in detail about Machine learning CO3 K2
10
algorithm.
Explain in detail about Detecting outlier with
example. CO3
11 K2
12 Discuss in detail about Clustering. CO3 K4

Supportive Online Courses
Sl. Courses Platform

No.
1 Big Data Computing Swayam
2 Python for Data Science Swayam
3 Applied Machine learning for Python Coursera
4 R Programming Coursera
REAL TIME APPLICATIONS IN DAY TO DAY LIFE
Working of Recommender Systems
Title : How Recommender Systems Work (Netflix/Amazon)

Description : During the last few decades, with the rise of Youtube,
Amazon, Netflix and many other such web services, recommender systems
have taken more and more place in our lives. From e-commerce to online
advertisement, recommender systems are today unavoidable in our daily
online journeys.
In a very general way, recommender systems are algorithms aimed at
suggesting relevant items to users . Recommender systems are really
critical in some industries as they can generate a huge amount of income
when they are efficient or also be a way to stand out significantly from
competitors.
Watch this you-tube video. This video explains the techniques behind
different recommendation systems with real time applications.
https://www.youtube.com/watch?v=n3RKsY2H-NE
Food Recommendations @ Swiggy
Title : Our experiments with food recommendations

@Swiggy - nitin hardeniya
Description : Food is a very personal choice. Swiggy are obsessed
about Customer Experience and want to make food discovery on
the platform seamless and a delight for the consumer. So when
you fire the Swiggy app, with the help of consumers
Implicit/explicit feedback to figure out Your Taste Preferences, Your
Price Affinity, Single/Group Order, Breakfast/ Late night Cravings
and provide a convenient, Simple but highly personalized food
ordering experience.
Watch this you-tube video, the food recommendations experience
shared by
Nitin - Senior Data Scientist @Swiggy.
https://www.youtube.com/watch?v=fRXWHJlgmHA
Mini Project Suggestions
Analyze Your Personal Facebook Posting Habits — Are you spending too
much time posting on Facebook? The numbers don’t lie, and you can find
them in this beginner-to-intermediate Python data project. Implementation
of machine learning algorithm.
(co3, K4)
.
Text & Reference Books
Sl. Book Name & Author Book

No.
1 EMC Education Services, "Data Science and Big Data Analytics: Text Book
Discovering, Analyzing, Visualizing and Presenting Data", Wiley
publishers, 2015.
2 Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Text Book
Datasets", Cambridge University Press, 2012.
3 An Introduction to Statistical Learning: with Applications in R Text Book

(Springer Texts in Statistics) Hardcover – 2017
4 Dietmar Jannach and Markus Zanker, "Recommender Systems: Reference

An Introduction", Cambridge University Press, 2010. Book
5 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A Reference

Practical Guide for Managers " CRC Press, 2015. Book
6 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contents of this information is strictly prohibited.

20IT503 - Big Data Analytics - Unit3

Uploaded by

Copyright:

Available Formats

You might also like

20IT503 - Big Data Analytics - Unit3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20IT503 - Big Data Analytics - Unit3

Uploaded by

Copyright:

Available Formats

Please read this disclaimer before proceeding:

3 Pre Requisites (Course Names with Code) 7

4 Syllabus (With Subject Code, Name, LTPC details) 8

6 CO- PO/PSO Mapping 11

8 Activity Based Learning 13

9 3 PREDICTIVE MODELING AND MACHINE 14

3.2 Polynomial Regression 15

3.3 Multivariate Regression 19

3.4 Bias/Variance Trade Off 23

K Fold Cross Validation 25

3.6 Data Cleaning and Normalization 30

3.7 Cleaning Web Log Data 37

3.8 Normalizing Numerical Data 40

3.9 Detecting Outliers 41

Introduction to Supervised And Unsupervised 44

11 Part A (Questions & Answers) 85

13 Supportive Online Certification Courses 96

14 Real time Applications 97

15 Content Beyond the Syllabus 99

16 Assessment Schedule 102 5

To Provide an overview of Apache Hadoop

To Provide HDFS Concepts and Interfacing with HDFS

To Understand Map Reduce Jobs

CS8391 – Data Structures

CS8492 – Database Management System

Data Science – Fundamentals and Components –Types of Digital Data –

UNIT II DESCRIPTIVE ANALYTICS USING STATISTICS 9

Mean, Median and Mode – Standard Deviation and Variance – Probability –

UNIT III PREDICTIVE MODELING AND MACHINE LEARNING 9

Linear Regression – Polynomial Regression – Multivariate Regression –

Introducing Hadoop –Hadoop Overview – RDBMS versus Hadoop – HDFS

UNIT V PYTHON AND R PROGRAMMING 9

Python Introduction – Data types - Arithmetic - control flow – Functions -

CO1 Identify Big Data and its Business Implications. K3

CO2 List the components of Hadoop and Hadoop Eco- K4

CO3 Access and Process Data on Distributed File System K4

CO4 Manage Job Execution in Hadoop Environment K4

CO5 Develop Big Data Solutions using Hadoop Eco System K4

CO6 Examine the given data with R programming K4

PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO

Data Cleaning and

Match the Picture with its relevance

Introduction to Machine Learning:

In the most simple words, Linear Regression is the supervised

Equation of Multiple Linear Regression, where bo is the intercept,

3.2 Polynomial Regression

Steps to follow archive Multivariate Regression

Bias is the difference between the average prediction of our model

1[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]

The first step is to pick a value for k in order to determine the

1 Fold1: [0.5, 0.2]

2for train, test in kfold.split(data):

3 print('train: %s, test: %s' % (train, test))

2 from numpy import array

3 from sklearn.model_selection import KFold

5 data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

6 # prepare cross validation