Professional Documents
Culture Documents
20IT503 - Big Data Analytics - Unit3
20IT503 - Big Data Analytics - Unit3
20IT503 - Big Data Analytics - Unit3
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
20IT503
Big Data Analytics
Department: IT
Batch/Year: 2020-24/ III
Created by: K.Selvi AP/IT
Date: 30.07.2022
Table of Contents
S NO CONTENTS PAGE NO
1 Contents 5
2 Course Objectives 6
5 Course Outcomes 10
7 Lecture Plan 12
10 Assignments 84
12 Part B Questions 95
.
Pre Requisites
CO1 2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO2 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4 2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO5 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
CO6 2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
Lecture Plan
UNIT – III
No
S of Proposed Actual Pertai Tax Mode of
Topics per
No date Date n on delivery
i ing CO om
ods y
lev
el
Linear Regression
1 Polynomial Regression 1 CO3 K4 Chalk&Board
Multivariate Regression,
2 Bias/Variance Trade Off 1 CO3 K4 Chalk&Board
Normalizing Chalk&Board
4 Numerical Data – 1 CO3 K4
Detecting
Outliers
Introduction to
5 1 CO3 K4 Chalk&Board
Supervised
And
Unsupervised
Learning
Reinforcement
6 1 CO3 K4 Chalk&Board
Learning
Dealing with
7 1 CO3 K4 Chalk&Board
Real World
Data
Machine Learning
8 1 CO3 K4 Chalk&Board
Algorithms.
9 Clustering 1 K4 Chalk&Board
CO3
ACTIVITY BASED LEARNING
What is bias?
Total Error
To build a good model, we need to find a good balance between bias
and variance such that it minimizes the total error.
3.5 K-Fold Cross-Validation:
Cross-validation is a resampling procedure used to evaluate machine
learning models on a limited data sample.
The procedure has a single parameter called k that refers to the
number of groups that a given data sample is to be split into. As such, the
procedure is often called k-fold cross-validation. When a specific value for k is
chosen, it may be used in place of k in the reference to the model, such as k=10
becoming 10-fold cross-validation.
Cross-validation is primarily used in applied machine learning to
estimate the skill of a machine learning model on unseen data. That is, to use a
limited sample in order to estimate how the model is expected to perform in
general when used to make predictions on data not used during the training of
the model.
It is a popular method because it is simple to understand and because
it generally results in a less biased or less optimistic estimate of the model skill
than other methods, such as a simple train/test split.
The general procedure is as follows:
• Shuffle the dataset randomly.
• Split the dataset into k groups
• For each unique group:
-Take the group as a hold out or test data set
-Take the remaining groups as a training data set
-Fit a model on the training set and evaluate it on the test set
-Retain the evaluation score and discard the model
• Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an
individual group and stays in that group for the duration of the procedure. This
means that each sample is given the opportunity to be used in the hold out set 1
time and used to train the model k-1 times.
This approach involves randomly dividing the set of observations into
k groups, or folds, of approximately equal size. The first fold is treated as a
validation set, and the method is fit on the remaining k − 1 folds.
Configuration of K:
The k value must be chosen carefully for your data sample.
A poorly chosen value for k may result in a mis-representative idea of
the skill of the model, such as a score with a high variance (that may change a
lot based on the data used to fit the model), or a high bias, (such as an
overestimate of the skill of the model).
Three common tactics for choosing a value for k are as follows:
Representative: The value for k is chosen such that each train/test group of
data samples is large enough to be statistically representative of the broader
dataset.
k=10: The value for k is fixed to 10, a value that has been found through
experimentation to generally result in a model skill estimate with low bias a
modest variance.
k=n: The value for k is fixed to n, where n is the size of the dataset to give each
test sample an opportunity to be used in the hold out dataset. This approach is
called leave-one-out cross-validation.
The choice of k is usually 5 or 10, but there is no formal rule. As k
gets larger, the difference in size between the training set and the resampling
subsets gets smaller. As this difference decreases, the bias of the technique
becomes smaller
Worked Example:
To make the cross-validation procedure concrete, let’s look at a
worked example.
Imagine we have a data sample with 6 observations:
We can then make use of the sample, such as to evaluate the skill of
a machine learning algorithm.
Three models are trained and evaluated with each fold given a
chance to be the held out test set.
For example:
Model1: Trained on Fold1 + Fold2, Tested on Fold3
Model2: Trained on Fold2 + Fold3, Tested on Fold1
Model3: Trained on Fold1 + Fold3, Tested on Fold2
The models are then discarded after they are evaluated as they have
served their purpose.
The skill scores are collected for each model and summarized for use.
Cross-Validation API:
We do not have to implement k-fold cross-validation manually. The
scikit-learn library provides an implementation that will split a given data sample
up.
The KFold() scikit-learn class can be used. It takes as arguments the
number of splits, whether or not to shuffle the sample, and the seed for
the pseudorandom number generator used prior to the shuffle.
For example, we can create an instance that splits a dataset into 3
folds, shuffles prior to the split, and uses a value of 1 for the pseudorandom
number generator.
1 kfold = KFold(3, True, 1)
The split() function can then be called on the class where the data
sample is provided as an argument. Called repeatedly, the split will return each
group of train and test sets. Specifically, arrays are returned containing the
indexes into the original data sample of observations to use for train and test
sets on each iteration.
For example, we can enumerate the splits of the indices for a data
sample using the created KFold instance as follows
1# enumerate splits
We can tie all of this together with our small dataset used in the
worked example of the prior section.
1 # scikit-learn k-fold cross-validation
4 # data sample
8 # enumerate splits
Running the example prints the specific observations chosen for each train
and test set. The indices are used directly on the original data array to retrieve the
observation values.
1 train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]
normalized_data = stats.boxcox(original_data)
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")
Output :
-636074.22
-2465086.34
Unsupervised learning
Unsupervised learning is the training of a machine using
information that is neither classified nor labeled and allowing the algorithm to
act on that information without guidance. Here the task of the machine is to
group unsorted information according to similarities, patterns, and differences
without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will
be given to the machine. Therefore the machine is restricted to find the hidden
structure in unlabeled data by itself.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1.Exclusive (partitioning)
2.Agglomerative
3.Overlapping
4.Probabilistic
Clustering Types:-
• Hierarchical clustering
• K-means clustering
• Principal Component Analysis
• Singular Value Decomposition
• Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Now, we will find some lines that split the data between the two
differently classified groups of data. This will be the line such that the distances
from the closest point in each of the two groups will be the farthest away.
Naive Bayes:
It is a classification technique based on Bayes’ theorem with an
assumption of independence between predictors. In simple terms, a Naive
Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the
other features, a naive Bayes classifier would consider all of these properties
to independently contribute to the probability that this fruit is an apple.
The Naive Bayesian model is easy to build and particularly useful
for very large data sets. Along with simplicity, Naive Bayes is known to
outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability
P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
Here,
• P(c|x) is the posterior probability of class (target)
given predictor (attribute).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.
Example: Let’s understand it using an example. Below I have a training
data set of weather and the corresponding target variable ‘Play’. Now, we
need to classify whether players will play or not based on weather
conditions. Let’s follow the below steps to perform it.
Step 1: Convert the data set to a frequency table.
Step 2: Create a Likelihood table by finding the probabilities like Overcast
probability = 0.29 and probability of playing is 0.64.
Step 3: Now, use the Naive Bayesian equation to calculate the posterior
probability for each class. The class with the highest posterior probability is
the outcome of the prediction.
kNN (k- Nearest Neighbors):
It can be used for both classification and regression problems.
However, it is more widely used in classification problems in the industry. K
nearest neighbors is a simple algorithm that stores all available cases and
classifies new cases by a majority vote of its k neighbors. The case assigned
to the class is most common amongst its K nearest neighbors measured by a
distance function.
These distance functions can be Euclidean, Manhattan,
Minkowski and Hamming distances. The first three functions are used for
continuous function and the fourth one (Hamming) for categorical variables.
If K = 1, then the case is simply assigned to the class of its nearest
neighbor. At times, choosing K turns out to be a challenge while performing
kNN modeling.
Why Clustering?
Clustering is very much important as it determines the
intrinsic grouping among the unlabelled data present. There are no criteria for
good clustering. It depends on the user, what is the criteria they may use which
satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural
clusters” and describe their unknown properties (“natural” data types), in
finding useful and suitable groupings (“useful” data classes) or in finding
.
unusual data objects (outlier detection). This algorithm must make some
assumptions that constitute the similarity of points and each assumption make
different and equally valid clusters.
Clustering Methods :
Density-Based Methods: These methods consider the clusters as the
dense region having some similarities and differences from the lower dense
region of the space. These methods have good accuracy and the ability to
merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of
Applications with Noise), OPTICS (Ordering Points to Identify Clustering
Structure), etc.
Hierarchical Based Methods: The clusters formed in this method form a
tree-type structure based on the hierarchy. New clusters are formed using
the previously formed one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced
Iterative Reducing Clustering and using Hierarchies), etc.
Partitioning Methods: These methods partition the objects into k clusters
and each partition forms one cluster. This method is used to optimize an
objective criterion similarity function such as when the distance is a major
parameter example K-means, CLARANS (Clustering Large Applications based
upon Randomized Search), etc.
Grid-based Methods: In this method, the data space is formulated into a
finite number of cells that form a grid-like structure. All the clustering
operations done on these grids are fast and independent of the number of
data objects example STING (Statistical Information Grid), wave cluster,
CLIQUE (CLustering In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised
learning algorithm that solves clustering problem. K-means algorithm
partitions n observations into k clusters where each observation belongs to the
cluster with the nearest mean serving as a prototype of the cluster.
CO3 K3
Part-A Questions and Answers
Computational
Complexity Simpler method Computationally complex
4 R Programming Coursera
REAL TIME APPLICATIONS IN DAY TO DAY LIFE
6 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contents of this information is strictly prohibited.