Unit - II Data Analysis

Data Analytics (KIT-601)
3rd year (Semester –VI)

Unit - II
Anil Singh
Asst. Prof.
CSE Dept.
UIT, Prayagraj
Data Analysis
• Data Analysis:
– Data analysis is defined as a process of cleaning,
transforming, and modeling data to discover
useful information for business decision-making.
– The purpose of Data Analysis is to extract useful
information from data and taking the decision
based upon the data analysis.
Data Analytics, KIT-601 2

Regression Modeling
• Regression Modeling:
– Regression is a method to mathematically formulate relationship
between variables that in due course can be used to estimate,
interpolate and extrapolate.
– Suppose we want to estimate the weight of individuals, which is
influenced by height, diet, workout, etc. Here, Weight is the predicted
variable. Height, Diet, Workout are predictor variables.
– The predicted variable is a dependent variable in the sense that it
depends on predictors. Predictors are also called as independent
variables.
– Regression reveals to what extent the predicted variable is affected by
the predictors. In other words, what amount of variation in predictors
will result in variations of the predicted variable.
– The predicted variable is mathematically represented as Y. The
predictor variables are represented as X1, X2, X3, etc. This
mathematical relationship is often called the regression model.

Regression Modeling
– Regression techniques allow the identification and estimation of
possible relationships between a pattern or variable of interest, and
factors that influence that pattern.
• For example, a company may be interested in understanding the effectiveness
of its marketing strategies. It may deploy a variety of marketing activities in a
given time period, perhaps TV advertising, and print advertising, social media
campaigns, radio advertising and so on.
– A regression model can be used to understand and quantify which of
its marketing activities actually drive sales, and to what extent.
– The advantage of regression over simple correlations is that it allows

you to control for the simultaneous impact of multiple other factors
that influence your variable of interest, or the “target” variable.
• That is, in this example, things like pricing changes or competitive activities
also influence sales of the brand of interest, and the regression model allows
you to account for the impacts of these factors when you estimate the true
impact of say each type of marketing activity on sales.

Regression Modeling
– Regression is mainly used for prediction, forecasting, time
series modeling and determining the causal-effect
relationship between variables.
– In regression, we plot a graph between the variables that
best fit the given data points. Using this plot, the machine
learning model can make predictions about the data.
– In simple words, “Regression shows a line or a curve that
passes through all data points on target-predictor graph in
such a way that the vertical distance between the data
points and the regression line is minimum.”
– The distance between data points and the line tells
whether a model has captured a strong relationship or not.

Regression Modeling
• Some examples of regression are as follows:
– Prediction of rain using temperature and other
factors
– Determining market trends
– Prediction of road accidents due to rash driving

Multivariate Analysis
• Multivariate analysis:
– It is a set of statistical techniques used for analysis
of data that contain more than one variable.
– Multivariate analysis is typically used for:

• Quality control and quality assurance
• Process optimization and process control
• Research and development
• Consumer and market research

• How multivariate methods are used
– Obtain a summary or an overview of a table. This
analysis is often called Principal Components Analysis
or Factor Analysis. In the overview, it is possible to
identify the dominant patterns in the data, such as
groups, outliers, trends, and so on. The patterns are
displayed as two plots.
– Analyse groups in the table, how these groups differ,
and to which group individual table rows belong. This
type of analysis is called Classification and
Discriminant Analysis.

– Find relationships between columns in data tables, for
instance relationships between process operation
conditions and product quality. The objective is to use
one set of variables (columns) to predict another, for
the purpose of optimization, and to find out which
columns are important in the relationship. The
corresponding analysis is called Multiple Regression
Analysis or Partial Least Squares (PLS), depending on
the size of the data tab.

Bayesian Modeling
• Bayesian modeling:
– Bayesian analysis is a statistical paradigm that answers research
questions about unknown parameters using probability
statements.
– For example,
• What is the probability that the average male height is between 70
and 80 inches or that the average female height is between 60 and 70
inches?
• What is the probability that people in a particular state vote
Republican or vote Democratic?
• What is the probability that a person accused of a crime is guilty?
• What is the probability that treatment A is more cost effective than
treatment B for a specific health care provider?
• What is the probability that a patient's blood pressure decreases if he
or she is prescribed drug A?

Bayesian Modeling
• What is the probability that the odds ratio is between 0.3 and 0.5?
• What is the probability that three out of five quiz questions will be
answered correctly by students?
• What is the probability that children with ADHD underperform
relative to other children on a standardized test?
• What is the probability that there is a positive effect of schooling on
wage?
• What is the probability that excess returns on an asset are positive?
– Such probabilistic statements are natural to Bayesian analysis

because of the underlying assumption that all parameters are
random quantities.
– In Bayesian analysis, a parameter is summarized by an entire
distribution of values instead of one fixed value as in classical
frequentist analysis.
– Estimating this distribution, a posterior distribution of a
parameter of interest, is at the heart of Bayesian analysis.
Bayesian Modeling
• Posterior Distribution:
– A posterior distribution comprises a prior distribution about a
parameter and a likelihood model providing information about the
parameter based on observed data.
– Depending on the chosen prior distribution and likelihood model, the
posterior distribution is either available analytically or approximated
by, for example, one of the Markov Chain Monte Carlo (MCMC)
methods.
• Bayesian inference:
– It uses the posterior distribution to form various summaries for the
model parameters, including point estimates such as posterior means,
medians, percentiles, and interval estimates known as credible
intervals.
– Moreover, all statistical tests about model parameters can be
expressed as probability statements based on the estimated posterior
distribution.

Bayesian Modeling
• Bayesian Network:
– These are a type of probabilistic graphical model that
uses Bayesian inference for probability computations.
– Bayesian networks aim to model conditional
dependence by edges in a directed graph.
– Through these relationships, one can efficiently
conduct inference on the random variables in the
graph through the use of factors.
– Using the relationships specified by our Bayesian
network, we can obtain a compact, factorized
representation of the joint probability distribution by
taking advantage of conditional independence.

Bayesian Modeling

Bayesian Modeling

Bayesian Modeling
– A Bayesian network is a directed acyclic graph, in which
each edge corresponds to a conditional dependency and
each node corresponds to a unique random variable.
– Formally, if an edge (A, B) exists in the graph connecting

random variables A and B, it means that P(B|A) is a factor
in the joint probability distribution, so we must know
P(B|A) for all values of B and A in order to conduct
inference.
– In the above example, since Rain has an edge going into
WetGrass, it means that P(WetGrass|Rain) will be a factor,
whose probability values are specified next to the
WetGrass node in a conditional probability table.

Bayesian Modeling
– Bayesian networks satisfy the local Markov property,
which states that a node is conditionally independent of its
non-descendants given its parents.
– In the above example, this means that P(Sprinkler|Cloudy,
Rain) = P(Sprinkler|Cloudy) since Sprinkler is conditionally
independent of its non-descendant, Rain, given Cloudy.
– This property allows us to simplify the joint distribution,
obtained in the previous section using the chain rule, to a
smaller form.
– After simplification, the joint distribution for a Bayesian
network is equal to the product of P(node|parents(node))
for all nodes.

Bayesian Modeling
• Inference:
– Inference over a Bayesian network can come in two forms.
• The first is simply evaluating the joint probability of a particular
assignment of values for each variable (or a subset) in the
network.
• For this, we already have a factorized form of the joint distribution,
so we simply evaluate that product using the provided conditional
probabilities.
• If we only care about a subset of variables, we will need to
marginalize out the ones we are not interested in.
• In many cases, this may result in underflow, so it is common to
take the logarithm of that product, which is equivalent to adding
up the individual logarithms of each term in the product.

Bayesian Modeling
• The second, more interesting inference task, is to find
P(x|e), or, to find the probability of some assignment of a
subset of the variables (x) given assignments of other
variables (our evidence, e).
• In the above example, an example of this could be to find
P(Sprinkler, WetGrass | Cloudy), where {Sprinkler, WetGrass}
is our x, and {Cloudy} is our e.
• In order to calculate this, we use the fact that
P(x|e) = P(x, e) / P(e) = αP(x, e),
where α is a normalization constant that we will calculate at
the end such that P(x|e) + P(￢x | e) = 1.
• In order to calculate P(x, e), we must marginalize the joint
probability distribution over the variables that do not appear
in x or e, which we will denote as Y.
Support Vector Machines (SVM)
• Support Vector Machines (SVM):
– It is a machine learning algorithm which
• Solves Classification Problem
• Uses a flexible representation of the class boundaries
• Implements automatic complexity control to reduce
overfitting
• Has a single global minimum which can be found in
polynomial time
– Support Vector Machine (SVM) is a supervised
machine learning algorithm which can be used for
both classification or regression challenges. However,
it is mostly used in classification problems.
• In the SVM algorithm, we plot each data item as a point in
n-dimensional space (where n is number of features you
have) with the value of each feature being the value of a
particular coordinate. Then, we perform classification by
finding the hyper-plane that differentiates the two classes
very well.

• Working of SVM:
– Identify the right hyper-plane (Scenario-1):
• Here, we have three hyper-planes (A, B and C). Now, identify the right
hyper-plane to classify star and circle.
• “Select the hyper-plane which segregates the two classes better”. In
this scenario, hyper-plane “B” has excellently performed this job.

– Identify the right hyper-plane (Scenario-2):
• Here, we have three hyper-planes (A, B and C) and all are
segregating the classes well. Now, How can we identify the right
hyper-plane?
• Here, maximizing the distances between nearest data point (either
class) and hyper-plane will help us to decide the right hyper-plane.
This distance is called as Margin.

• Let’s look at the below snapshot:
• Above, you can see that the margin for hyper-plane C is high as
compared to both A and B. Hence, we name the right hyper-plane as
C. Another lightning reason for selecting the hyper-plane with higher
margin is robustness. If we select a hyper-plane having low margin
then there is high chance of miss-classification

• Identify the right hyper-plane (Scenario-3):
• Hint: Use the rules as discussed in previous section to identify the right
hyper-plane
• Some of you may have selected the hyper-plane B as it has higher margin
compared to A. But, here is the catch, SVM selects the hyper-plane which
classifies the classes accurately prior to maximizing margin. Here, hyper-
plane B has a classification error and A has classified all correctly.
Therefore, the right hyper-plane is A.

• SVM Kernel:
– In the SVM classifier, it is easy to have a linear hyper-plane
between these two classes. But, another burning question
which arises is, should we need to add this feature manually to
have a hyper-plane. No, the SVM algorithm has a technique
called the kernel trick.
– The SVM kernel is a function that takes low dimensional input
space and transforms it to a higher dimensional space i.e. it
converts not separable problem to separable problem. It is
mostly useful in non-linear separation problem.
– Simply put, it does some extremely complex data
transformations, then finds out the process to separate the data
based on the labels or outputs you’ve defined.

Analysis of Time Series
• Time series:
– It is a collection of observations made sequentially in time.
– In this, successive observations are NOT independent and the
order of observation is crucial.
• Goals of time series analysis:
– Descriptive: Identify patterns in correlated data trends and
seasonal variation
– Explanation: understanding and modeling the data
– Forecasting: prediction of short-term trends from previous
patterns
– Intervention analysis: how does a single event change the time
series?
– Quality control: deviations of a specified size indicate a problem

• Linear System Analysis:
– A time series model is a tool used to predict future values
of a series by analyzing the relationship between the
values observed in the series and the time of their
occurrence.
– Time series models can be developed using a variety of
time series statistical techniques.
– If there has been any trend and/or seasonal variation
present in the data in the past then time series models can
detect this variation, use this information in order to fit the
historical data as closely as possible, and in doing so
improve the precision of future forecasts.

– There are many traditional techniques used in
time series analysis.
– Some of these include:

• Exponential Smoothing
• Linear Time Series Regression and Curvefit
• Autoregression
• ARIMA (Autoregressive Integrated Moving Average)
• Intervention Analysis
• Seasonal Decomposition

• ARIMA:
– ARIMA stands for AutoRegressive Integrated Moving
Average, and the assumption of these models is that
the variation accounted for in the series variable can
be divided into three components:
• Autoregressive (AR)
• Integrated (I) or Difference
• Moving Average (MA)
– An ARIMA model can have any component, or

combination of components, at both the nonseasonal
and seasonal levels.
– There are many different types of ARIMA models
and the general form of an ARIMA model is
ARIMA(p,d,q)(P,D,Q), where:
• p refers to the order of the nonseasonal autoregressive
process incorporated into the ARIMA model (and P the
order of the seasonal autoregressive process)
• d refers to the order of nonseasonal integration or
differencing (and D the order of the seasonal
integration or differencing)
• q refers to the order of the nonseasonal moving
average process incorporated in the model (and Q the
order of the seasonal moving average process).

• Non linear dynamics:
– Detecting trends and patterns in financial data is of
great interest to the business world to support the
decision-making process.
– A new generation of methodologies, including neural
networks, knowledge-based systems and genetic
algorithms, has attracted attention for analysis of
trends and patterns.
– In particular, neural networks are being used
extensively for financial forecasting with stock
markets, foreign exchange trading, commodity future
trading and bond yields.
– The application of neural networks in time series
forecasting is based on the ability of neural networks
to approximate nonlinear functions.
– In fact, neural networks offer a novel technique that
doesn’t require a pre-specification during the
modeling process because they independently learn
the relationship inherent in the variables.
– The term neural network applies to a loosely related
family of model, characterized by a large parameter
space and flexible structure, descending from studies
such as the study of the brain function.
Rule Induction
• Rule induction is a data mining process of deducing if-then rules
from a data set.
• These symbolic decision rules explain an inherent relationship
between the attributes and class labels in the data set.
• Many real-life experiences are based on intuitive rule induction.
– For example, we can proclaim a rule that states “if it is 8 a.m. on a
weekday, then highway traffic will be heavy” and “if it is 8 p.m. on a
Sunday, then the traffic will be light.”
– These rules are not necessarily right all the time. 8 a.m. weekday
traffic may be light during a holiday season. But, in general, these rules
hold true and are deduced from real-life experience based on our
every day observations.
• Rule induction provides a powerful classification approach .

Neural Network: Learning and
Generalization
• Learning:
– The principal reason why neural networks have
attracted such interest, is the existence of learning
algorithms for neural networks:
• algorithms that use data to estimate the optimal weights in a
network to perform some task.
– There are three basic approaches to learning in neural
networks.
– Supervised learning:
• It uses a training set that consists of a set of pattern pairs: an
input pattern and the corresponding desired (or target)
output pattern. The desired output maybe regarded as the
network’s “teacher” for that input.

Generalization
• The basic approach in supervised learning is for the
network to compute the output its current weights
produce for a given input, and to compare this network
output with the desired output.
• The aim of the learning algorithm is to adjust the
weights so as to minimize the difference between the
network output and the desired output.

Generalization

Generalization
– Reinforcement learning:
• It uses much less supervision.
• If a network aims to perform that some task, then the
reinforcement signal is a simple “yes” or “no” at the end of
the task to indicate whether the task has been performed
satisfactorily.
– Unsupervised learning:
• It only uses input data there is no training signal, unlike the
previous two approaches.
• The aim of unsupervised learning is to make sense of some
dataset, for example clustering similar patterns together.
Compression.

Generalization
• Generalization of Neural Network:
– One of the major advantages of neural nets is
their ability to generalize. This means that a
trained net could classify data from the same class
as the learning data that it has never seen before.
– In real world applications developers normally
have only a small part of all possible patterns for
the generation of a neural net.

Generalization
– To reach the best generalization, the dataset
should be split into three parts:
• The training set is used to train a neural net. The error
of this dataset is minimized during training.
• The validation set is used to determine the
performance of a neural network on patterns that are
not trained during learning.
• A test set for finally checking the over all performance
of a neural net.

Generalization
– The learning should be stopped in the minimum
of the validation set error. At this point the net
generalizes best.
– When learning is not stopped, overtraining occurs
and the performance of the net on the whole data
decreases, despite the fact that the error on the
training data still gets smaller.
– After finishing the learning phase, the net should
be finally checked with the third data set, the test
set.

Competitive Learning
• Competitive learning:
– It is a form of unsupervised learning in artificial neural
networks, in which nodes compete for the right to respond
to a subset of the input data.
– Models and algorithms based on the principle of
competitive learning include vector quantization and self-
organizing maps (Kohonen maps).
– In a competitive learning model, there are hierarchical sets
of units in the network with inhibitory and excitatory
connections.
– The excitatory connections are between individual layers
and the inhibitory connections are between units in
layered clusters. Units in a cluster are either active or
inactive.
Principal Component Analysis (PCA)
• Principal Component Analysis:
– Principal components analysis (PCA) is a statistical
technique that allows identifying underlying linear
patterns in a data set so it can be expressed in terms
of other data set of a significatively lower dimension
without much loss of information.
– The final data set should explain most of the variance
of the original data set by reducing the number of
variables.
– The final variables will be named as principal
components.

• The steps to perform principal components analysis are
as follows:
– Subtract mean:
• The first step in the principal component analysis is to subtract the
mean for each variable of the data set.
– Calculate the covariance matrix:
• The covariance of two random variables measures the degree of
variation from their means for each other.
• The sign of the covariance provides us with information about the
relation between them:
– If the covariance is positive, then the two variables increase and decrease
together.
– If the covariance is negative, then when one variable increases, the other
decreases, and vice versa.
• These values determine the linear dependencies between the
variables, which will be used to reduce the data set's dimension.
– Calculate eigen vectors and eigen values:
• Eigen vectors are defined as those vectors whose
directions remain unchanged after any linear
transformation has been applied.
– However, their length could not remain the same after the
transformation, i.e., the result of this transformation is the
vector multiplied by a scalar.
– This scalar is called eigen value, and each eigen vector has
one associated with it.

• The number of eigen vectors or components that we
can calculate for each data set is equal to the data set's
dimension.
– Since they are calculated from the covariance matrix
described before, eigen vectors represent the directions in
which the data have a higher variance.
– On the other hand, their respective eigen values determine
the amount of variance that the data set has in that direction.

– Select principal components:
• Among the available eigen vectors that previously
calculated, we must select those onto which we project the
data. The selected eigen vectors will be called principal
components.
• To establish a criterion to select the eigen vectors, we must
first define the relative variance of each and the total
variance of a data set.
• The relative variance of an eigen vector measures how much
information can be attributed to it.
• The total variance of a data set is the sum of the variance of
all the variables.
• These two concepts are determined by the eigen values.

– Reduce the data dimension:
• Once we have selected the principal components, the
data must be projected onto them.
• Although this projection can explain most of the
variance of the original data, we have lost the
information about the variance along with the second
component.
• In general, this process is irreversible, which means that
we cannot recover the original data from the
projection.


Fuzzy Logic
• Fuzzy Logic:
– It is an approach of reasoning to make decisions by the humans
which involve digital value yes or no.
– It uses a fuzzy set with a fuzzy logic computer process using
natural language.
– They are applied in rule-based automatic controllers establishes
non-linear mapping and considered to be a designed method by
the consumers.
– This system works on the principle based on the probability of
input state a particular output is been assigned.
– The word fuzzy means precision to imprecision.
– It comprises four components namely fuzzifier, rules, inference
engine, defuzzifier.

Fuzzy Logic
• Definition:
– It is defined as a control logic that pretends to use degrees
of input and output to estimate human reasoning with the
integration of rule-based implementation.
– The technique used in the manipulation of undesired
information or facts which involves some degree of
uncertainty.

Extracting Fuzzy Models from the
Data
• In the context of intelligent data analysis it is of
great interest how such fuzzy models can
automatically be derived from example data.
• Since, besides prediction, understandability is of
prime concern, the resulting fuzzy model should
offer insights into the underlying system.
• To achieve this, different approaches exist that
construct grid-based rule sets defining a global
granulation of the input space, as well as fuzzy
graph based structures.

Data
• Extracting Grid-Based Fuzzy Models from Data:
– Grid-based rule sets model each input variable
through a usually small set of linguistic values.
– The resulting rule base uses a U or a subset of all
possible combinations of these linguistic values for
each variable, resulting in a global granulation of the
feature space into "tiles.
– Extracting grid-based fuzzy models from data is
straightforward when the input granulation is fixed,
that is, the antecedents of all rules are predefined.
Then only a matching consequent for each rule needs
to be found.
Data
• Extracting Fuzzy Graphs from Data:
– In high-dimensional feature spaces a global granulation
results in a large number of rules. For these tasks a fuzzy
graph based approach is more suitable.
– A possible disadvantage of the individual membership
functions is the potential loss of interpretation. Projecting
all membership functions onto one variable will usually not
lead to meaningful linguistic values.
– In many data analysis applications, however, such a
meaningful granulation of all attributes is either not
available or hard to determine automatically.

Data
– The primary function of a fuzzy graph is to serve as a
representation of an imprecisely defined dependency. Fuzzy
graphs do not have a natural linguistic interpretation of the
granulation of their input space.
– The main advantage is the low dimensionality of the individual
rules. The algorithm only introduces restriction on few of the
available input variables, thus making the extracted rules easier
to interpret.
– The final set of fuzzy points forms a fuzzy graph, where each
fuzzy point is associated with one output region and is used to
compute a membership value for a certain input pattern.
– Fuzzy inference then produces a soft value for the output and
using the well-known center-of-gravity method a final crisp
output value can be obtained, if so desired.
Few Questions…. (Unit-II)
• Explain what is regression modeling.
• Explain supervised learning and unsupervised
learning.
• Explain Principle Component Analysis (PCA) in
data analysis.
• Describe the Bayesian network.

Unit - II Data Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Unit - II Data Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit - II Data Analysis

Uploaded by

Copyright:

Available Formats

Data Analytics (KIT-601)

3rd year (Semester –VI)

Data Analytics, KIT-601 2

Data Analytics, KIT-601 3

– The advantage of regression over simple correlations is that it allows

Data Analytics, KIT-601 4

Data Analytics, KIT-601 5

Data Analytics, KIT-601 6

– Multivariate analysis is typically used for:

Data Analytics, KIT-601 7

Data Analytics, KIT-601 8

Data Analytics, KIT-601 9

Data Analytics, KIT-601 10

– Such probabilistic statements are natural to Bayesian analysis

Data Analytics, KIT-601 12

Data Analytics, KIT-601 13

Data Analytics, KIT-601 14

Data Analytics, KIT-601 15

– Formally, if an edge (A, B) exists in the graph connecting

Data Analytics, KIT-601 16

Data Analytics, KIT-601 17

Data Analytics, KIT-601 18

Data Analytics, KIT-601 21

Data Analytics, KIT-601 22

Data Analytics, KIT-601 23

Data Analytics, KIT-601 24

Data Analytics, KIT-601 25

Data Analytics, KIT-601 26

Data Analytics, KIT-601 27

Data Analytics, KIT-601 28

– Some of these include:

Data Analytics, KIT-601 29

– An ARIMA model can have any component, or

Data Analytics, KIT-601 31

Data Analytics, KIT-601 34

Data Analytics, KIT-601 35

Data Analytics, KIT-601 36

Data Analytics, KIT-601 37

Data Analytics, KIT-601 38

Data Analytics, KIT-601 39

Data Analytics, KIT-601 40

Data Analytics, KIT-601 41

Data Analytics, KIT-601 43

Data Analytics, KIT-601 45

Data Analytics, KIT-601 46

Data Analytics, KIT-601 47

Data Analytics, KIT-601 48

Data Analytics, KIT-601 49

Data Analytics, KIT-601 50

Data Analytics, KIT-601 51

Data Analytics, KIT-601 52

Data Analytics, KIT-601 54

Data Analytics, KIT-601 57

You might also like