Professional Documents
Culture Documents
Chapter 7 SQQS1033
Chapter 7 SQQS1033
3
WHAT IS EDA?
Exploratory Data Analysis (EDA) is an
approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to
• maximize insight into a data set;
• uncover underlying structure;
• extract important variables;
• detect outliers and anomalies;
• test underlying assumptions;
• develop parsimonious models; and
• determine optimal factor settings.
4
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
INSIGHT INTO THE DATA
Insight implies detecting and uncovering underlying structure
in the data. Such underlying structure may not be
encapsulated in the list of items above; such items serve as
the specific targets of an analysis, but the real insight and
"feel" for a data set comes as the analyst judiciously probes
and explores the various subtleties of the data. The "feel" for
the data comes almost exclusively from the application of
various graphical techniques, the collection of which serves
as the window into the essence of the data. Graphics are
irreplaceable--there are no quantitative analogues that will
give the same insight as well-chosen graphics.
INSIGHT INTO THE DATA
To get a "feel" for the data, it is not enough for
the analyst to know what is in the data; the
analyst also must know what is NOT in the
data, and the only way to do that is to draw
on our own human pattern-recognition and
comparative abilities in the context of a series
of judicious graphical techniques applied to
the data.
6
PHILOSOPHY
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm 7
PHILOSOPHY
EDA is not a mere collection of techniques; EDA is a
philosophy as to how we dissect a data set;
what we look for;
how we look; and
how we interpret.
It is true that EDA heavily uses the collection of techniques
that we call "statistical graphics", but it is not identical to
statistical graphics per se.
8
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
HOW DOES EXPLORATORY DATA ANALYSIS
(EDA) DIFFER FROM CLASSICAL DATA ANALYSIS?
For classical analysis, the sequence is
9
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
HOW DOES EXPLORATORY DATA ANALYSIS (EDA)
DIFFER FROM SUMMARY ANALYSIS?
A summary analysis is simply a numeric reduction of a historical data set. It
is quite passive. Its focus is in the past. Quite commonly, its purpose is to
simply arrive at a few key statistics (for example, mean and standard
deviation) which may then either replace the data set or be added to the
data set in the form of a summary table.
In contrast, EDA has as its broadest goal the desire to gain insight into the
data. Where as summary statistics are passive and historical, EDA is
active and futuristic. In an attempt to "understand" the process and
improve it in the future, EDA uses the data as a "window" to peer into the
heart of the data. There is an archival role in the research and
manufacturing world for summary statistics, but there is an enormously
larger role for the EDA approach.
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm 10
CLASSICAL VERSUS EDA
11
EDA AND DESCRIPTIVE STATISTICS
EDA emphasizes graphical techniques while classical emphasize quantitative
techniques.
In practice, an analyst typically should used a mixture of graphical and
quantitative techniques.
Both of these are part of Descriptive Statistics.
• Frequency table, summary statistics (min, max,, total etc.)
• Graphs
• Measures of central tendency
• Measures of dispersion / variation
• Measures of shapes/ distribution shapes
• Relationship/Association
12
AN EDA/GRAPHICS EXAMPLE
A simple, classic (Anscombe) example of the central role
that graphics play in terms of providing insight into a
data set starts with the following data set:
X Y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm 14
AN EDA/GRAPHICS EXAMPLE
In contrast, the following simple scatter plot of the data
15
AN EDA/GRAPHICS EXAMPLE
16
AN EDA/GRAPHICS EXAMPLE
Data Set 1 Data Set 2 Data Set 3 Data Set 4
N 11 11 11 11
Mean of X 9 9 9 9
Mean of Y 7.5 7.5 7.5 7.5
Standard 3.32 3.32 3.32 3.32
Deviation of
X
Standard 2.03 2.03 2.03 2.03
Deviation of
Y
Correlation 0.816 0.816 0.816 0.816
One might naively assume that the four data sets are
"equivalent" since that is what the statistics tell us; but
what do the statistics not tell us?
17
AN EDA/GRAPHICS EXAMPLE
18
BASIC EDA ASSUMPTION
Normality:
If the normal distribution assumption holds, then
1. the histogram will be bell-shaped, and
2. the normal probability plot will be linear.
3. the skewness and kurtosis will be zero
4. the test of normality: Kolmogorov-Smirnov
or Shapiro Wilk statistics – if the significance level
(p-value) is greater than 0.05, then normality is
assumed.
19
BASIC EDA ASSUMPTION
Normality: Histogram and Normal Probability Plot
approximately normal
distribution
non-normal,
U-shaped
distribution
20
BASIC EDA ASSUMPTION
Normality checking Versus Central Limit Theorem.
The central limit theorem is one of the most
remarkable results of the theory of probability. In its
simplest form, the theorem states that the sum of a
large number of independent observations from the
same distribution has, under certain general
conditions, an approximate normal distribution.
Moreover, the approximation steadily improves as
the number of observations increases.
21
BASIC EDA ASSUMPTION
DEALING WITH NON-NORMAL
Many classical statistical tests depend on normality assumptions.
Significant skewness and kurtosis clearly indicate that data are not
normal. If a data set exhibits significant skewness or kurtosis (as
indicated by a histogram or the numerical measures), what can we do
about it? One approach is to apply some type of transformation to try
to make the data normal, or more nearly normal. The
Box-Cox transformation is a useful technique for trying to normalize a
data set. In particular, taking the log or square root of a data set is
often useful for data that exhibit moderate right skewness.
Another approach is to use techniques based on distributions other than
the normal. – Non-parametric
22
BASIC EDA ASSUMPTION
DEALING WITH NON-NORMAL - TRANSFORMATION
Purpose: Find transformation to normalize data
Many statistical tests are based on the assumption of normality. The
assumption of normality often leads to tests that are simple,
mathematically tractable, and powerful compared to tests that do not
make the normality assumption. Unfortunately, many real data sets
are in fact not approximately normal. However, an appropriate
transformation of a data set can often yield a data set that does follow
approximately a normal distribution. This increases the applicability
and usefulness of statistical techniques based on the normality
assumption.
23
BASIC EDA ASSUMPTION DEALING WITH
NON-NORMAL – NON-PARAMETRIC
Applied Statistics
Descriptive Inferential
Statistics Statistics
Parametric Non-parametric
24
INFERENTIAL STATISTICS:
PARAMETRIC VERSUS NONPARAMETRIC
Parametric Non-parametric
Assumed distribution Normal Any
Assumed variance Homogeneous Any
Typical data Ratio or Interval Ordinal or Nominal
Data set relationships Independent Any
Usual central measure Mean Median
Can draw more Simplicity; Less
Benefits
conclusions affected by outliers
Tests
Choosing parametric Choosing a non-
Choosing
test parametric test
Correlation test Pearson Spearman
Independent Independent-
Mann-Whitney test
measures, 2 groups measures t-test
One-way,
Independent
independent- Kruskal-Wallis test
measures, >2 groups
measures ANOVA
Repeated measures, 2
Matched-pair t-test Wilcoxon test
conditions
Repeated measures, One-way, repeated
Friedman's test
>2 conditions measures ANOVA
25
Source: http://changingminds.org/explanations/research/analysis/parametric_non-parametric.htm
DESCRIPTIVE STATISTICS PERMISSIBLE WITH
DIFFERENT TYPES OF MEASUREMENTS
Type of Type of
measurement descriptive analysis
Index numbers
Ratio Geometric mean
Harmonic mean
source: Zikmund, W. G. (2003). Business Research Methods.
MEASURES OF CENTRAL TENDENCY AND DISPERSION
PERMISSIBLE WITH EACH TYPE OF MEASUREMENT SCALE
Mann-Whitney U-test
Ordinal Kruskal-Wallis test
Wilcoxon test
28
source: Zikmund, W. G. (2003). Business Research Methods.
BIVARIATE ANALYSIS-COMMON PROCEDURES
FOR TESTING ASSOCIATION
Several Multiple
One dependent independent
dependent
variable and dependent
variables
variables
Multiple Multivariate
Multiple Conjoint
discriminant analysis
regression analysis
analysis of variance
30
source: Zikmund, W. G. (2003). Business Research Methods.
MULTIVARIATE ANALYSIS:
CLASSIFICATION OF INTERDEPENDENCE METHODS
Interdependence
methods
Metric Nonmetric
Factor Cluster Multidimensional multidimensional
analysis analysis scaling scaling
31
source: Zikmund, W. G. (2003). Business Research Methods.
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES
Bihistogram
The bihistogram is an EDA tool for assessing whether a
before-versus-after experiment has caused a change in
location;
variation; or
distribution.
It is a graphical alternative to the paired two-sample t-test.
The bihistogram can be more powerful than the t-test in
that all of the distributional features (location, variation,
skewness, outliers) are evident on a single plot.
32
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES
Bihistogram
33
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES
Residual plots in checking Regression assumptions.
Berry & Feldman (1995) pointed out that most regression assumptions are concerned with
residuals.
Residuals have constant variance (homoscedasticity)
When the error term variance appears constant, the data are considered homoscedastic
a plot of residuals versus predicted values
34
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES: UNIVARIATE
36
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
ONLINE TASK
Explain about Exploratory Data Analysis (EDA) approach
introduced by Tukey (1977) by using an example.