Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

CHAPTER 7

DATA EXPLORATORY APPROACHES


This topic covers:
 TRADITIONAL/CLASSICAL APPROACH
 EDA APPROACH BY TUKEY
 MIXED APPROACH
INTRODUCTION
Data Exploratory approaches : “how a data analysis
should be carried out”
• Traditional/Classical Approach - focus more on
numerical
• EDA Tukey Approach – focus more on graphical
• Mixed Approach – graphical + numerical
EXPLORING DATA ≈ DETECTIVE WORK
(JOHN W TUKEY, 1915-2000)

The role of researcher is to explore the data in


as many ways as possible until a plausible
“story” of the data emerges.
A detective does not collect just any
information. Instead he collects evidence and
clues related to the central question of the
case.

3
WHAT IS EDA?
Exploratory Data Analysis (EDA) is an
approach/philosophy for data analysis that employs a
variety of techniques (mostly graphical) to
• maximize insight into a data set;
• uncover underlying structure;
• extract important variables;
• detect outliers and anomalies;
• test underlying assumptions;
• develop parsimonious models; and
• determine optimal factor settings.

4
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
INSIGHT INTO THE DATA
Insight implies detecting and uncovering underlying structure
in the data. Such underlying structure may not be
encapsulated in the list of items above; such items serve as
the specific targets of an analysis, but the real insight and
"feel" for a data set comes as the analyst judiciously probes
and explores the various subtleties of the data. The "feel" for
the data comes almost exclusively from the application of
various graphical techniques, the collection of which serves
as the window into the essence of the data. Graphics are
irreplaceable--there are no quantitative analogues that will
give the same insight as well-chosen graphics.
INSIGHT INTO THE DATA
To get a "feel" for the data, it is not enough for
the analyst to know what is in the data; the
analyst also must know what is NOT in the
data, and the only way to do that is to draw
on our own human pattern-recognition and
comparative abilities in the context of a series
of judicious graphical techniques applied to
the data.

6
PHILOSOPHY

EDA is not identical to statistical graphics although the two


terms are used almost interchangeably.
Statistical graphics is a collection of techniques--all graphically
based and all focusing on one data characterization aspect.
EDA encompasses a larger venue; EDA is an approach to data
analysis that postpones the usual assumptions about what
kind of model the data follow with the more direct approach
of allowing the data itself to reveal its underlying structure
and model.

Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm 7
PHILOSOPHY
EDA is not a mere collection of techniques; EDA is a
philosophy as to how we dissect a data set;
 what we look for;
 how we look; and
 how we interpret.
It is true that EDA heavily uses the collection of techniques
that we call "statistical graphics", but it is not identical to
statistical graphics per se.

8
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
HOW DOES EXPLORATORY DATA ANALYSIS
(EDA) DIFFER FROM CLASSICAL DATA ANALYSIS?
For classical analysis, the sequence is

Problem => Data => Model => Analysis => Conclusions

For EDA, the sequence is

Problem => Data => Analysis => Model => Conclusions

9
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
HOW DOES EXPLORATORY DATA ANALYSIS (EDA)
DIFFER FROM SUMMARY ANALYSIS?
A summary analysis is simply a numeric reduction of a historical data set. It
is quite passive. Its focus is in the past. Quite commonly, its purpose is to
simply arrive at a few key statistics (for example, mean and standard
deviation) which may then either replace the data set or be added to the
data set in the form of a summary table.
In contrast, EDA has as its broadest goal the desire to gain insight into the
data. Where as summary statistics are passive and historical, EDA is
active and futuristic. In an attempt to "understand" the process and
improve it in the future, EDA uses the data as a "window" to peer into the
heart of the data. There is an archival role in the research and
manufacturing world for summary statistics, but there is an enormously
larger role for the EDA approach.

Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm 10
CLASSICAL VERSUS EDA

Classical (Traditional) Exploratory Data Analysis


Frequency distribution Stem and leaf plot
Histogram Boxplot
Mean Median
Standard Deviation Interquartile range

Source: Bluman, A.G. (2008). Elementary Statistics: A Step By Step Approach.


6th Edition. New York: McGraw-Hill.

11
EDA AND DESCRIPTIVE STATISTICS
EDA emphasizes graphical techniques while classical emphasize quantitative
techniques.
In practice, an analyst typically should used a mixture of graphical and
quantitative techniques.
Both of these are part of Descriptive Statistics.
• Frequency table, summary statistics (min, max,, total etc.)
• Graphs
• Measures of central tendency
• Measures of dispersion / variation
• Measures of shapes/ distribution shapes
• Relationship/Association

12
AN EDA/GRAPHICS EXAMPLE
A simple, classic (Anscombe) example of the central role
that graphics play in terms of providing insight into a
data set starts with the following data set:
X Y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68

Anscombe, Francis (1973), Graphs in Statistical Analysis, The American Statistician, 13


pp. 195-199.
AN EDA/GRAPHICS EXAMPLE
If the goal of the analysis is to compute summary statistics
plus determine the best linear fit for Y as a function of
X, the results might be given as:
N = 11
Mean of X = 9.0
Mean of Y = 7.5
Standard Deviation of X =3.32
Standard Deviation of Y = 2.03
Correlation = 0.816
The above quantitative analysis, although valuable, gives
us only limited insight into the data.

Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm 14
AN EDA/GRAPHICS EXAMPLE
In contrast, the following simple scatter plot of the data

suggests the following:


1. The data set "behaves like" a linear curve with some scatter;
2. There is no justification for a more complicated model (e.g.,
quadratic);
3. There are no outliers;
4. The vertical spread of the data appears to be of equal height
irrespective of the X-value; this indicates that the data are
equally-precise throughout and so a "regular" (that is, equi-weighted) fit is appropriate.

15
AN EDA/GRAPHICS EXAMPLE

16
AN EDA/GRAPHICS EXAMPLE
Data Set 1 Data Set 2 Data Set 3 Data Set 4
N 11 11 11 11
Mean of X 9 9 9 9
Mean of Y 7.5 7.5 7.5 7.5
Standard 3.32 3.32 3.32 3.32
Deviation of
X
Standard 2.03 2.03 2.03 2.03
Deviation of
Y
Correlation 0.816 0.816 0.816 0.816

One might naively assume that the four data sets are
"equivalent" since that is what the statistics tell us; but
what do the statistics not tell us?
17
AN EDA/GRAPHICS EXAMPLE

In fact, the four data sets are


far from "equivalent” and a
scatter plot of each data set,
which would be step 1 of any
EDA approach, would tell us
that immediately.

Conclusions from the scatter


plots are:
1. data set 1 is clearly linear with some scatter.
2. data set 2 is clearly quadratic.
3. data set 3 clearly has an outlier.
4. data set 4 is obviously the victim of a poor experimental design with a single point
far removed from the bulk of the data "wagging the dog".

18
BASIC EDA ASSUMPTION
Normality:
If the normal distribution assumption holds, then
1. the histogram will be bell-shaped, and
2. the normal probability plot will be linear.
3. the skewness and kurtosis will be zero
4. the test of normality: Kolmogorov-Smirnov
or Shapiro Wilk statistics – if the significance level
(p-value) is greater than 0.05, then normality is
assumed.

19
BASIC EDA ASSUMPTION
Normality: Histogram and Normal Probability Plot

approximately normal
distribution

non-normal,
U-shaped
distribution

20
BASIC EDA ASSUMPTION
Normality checking Versus Central Limit Theorem.
The central limit theorem is one of the most
remarkable results of the theory of probability. In its
simplest form, the theorem states that the sum of a
large number of independent observations from the
same distribution has, under certain general
conditions, an approximate normal distribution.
Moreover, the approximation steadily improves as
the number of observations increases.

21
BASIC EDA ASSUMPTION
DEALING WITH NON-NORMAL
Many classical statistical tests depend on normality assumptions.
Significant skewness and kurtosis clearly indicate that data are not
normal. If a data set exhibits significant skewness or kurtosis (as
indicated by a histogram or the numerical measures), what can we do
about it? One approach is to apply some type of transformation to try
to make the data normal, or more nearly normal. The
Box-Cox transformation is a useful technique for trying to normalize a
data set. In particular, taking the log or square root of a data set is
often useful for data that exhibit moderate right skewness.
Another approach is to use techniques based on distributions other than
the normal. – Non-parametric

22
BASIC EDA ASSUMPTION
DEALING WITH NON-NORMAL - TRANSFORMATION
Purpose: Find transformation to normalize data
Many statistical tests are based on the assumption of normality. The
assumption of normality often leads to tests that are simple,
mathematically tractable, and powerful compared to tests that do not
make the normality assumption. Unfortunately, many real data sets
are in fact not approximately normal. However, an appropriate
transformation of a data set can often yield a data set that does follow
approximately a normal distribution. This increases the applicability
and usefulness of statistical techniques based on the normality
assumption.

23
BASIC EDA ASSUMPTION DEALING WITH
NON-NORMAL – NON-PARAMETRIC

Applied Statistics

Descriptive Inferential
Statistics Statistics

Process of drawing information from


sampled observations of a population and
Statistics that are making conclusions about the population.
used to describe
population or sample

Parametric Non-parametric
24
INFERENTIAL STATISTICS:
PARAMETRIC VERSUS NONPARAMETRIC
Parametric Non-parametric
Assumed distribution Normal Any
Assumed variance Homogeneous Any
Typical data Ratio or Interval Ordinal or Nominal
Data set relationships Independent Any
Usual central measure Mean Median
Can draw more Simplicity; Less
Benefits
conclusions affected by outliers
Tests
Choosing parametric Choosing a non-
Choosing
test parametric test
Correlation test Pearson Spearman
Independent Independent-
Mann-Whitney test
measures, 2 groups measures t-test
One-way,
Independent
independent- Kruskal-Wallis test
measures, >2 groups
measures ANOVA
Repeated measures, 2
Matched-pair t-test Wilcoxon test
conditions
Repeated measures, One-way, repeated
Friedman's test
>2 conditions measures ANOVA

25
Source: http://changingminds.org/explanations/research/analysis/parametric_non-parametric.htm
DESCRIPTIVE STATISTICS PERMISSIBLE WITH
DIFFERENT TYPES OF MEASUREMENTS

Type of Type of
measurement descriptive analysis

Two Frequency table


categories Proportion (percentage)
Nominal Mode

More than Frequency table


two categories Category proportions
(percentages) Mode
Rank Order
Ordinal Median

Interval Arithmetic mean

Index numbers
Ratio Geometric mean
Harmonic mean
source: Zikmund, W. G. (2003). Business Research Methods.
MEASURES OF CENTRAL TENDENCY AND DISPERSION
PERMISSIBLE WITH EACH TYPE OF MEASUREMENT SCALE

Type of Scale Measure of Central Measure of


Tendency Dispersion
Nominal Mode None

Ordinal Median Percentile


Interval or ratio Mean Standard deviation

source: Zikmund, W. G. (2003). Business Research Methods.


COMMON BIVARIATE TESTS OF DIFFERENCES

Differences among Differences among


Type of
two independent three or more independent
measurement
groups groups

Independent groups: One-way


Interval and ratio
t-test or Z-test ANOVA

Mann-Whitney U-test
Ordinal Kruskal-Wallis test
Wilcoxon test

Z-test (two proportions)


Nominal Chi-square test
Chi-square test

28
source: Zikmund, W. G. (2003). Business Research Methods.
BIVARIATE ANALYSIS-COMMON PROCEDURES
FOR TESTING ASSOCIATION

Measurement Level a Measure of association Sample question

Correlation coefficient Are dollar sales associated


Interval and ratio scales (Pearson’s r) with advertising dollar
Bivariate regression analysis expenditures?

Is rank preference for


Chi-square
shopping centers associated
Ordinal scales Spearman rank correlation
with Likert scale ranking
Kendall’s rank correlation
of convenience of locations?

Chi-square Is sex associated with


Nominal scales Phi-coefficient brand awareness
Contingency coefficient (aware/not aware)?
29
source: Zikmund, W. G. (2003). Business Research Methods.
MULTIVARIATE ANALYSIS:
CLASSIFICATION OF DEPENDENCE METHODS
Dependence
methods

How many variables


are dependent?

Several Multiple
One dependent independent
dependent
variable and dependent
variables
variables

Metric-The scales Nonmetric-The Metric-The scales Nonmetric-The


are ratio Scales are nominal are ratio Scales are nominal Canonical
or interval or ordinal or interval or ordinal analysis

Multiple Multivariate
Multiple Conjoint
discriminant analysis
regression analysis
analysis of variance
30
source: Zikmund, W. G. (2003). Business Research Methods.
MULTIVARIATE ANALYSIS:
CLASSIFICATION OF INTERDEPENDENCE METHODS
Interdependence
methods

Are inputs metric?

Metric-The scales Nonmetric-The


are ratio Scales are nominal
or interval or ordinal

Metric Nonmetric
Factor Cluster Multidimensional multidimensional
analysis analysis scaling scaling

31
source: Zikmund, W. G. (2003). Business Research Methods.
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES

Bihistogram
The bihistogram is an EDA tool for assessing whether a
before-versus-after experiment has caused a change in
 location;
 variation; or
 distribution.
It is a graphical alternative to the paired two-sample t-test.
The bihistogram can be more powerful than the t-test in
that all of the distributional features (location, variation,
skewness, outliers) are evident on a single plot.

32
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES

Bihistogram

33
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES
Residual plots in checking Regression assumptions.
Berry & Feldman (1995) pointed out that most regression assumptions are concerned with
residuals.
 Residuals have constant variance (homoscedasticity)
 When the error term variance appears constant, the data are considered homoscedastic
 a plot of residuals versus predicted values

If the residuals variance is around zero, it


implies that the assumption of
homoscedasticity is not violated. If there is
a high concentration of residuals above
zero or below zero, the variance is not
constant - heteroscedasticity

34
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES: UNIVARIATE

Run Sequence Plot: Lag Plot: 1.3.3.15 Histogram: 1.3.3.14


1.3.3.25

Normal Probability Plot: 4-Plot: 1.3.3.32 PPCC Plot: 1.3.3.23


1.3.3.21

Weibull Plot: 1.3.3.30 Probability Plot: Box-Cox Linearity Plot: 1.3.3.5


1.3.3.22

Box-Cox Normality Plot: Bootstrap Plot:


Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm 35
1.3.3.6 1.3.3.4
OTHER EXAMPLES OF EDA GRAPHICAL TECHNIQUES: TIME SERIES

Run Sequence Plot: Spectral Plot: 1.3.3.27 Autocorrelation Plot: 1.3.3.1


1.3.3.25

Complex Demodulation Complex


Amplitude Plot: 1.3.3.8 Demodulation Phase
Plot: 1.3.3.9

36
Source: http://www.itl.nist.gov/div898/handbook/eda/eda.htm
ONLINE TASK
Explain about Exploratory Data Analysis (EDA) approach
introduced by Tukey (1977) by using an example.

Post your answer on a Padlet:


https://padlet.com/norhaslizadesa/EDA_GROUPE

You might also like