Module 2 - Statistical Foundations

Module 2 - Statistical
Foundations
Module 2 - Statistical Foundations
• Descriptive statistics, Statistical Features,
summarizing the data, outlier analysis,
Understanding distributions and plots,
Univariate statistical plots and usage, Bivariate
and multivariate statistics, Dimensionality
Reduction, Over and Under Sampling,
Bayesian Statistics, Statistical Modeling for
data analysis
Basics of Statistics
• Definition: Science of collection, presentation, analysis,
and reasonable interpretation of data.
• Statistics presents a rigorous scientific method for
gaining insight into data. For example, suppose we
measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to
provide an informative account. However statistics can
give an instant overall picture of data based on graphical
presentation or numerical summarization irrespective to
the number of data points. Besides data summarization,
another important task of statistics is to make inference
and predict relations of variables.
A Taxonomy of Statistics
Statistics in Data science
 Statistics is a set of mathematical methods and tools that

enable us to answer important questions about data. It is
divided into two categories:
 Descriptive Statistics - this offers methods to
summarise data by transforming raw observations into
meaningful information that is easy to interpret and share.
 Inferential Statistics - this offers methods to study
experiments done on small samples of data and chalk out
the inferences to the entire population (entire domain).
General Statistics Skills
 How to define statistically answerable questions for

effective decision making.
 Calculating and interpreting common statistics and how
to use standard data visualization techniques to
communicate findings.
 Understanding of how mathematical statistics is applied to
the field, concepts such as the central limit theorem and
the law of large numbers.
 Making inferences from estimates of location and
variability (ANOVA).
 How to identify the relationship between target variables
and independent variables.
 How to design statistical hypothesis testing experiments,
A/B testing, and so on.
 How to calculate and interpret performance metrics like
p-value, alpha, type1 and type2 errors, and so on.
Important Statistics Concepts
 Getting Started— Understanding types of data (rectangular

and non-rectangular), estimate of location, estimate of
variability, data distributions, binary and categorical data,
correlation, relationship between different types of variables.
 Distribution of Statistic — random numbers, the law of
large numbers, Central Limit Theorem, standard error, and
so on.
 Data sampling and Distributions — random sampling,
sampling bias, selection bias, sampling distribution,
bootstrapping, confidence interval, normal distribution, t-
distribution, binomial distribution, chi-square distribution, F-
distribution, Poisson and exponential distribution.
Six-Step Problem Solving Model
 This technique uses an analytical approach to solve any given

problem. As the name suggests, this technique uses 6 steps to
solve a problem, which are:
1. Have a clear and concise problem definition.
2. Study the roots of the problem.
3. Brainstorm possible solutions to the problem.
4. Examine the possible solution and chose the best one.
5. Implement the solution effectively.
6. Evaluate the results.
 This model follows the mindset of continuous development and
improvement. So, on step 6, if your results didn’t turn out the
way you wanted, you can go back to stem 4 and choose another
solution or to step 1 and try to define the problem differently.
Statistical Description of Data
• Statistics describes a numeric set of data by its
• Center
• Variability
• Shape
• Statistics describes a categorical set of data by
• Frequency, percentage or proportion of each category
Some Definitions
Variable - any characteristic of an individual or entity. A variable can take different
values for different individuals. Variables can be categorical or quantitative. Per S. S.
Stevens…
• Nominal - Categorical variables with no inherent order or ranking sequence such as names or classes
(e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II, III). The only operation
that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be compared for
equality, or greater or less, but not how much greater or less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally, differences between
values are meaningful, however, the scale is not absolutely anchored. Calendar dates and
temperatures on the Fahrenheit scale are examples. Addition and subtraction, but not
multiplication and division are meaningful operations.
• Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero point, e.g. age,
weight, temperature (Kelvin). Addition, subtraction, multiplication, and division are all meaningful
operations.
Some Definitions
Distribution - (of a variable) tells us what values the variable takes and how often it
takes these values.
• Unimodal - having a single peak
• Bimodal - having two distinct peaks
• Symmetric - left and right half are mirror images.
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:
Frequency Distribution of Age
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Grouped Frequency Distribution of Age:

Age Group 1-2 3-4 5-6
Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page
Age 1 2 3 4 5 6
Cumulative Frequency 5 8 15 20 24 26
Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency 8 20 26
Data Presentation
Two types of statistical presentation of data - graphical and numerical.
Graphical Presentation: We look for the overall pattern and for striking deviations from
that pattern. Over all pattern usually described by shape, center, and spread of the data.
An individual value that falls outside the overall pattern is called an outlier.
Bar diagram and Pie charts are used for categorical variables.
Histogram, stem and leaf and Box-plot are used for numerical variable.
Data Presentation –Categorical Variable
Bar Diagram: Lists the categories and presents the percent or count of individuals who fall
in each category.
Figure 1: Bar Chart of Subjects in

Tre atm ent Groups
Treatment Frequency Proportion Percent
Group (%)
Nu m ber of Subjects
30
25
1 15 (15/60)=0.25 25.0
20
15
2 25 (25/60)=0.333 41.7
10
3 20 (20/60)=0.417 33.3
5
0 Total 60 1.00 100
1 2 3
Treatm ent Group
Data Presentation –Categorical Variable
Pie Chart: Lists the categories and presents the percent or count of individuals who fall in
each category.
Figure 2: Pie Chart of Treatment Frequency Proportion Percent

Subjects in Treatment Groups Group (%)
1 15 (15/60)=0.25 25.0
25% 2 25 (25/60)=0.333 41.7

33% 1
2 3 20 (20/60)=0.417 33.3
3 Total 60 1.00 100

42%
Graphical Presentation –Numerical Variable
Histogram: Overall pattern can be described by its shape, center, and spread. The
following age distribution is right skewed. The center lies between 80 to 100. No
outliers.
Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518
16 Median 84
14 Mode 84
Number of Subjects
12 Standard Deviation 30.22979318

10
Sample Variance 913.8403955
8
Kurtosis -1.183899591
6
4 Skewness 0.389872725
2 Range 95
0 Minimum 48
40 60 80 100 120 140 More
Maximum 143
Age in Month
Sum 5425
Count 60
Graphical Presentation –Numerical Variable
Box-Plot: Describes the five-number summary
Figure 3: Distribution of Age

160
140
120
q1
100 min
80 median
60 max
q3
40
20
0
1
Box Plot
Numerical Presentation
A fundamental concept in summary statistics is that of a central value for a set of
observations and the extent to which the central value characterizes the whole
set of data. Measures of central value such as the mean or median must be
coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.
To understand how well a central value characterizes a set of observations, let

us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.
Methods of Center Measurement
Center measurement is a summary measure of the overall level of a dataset
Commonly used methods are mean, median, mode, geometric mean etc.
Mean: Summing up all the observation and dividing by number of observations. Mean
of 20, 30, 40 is (20+30+40)/3 = 30.
Notation : Let x1 , x2 , ...xn are n observations of a variable
x. Then the mean of this variable,
n
x1  x2  ...  xn x i
x  i 1
n n
Methods of Center Measurement
Median: The middle value in an ordered sequence of observations. That is, to find
the median we need to order the data set and then find the middle value. In case of
an even number of observations the average of the two middle most values is the
median. For example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the number of observations
is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the average of the two middle
values from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.
Mode: The value that is observed most frequently. The mode is undefined for
sequences in which no observation is repeated.
Mean or Median
The median is less sensitive to outliers (extreme scores) than the mean and thus a
better measure than the mean for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these
four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20-40.
So, the mean 270 really fails to give a realistic picture of the major part of the data. It
is influenced by extreme value 990.
Methods of Variability Measurement
Variability (or dispersion) measures the amount of scatter in a dataset.
Commonly used methods: range, variance, standard deviation, interquartile range,

coefficient of variation etc.
Range: The difference between the largest and the smallest observations. The range of
10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.
Variance: The variance of a set of observations is the average of the squares of the
deviations of the observations from their mean. In symbols, the variance of the n
observations x1, x2,…xn is
( x1  x ) 2  ....  ( xn  x ) 2
S 
2
n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is
(5  5) 2  (3  5) 2  (7  5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard deviation of the above
example is 2.
Quartiles: Data can be divided into four regions that cover the total range of observed
values. Cut points for these regions are known as quartiles.
In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where q is

the desired quartile and n is the number of observations of data.
The first quartile (Q1) is the first 25% of the data. The second quartile (Q2) is between
the 25th and 50th percentage points in the data. The upper bound of Q2 is the median.
The third quartile (Q3) is the 25% of the data lying between the median and the 75% cut
point in the data.
Q1 is the median of the first half of the ordered observations and Q3 is the median of
the second half of the ordered observations.
In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th observation is
11. So Q1 is of this data is 11.
An example with 15 numbers

3 6 7 11 13 22 30 40 44 50 52 61 68 80 94 Q1
Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is also the Median.) The
third quartile is Q3=61.
Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the previous
example is 61- 40=21. The middle half of the ordered data lie between 40 and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points are called Deciles
Percentiles: If data is ordered and divided into 100 parts, then cut points are called
Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th
percentile of the data is Q3.
In notations, percentiles of a data is the ((n+1)/100)p th observation of the data, where p

is the desired percentile and n is the number of observations of data.
Coefficient of Variation: The standard deviation of data divided by it’s mean. It is usually
expressed in percent.

Coefficient of Variation = 100
x
Five Number Summary
Five Number Summary: The five number summary of a distribution consists of the
smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum) observation written in
order from smallest to largest.
Box Plot: A box plot is a graph of the five number summary. The central box spans
the quartiles. A line within the box marks the median. Lines extending above and
below the box mark the smallest and the largest observations (i.e., the range).
Outlying samples may be additionally plotted outside the range.
Boxplot
Distribution of Age in Month
160
160
140
140
120
120 q1
100 q1
100 min
min
80 median
80 median
60 max
60 max
q3
40 q3
40
20
20
0
0
1
1
Choosing a Summary
The five number summary is usually better than the mean and standard deviation for
describing a skewed distribution or a distribution with extreme outliers. The mean and
standard deviation are reasonable for symmetric distributions that are free of outliers.
In real life we can’t always expect symmetry of the data. It’s a common practice to include
number of observations (n), mean, median, standard deviation, and range as common for
data summarization purpose. We can include other summary statistics like Q1, Q3,
Coefficient of variation if it is considered to be important for describing data.
Shape of Data
• Shape of data is measured by
– Skewness
– Kurtosis
Skewness
• Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail
Let x1 , x2 ,... xn be n observations. Then,

n
n  ( xi  x ) 3
Skewness  i 1
3/ 2
 n
2
  ( xi  x ) 
 i 1 
Kurtosis
• Measures peakedness of the distribution of data. The
kurtosis of normal distribution is 0.
Let x1 , x2 ,...xn be n observations. Then,

n
n ( xi  x ) 4
Kurtosis  i 1
2
3
 n 2
  ( xi  x ) 
 i 1 
Summary of the Variable ‘Age’ in the given
data set
Mean 90.41666667 Histogram of Age
Standard Error 3.902649518
10
Median 84
Mode 84
8
Standard Deviation 30.22979318
Number of Subjects
Sample Variance 913.8403955
6
Kurtosis -1.183899591
4
Skewness 0.389872725
Range 95
2
Minimum 48
Maximum 143
0
Sum 5425 40 60 80 100 120 140 160
Count 60 Age in Month

Summary of the Variable ‘Age’ in the given
data set
Boxplot of Age in Month

140
120
Age(month)
100
80
60
Class Summary (First Part)
So far we have learned-
Statistics and data presentation/data summarization

Graphical Presentation: Bar Chart, Pie Chart, Histogram, and Box Plot
Numerical Presentation: Measuring Central value of data (mean, median, mode etc.),
measuring dispersion (standard deviation, variance, co-efficient of variation, range, inter-
quartile range etc), quartiles, percentiles, and five number summary
Outlier Analysis
What Are Outliers?
• Outlier: A data object that deviates significantly from the normal objects as if it
were generated by a different mechanism
– Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
• Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into
the model
• Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
39
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly) Global Outlier
– Object is Og if it significantly deviates from the rest of the data set

– Ex. Intrusion detection in computer networks
– Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
– Object is Oc if it deviates significantly based on a selected context
– Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
– Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
– Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
– Issue: How to define or formulate meaningful context?
40
Types of Outliers (II)
• Collective Outliers
– A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data objects
may not be outliers
– Applications: E.g., intrusion detection: Collective Outlier
• When a number of computers keep sending denial-of-
service packages to each other
 Detection of collective outliers
 Consider not only behavior of individual objects, but also that of
groups of objects
 Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure

on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
41
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application
 The border between normal and outlier objects is often a gray area
 Application-specific outlier detection

 Choice of distance measure among objects and the model of
relationship among objects are often application-dependent

 E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations

 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction
between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
 Understandability
 Understand why these are outliers: Justification of the detection
 Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism

42
Outlier Detection (1): Statistical Methods
• Statistical methods (also known as model-based methods) assume that the normal data
follow some statistical model (a stochastic model)
– The data not following the model are outliers.
 Example (right figure): First use Gaussian distribution

to model the normal data
 For each object y in region R, estimate g (y), the
D
probability of y fits the Gaussian distribution
 If g (y) is very low, y is unlikely generated by the
D
Gaussian model, thus an outlier
 Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
 There are rich alternatives to use various statistical models
 E.g., parametric vs. non-parametric
43
•
Statistical Approaches
Statistical approaches assume that the objects in a data set are generated by
a stochastic process (a generative model)
• Idea: learn a generative model fitting the given data set, and then identify the
objects in low probability regions of the model as outliers
• Methods are divided into two categories: parametric vs. non-parametric
• Parametric method
– Assumes that the normal data is generated by a parametric distribution
with parameter θ
– The probability density function of the parametric distribution f(x, θ)
gives the probability that object x is generated by the distribution
– The smaller this value, the more likely x is an outlier
• Non-parametric method
– Not assume an a-priori statistical model and determine the model from
the input data
– Not completely parameter free but consider the number and nature of
the parameters are flexible and not fixed in advance
– Examples: histogram and kernel density estimation
44
Parametric Methods I: Detection Univariate Outliers Based on
Normal Distribution
• Univariate data: A data set involving only one attribute or variable
• Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability as
outliers
• Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
– Use the maximum likelihood method to estimate μ and σ
 Taking derivatives with respect to μ and σ2, we derive the following

maximum likelihood estimates
 For the above data with n = 10, we have

 Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since
45
Parametric Methods I: The Grubb’s Test
• Univariate outlier detection: The Grubb's test (maximum normed residual
test) ─ another statistical method under normal distribution
– For each object x in a data set, compute its z-score: x is an outlier if
where is the value taken by a t-distribution at a significance

level of α/(2N), and N is the # of objects in the data set
46
Parametric Methods II: Detection of Multivariate
Outliers
• Multivariate data: A data set involving two or more attributes or variables
• Transform the multivariate outlier detection task into a univariate outlier
detection problem
• Method 1. Compute Mahalaobis distance
– Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō) where S is
the covariance matrix
– Use the Grubb's test on this measure to detect outliers
• Method 2. Use χ2 –statistic:
– where Ei is the mean of the i-dimension among all objects, and n is the
dimensionality
– If χ2 –statistic is large, then object oi is an outlier
47
Parametric Methods III: Using Mixture of Parametric
Distributions
• Assuming data generated by a normal distribution could
be sometimes overly simplified
• Example (right figure): The objects between the two
clusters cannot be captured as outliers since they are
close to the estimated mean
 To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
 Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
 An object o is an outlier if it does not belong to any cluster
48
Non-Parametric Methods: Detection Using Histogram
• The model of normal data is learned from the input

data without any a priori structure.
• Often makes fewer assumptions about the data, and
thus can be applicable in more scenarios
• Outlier detection using histogram:
 Figure shows the histogram of purchase amounts in transactions

 A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
 Problem: Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability
density distribution of the data. If the estimated density function is high,
the object is likely normal. Otherwise, it is likely an outlier.
49
Analysis Strategies
• Why do we have to have them?
– People who read our ‘research’ are

interested in the highlights
– Should try to communicate findings in
an understandable and ‘painless
fashion’
Three types of analysis
• Univariate analysis
– the examination of the distribution of cases on
only one variable at a time (e.g., college
graduation)
• Bivariate analysis
– the examination of two variables simultaneously
(e.g., the relation between gender and college
graduation)
• Multivariate analysis
– the examination of more than two variables
simultaneously (e.g., the relationship between
gender, race, and college graduation)
“Purpose”
• Univariate analysis
– Purpose: description
• Bivariate analysis
– Purpose: determining the empirical

relationship between the two variables
• Multivariate analysis
– Purpose: determining the empirical

relationship among the variables
Univariate Analysis
• Involves examination of the distribution of
cases on only ONE variable at a time
• Frequency distributions are listings of the
number of cases in each attribute of a variable
– Ungrouped frequency distribution
– Grouped frequency distribution
• Proportions express number of cases of the

criterion variable as part of the total
population; frequency of criterion variable
divided by N
• Percentages are simple 100 X proportion
– Or [100 X (frequency of criterion variable
divided by N)]
• Rates make comparisons more

meaningful by controlling for population
differences
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:
Frequency Distribution of Age
Age 1 2 3 4 5 6
Grouped Frequency Distribution of Age:

Age Group 1-2 3-4 5-6
Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page
Age 1 2 3 4 5 6
Cumulative Frequency 5 8 15 20 24 26
Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency 8 20 26
Measures of Central Tendency
• Measures of central tendency reflect the
central tendencies of a distribution
– Mode reflects the attribute with the
greatest frequency
– Median reflects the attribute that cuts the

distribution in half
– Mean reflects the average; sum of

attributes divided by # of cases
Measures of Dispersion
• Measures of dispersion reflect the spread or
distribution of the distribution
– Range is the difference between largest & smallest
scores; high – low
– Variance is the average of the squared differences

between each observation and the mean
– Standard deviation is the square root of variance
Types of Variables
• Continuous: increase steadily in tiny
fractions
• Discrete: jumps from category to category

Subgroup Comparisons
• Somewhere between univariate & bivariate,
are Subgroup Comparisons
• Present descriptive univariate data for each of

several subgroups
– Ratios: compare the number of cases in one
category with the number in another
Contingency Tables
• Format: attributes of independent variable are
used as column headings and attributes of the
dependent variable are used as row headings
• Guidelines for presenting & interpreting
contingency tables
– Contents of table described in title
– Attributes of each variable clearly described
– Base on which percentages are computed should be shown
– Norm is to percentage down & compare across
– Table should indicate # of cases omitted from analysis
Bivariate Analysis
• Bivariate analysis focus on the
relationship between two variables
Bivariate Analysis Examples
Multivariate Analysis
• Multivariate Analysis allow the separate
and combined effects of the
independent variable to be examined
Multivariate Analysis
• Many statistical techniques focus on just one
or two variables
• Multivariate analysis (MVA) techniques allow
more than two variables to be analysed at
once
– Multiple regression is not typically included under
this heading, but can be thought of as a
multivariate analysis
Many Variables
• Commonly have many relevant variables in market
research surveys
– E.g. one not atypical survey had ~2000 variables
– Typically researchers pore over many crosstabs
– However it can be difficult to make sense of these, and the
crosstabs may be misleading
• MVA can help summarise the data
– E.g. factor analysis and segmentation based on agreement
ratings on 20 attitude statements
• MVA can also reduce the chance of obtaining
spurious results
Multivariate Analysis Methods
• Two general types of MVA technique
– Analysis of dependence
• Where one (or more) variables are dependent
variables, to be explained or predicted by others
– E.g. Multiple regression, PLS, MDA
– Analysis of interdependence
• No variables thought of as “dependent”
• Look at the relationships among variables, objects or
cases
– E.g. cluster analysis, factor analysis
Multivariate Analysis Example
• Example 1. A researcher has collected data on three psychological variables, four
academic variables (standardized test scores), and the type of educational program
the student is in for 600 high school students. She is interested in how the set of
psychological variables is related to the academic variables and the type of program
the student is in.
• Example 2. A doctor has collected data on cholesterol, blood pressure, and weight.
She also collected data on the eating habits of the subjects (e.g., how many ounces
of red meat, fish, dairy products, and chocolate consumed per week). She wants to
investigate the relationship between the three measures of health and eating habits.
• Example 3. A researcher is interested in determining what factors influence the
health African Violet plants. She collects data on the average leaf diameter, the mass
of the root ball, and the average diameter of the blooms, as well as how long the
plant has been in its current container. For predictor variables, she measures several
elements in the soil, as well as the amount of light and water each plant receives.
• The study to determine the possible causes of a medical condition, such as heart disease. An
initial survey of non-disease males is conducted and data are collected on age, body weight,
height, serum cholesterol, phospholipids, blood glucose, diet, and many other putative factors.
The history of these males is followed and it is determined if and when they may be diagnosed
with heart disease.
• • Determining the value of an apartment. Factors possibly related to the value are size of the
apartment, age of the building, number of bedrooms, number of bathrooms, and location (e.g.
floor, view, etc.).
• • A medical study is conducted to determine the effects of air pollution on lung function.
Because you can’t assign people randomly to treatment groups (i.e. a rural environment with
no air quality concerns vs. a large city with air quality issues), a research chose four cohorts
that live in areas with very different air quality and each location is close to an air-monitoring
device. The researchers took measures of lung function on each individual at two different
time periods by recording one breath. Data collected on this breath included length of time for
the inhale and exhale, speed and force of the exhale, and amount of air exhaled after one
second and at the mid-point of the exhale. The air quality at these two times was also
recorded.
• • Political surveys to determine which qualities in a candidate are most important in garnering
popularity
Case Study
• Fake News Detection
• Road Lane Line Detection
• Sentiment Analysis
• Detecting Parkinson’s Disease
• Color Detection [Fruit]
• Leaf Disease Detection
• Chatbot Project
Case Study 1: How can we improve client acquisition rate?
Case Study 2: How do I create a sales incentive model?
Case Study 3: How can I price more accurately?
• https://www.baselismail.com/data-science-gui
de-and-case-studies/
• https://data-flair.training/blogs/data-science-c
ase-studies/
• https://data-flair.training/blogs/data-science-u
se-cases/
• https://eleuven.github.io/statthink/exercise-so
lutions.html#chapter-1
Graphic Presentation
• The Pie Chart
• The Bar Graph
• The Statistical Map
• The Histogram
• Statistics in Practice
• The Frequency Polygon
• Times Series Charts
• Distortions in Graphs
It is important to choose the appropriate graphs
to make statistical information coherent.
The Pie Chart: The Race and Ethnicity of the
Elderly
• Pie chart: a graph showing the differences in

frequencies or percentages among categories
of a nominal or an ordinal variable. The
categories are displayed as segments of a
circle whose pieces add up to 100 percent of
the total frequencies.
Too many categories can be messy!
2.8% .8% .6%

.5%
8.3%
White
Black
American Indian
Asian
Pacific Islander
Two or more
87.7%
N = 35,919,174
Figure 3.1 Annual Estimates of U.S. Population 65 Years and Over by Race, 2003
We can reduce some of the categories
4%
8.3%
White
Black
Other and two or more
87.7%
N = 35,919,174
Figure 3.2 Annual Estimates of U.S. Population 65 Years and Over, 2003
The Bar Graph: The Living Arrangements
and Labor Force Participation of the Elderly
• Bar graph: a graph showing the differences

in frequencies or percentages among
categories of a nominal or an ordinal
variable. The categories are displayed as
rectangles of equal width with their height
proportional to the frequency or
percentage of the category.
80
70
60
50
40 Series1
30
20
10
0
Living alone Married Other
N=13,886,000
Figure 3.3 Living Arrangements of Males (65 and Older) in the United States, 2000
Can display more info by splitting sex
80
70
60
50
Males
40 Females
30
20
10
0
Living alone Married Other
Figure 3.4 Living Arrangement of U.S. Elderly (65 and Older) by Gender, 2003
9.8
65 +
18
44.3 Women
60- to 64
57.2 Men
63.4
55 to 59
77.1
0 20 40 60 80 100
Figure 3.5 Percent of Men and Women 55 Years and Over in the Civilian Labor Force,
2002
The Statistical Map: The Geographic
Distribution of the Elderly
We can display dramatic geographical changes
in American society by using a statistical map.
Maps are especially useful for describing
geographical variations in variables, such as
population distribution, voting patterns,
crimes rates, or labor force participation.
The Histogram
• Histogram: a graph showing the differences
in frequencies or percentages among
categories of an interval-ratio variable. The
categories are displayed as contiguous bars,
with width proportional to the width of the
category and height proportional to the
frequency or percentage of that category.
30
25
20
15
10
0
65-69 70-74 75-79 80-84 85-89 90-94 95+
Figure 3.7 Age Distribution of U.S. Population 65 Years and Over, 2000
The Frequency Polygon
• Frequency polygon: a graph showing the

differences in frequencies or percentages
among categories of an interval-ratio
variable. Points representing the
frequencies of each category are placed
above the midpoint of the category and are
jointed by a straight line.
Source: Adapted from U.S. Bureau of the Census, Center for International Research,
International Data Base, 2003.
Population of Japan, Age 55 and Over, 2000,

2010, and 2020
12,000,000
10,000,000
8,000,000 2000
6,000,000 2010
4,000,000 2020
2,000,000
0
55-59 60-64 65-69 70-74 75-79 80+
Figure 3.11. Population of Japan, Age 55 and Over, 2000, 2010, and 2020

Time Series Charts
• Time series chart: a graph displaying

changes in a variables at different points in
time. It shows time (measured in units
such as years or months) on the horizontal
axis and the frequencies (percentages or
rates) of another variable on the vertical
axis.
Source: Federal Interagency Forum on Aging Related Statistics, Older Americans 2004:
Key Indicators of Well Being, 2004.
25
20
15
10
0
1900 1920 1940 1960 1980 2000 2020 2040 2060
Figure 3.12 Percentage of Total U. S. Population 65 Years and Over, 1900 to 2050
Source: U.S. Bureau of the Census, “65+ in America,” Current Population Reports,
1996, Special Studies, P23-190, Table 6-1.
16
14
12
10
Males
8
Females
6
0
1960 1980 2000 2020 2040
Figure 3.13 Percentage Currently Divorced Among U.S. Population 65 Years and
Over, by Gender, 1960 to 2040
Distortions in Graphs
Graphs not only quickly inform us; they can
quickly deceive us. Because we are often
more interested in general impressions
than in detailed analyses of the numbers,
we are more vulnerable to being swayed by
distorted graphs.
– What are graphical distortions?

– How can we recognize them?
Shrinking an Stretching the Axes: Visual
Confusion
Shrinking an Stretching the Axes: Visual
Confusion
Distortions with Picture Graphs
Statistics in Practice
The following graphs are particularly
suitable for making comparisons among
groups:
- Bar chart
- Frequency polygon
- Time series chart
Source: Smith, 2003.
11.8
85+
20.8
14.7 Women
65-74
23.3 Men
21.5
55-64
31.1
0 10 20 30 40
Figure 3.17 Percentage of College Graduates among People 55 years and over by age
and sex, 2002
Source: Stoops, Nicole. 2004. “Educational Attainment in the United States: 2003.”
Current Population Reports, P20-550. Washington D.C.: U.S. Government Printing
Office.
60
50 all races 15-64
40
all races 65+
30
black alone
15-64
20
black alone
10 65+
0
0 to 8 9 to 12 13 to 16+
15
Figure 3.18 Years of School Completed in the United States by Race and Age, 2003
Why use charts and graphs?
– What do you lose?
– ability to examine numeric detail offered by a table
– potentially the ability to see additional relationships
within the data
– potentially time: often we get caught up in selecting
colors and formatting charts when a simply formatted
table is sufficient
– What do you gain?
– ability to direct readers’ attention to one aspect of the
evidence
– ability to reach readers who might otherwise be
intimidated by the same data in a tabular format
– ability to focus on bigger picture rather than perhaps
minor technical details
References
• The material is prepared by taking inputs from
various text books and other internet sources.

Module 2 - Statistical Foundations

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2 - Statistical Foundations

Uploaded by

Copyright:

Available Formats

Module 2 - Statistical

 Statistics is a set of mathematical methods and tools that

 How to define statistically answerable questions for

 Getting Started— Understanding types of data (rectangular

 This technique uses an analytical approach to solve any given

Frequency Distribution of Age

Grouped Frequency Distribution of Age:

Age Group 1-2 3-4 5-6

Figure 1: Bar Chart of Subjects in

Figure 2: Pie Chart of Treatment Frequency Proportion Percent

25% 2 25 (25/60)=0.333 41.7

3 Total 60 1.00 100

12 Standard Deviation 30.22979318

Box-Plot: Describes the five-number summary

Figure 3: Distribution of Age

To understand how well a central value characterizes a set of observations, let

Center measurement is a summary measure of the overall level of a dataset

Variability (or dispersion) measures the amount of scatter in a dataset.

Commonly used methods: range, variance, standard deviation, interquartile range,

In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where q is

An example with 15 numbers

In notations, percentiles of a data is the ((n+1)/100)p th observation of the data, where p

Let x1 , x2 ,... xn be n observations. Then,

Let x1 , x2 ,...xn be n observations. Then,

Sum 5425 40 60 80 100 120 140 160

Count 60 Age in Month

Boxplot of Age in Month

Statistics and data presentation/data summarization

– Object is Og if it significantly deviates from the rest of the data set

among data objects, such as a distance or similarity measure

 Application-specific outlier detection

relationship among objects are often application-dependent

marketing analysis, larger fluctuations

 Specify the degree of an outlier: the unlikelihood of the object being

generated by a normal mechanism

 Example (right figure): First use Gaussian distribution

 Taking derivatives with respect to μ and σ2, we derive the following

 For the above data with n = 10, we have

where is the value taken by a t-distribution at a significance

• The model of normal data is learned from the input

 Figure shows the histogram of purchase amounts in transactions

– People who read our ‘research’ are

– Purpose: determining the empirical

– Purpose: determining the empirical

• Proportions express number of cases of the

• Rates make comparisons more

Frequency Distribution of Age

Grouped Frequency Distribution of Age:

Age Group 1-2 3-4 5-6

– Median reflects the attribute that cuts the

– Mean reflects the average; sum of

– Variance is the average of the squared differences

• Discrete: jumps from category to category

• Present descriptive univariate data for each of

• Pie chart: a graph showing the differences in

2.8% .8% .6%

• Bar graph: a graph showing the differences

• Frequency polygon: a graph showing the

Population of Japan, Age 55 and Over, 2000,

• Time series chart: a graph displaying

– What are graphical distortions?

50 all races 15-64