Professional Documents
Culture Documents
Review of Basic Statistics
Review of Basic Statistics
QBUS2810
1 / 86
A review of basic statistics
2 / 86
Example: Earthlink
3 / 86
Example: Earthlink
The company (www.earthlink.net)
4 / 86
Example: Earthlink
Exploratory analysis
What does this table tell us? What does it say about the
relationship between sessions and mailboxes?
5 / 86
Example: Earthlink
Exploratory analysis
6 / 86
Example: Earthlink
Exploratory analysis
7 / 86
Scatter plots
Relationships:
8 / 86
Example: Earthlink
Scatter plot
Test the null hypothesis that the mean no. of sessions in the
two groups are the same, against the alternative that they
differ (hypothesis testing, confidence intervals).
10 / 86
Example: Earthlink
Compare small (mailboxes < 2) and large (mailboxes 2) webmail
customers:
11 / 86
Example: Earthlink
12 / 86
Example: Earthlink
Step 1: estimation
13 / 86
Example: Earthlink
Step 2: hypothesis testing
YsYl YsYl
t= q 2
=
2
ss sl SE(Y s Y l )
ns + nl
etc.
14 / 86
Example: Earthlink
Computing the test
|t| < 1.96, so do not reject (at the 5% significance level) the null
hypothesis that the two means are the same.
15 / 86
Example: Earthlink
Step 3: confidence interval
A 95% confidence interval for the difference between the means is,
16 / 86
Example: Earthlink
Proportions
P (low|small) vs P (low|large)
17 / 86
Example: Earthlink
Proportion of low sessions by mailbox size
18 / 86
What comes next
19 / 86
Review of statistics
Data types
Probability
Estimation
Testing
Confidence Intervals
20 / 86
Definitions and concepts
Population
The group or collection of all possible entities of interest. We will
abstractly think of populations as infinitely large.
Variable
A quantity of interest that varies and can be measured: e.g.
categories, numerical values, counts etc.
21 / 86
Definitions and concepts
Sample
A subset of the population available for analysis.
Parameter
An unknown non-random quantity of interest regarding the
population.
22 / 86
Definitions and concepts
Estimation
Using sample data to approximate the value of a parameter.
Inference
Employing statistical methods to estimate uncertainty in
estimation from a sample, using probability. Making statistical,
probabilistic conclusions about a parameter based on a sample.
23 / 86
Measurement
Categorical data
Unordered categories: nominal data.
Ordered categories: ordinal data.
Numerical data
Interval data: ordered, numerical data, differences meaningful
but no TRUE zero.
Ratio data: continuous numbers, discrete counts.
24 / 86
Probability distribution
25 / 86
Discrete RVs and probability
26 / 86
Discrete or categorical RVs and probability
Example: Y is the number of times your PC crashes while
completing your assignment task.
27 / 86
Discrete RVs and probability
P (Y = yi ) = pi
for i = 1, 2, . . . m, where
m
X m
X
P (Y = yi ) = pi = 1
i=1 i=1
28 / 86
Discrete RVs
Mean and variance
29 / 86
Discrete RVs
Median and mode
30 / 86
Discrete or categorical RVs
Example: Internet access percentage across countries (2008)
31 / 86
Continuous RVs and probability
32 / 86
Continuous RVs
33 / 86
Continuous RVs
Mean and variance
34 / 86
Continuous RVs
Median and mode
And the mode of Y is the value a such that p(y) p(a) for all
possible values of Y (the value that maximises the pdf).
35 / 86
The normal (Gaussian) distribution
36 / 86
Measurement
Graphing and summary statistics for ratio, interval variables
Continuous:
Discrete:
37 / 86
Measurement
Graphing and summary statistics for category variables
38 / 86
Categorical RVs
Parameters
39 / 86
Earthlink: Mailboxes
40 / 86
Earthlink: Customer churn (60 days)
41 / 86
Good graphing principles
42 / 86
Whats wrong with these plots?
43 / 86
Are these graphs better? Why?
44 / 86
Comments?
45 / 86
Comments?
46 / 86
Statistical tools
1 Summary statistics.
2 Graph.
3 Estimation.
4 Testing.
47 / 86
Sampling distributions
y1 , y2 , y3 , . . . , yn
from a population Y .
y1 , y2 , y3 , . . . , yn
49 / 86
Sampling distribution of the sample mean
50 / 86
Sampling distribution of the sample mean
Estimation
51 / 86
Sampling distribution of the sample mean
Y is a random variable.
52 / 86
Sampling distribution of the sample mean
Things we want to know
53 / 86
Sampling distribution of the sample mean
If Yi represent i.i.d. samples (from any distribution), then across
many such samples:
n n n
! !
1X 1X 1X
E(Y ) = E( Yi ) = E E(Yi ) =E =
n n n
i=1 i=1 i=1
n
!
1X
Var(Y ) = Var Yi
n
i=1
n n X
1 X X
= Var(Yi ) + 2 Cov(Yi , Yj )
n2
i=1 i=1 j<i
1 2
= (n 2 + 0) =
n2 n
54 / 86
Sampling distribution of the sample mean
Mean and variance of the sampling distribution of Y
E(Y ) =
2
Var(Y ) =
n
Implications:
Y is an unbiased estimator of .
56 / 86
Sampling distribution of the sample mean
The Central Limit Theorem (CLT)
2
Y N ,
n
That is, for a standardised Y ,
Y E(Y ) Y
q = N (0, 1).
/ n
Var(Y )
The larger n, the better the approximation is.
57 / 86
Sampling distribution of the sample mean
Summary
58 / 86
Sampling distribution of the sample mean
Why use Y to estimate ?
1 Y is unbiased: E(Y ) = .
p
2 Y is consistent: Y .
3 Y is P
the least squares estimator of , i.e. it solves
min ni=1 (Yi m)2 .
m
59 / 86
Sampling distribution of the sample mean
Why use Y to estimate ?
n n n
d X X d X
(Yi m)2 = (Yi m)2 = 2 (Yi m)
dm dm
i=1 i=1 i=1
60 / 86
Sampling distribution of the sample mean
Why use Y to estimate ?
61 / 86
Bias, consistency, and efficiency
Let
b be an estimator of .
The bias of ).
b is E(b
) = 0.
b is an unbiased estimator of if E(b
p
b .
b is a consistent estimator of if
Let
e be another estimator of .
b is more efficient than
e if
Var(b
) < Var(e
).
62 / 86
Confidence intervals
SY
Y z(1/2) .
n
63 / 86
Confidence intervals: a quiz
64 / 86
Confidence intervals
65 / 86
Confidence intervals
A tip
In classical statistical inference, all the probabilistic statements
that we make are about samples and sample estimators.
66 / 86
Hypothesis testing
Example:
67 / 86
Hypothesis testing
68 / 86
Hypothesis testing
Parametric location - t test
69 / 86
Hypothesis testing
P-values
p-val = P (t212 < 9.3) + P (t212 > 9.3) = 2 P (t212 > 9.3) 0.
70 / 86
Hypothesis testing
One-sided p-values
72 / 86
Hypothesis testing
A common pitfall
Note that the significance level is pre-specified, technically before
you see any data. You may sometimes read statements such as
the test statistic is almost significant, which are at odds with the
underlying principles of hypothesis testing. Either the result is
statistically significant given the pre-specified level, or not. End of
story.
73 / 86
Students t distribution
74 / 86
Students t distribution
Y
tn1
s/ n
75 / 86
Students t distribution
76 / 86
Students t distribution
Properties
Symmetric.
More widely dispersed than N (0, 1). More area in tails and
less in the centre than the normal distribution.
77 / 86
Students t distribution
Confidence interval
78 / 86
Students t distribution
79 / 86
Students t distribution
Computing the p-value with with an estimated 2
obs !
Y 0
p-value = P t> = PH0 (|t| > |t-stat|)
s/ n
80 / 86
Students t distribution
Summary
Y
If Y is distributed N (, 2 ), then
s/ n
tn1 is an exact
result when 2 is unknown.
For n > 30, the t-distribution and N (0, 1) are very close and
there is not practical difference. As n grows large, the tn1
distribution N (0, 1).
81 / 86
Hypothesis testing
What is the link between the p-value and the significance level?
82 / 86
Students t distribution
But...
83 / 86
Students t distribution
84 / 86
A skewed distribution
85 / 86
Summary
From the two assumptions of:
1 Simple random sampling of a population, that is, Yi for
i = 1, . . . , n are i.i.d.
2 0 < E(Y 4 ) < .