Review of Basic Statistics

Statistical Modelling for Business
QBUS2810
Module 0: Review of Basic Statistics
Marcel Scharth and Richard Gerlach
Discipline of Business Analytics, The University of Sydney Business School
1 / 86
A review of basic statistics
Lets work through an example using a data set from Earthlink, a

large internet provider that uses loyalty marketing to reduce churn.
Data: dueu-100kurs-sets.csv, 100000 observations
Churn: the percentage of customers who stop choosing your

company in a fixed time period. e.g. 90 days, 1 year.
2 / 86
Example: Earthlink
Goal: Predicting number of online sessions per customer by their

number of unique mailboxes.
Empirical question: What is the effect on the number of sessions

of a user adding one extra mailbox? Adding 5 extra boxes?
Is there any way to answer this without data? Which variable

would Earthlink like to predict? Why?
3 / 86
Example: Earthlink
The company (www.earthlink.net)
Number of sessions (no sessions) per month.
Number of WebMail boxes (no mailboxes) for each customer.
Number of customers: 100, 000.
4 / 86
Example: Earthlink
Exploratory analysis
What does this table tell us? What does it say about the
relationship between sessions and mailboxes?
5 / 86
Example: Earthlink
Do these plots say anything?
6 / 86
Example: Earthlink
7 / 86
Scatter plots
Scatterplots are an important graphical tool to explore

relationships between numerical variables.
Relationships:
If Y tends to increase with X: positively related.
If Y tends to decrease as X increases: negatively related.
Here we mean typically increase or decrease, e.g. on average.
8 / 86
Example: Earthlink
Scatter plot
Does this figure suggest a relationship? Why are mailboxes on the

horizontal axis?
9 / 86
Example: Earthlink
How to quantify the effect of mailboxes on sessions?
We can start with simple approaches. Consider customers with

small no. boxes and with large no. boxes (estimation/comparison).
Test the null hypothesis that the mean no. of sessions in the
two groups are the same, against the alternative that they
differ (hypothesis testing, confidence intervals).
Assess whether the proportion of high no. of sessions differs

between customers with small and large numbers of mailboxes
(HT, CIs for proportions).
10 / 86
Example: Earthlink
Compare small (mailboxes < 2) and large (mailboxes 2) webmail
customers:
11 / 86
Example: Earthlink
Compare small (mailboxes < 2) and large (mailboxes 2) webmail

customers:
1 Estimation of difference between means.

2 Test hypothesis that mean diff = 0.
3 Construct a confidence interval for mean diff.
12 / 86
Example: Earthlink
Step 1: estimation
Y small Y large = 22.15 23.17 = 1.02
Is this a large difference statistically?
Is this a large difference in a practical (i.e. profit) sense?
Are averages the right measure here? E(no sessions|small) vs

E(no sessions|large).
What about medians? They are actually 0 for both groups!
13 / 86
Example: Earthlink
Step 2: hypothesis testing
Difference in means test. Compute the t-statistic:
YsYl YsYl
t= q 2
=
2
ss sl SE(Y s Y l )
ns + nl
where SE(Y s Y l ) is the standard error of Y s Y l , the

subscripts s and l refer to small and large number of boxes,
sn
1 X
s2s = (Ys,i Y s )2 ,
ns 1
i=1
etc.
14 / 86
Example: Earthlink
Computing the test
YsYl YsYl 23.17 22.15 1.02

t= q = = = = 1.60
s2s s2l SE(Y s Y l ) 0.89 0.64
ns + nl
|t| < 1.96, so do not reject (at the 5% significance level) the null
hypothesis that the two means are the same.
15 / 86
Example: Earthlink
Step 3: confidence interval
A 95% confidence interval for the difference between the means is,
(Y s Y l )1.96SE(Y s Y l ) = 1.021.960.64 = (0.23, 2.27).
Two equivalent statements:
The 95% confidence interval for diff includes 0;

The hypothesis that diff = 0 is not rejected at the 5% level.
16 / 86
Example: Earthlink
Proportions
Compare small (boxes < 2) and large (boxes 2) customers with

low sessions (< 5) and high sessions ( 5):
P (low|small) vs P (low|large)
17 / 86
Example: Earthlink
Proportion of low sessions by mailbox size
Test the null hypothesis that P (low|small) = P (low|large). What

is the best way to measure and assess this relationship?
18 / 86
What comes next
Mechanics of estimation, hypothesis testing, and confidence

intervals for assessing and testing relationships. First, we will
review some basics:
Data types, graphing, summary stats.
What assumptions do these procedures rely on? What

alternatives exist in case these dont hold?
Foundations of probability and statistics.
19 / 86
Review of statistics
Data types
Probability
Estimation
Testing
Confidence Intervals
Readings: Fox Chapters 1,2,3; Berenson et al Chapters 1-11

(better). NB this is your BUSS1020 text
20 / 86
Definitions and concepts
Population
The group or collection of all possible entities of interest. We will
abstractly think of populations as infinitely large.
Variable
A quantity of interest that varies and can be measured: e.g.
categories, numerical values, counts etc.
Random variable (RV)

A variable whose values appear (to us) random (i.e. not able to
be perfectly predicted).
21 / 86
Sample
A subset of the population available for analysis.
Exploratory data analysis

Graphs, summary measures regarding a sample of data.
Parameter
An unknown non-random quantity of interest regarding the
population.
22 / 86
Estimation
Using sample data to approximate the value of a parameter.
Inference
Employing statistical methods to estimate uncertainty in
estimation from a sample, using probability. Making statistical,
probabilistic conclusions about a parameter based on a sample.
23 / 86
Measurement
Categorical data
Unordered categories: nominal data.
Ordered categories: ordinal data.
Numerical data
Interval data: ordered, numerical data, differences meaningful
but no TRUE zero.
Ratio data: continuous numbers, discrete counts.
24 / 86
Probability distribution
What exactly is a probability?
The probabilities for each possible value of Y in the

population, e.g. P r(Y = yes) (Y is discrete or category).
Or ranges of values for Y, e.g. P r(40 Y 60) (Y is

continuous).
25 / 86
Discrete RVs and probability
Let Y be a discrete rv with m possible values. The probability

distribution for Y is
P (Y = yi ) = pi
for i = 1, 2, . . . m, where
m
X m
X
P (Y = yi ) = pi = 1
i=1 i=1
and 0 pi 1 for all i.
26 / 86
Discrete or categorical RVs and probability
Example: Y is the number of times your PC crashes while
completing your assignment task.
27 / 86
Discrete RVs and probability
Let Y be a discrete random variable with m possible values. The

probability distribution for Y is
P (Y = yi ) = pi
for i = 1, 2, . . . m, where
m
X m
X
P (Y = yi ) = pi = 1
i=1 i=1
and 0 pi 1 for all i.
28 / 86
Discrete RVs
Mean and variance
Let Y be a discrete RV with m possible numerical values. The

expected value of Y is defined as:
m
X
= E(Y ) = p i yi .
i=1
The variance of Y is:

m
X
2 = Var(Y ) = E(Y )2 = pi (yi )2 .
i=1
29 / 86
Discrete RVs
Median and mode
Let Y be a discrete rv with m possible numerical values and

without loss of generality let yi be ordered for i = 1, . . . , m.
The median for Y is defined as the value of yi such

P (Y < yi1 ) < 0.5 and P (Y > yi+1 ) > 0.5.
And the mode of Y is the value yi such that

P (Y = yi ) = max(P (Y = yj )) over j = 1, . . . , m.
30 / 86
Discrete or categorical RVs
Example: Internet access percentage across countries (2008)
31 / 86
Continuous RVs and probability
Let Y be a continuous random variable. The cumulative

distribution function (CDF) is defined as
Z a
P (Y a) = p(y)dy,

where p(y) is the probability density function (pdf), p(y) 0, and

Z
p(y)dy = 1.

We have that P (Y = y) = 0 for any particular value y (why?).
32 / 86
Continuous RVs
33 / 86
Continuous RVs
Mean and variance
Let Y be a continuous RV. The expected value of Y is defined as

Z
= E(Y ) = yp(y)dy.

The variance of Y is
Z
2 2
= Var(Y ) = E(Y ) = (y )2 p(y)dy.

34 / 86
Continuous RVs
Median and mode
Let Y be a continuous RV.
The median Rfor Y is defined as the value such

P (Y < ) = p(y)dy = 0.5.
And the mode of Y is the value a such that p(y) p(a) for all
possible values of Y (the value that maximises the pdf).
35 / 86
The normal (Gaussian) distribution
36 / 86
Measurement
Graphing and summary statistics for ratio, interval variables
Continuous:
Graphs: histogram, boxplot, dotplot, scatterplot.

Location: mean, median, percentiles, mode.
Spread: std. deviation, range, inter-quartile range.
Shape: skewness, kurtosis.
Discrete:
Small range: bar chart.

Large range: as for ratio data.
Location: mode, median, mean.
Spread: range, std deviation.
37 / 86
Measurement
Graphing and summary statistics for category variables
Ordinal (ordered categories):
Graphs: Bar chart, Pie chart.

Location: Mode, % in each category, median.
Spread: Range, IQR.
Nominal (unordered categories):
Graphs: Pareto chart, Bar chart, Pie chart.

Location: Mode, % in each category.
38 / 86
Categorical RVs
Parameters
Let Y be a categorical random variable with m possible values.

The expected value and variance of Y are undefined. The mode of
Y is the category i such that pi = max{p1 , p2 , . . . , pm }.
39 / 86
Earthlink: Mailboxes
40 / 86
Earthlink: Customer churn (60 days)
41 / 86
Good graphing principles
Highlight the message, minimize the noise

Include 0 on the vertical axis for proper comparison of heights
Show horizontal spacings that reflect reality (e.g. time)
3D effects usually induce noise, dont help message
properly label axes
42 / 86
Whats wrong with these plots?
43 / 86
Are these graphs better? Why?
44 / 86
Comments?
45 / 86
Comments?
46 / 86
Statistical tools
How do we know which to use?
1 Summary statistics.
2 Graph.
3 Estimation.
4 Testing.
47 / 86
Sampling distributions
We have a sample of data
y1 , y2 , y3 , . . . , yn
from a population Y .
We estimate the parameters of interest statistics, for example

the sample average
n
1X
y= yi
n
i=1
estimates the population mean .
How much could the estimate change if a different sample had

been taken?
48 / 86
Sampling distribution of the sample mean
Assume a simple random sample (SRS)
y1 , y2 , y3 , . . . , yn
is taken from the population Y .
What is the (sampling) distribution of the sample mean? Or: how

much could the sample mean tell us about the true mean ?
49 / 86
Because each observation is selected at random, Yi has no

information about Yj . Thus:
Yi and Yj are independently distributed.

Yi and Yj come from the same distribution, that is, Yi and Yj
are identically distributed.
That is, Yi and Yj are i.i.d.
More generally, under SRS, Yi , i = 1, . . . , n are i.i.d.
SRS allows inferential statements to be made about the

population, using only a sample of data from it.
50 / 86
Estimation
Y is the natural sample estimator of the population mean. But:
What are the (sampling) properties of Y ?
Why use Y rather than some other estimator? E.g. y1 (the

first
P observation); or
Pa weighted average of sample points
( ni=1 wi yi where ni=1 w1 = 1); or the median.
51 / 86
Y is a random variable.
If the sample is drawn at random, then the observed Y is also

random.
The distribution of possible values for Y over different possible
samples of size n is called the sampling distribution of Y .
The mean and variance of Y are the mean and variance of its
sampling distribution, E(Y ) and Var(Y ).
The concept of the sampling distribution underpins all of statistical

inference.
52 / 86
Things we want to know
Is the mean of Y the true population mean E(Y ) = , i.e., is

Y an unbiased estimator of ?
What is the variance of Y ? How does Var(Y ) depend on n?
Does Y become closer and closer to when n is large, i.e. is

Y a consistent estimator of ?
Does Y appears Gaussian for a large n? is this generally true?

The Central Limit Theorem (CLT) suggests Y is
approximately normally distributed for n large.
53 / 86
If Yi represent i.i.d. samples (from any distribution), then across
many such samples:
n n n
! !
1X 1X 1X
E(Y ) = E( Yi ) = E E(Yi ) =E =
n n n
i=1 i=1 i=1
n
!
1X
Var(Y ) = Var Yi
n
i=1

n n X
1 X X
= Var(Yi ) + 2 Cov(Yi , Yj )
n2
i=1 i=1 j<i
1 2
= (n 2 + 0) =
n2 n
54 / 86
Mean and variance of the sampling distribution of Y
E(Y ) =
2
Var(Y ) =
n
Implications:
Y is an unbiased estimator of .
Var(Y ) is inversely proportional to n. The standard deviation

of the sampling distribution, or sample uncertainty, is

proportional to 1/ n.
Thus the sampling uncertainty associated with Y goes to zero

as n increases!
55 / 86
For small n, the distribution of Y is complicated: it depends

on the distribution of Y .
As n increases, the distribution of Y becomes more tightly

centered around E(Y ) (the Law of Large Numbers).
And, the distribution of Y becomes Gaussian (the Central

Limit Theorem).
56 / 86
The Central Limit Theorem (CLT)
If (Y1 , . . . , Yn ) are i.i.d. and , 2 < , then for n large the

distribution of Y is well approximated by a Gaussian.
2

Y N ,
n
That is, for a standardised Y ,
Y E(Y ) Y
q = N (0, 1).
/ n
Var(Y )
The larger n, the better the approximation is.
57 / 86
Summary
For Y1 , . . . , Yn i.i.d. and , 2 < ,
The exact (finite sample) sampling distribution of Y has mean

and variance 2 /n.
The distribution of Y is complicated and depends on the

distribution of Y.
When n is large, the sampling distribution simplifies. Y

Y E(Y )
(Law of large numbers). is approximately N (0, 1)
Var(Y )
(CLT).
58 / 86
Why use Y to estimate ?
1 Y is unbiased: E(Y ) = .
p
2 Y is consistent: Y .
3 Y is P
the least squares estimator of , i.e. it solves
min ni=1 (Yi m)2 .
m
59 / 86
n n n
d X X d X
(Yi m)2 = (Yi m)2 = 2 (Yi m)
dm dm
i=1 i=1 i=1
Set the derivative to zero and denote the optimal value of m by m:

b
n n
X 1X
m
b = nm
b m
b = Yi = Y
n
i=1 i=1
60 / 86
4 Y has the smallest sampling variance among all other linear

unbiased estimators.
b = ni=1 ai Yi , where ai is chosen so that

P
Consider b is unbiased.
Then Var(b) Var(Y ) (proof: beyond this unit).
The sample mean is the most efficient (Best) Linear Unbiased

Estimator of the population mean (abbreviated as BLUE).
61 / 86
Bias, consistency, and efficiency
Let
b be an estimator of .
The bias of ).
b is E(b
) = 0.
b is an unbiased estimator of if E(b
p
b .
b is a consistent estimator of if
Let
e be another estimator of .
b is more efficient than
e if
Var(b
) < Var(e
).
62 / 86
Confidence intervals
Under the assumptions of the CLT, an approximate 95%

confidence interval (CI) for is
q
Y 1.96 Var(Y ).
Under the assumptions of the CLT, a general approximate

100(1 )% CI for is
SY
Y z(1/2) .
n
63 / 86
Confidence intervals: a quiz
A poll constructed a 95% CI of (0.63, 0.73) for the proportion of

NSW residents that support the continuation of the lockout laws.
What is the accurate interpretation of this CI?
(A) There is a 95% probability that the sample proportion is

between 0.63 and 0.73.
(B) The poll estimated that the proportion of NSW residents that
supports the lockouts is between 0.63 and 0.73. The estimator has
the property that it covers the population parameter 95% of the
time in repeated samples.
(C) There is a 95% probability that a random sample of the NSW
population will yield a sample proportion between 0.63 and 0.73.
(D) There is a 95% probability that the proportion of NSW
residents that support the lockouts is between 0.63 and 0.73.
64 / 86
Another way to think about it is:
Obtaining a sample proportion of 0.68 (holding the sample size

fixed) is unlikely if the population proportion is outside
(0.63,0.73).
65 / 86
A tip
In classical statistical inference, all the probabilistic statements
that we make are about samples and sample estimators.
Any statement that treats the population parameter as a random

variable is incorrect in this framework. The sample is random, the
parameter is not.
66 / 86
Hypothesis testing
If a research question suggests a specific value of a parameter, a

hypothesis test can be appropriate.
NB: A two-sided hypothesis test is equivalent to a central

confidence interval.
Example:
What is the typical internet access percentage?

Is the typical internet percentage equal to 25%?
67 / 86
Hypothesis testing
The first and most important step is setting the alternative

hypothesis.
Example: An internet access level of 25% is a threshold for

developed nations.
Is the typical internet access level 25%?

Is the typical internet access level less than 25%?
Require different alternative hypotheses.
68 / 86
Hypothesis testing
Parametric location - t test
69 / 86
Hypothesis testing
P-values
Roughly represent support against the null hypothesis.
When the alternative is H1 : 6= 25,
p-val = P (t212 < 9.3) + P (t212 > 9.3) = 2 P (t212 > 9.3) 0.
70 / 86
Hypothesis testing
One-sided p-values
When the alternative is H1 : < 5,
p-val = P (t212 < 9.3) P (Z < 9.3) 0.
This is exactly half of the two-sided p-value.

71 / 86
Hypothesis testing
Revision of terminology
The p-value is the probability of drawing a sample statistic (e.g.

Y ) at least as extreme as that observed, if the null hypothesis
is true.
The significance level of a test is a pre-specified probability of

incorrectly rejecting the null hypothesis, when it is true.
Calculating the p-value based on Y :
P (observing Y as far or further away from 0 | = 0 ).
72 / 86
Hypothesis testing
A common pitfall
Note that the significance level is pre-specified, technically before
you see any data. You may sometimes read statements such as
the test statistic is almost significant, which are at odds with the
underlying principles of hypothesis testing. Either the result is
statistically significant given the pre-specified level, or not. End of
story.
73 / 86
Students t distribution
Remember the recipe?
1 Compute the t-statistic
2 Compute the degrees of freedom, which is n 1.
3 Look up the 5% critical value
4 If the t-statistic exceeds (in absolute value) this critical value,

reject the null hypothesis.
74 / 86
Usually population is unknown, so that we estimate by the

sample standard deviation. Here, the sampling distribution of Y is
better approximated by a Students t-distribution instead of a
Gaussian.
Y
tn1
s/ n
A t-distribution has higher variance and fatter tails than a

Gaussian, allowing for the estimation of by s.
75 / 86
76 / 86
Properties
Symmetric.
More widely dispersed than N (0, 1). More area in tails and
less in the centre than the normal distribution.
For large n, the t distribution converges to the standard

normal N ( = 0, 2 = 1).
A t distribution has an associated degrees of freedom. In the

case of Y it is n 1.
77 / 86
Confidence interval
In general, the confidence interval for a population mean is given

by:
s
Y t1/2;n1
n
For a 95% confidence interval with large sample size n,

t0.975;n1 1.96 by the central limit theorem. For smaller sample
sizes t0.975;n1 is larger. Python or R can calculate this interval for
us.
78 / 86
Here are some 5% critical values for 2-sided tests:
Degrees of freedom (n 1) 5% t-distribution critical value

10 2.23
20 2.09
30 2.04
60 2.00
1.96
79 / 86
Computing the p-value with with an estimated 2
obs !
Y 0
p-value = P t> = PH0 (|t| > |t-stat|)

s/ n
(the probability under students t outside t-stat).
80 / 86
Summary
Y
If Y is distributed N (, 2 ), then
s/ n
tn1 is an exact
result when 2 is unknown.
However, the assumption that Y is distributed N (, 2 ), is

rarely plausible in practice.
For n > 30, the t-distribution and N (0, 1) are very close and
there is not practical difference. As n grows large, the tn1
distribution N (0, 1).
81 / 86
Hypothesis testing
What is the link between the p-value and the significance level?
The significance level is pre-specified. For example, if the

pre-specified significance level is 5%,
You reject the null hypothesis if |t| > 1.96.
Equivalently, you reject if p < 0.05.
The p-value is sometimes called the marginal significance level.
Often, it is better to communicate the p-value than simply

whether a test rejects or not, as long as your discussion does
not fall into the trap discussed earlier. The p-value contains
more information than the yes/no statement about whether
the test rejects.
82 / 86
The theory of the t-distribution was one of the early triumphs of

mathematical statistics. It is very neat: if Y is i.i.d. normal, then
you can know the exact, finite-sample distribution of the t-statistic
it is the Students t. So, you can construct confidence intervals
(using the Student t critical value) that have exactly the right
coverage rate, no matter what the sample size.
But...
83 / 86
The Student-t distribution is most relevant only when the sample

size is very small; but in that case, for it to be correct, you must
be sure that the population distribution of Y is normal. In business
and economic data, the normality assumption is rarely credible.
For example, do you think that earnings are normally distributed?

Suppose you have a sample of n = 10 observations from one of
these distributions would you feel comfortable using the Student t
distribution?
84 / 86
A skewed distribution
85 / 86
Summary
From the two assumptions of:
1 Simple random sampling of a population, that is, Yi for
i = 1, . . . , n are i.i.d.
2 0 < E(Y 4 ) < .
We developed for large samples (large n):
Theory of estimation (sampling distribution of Y ).
Theory of hypothesis testing (large n distribution of t-statistic

and computation of the p-value).
Theory of confidence intervals (constructed by inverting test

statistic).
Are assumptions (1) and (2) plausible in practice?

86 / 86

Review of Basic Statistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Review of Basic Statistics

Uploaded by

Copyright:

Available Formats

Statistical Modelling for Business

Module 0: Review of Basic Statistics

Marcel Scharth and Richard Gerlach

Discipline of Business Analytics, The University of Sydney Business School

Lets work through an example using a data set from Earthlink, a

Data: dueu-100kurs-sets.csv, 100000 observations

Churn: the percentage of customers who stop choosing your

Goal: Predicting number of online sessions per customer by their

Empirical question: What is the effect on the number of sessions

Is there any way to answer this without data? Which variable

Number of sessions (no sessions) per month.

Number of WebMail boxes (no mailboxes) for each customer.

Number of customers: 100, 000.

Do these plots say anything?

Scatterplots are an important graphical tool to explore

If Y tends to increase with X: positively related.

If Y tends to decrease as X increases: negatively related.

Here we mean typically increase or decrease, e.g. on average.

Does this figure suggest a relationship? Why are mailboxes on the

We can start with simple approaches. Consider customers with

Assess whether the proportion of high no. of sessions differs

Compare small (mailboxes < 2) and large (mailboxes 2) webmail

1 Estimation of difference between means.

Y small Y large = 22.15 23.17 = 1.02

Is this a large difference statistically?

Is this a large difference in a practical (i.e. profit) sense?

Are averages the right measure here? E(no sessions|small) vs

What about medians? They are actually 0 for both groups!

Difference in means test. Compute the t-statistic:

where SE(Y s Y l ) is the standard error of Y s Y l , the

YsYl YsYl 23.17 22.15 1.02

(Y s Y l )1.96SE(Y s Y l ) = 1.021.960.64 = (0.23, 2.27).

Two equivalent statements:

The 95% confidence interval for diff includes 0;

Compare small (boxes < 2) and large (boxes 2) customers with

Test the null hypothesis that P (low|small) = P (low|large). What

Mechanics of estimation, hypothesis testing, and confidence

Data types, graphing, summary stats.

What assumptions do these procedures rely on? What

Foundations of probability and statistics.

Readings: Fox Chapters 1,2,3; Berenson et al Chapters 1-11

Random variable (RV)

Exploratory data analysis

What exactly is a probability?

The probabilities for each possible value of Y in the

Or ranges of values for Y, e.g. P r(40 Y 60) (Y is

Let Y be a discrete rv with m possible values. The probability

and 0 pi 1 for all i.

Let Y be a discrete random variable with m possible values. The

and 0 pi 1 for all i.

Let Y be a discrete RV with m possible numerical values. The

The variance of Y is:

Let Y be a discrete rv with m possible numerical values and

The median for Y is defined as the value of yi such

And the mode of Y is the value yi such that

Let Y be a continuous random variable. The cumulative

where p(y) is the probability density function (pdf), p(y) 0, and

We have that P (Y = y) = 0 for any particular value y (why?).

Let Y be a continuous RV. The expected value of Y is defined as

Let Y be a continuous RV.

The median Rfor Y is defined as the value such

Graphs: histogram, boxplot, dotplot, scatterplot.

Small range: bar chart.