Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Statistical Modelling for Business

QBUS2810

Module 0: Review of Basic Statistics

Marcel Scharth and Richard Gerlach

Discipline of Business Analytics, The University of Sydney Business School

1 / 86
A review of basic statistics

Lets work through an example using a data set from Earthlink, a


large internet provider that uses loyalty marketing to reduce churn.

Data: dueu-100kurs-sets.csv, 100000 observations

Churn: the percentage of customers who stop choosing your


company in a fixed time period. e.g. 90 days, 1 year.

2 / 86
Example: Earthlink

Goal: Predicting number of online sessions per customer by their


number of unique mailboxes.

Empirical question: What is the effect on the number of sessions


of a user adding one extra mailbox? Adding 5 extra boxes?

Is there any way to answer this without data? Which variable


would Earthlink like to predict? Why?

3 / 86
Example: Earthlink
The company (www.earthlink.net)

Number of sessions (no sessions) per month.

Number of WebMail boxes (no mailboxes) for each customer.

Number of customers: 100, 000.

4 / 86
Example: Earthlink
Exploratory analysis

What does this table tell us? What does it say about the
relationship between sessions and mailboxes?

5 / 86
Example: Earthlink
Exploratory analysis

Do these plots say anything?

6 / 86
Example: Earthlink
Exploratory analysis

7 / 86
Scatter plots

Scatterplots are an important graphical tool to explore


relationships between numerical variables.

Relationships:

If Y tends to increase with X: positively related.

If Y tends to decrease as X increases: negatively related.

Here we mean typically increase or decrease, e.g. on average.

8 / 86
Example: Earthlink
Scatter plot

Does this figure suggest a relationship? Why are mailboxes on the


horizontal axis?
9 / 86
Example: Earthlink
How to quantify the effect of mailboxes on sessions?

We can start with simple approaches. Consider customers with


small no. boxes and with large no. boxes (estimation/comparison).

Test the null hypothesis that the mean no. of sessions in the
two groups are the same, against the alternative that they
differ (hypothesis testing, confidence intervals).

Assess whether the proportion of high no. of sessions differs


between customers with small and large numbers of mailboxes
(HT, CIs for proportions).

10 / 86
Example: Earthlink
Compare small (mailboxes < 2) and large (mailboxes 2) webmail
customers:

11 / 86
Example: Earthlink

Compare small (mailboxes < 2) and large (mailboxes 2) webmail


customers:

1 Estimation of difference between means.


2 Test hypothesis that mean diff = 0.
3 Construct a confidence interval for mean diff.

12 / 86
Example: Earthlink
Step 1: estimation

Y small Y large = 22.15 23.17 = 1.02

Is this a large difference statistically?

Is this a large difference in a practical (i.e. profit) sense?

Are averages the right measure here? E(no sessions|small) vs


E(no sessions|large).

What about medians? They are actually 0 for both groups!

13 / 86
Example: Earthlink
Step 2: hypothesis testing

Difference in means test. Compute the t-statistic:

YsYl YsYl
t= q 2
=
2
ss sl SE(Y s Y l )
ns + nl

where SE(Y s Y l ) is the standard error of Y s Y l , the


subscripts s and l refer to small and large number of boxes,
sn
1 X
s2s = (Ys,i Y s )2 ,
ns 1
i=1

etc.

14 / 86
Example: Earthlink
Computing the test

YsYl YsYl 23.17 22.15 1.02


t= q = = = = 1.60
s2s s2l SE(Y s Y l ) 0.89 0.64
ns + nl

|t| < 1.96, so do not reject (at the 5% significance level) the null
hypothesis that the two means are the same.
15 / 86
Example: Earthlink
Step 3: confidence interval

A 95% confidence interval for the difference between the means is,

(Y s Y l )1.96SE(Y s Y l ) = 1.021.960.64 = (0.23, 2.27).

Two equivalent statements:

The 95% confidence interval for diff includes 0;


The hypothesis that diff = 0 is not rejected at the 5% level.

16 / 86
Example: Earthlink
Proportions

Compare small (boxes < 2) and large (boxes 2) customers with


low sessions (< 5) and high sessions ( 5):

P (low|small) vs P (low|large)

17 / 86
Example: Earthlink
Proportion of low sessions by mailbox size

Test the null hypothesis that P (low|small) = P (low|large). What


is the best way to measure and assess this relationship?

18 / 86
What comes next

Mechanics of estimation, hypothesis testing, and confidence


intervals for assessing and testing relationships. First, we will
review some basics:

Data types, graphing, summary stats.

What assumptions do these procedures rely on? What


alternatives exist in case these dont hold?

Foundations of probability and statistics.

19 / 86
Review of statistics

Data types
Probability
Estimation
Testing
Confidence Intervals

Readings: Fox Chapters 1,2,3; Berenson et al Chapters 1-11


(better). NB this is your BUSS1020 text

20 / 86
Definitions and concepts

Population
The group or collection of all possible entities of interest. We will
abstractly think of populations as infinitely large.

Variable
A quantity of interest that varies and can be measured: e.g.
categories, numerical values, counts etc.

Random variable (RV)


A variable whose values appear (to us) random (i.e. not able to
be perfectly predicted).

21 / 86
Definitions and concepts

Sample
A subset of the population available for analysis.

Exploratory data analysis


Graphs, summary measures regarding a sample of data.

Parameter
An unknown non-random quantity of interest regarding the
population.

22 / 86
Definitions and concepts

Estimation
Using sample data to approximate the value of a parameter.

Inference
Employing statistical methods to estimate uncertainty in
estimation from a sample, using probability. Making statistical,
probabilistic conclusions about a parameter based on a sample.

23 / 86
Measurement

Categorical data
Unordered categories: nominal data.
Ordered categories: ordinal data.

Numerical data
Interval data: ordered, numerical data, differences meaningful
but no TRUE zero.
Ratio data: continuous numbers, discrete counts.

24 / 86
Probability distribution

What exactly is a probability?

The probabilities for each possible value of Y in the


population, e.g. P r(Y = yes) (Y is discrete or category).

Or ranges of values for Y, e.g. P r(40 Y 60) (Y is


continuous).

25 / 86
Discrete RVs and probability

Let Y be a discrete rv with m possible values. The probability


distribution for Y is
P (Y = yi ) = pi
for i = 1, 2, . . . m, where
m
X m
X
P (Y = yi ) = pi = 1
i=1 i=1

and 0 pi 1 for all i.

26 / 86
Discrete or categorical RVs and probability
Example: Y is the number of times your PC crashes while
completing your assignment task.

27 / 86
Discrete RVs and probability

Let Y be a discrete random variable with m possible values. The


probability distribution for Y is

P (Y = yi ) = pi

for i = 1, 2, . . . m, where
m
X m
X
P (Y = yi ) = pi = 1
i=1 i=1

and 0 pi 1 for all i.

28 / 86
Discrete RVs
Mean and variance

Let Y be a discrete RV with m possible numerical values. The


expected value of Y is defined as:
m
X
= E(Y ) = p i yi .
i=1

The variance of Y is:


m
X
2 = Var(Y ) = E(Y )2 = pi (yi )2 .
i=1

29 / 86
Discrete RVs
Median and mode

Let Y be a discrete rv with m possible numerical values and


without loss of generality let yi be ordered for i = 1, . . . , m.

The median for Y is defined as the value of yi such


P (Y < yi1 ) < 0.5 and P (Y > yi+1 ) > 0.5.

And the mode of Y is the value yi such that


P (Y = yi ) = max(P (Y = yj )) over j = 1, . . . , m.

30 / 86
Discrete or categorical RVs
Example: Internet access percentage across countries (2008)

31 / 86
Continuous RVs and probability

Let Y be a continuous random variable. The cumulative


distribution function (CDF) is defined as
Z a
P (Y a) = p(y)dy,

where p(y) is the probability density function (pdf), p(y) 0, and


Z
p(y)dy = 1.

We have that P (Y = y) = 0 for any particular value y (why?).

32 / 86
Continuous RVs

33 / 86
Continuous RVs
Mean and variance

Let Y be a continuous RV. The expected value of Y is defined as


Z
= E(Y ) = yp(y)dy.

The variance of Y is
Z
2 2
= Var(Y ) = E(Y ) = (y )2 p(y)dy.

34 / 86
Continuous RVs
Median and mode

Let Y be a continuous RV.

The median Rfor Y is defined as the value such



P (Y < ) = p(y)dy = 0.5.

And the mode of Y is the value a such that p(y) p(a) for all
possible values of Y (the value that maximises the pdf).

35 / 86
The normal (Gaussian) distribution

36 / 86
Measurement
Graphing and summary statistics for ratio, interval variables

Continuous:

Graphs: histogram, boxplot, dotplot, scatterplot.


Location: mean, median, percentiles, mode.
Spread: std. deviation, range, inter-quartile range.
Shape: skewness, kurtosis.

Discrete:

Small range: bar chart.


Large range: as for ratio data.
Location: mode, median, mean.
Spread: range, std deviation.

37 / 86
Measurement
Graphing and summary statistics for category variables

Ordinal (ordered categories):

Graphs: Bar chart, Pie chart.


Location: Mode, % in each category, median.
Spread: Range, IQR.

Nominal (unordered categories):

Graphs: Pareto chart, Bar chart, Pie chart.


Location: Mode, % in each category.

38 / 86
Categorical RVs
Parameters

Let Y be a categorical random variable with m possible values.


The expected value and variance of Y are undefined. The mode of
Y is the category i such that pi = max{p1 , p2 , . . . , pm }.

39 / 86
Earthlink: Mailboxes

40 / 86
Earthlink: Customer churn (60 days)

41 / 86
Good graphing principles

Highlight the message, minimize the noise


Include 0 on the vertical axis for proper comparison of heights
Show horizontal spacings that reflect reality (e.g. time)
3D effects usually induce noise, dont help message
properly label axes

42 / 86
Whats wrong with these plots?

43 / 86
Are these graphs better? Why?

44 / 86
Comments?

45 / 86
Comments?

46 / 86
Statistical tools

How do we know which to use?

1 Summary statistics.
2 Graph.
3 Estimation.
4 Testing.

47 / 86
Sampling distributions

We have a sample of data

y1 , y2 , y3 , . . . , yn

from a population Y .

We estimate the parameters of interest statistics, for example


the sample average
n
1X
y= yi
n
i=1

estimates the population mean .

How much could the estimate change if a different sample had


been taken?
48 / 86
Sampling distribution of the sample mean

Assume a simple random sample (SRS)

y1 , y2 , y3 , . . . , yn

is taken from the population Y .

What is the (sampling) distribution of the sample mean? Or: how


much could the sample mean tell us about the true mean ?

49 / 86
Sampling distribution of the sample mean

Because each observation is selected at random, Yi has no


information about Yj . Thus:

Yi and Yj are independently distributed.


Yi and Yj come from the same distribution, that is, Yi and Yj
are identically distributed.
That is, Yi and Yj are i.i.d.
More generally, under SRS, Yi , i = 1, . . . , n are i.i.d.

SRS allows inferential statements to be made about the


population, using only a sample of data from it.

50 / 86
Sampling distribution of the sample mean
Estimation

Y is the natural sample estimator of the population mean. But:

What are the (sampling) properties of Y ?

Why use Y rather than some other estimator? E.g. y1 (the


first
P observation); or
Pa weighted average of sample points
( ni=1 wi yi where ni=1 w1 = 1); or the median.

51 / 86
Sampling distribution of the sample mean

Y is a random variable.

If the sample is drawn at random, then the observed Y is also


random.
The distribution of possible values for Y over different possible
samples of size n is called the sampling distribution of Y .
The mean and variance of Y are the mean and variance of its
sampling distribution, E(Y ) and Var(Y ).

The concept of the sampling distribution underpins all of statistical


inference.

52 / 86
Sampling distribution of the sample mean
Things we want to know

Is the mean of Y the true population mean E(Y ) = , i.e., is


Y an unbiased estimator of ?

What is the variance of Y ? How does Var(Y ) depend on n?

Does Y become closer and closer to when n is large, i.e. is


Y a consistent estimator of ?

Does Y appears Gaussian for a large n? is this generally true?


The Central Limit Theorem (CLT) suggests Y is
approximately normally distributed for n large.

53 / 86
Sampling distribution of the sample mean
If Yi represent i.i.d. samples (from any distribution), then across
many such samples:

n n n
! !
1X 1X 1X
E(Y ) = E( Yi ) = E E(Yi ) =E =
n n n
i=1 i=1 i=1

n
!
1X
Var(Y ) = Var Yi
n
i=1

n n X
1 X X
= Var(Yi ) + 2 Cov(Yi , Yj )
n2
i=1 i=1 j<i

1 2
= (n 2 + 0) =
n2 n

54 / 86
Sampling distribution of the sample mean
Mean and variance of the sampling distribution of Y

E(Y ) =

2
Var(Y ) =
n

Implications:

Y is an unbiased estimator of .

Var(Y ) is inversely proportional to n. The standard deviation


of the sampling distribution, or sample uncertainty, is

proportional to 1/ n.

Thus the sampling uncertainty associated with Y goes to zero


as n increases!
55 / 86
Sampling distribution of the sample mean

For small n, the distribution of Y is complicated: it depends


on the distribution of Y .

As n increases, the distribution of Y becomes more tightly


centered around E(Y ) (the Law of Large Numbers).

And, the distribution of Y becomes Gaussian (the Central


Limit Theorem).

56 / 86
Sampling distribution of the sample mean
The Central Limit Theorem (CLT)

If (Y1 , . . . , Yn ) are i.i.d. and , 2 < , then for n large the


distribution of Y is well approximated by a Gaussian.

2
 
Y N ,
n
That is, for a standardised Y ,

Y E(Y ) Y
q = N (0, 1).
/ n
Var(Y )
The larger n, the better the approximation is.

57 / 86
Sampling distribution of the sample mean
Summary

For Y1 , . . . , Yn i.i.d. and , 2 < ,

The exact (finite sample) sampling distribution of Y has mean


and variance 2 /n.

The distribution of Y is complicated and depends on the


distribution of Y.

When n is large, the sampling distribution simplifies. Y


Y E(Y )
(Law of large numbers). is approximately N (0, 1)
Var(Y )
(CLT).

58 / 86
Sampling distribution of the sample mean
Why use Y to estimate ?

1 Y is unbiased: E(Y ) = .

p
2 Y is consistent: Y .

3 Y is P
the least squares estimator of , i.e. it solves
min ni=1 (Yi m)2 .
m

59 / 86
Sampling distribution of the sample mean
Why use Y to estimate ?

n n n
d X X d X
(Yi m)2 = (Yi m)2 = 2 (Yi m)
dm dm
i=1 i=1 i=1

Set the derivative to zero and denote the optimal value of m by m:


b
n n
X 1X
m
b = nm
b m
b = Yi = Y
n
i=1 i=1

60 / 86
Sampling distribution of the sample mean
Why use Y to estimate ?

4 Y has the smallest sampling variance among all other linear


unbiased estimators.

b = ni=1 ai Yi , where ai is chosen so that


P
Consider b is unbiased.
Then Var(b) Var(Y ) (proof: beyond this unit).

The sample mean is the most efficient (Best) Linear Unbiased


Estimator of the population mean (abbreviated as BLUE).

61 / 86
Bias, consistency, and efficiency

Let
b be an estimator of .

The bias of ).
b is E(b

) = 0.
b is an unbiased estimator of if E(b

p
b .
b is a consistent estimator of if

Let
e be another estimator of .
b is more efficient than
e if
Var(b
) < Var(e
).

62 / 86
Confidence intervals

Under the assumptions of the CLT, an approximate 95%


confidence interval (CI) for is
q
Y 1.96 Var(Y ).

Under the assumptions of the CLT, a general approximate


100(1 )% CI for is

SY
Y z(1/2) .
n

63 / 86
Confidence intervals: a quiz

A poll constructed a 95% CI of (0.63, 0.73) for the proportion of


NSW residents that support the continuation of the lockout laws.
What is the accurate interpretation of this CI?

(A) There is a 95% probability that the sample proportion is


between 0.63 and 0.73.
(B) The poll estimated that the proportion of NSW residents that
supports the lockouts is between 0.63 and 0.73. The estimator has
the property that it covers the population parameter 95% of the
time in repeated samples.
(C) There is a 95% probability that a random sample of the NSW
population will yield a sample proportion between 0.63 and 0.73.
(D) There is a 95% probability that the proportion of NSW
residents that support the lockouts is between 0.63 and 0.73.

64 / 86
Confidence intervals

Another way to think about it is:

Obtaining a sample proportion of 0.68 (holding the sample size


fixed) is unlikely if the population proportion is outside
(0.63,0.73).

65 / 86
Confidence intervals

A tip
In classical statistical inference, all the probabilistic statements
that we make are about samples and sample estimators.

Any statement that treats the population parameter as a random


variable is incorrect in this framework. The sample is random, the
parameter is not.

66 / 86
Hypothesis testing

If a research question suggests a specific value of a parameter, a


hypothesis test can be appropriate.

NB: A two-sided hypothesis test is equivalent to a central


confidence interval.

Example:

What is the typical internet access percentage?


Is the typical internet percentage equal to 25%?

67 / 86
Hypothesis testing

The first and most important step is setting the alternative


hypothesis.

Example: An internet access level of 25% is a threshold for


developed nations.

Is the typical internet access level 25%?


Is the typical internet access level less than 25%?

Require different alternative hypotheses.

68 / 86
Hypothesis testing
Parametric location - t test

69 / 86
Hypothesis testing
P-values

Roughly represent support against the null hypothesis.

When the alternative is H1 : 6= 25,

p-val = P (t212 < 9.3) + P (t212 > 9.3) = 2 P (t212 > 9.3) 0.

70 / 86
Hypothesis testing
One-sided p-values

When the alternative is H1 : < 5,

p-val = P (t212 < 9.3) P (Z < 9.3) 0.

This is exactly half of the two-sided p-value.


71 / 86
Hypothesis testing
Revision of terminology

The p-value is the probability of drawing a sample statistic (e.g.


Y ) at least as extreme as that observed, if the null hypothesis
is true.

The significance level of a test is a pre-specified probability of


incorrectly rejecting the null hypothesis, when it is true.

Calculating the p-value based on Y :

P (observing Y as far or further away from 0 | = 0 ).

72 / 86
Hypothesis testing

A common pitfall
Note that the significance level is pre-specified, technically before
you see any data. You may sometimes read statements such as
the test statistic is almost significant, which are at odds with the
underlying principles of hypothesis testing. Either the result is
statistically significant given the pre-specified level, or not. End of
story.

73 / 86
Students t distribution

Remember the recipe?

1 Compute the t-statistic

2 Compute the degrees of freedom, which is n 1.

3 Look up the 5% critical value

4 If the t-statistic exceeds (in absolute value) this critical value,


reject the null hypothesis.

74 / 86
Students t distribution

Usually population is unknown, so that we estimate by the


sample standard deviation. Here, the sampling distribution of Y is
better approximated by a Students t-distribution instead of a
Gaussian.

Y
tn1
s/ n

A t-distribution has higher variance and fatter tails than a


Gaussian, allowing for the estimation of by s.

75 / 86
Students t distribution

76 / 86
Students t distribution
Properties

Symmetric.

More widely dispersed than N (0, 1). More area in tails and
less in the centre than the normal distribution.

For large n, the t distribution converges to the standard


normal N ( = 0, 2 = 1).

A t distribution has an associated degrees of freedom. In the


case of Y it is n 1.

77 / 86
Students t distribution
Confidence interval

In general, the confidence interval for a population mean is given


by:
s
Y t1/2;n1
n

For a 95% confidence interval with large sample size n,


t0.975;n1 1.96 by the central limit theorem. For smaller sample
sizes t0.975;n1 is larger. Python or R can calculate this interval for
us.

78 / 86
Students t distribution

Here are some 5% critical values for 2-sided tests:

Degrees of freedom (n 1) 5% t-distribution critical value


10 2.23
20 2.09
30 2.04
60 2.00
1.96

79 / 86
Students t distribution
Computing the p-value with with an estimated 2

obs !
Y 0
p-value = P t> = PH0 (|t| > |t-stat|)

s/ n

(the probability under students t outside t-stat).

80 / 86
Students t distribution
Summary

Y
If Y is distributed N (, 2 ), then
s/ n
tn1 is an exact
result when 2 is unknown.

However, the assumption that Y is distributed N (, 2 ), is


rarely plausible in practice.

For n > 30, the t-distribution and N (0, 1) are very close and
there is not practical difference. As n grows large, the tn1
distribution N (0, 1).

81 / 86
Hypothesis testing
What is the link between the p-value and the significance level?

The significance level is pre-specified. For example, if the


pre-specified significance level is 5%,

You reject the null hypothesis if |t| > 1.96.

Equivalently, you reject if p < 0.05.

The p-value is sometimes called the marginal significance level.

Often, it is better to communicate the p-value than simply


whether a test rejects or not, as long as your discussion does
not fall into the trap discussed earlier. The p-value contains
more information than the yes/no statement about whether
the test rejects.

82 / 86
Students t distribution

The theory of the t-distribution was one of the early triumphs of


mathematical statistics. It is very neat: if Y is i.i.d. normal, then
you can know the exact, finite-sample distribution of the t-statistic
it is the Students t. So, you can construct confidence intervals
(using the Student t critical value) that have exactly the right
coverage rate, no matter what the sample size.

But...

83 / 86
Students t distribution

The Student-t distribution is most relevant only when the sample


size is very small; but in that case, for it to be correct, you must
be sure that the population distribution of Y is normal. In business
and economic data, the normality assumption is rarely credible.

For example, do you think that earnings are normally distributed?


Suppose you have a sample of n = 10 observations from one of
these distributions would you feel comfortable using the Student t
distribution?

84 / 86
A skewed distribution

85 / 86
Summary
From the two assumptions of:
1 Simple random sampling of a population, that is, Yi for
i = 1, . . . , n are i.i.d.
2 0 < E(Y 4 ) < .

We developed for large samples (large n):

Theory of estimation (sampling distribution of Y ).

Theory of hypothesis testing (large n distribution of t-statistic


and computation of the p-value).

Theory of confidence intervals (constructed by inverting test


statistic).

Are assumptions (1) and (2) plausible in practice?


86 / 86

You might also like