Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

STA1000S

Notes
Table of Contents
Week 1 .......................................................................................................................................... 6
Probability Versus Odds ......................................................................................................................... 6
Statistical Distributions .......................................................................................................................... 6
Excel Formulas ................................................................................................................................... 6
Fair Game ............................................................................................................................................... 6
Win Percentage ...................................................................................................................................... 6
House Advantage ................................................................................................................................... 7
Expected Gain/Loss ............................................................................................................................ 7
Week 2 .......................................................................................................................................... 8
Counting Rules ....................................................................................................................................... 8
Permutations .......................................................................................................................................... 8
Combinations ......................................................................................................................................... 8
Counting Rules ....................................................................................................................................... 8
Counting Rule 1: ................................................................................................................................. 8
Counting Rule 2: ................................................................................................................................. 8
Counting Rule 3: ................................................................................................................................. 8
Counting Rule 4: ................................................................................................................................. 8
Conditional Probability ........................................................................................................................... 9
Bayes’ Theorem .................................................................................................................................. 9
If A and B are Unrelated/Independent: .............................................................................................. 9
The Table Method for Bayes Theorem ................................................................................................. 10
Week 3 ........................................................................................................................................ 12
Qualitative data .................................................................................................................................... 12
Quantitative data: ................................................................................................................................ 12
Ordinal data: ........................................................................................................................................ 12
Excel ................................................................................................................................................. 13
Exploratory Data Analysis..................................................................................................................... 13
Visually Displaying Data ....................................................................................................................... 13
Skewness .............................................................................................................................................. 14
Five-Number Data Summaries ............................................................................................................. 14
o Median .................................................................................................................................... 14
o Lower-quartile: ........................................................................................................................ 15
o Upper quartile ......................................................................................................................... 15
Constructing Box and Whisker Plots: ............................................................................................... 15
......................................................................................................................................................... 15
Five-Number Summary in Excel ........................................................................................................... 15
Summary Statistics ............................................................................................................................... 16
Standard Deviation in Excel ............................................................................................................. 16
Formulas for Mean and Variance ......................................................................................................... 16
Measures of Location and Spread ........................................................................................................ 16
Location ........................................................................................................................................... 16
Spread .............................................................................................................................................. 17
For more notes, videos and explanations: ........................................................................................... 17
Week 4 ........................................................................................................................................ 18
Random Variable .................................................................................................................................. 18
Probability Mass Functions/Discrete Random Variables ..................................................................... 18
Probability Density Functions/Continuous Random Variables............................................................. 18
Expected Values of PDFs and PMFs ...................................................................................................... 19
Variance of Random Variable X ............................................................................................................ 19
Probability Mass Function (Discrete): .............................................................................................. 19
Probability Density Function (Continuous): ...................................................................................... 19
Coefficient of Variation ........................................................................................................................ 20
Expected Winnings/Loss ...................................................................................................................... 20
Combining Random Variables .............................................................................................................. 20
Week 5 ........................................................................................................................................ 21
Probability Distribution ........................................................................................................................ 21
Uniform Distribution ........................................................................................................................ 21
• Expected Value ........................................................................................................................ 21
• Variance .................................................................................................................................. 21
• Graph of Uniform Distribution................................................................................................. 22
Binomial Distribution ....................................................................................................................... 23
• Expected Value ........................................................................................................................ 23
• Variance .................................................................................................................................. 23
Week 6 ........................................................................................................................................ 24
Probability Distributions....................................................................................................................... 24
Poisson Distribution ............................................................................................................................. 24
• Expected Value ........................................................................................................................ 24
• Variance .................................................................................................................................. 24
• Graph....................................................................................................................................... 24
......................................................................................................................................................... 24
Exponential Distribution....................................................................................................................... 25
• Expected Value ........................................................................................................................ 25
• Variance .................................................................................................................................. 25
......................................................................................................................................................... 25
Central Limit Theorem...................................................................................................................... 26
The Normal Distribution ................................................................................................................... 26
Calculating Probability in Normal Distributions ................................................................................... 27
Things to Remember ........................................................................................................................... 28
Subtracting / Adding / Multiplying Normal Distributions ................................................................ 28
Lower/Upper Quartiles with Normal Distributions .......................................................................... 28
Example of Normal Distribution Question ........................................................................................... 29
Week 7 ........................................................................................................................................ 30
Sample v Population ........................................................................................................................ 30
Percentage Point Notation ................................................................................................................... 33
Confidence Intervals............................................................................................................................. 33
Point Estimate .................................................................................................................................. 33
Interval ............................................................................................................................................. 33
Confidence Interval Formula: ............................................................................................................... 33
Width of Confidence Interval: .......................................................................................................... 33
Determining Sample Size...................................................................................................................... 34
General Sample Size Formula ........................................................................................................... 34
Sample Size Formula When Trying to Achieve ‘L’ Accuracy ............................................................. 34
Some Common Z Values: ..................................................................................................................... 34
Some Things to Remember: ................................................................................................................. 34
Week 8 ........................................................................................................................................ 35
The Hypothesis Test ............................................................................................................................. 35
2-Sided Test .......................................................................................................................................... 37
Rejection Region in a 2-Sided Test ................................................................................................... 37
Which Level of Significance (a) to Use? ............................................................................................... 37
Some Things to Remember: ................................................................................................................. 38
Comparing 2 Sample Means ................................................................................................................ 39
Subtracting Distributions ................................................................................................................. 39
Finding the Z Value for Calculating the Test Statistic ....................................................................... 39
The Modified Approach........................................................................................................................ 39
The P Value Explained ...................................................................................................................... 41
• Example of P-Value Question .................................................................................................. 41
Week 9 ........................................................................................................................................ 42
Unknown Population Variances ........................................................................................................... 42
Finding the t-value: .......................................................................................................................... 42
Comparing to Z-Table ........................................................................................................................... 43
Confidence Interval Without Knowing Population Variance ................................................................ 43
Testing the Mean: ................................................................................................................................ 43
Two-Sided Test with Same Example ................................................................................................. 46
The Modified Approach........................................................................................................................ 46
P-Value with One-Sided.................................................................................................................... 46
P-Value with Two-Sided ................................................................................................................... 47
• P-Value Example ...................................................................................................................... 47
What are Degrees of Freedom? ........................................................................................................... 47
The Degree of Freedom “Rule”......................................................................................................... 47
Some Things to Remember .................................................................................................................. 48
Comparing Two Means with the T-Distribution (6 Step Approach) ..................................................... 48
Finding a T-Value when Dealing with Two Means ........................................................................... 48
Comparing Two Means with the T-Distribution (Modified Approach)................................................. 49
Finding the P-Value .......................................................................................................................... 49
Week 10 ...................................................................................................................................... 50
Comparing two Means in Paired Data Sets (6 Step Approach) ............................................................ 50
Comparing two Means in Paired Data Sets (Modified Approach)........................................................ 50
Calculate P-Value ............................................................................................................................. 50
P-Value and Rejecting or Accepting H0 ............................................................................................ 50
Excel and the T-Distribution ................................................................................................................. 51
- Right-Tailed Test ( > )............................................................................................................... 51
- Left-Tailed Test ( < ) ................................................................................................................. 51
- Two-Tailed Test ( < > ) ............................................................................................................. 51
Confidence Intervals Under the T-Test ................................................................................................ 51
Some Things to Remember .................................................................................................................. 52
Example of a Question ......................................................................................................................... 53
Goodness-of-Fit-Test: Whether Data Fits Various Distributions .......................................................... 54
Goodness-of-Fit-Test Under the 6-Step Approach ........................................................................... 54
Chi-squared Distribution ...................................................................................................................... 54
Getting Correct Degrees of Freedom for Chi-Squared ...................................................................... 55
What Does the Critical Value Mean? ............................................................................................... 55
Test Statistic Formula for Chi-Squared ............................................................................................. 56
Goodness-of-Fit Test Under the Modified Approach ........................................................................ 57
- Finding the P-Value ................................................................................................................. 57
A Note on Both the Modified and 6-Step Approach ............................................................................ 58
................................................................................................................. Error! Bookmark not defined.
Some Things to Remember .................................................................................................................. 59
Week 11 ...................................................................................................................................... 60
Testing for an Association Between Two Categorical Variables........................................................... 60
Testing for Association Using the 6-Step Approach ............................................................................. 60
Getting Correct Degrees of Freedom Under Two Variable Association Test .................................... 61
Degrees of Freedom for Tests of Association....................................................................................... 62
Testing for Association Using the Modified Approach ......................................................................... 64
Finding the P-Value .......................................................................................................................... 64
Excel and the Chi-Squared Test of Association ................................................................................. 64
- P-Value .................................................................................................................................... 64
- Critical Value ........................................................................................................................... 65
- Test Statistic ............................................................................................................................ 65
Testing for a Linear Relationship Between Two Variables ................................................................... 65
Is There a Linear Relationship Between X and Y? ............................................................................. 65
The Coefficient of Determination: R2 ................................................................................................... 66
Using X to Predict Y .............................................................................................................................. 66
Straight Line Formula with True Paramters ..................................................................................... 66
Straight Line Formula with Estimated Paramters ............................................................................ 66
Hypothesis Test About 𝛽 Using the 6-Step Approach .......................................................................... 68
Week 1
𝑛𝑜. 𝑜𝑓 𝑒𝑞𝑢𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒𝑙𝑦 𝑤𝑎𝑦𝑠 𝑜𝑓 𝑔𝑒𝑡𝑡𝑖𝑛𝑔 𝑜𝑢𝑡𝑐𝑜𝑚𝑒
Pr 𝑂𝑢𝑡𝑐𝑜𝑚𝑒 =
𝑛𝑜. 𝑜𝑓 𝑒𝑞𝑢𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒𝑙𝑦 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

Probability Versus Odds

Probability reflects the number of ways of getting specific outcome relative to the
total number of ways of conducting the experiment.

Odds reflect the number of ways that give you the event of interest relative to the
number of ways that don’t give you the event of interest.

Statistical Distributions

Statistical distributions are long-run patterns of various outcomes – used to predict


etc.

Excel Formulas

= rand()
® Generates a random number between 0 and 1. By pressing F9, random
numbers are re-generated.
= countif (A1:A20;1)
® “A1:A20” refers to range of data.
® “1” refers to what you’re looking for.
By pressing F4, you lock the data set to specific numbers.

Fair Game

Nobody is expected to win and nobody is expected to lose in the long run.

(Example):

I have a stall where people come and bet on the numbers from die throws. Each
bet is R1. If Amy bets R1 on all 6 numbers, in order for it to be a fair game, she
should win R6 if she’s correct – what she pays = what she wins.
Win Percentage

𝑡𝑜𝑡𝑎𝑙 𝑝𝑎𝑦𝑜𝑢𝑡 𝑓𝑜𝑟 𝑤𝑖𝑛𝑛𝑖𝑛𝑔


𝑊𝑖𝑛 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 = 𝑋 100
𝑎𝑚𝑜𝑢𝑛𝑡 𝑏𝑒𝑡 𝑎𝑐𝑟𝑜𝑠𝑠 𝑎𝑙𝑙 𝑛𝑢𝑚𝑏𝑒𝑟𝑠
Indicates the percentage of what we have received in the form of bets that we pay
back in winnings.

The amount we pay back of what we take in.

House Advantage

This = the amount we [the stall owners] retain.

𝐻𝑜𝑢𝑠𝑒 𝐴𝑑𝑣 = 100 − 𝑊𝑖𝑛 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒

Expected Gain/Loss

𝐸 𝐺 𝑜𝑟 𝐿 = 𝑇𝑜𝑡𝑎𝑙 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝐵𝑒𝑡 − (Pr 𝑊𝑖𝑛𝑛𝑖𝑛𝑔 × 𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑦𝑜𝑢𝑡)

𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝐵𝑒𝑡 = 𝑅2
1
Pr 𝑊𝑖𝑛𝑛𝑖𝑛𝑔 = ( )×2
12
𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑦𝑜𝑢𝑡 = 𝑅11 × 𝑅2

Therefore:
1
𝐸 𝐺 𝑜𝑟 𝐿 = 2 − (( )×22)
12
𝐸 𝐺 𝑜𝑟 𝐿 = 0.166

*If E(G or L) <1, then it’s a loss!


Week 2
Counting Rules

• Assist us in determining the number of elements in a given set.

Permutations

A permutation is an arrangement of objects where order matters.

When there are arrangements that can result in a number of various arrangements
(like 6 people all changing positions), to calculate the number of possible
outcomes, we have to assess it from the first position to the last.
o The first position has 6 options, then the second has 5 (because there
is now somebody sitting in position one) and so on. We then multiply
the numbers.

Combinations

A combination is an arrangement of objects where order does not matter.

Counting Rules

Counting Rule 1:
Arrangement of n objects without repetition:
= 𝑛!
Counting Rule 2:
Number of ways of ordering (order matters) n items chosen r at a time, without
repetition:
𝑛!
𝑛−𝑟 !
Counting Rule 3:
Number of ways of selecting (order doesn’t matter) r objects from a total of n
objects, without repetition:
𝑛!
𝑟! 𝑛 − 𝑟 !
Counting Rule 4:
Number of arrangements of n taken r at a time, with repetition:
= 𝑛S
Always ask:
Does order matter?
Is repetition allowed?
Conditional Probability

Is a method of updating our knowledge on the probability of an event when we


are provided with new information concerned with another event which may or
may not occur.

Formula for conditional probability = the probability of event A occurring, given


that event B has occurred.
Pr A ∩ B
Pr 𝐴 𝐵 =
Pr B
Pr A ∩ B
Pr(𝐵|𝐴) =
Pr A

Formula for not A given B:


= 1 − Pr (𝐴|𝐵)
• Rule only applies if we have the same conditional information provided.

Bayes’ Theorem

Pr 𝐵 𝐴 . Pr (𝐴)
Pr 𝐴 𝐵 =
Pr (𝐵|𝐴). Pr 𝐴 + Pr 𝐵 𝐴 . Pr ( 𝐴 )

If A and B are Unrelated/Independent:


Pr 𝐴 𝐵 = Pr (𝐴)

(Example 3.36 in Introstat page 87)


The Table Method for Bayes Theorem

We know that:
• Pr 𝑍 = 0,02
• Pr 𝑃 𝑍 = 0,07
• Pr (𝑃|𝑍) = 0,01

1. Draw up a table
2. Fill in the blocks that you can.
3. Perform calculations. Remember that these all represent intersections!
Therefore, we’ll be multiplying.
o (Ex. For 𝑃):
§ We know that Pr (𝑃|𝑍) = 0,01
• And we know that Pr 𝑍 = 0,02
o Therefore, multiply the two numbers. (=0,0002)
o Then for P, we go 0,02 − 0,0002 = 0,0198
o Then we know that Pr 𝑃 𝑍 = 0,07
§ And we know that Pr 𝑍 = 0,02 and therefore Pr 𝑍 = 0,98
• Therefore, 0,07×0,98 = 0,0686
• Then, 0,98 − 0,0686 = 0,9114

𝑍 𝑍 Total
𝑃 0,0198 0,0686 0,0884
𝑃 0,0002 0,9114 0,9116
Total 0,02 0,98 1
de de gh
f f i
= ig
de
= 0.053
Week 3
Qualitative data: (Categorical/Nominal Data)
•No numbers
•More than two categories but without intrinsic order
Quantitative data: (Fully Numeric Data)
• Can be ranked
• Always has numbers
Ordinal data: falls between the two (semi-numerical);
o Ordered haphazardly – can be categorical
o Size between numbers do not necessarily have to be the same
§ Levels of satisfaction/education
Excel
= 𝐹𝑅𝐸𝑄𝑈𝐸𝑁𝐶𝑌(𝐽2: 𝐽974; 𝑀20: 𝑀29)
Bins array: bins = categories.
Data array: data from Therefore if you want marks from 0 to
which you want to 100% going like “0,10,20…”, these
retrieve your answers are your bins.

R2 tells us the amount of variation of ‘y’ which can be explained by the variation in
‘x’.
Exploratory Data Analysis

Source: ww.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf

Visually Displaying Data


Diagrams provide insight into data

Qualitative data: (Categorical/Nominal Data)


Bar chart
Pie chart

Quantitative data: (Fully Numeric Data)


Can be ranked
1. Histogram (no 3D columns)
o No gap in between adjacent bars.
o Bars do not correspond to named categories; correspond to intervals
on a number line.
o Class intervals should be same width.
yz{|}
§ 𝑆𝑖𝑧𝑒 𝑜𝑓 𝐸𝑎𝑐ℎ 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 =
~z•€•} ~‚ƒ}
„•z„…„•‚{
§ 𝐿=
{
§ Range = max – min
o The 4 step histogram procedure:
o 1. Determine sample size. Find xmin and xmax.
o 2. Choose class intervals which cover the range. (Use formula)
o 3. Observe the number of things within each interval. Make a tick
sheet and then set up a frequency distribution.
§ Class intervals = bins. Intervals = bin width. Class frequency =
bin frequency.
o 4. Plot histogram with appropriate scales for axes.
o Histogram with 2 clear peaks = bimodal.
2. Scatter plot (relationship between two variables)
3. Box and whisker plots (see after ‘Five-Number Data Summaries’)
4. Stem and Leaf Plots
Visual effect like histogram, but, original data values can be extracted from
the display.
Means of data storage.
o How to construct stem and leaf plots – see page 15-16 in Introstat.
Skewness
Skew to the right:
Tail stretches off to right.
Skew to the left:
Tail stretches off to left.
Negatively skewed distribution:
Mean < median; the few low scores shift mean to the left.
Positively skewed distribution:
Mode < mean and median.

Five-Number Data Summaries


Aim to reduce large batch of data to a few key numbers which summarize the
data.

In numerical dataset of size ‘n’, sorted from smallest to largest, smallest number
has rank 1.
o X(r) denotes number with rank r.
o X(r+1/2) denotes number half-way between numbers with rank r and
rank r+1.
o Median: X(m) is number with rank (n+1)/2
§ Divides data into 2 equal halves
• If ‘n’ is even - sample median is the average of the
two middle observations.
• If ‘n’ is odd - sample median is the middlemost
observation.
• Is “robust” – not sensitive to outliers.
o Lower-quartile: X(l) number with rank l=([m]+1)/2
§ Where ‘m’ = rank of median. “[m]” means that if ‘m’ =
something and a half, we drop the half.
§ LQ (if it were representing a mark for a test) is the mark,
below which, the lowest 25% of students scored.
o Upper quartile: X(u) rank u=n-l+1
§ UQ (if it were representing a mark for a test), is the mark,
below which, 75% of the students’ marks lie.

Five-number summary:
𝑥 d , 𝑥ℓ , 𝑥• , 𝑥ˆ , 𝑥{

Constructing Box and Whisker Plots:


1) Draw box from lower quartile to upper quartile.
2) Draw line across the box at the median.
3) Draw whiskers protruding out from the box to the extremes.

Box and whisker plots are useful when we want to compare two or more sets
of data; this is done by constructing the plots side-by-side. (Use same vertical
scale for all plots which are being compared)
Five-Number Summary in Excel

= 𝑄𝑈𝐴𝑅𝑇𝐼𝐿𝐸(𝑎𝑟𝑟𝑎𝑦; 𝑞𝑢𝑎𝑟𝑡)
For ‘quart’, put in 0-4 depending on which quartile you would like.
Because 5 number summaries can identify unusually small/large values.
These are referred to as ‘strays’ or ‘outliers’ if more extreme.
• Strays:
o For strays on the lower side1 = 𝑀𝑒𝑑𝑖𝑎𝑛 − 3 ×(𝑀𝑒𝑑𝑖𝑎𝑛 − 𝐿𝑄)
o For strays on the bigger side = 𝑀𝑒𝑑𝑖𝑎𝑛 + 3 × 𝑈𝑄 − 𝑀𝑒𝑑𝑖𝑎𝑛
• Outliers:
o For outliers on the lower side = 𝑀𝑒𝑑𝑖𝑎𝑛 − 6 ×(𝑀𝑒𝑑𝑖𝑎𝑛 − 𝐿𝑄)
o For outliers on the bigger side = 𝑀𝑒𝑑𝑖𝑎𝑛 + 6 ×(𝑈𝑄 − 𝑀𝑒𝑑𝑖𝑎𝑛)
Summary Statistics
Numerical rather than graphical.
o Statistic: any quantity calculated from the data values of a sample

Standard Deviation in Excel


= 𝑆𝑇𝐷𝐸𝑉. 𝑆(𝑒2: 𝑒974)
o “.S” because it’s a sample of the population.
SD gives indication of how variable the data is.
Used by analysts to measure volatility of price changes.
Formulas for Mean and Variance
𝑆 = 𝑠𝑑

Most important measure of location.


The mean is not robust, meaning it is
easily-affected by outliers.
𝑥̅ ∗ = 𝑈𝑝𝑑𝑎𝑡𝑒𝑑 𝑀𝑒𝑎𝑛

Measures of Location and Spread


Location

Measure of location describes any statistic which purports to locate the middle of
the data set. Here are the two measures of location:
1. The sample median
2. The sample mean (most important measure of location)
• Most useful with symmetric distribution in datasets.
• Predominant measure of location.

1
Lower side: numbers smaller than the median.
Spread

Measure of spread gives insight into variability of the dataset. Three measures of
spread:
1. Range 𝑅 = 𝑥({) − 𝑥(d)
® Unreliable measure of spread; depends only on smallest and largest
values in the sample. Thus, it is the most sensitive to outliers.
o Non-robust
2. Interquartile Range 𝐼 = 𝑥(ˆ) − 𝑥(•)
® Length of the interval covering the central half of the dataset. Therefore
not sensitive to outliers.
o Robust
3. Sample variance is the most-used measure of spread;
® Easily-manipulated algebraically.

For more notes, videos and explanations:

http://www.stat.berkeley.edu/~stark/SticiGui/Text/location.htm
Week 4
Random Variable

“X” = random variable


“x” = particular outcome

Pr 𝑋 = 𝑥 = 𝑃𝑟(𝑥)

Probability Mass Functions/Discrete Random Variables


Discrete = value obtained by counting.
Examples: number of students present, number of red marbles in a jar, students’
grade level.
• Finite or a countably infinite number of possible outcomes.
o Possible values that the random variable can take on = isolated
values along the real line T
Probability function = Probability mass function Introstat p94
ex.7A for
further
explanation

Probability Density Functions/Continuous Random Variables


Continuous = value obtained by measuring.
Examples: height of students in class, weight of students in class, time it takes to get
to school
Continuous random variables = possible values form continuous set
• Probability density function of the
random variable “x”
Provides probabilities associated with the
continuous random variable taking on values
within specified ranges.
o Area from the x axis and
curve of function = 1
o The area between any two values
[c and d] =
Pr (𝑋 𝑙𝑖𝑒𝑠 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑐 𝑎𝑛𝑑 𝑑)
Expected Values of PDFs and PMFs

Expected value of X.
Created through observing patterns.
o Weighted sum of all possible values of X
• Expected value of X also acts as the mean for probability density functions
and probability mass functions!

Variance of Random Variable X

3C, 4C on p146
Probability Mass Function (Discrete):

A 𝐸 (𝑋 g ) = •–𝑥 g × 𝑝(𝑥 )—

B 𝐸 (𝑋) = •(𝑥 × 𝑝(𝑥 ))

𝑉𝑎𝑟(𝑋 ) = 𝐴 − (𝐵)g

Probability Density Function (Continuous):


g)
A 𝐸 (𝑋 = ˜ –𝑥 g × 𝑓 (𝑥 )— 𝑑𝑥
z


B 𝐸 (𝑋) = ˜ –𝑥 × 𝑓(𝑥 )— 𝑑𝑥
z

𝑉𝑎𝑟(𝑋 ) = 𝐴 − (𝐵)g
Coefficient of Variation

Measures the variability of the data relative to the mean.

𝑉𝑎𝑟 𝑋
𝐶. 𝑉 =
𝐸 𝑋

Expected Winnings/Loss

(Example): Buy one $10 raffle ticket for a new car valued at $15,000. Two
thousand tickets are sold. What is the expected value of your gain?
Win Lose
Gain (x) 14990 10
Probability P(x) 1/2000 1999/2000
P(x).x 7.495 9.995

Therefore, expected value of my gain:


𝐸 𝑋 = 7.495 − 9.995
𝐸 𝑋 = −$2.5

Combining Random Variables


• A and B have to be independent.

𝐸 𝐴 + 𝐵 = 𝐸 𝐴 + 𝐸(𝐵)

𝐸 𝐴 − 𝐵 = 𝐸 𝐴 − 𝐸(𝐵)

𝐸 𝑐𝐴 = 𝑐(𝐸 𝐴 )

𝑉𝑎𝑟 𝐴 + 𝐵 = 𝑉𝑎𝑟 𝐴 + 𝑉𝑎𝑟(𝐵)

𝑉𝑎𝑟 𝐴 − 𝐵 = 𝑉𝑎𝑟 𝐴 + 𝑉𝑎𝑟(𝐵)

𝑉𝑎𝑟 𝑐𝐴 = 𝑐 g (𝑉𝑎𝑟 𝐴 )
Week 5
Probability Distribution
1) Uniform
2) Binomial

Uniform Distribution
Where ‘a’ is the lower
bound and ‘b’ is the upper
bound.

If it is a continuous random variable (measuring) which is equally likely


to lie between ‘a’ and ‘b’ but impossible to lie outside ‘a’ or ‘b’ then it has a
uniform distribution.
o Probability density function.
§ Therefore you must integrate the probability density function!

• Expected Value

𝑏+𝑎
𝐸 𝑋 =
2

• Variance

g
𝑏−𝑎
𝑉𝑎𝑟 𝑋 =
12
• Graph of Uniform Distribution
Binomial Distribution

Discrete distribution (counting)


• Only two outcomes (success or failure)

Pr 𝑆𝑢𝑐𝑐𝑒𝑠𝑠 = 𝑝
Pr 𝐹𝑎𝑖𝑙𝑢𝑟𝑒 = 1 − 𝑝

• Random variable which records number of successes in “n” trials with the
probability ‘p’ of success where ‘p’ remains constant throughout.
(Each trial is independent of the previous one; they don’t influence one
another) Probability associated with ‘X’ is:

“X is distributed
according to binomial
distribution with
parameters ‘n’ and ‘p’.
Pr 𝑋 ≥ 3 = 1 − Pr (𝑋 < 3)
or
Pr 𝐴 = 1 − Pr (𝐴)
∴ Pr 𝑋 ≥ 3 = 1 − Pr 𝑋 = 2 − Pr 𝑋 = 1 − Pr (𝑋 = 0)

• Expected Value

𝐸 𝑋 = 𝑛×𝑝

• Variance

𝑉𝑎𝑟 𝑋 = (𝑛×𝑝)×(1 − 𝑝)

Where:
‘n’ = number of repetitions
‘p’ = Pr(Success)
Week 6
Probability Distributions
Poisson Distribution
This distribution depends on events occurring randomly at an average rate
of occurrence.
• Events which occur at an average rate of occurrence occur according to a
‘Poisson Process’.

Poisson distribution with a discrete


(counting) random variable and therefore
probability mass function.
- Models how many times events
occur.

• Expected Value
𝐸 𝑋 =𝜆

• Variance
𝑉𝑎𝑟 𝑋 = 𝜆

• Graph
Exponential Distribution

Exponential distribution with a


continuous random variable and therefore
probability density function.
- Models space between events.
- Integrate!
• Expected Value
1
𝐸 𝑋 =
𝜆
• Variance
1
𝑉𝑎𝑟 𝑋 =
𝜆g
Probability of more than 3? = Pr (𝑋 > 3)
¢
Way 1: ∫e 𝜆𝑒 …¡„ 𝑑𝑥
3
(Better) Way 2: 1 − ∫0 𝜆𝑒−𝜆𝑥 𝑑𝑥

- Ensure lambda uses same units as question being asked! Convert lambda
and then when answering question, use the converted lambda.
Central Limit Theorem

If X is the sum of a large number of random increments, then X [random


variable] has a normal distribution.

The Normal Distribution

• Bell-shaped and symmetrical


• Continuous (measure) and has probability density function:

1 d „…¥ §
… ×
𝑓 𝑥 = ×𝑒 g ¦ 𝑓𝑜𝑟 − ∞ < 𝑥 < ∞
2𝜋𝜎 g
Random variable x has the
𝑋~𝑁(μ, 𝜎 g )
normal distribution with the
s = Sigma = Standard deviation
parameters µ and s2.
• Tells us where the graph is located

Normal distribution has two parameters:


s2 = Variance
µ = “mu” = Mean
• Tells us how spread out the distribution is
o Distribution gets flatter as µ gets bigger
• Always positive
• Centre of Normal Distribution is located at the value µ.
o Symmetrical about the mean/E(X)

Area under density function


= 1. Hence, area on the left
= area on the right = 0.5.

Even though it’s a probability density function, the probabilities cannot be found
through integration, however, there is a Z table.
Probability of lying within a given
number of standard deviations from
the mean is the same for all normal
distributions, regardless of their
parameters.
Hence, area A = area B A B

Calculating Probability in Normal Distributions

To find the probability of X lying in a given interval, the units of X have to be


converted to a “Standard Normal Distribution” (Z).
• Standard Normal Distribution (Z) has a mean of 0 and variance of 1.
• Z is measured by the number of standard deviations away from
the mean of X

In order to convert, use this formula: Remember that when given


‘X~N(100, 4)’ you need to Ö the 4
𝑋−𝜇 so that you get s
𝑍=
𝜎
Following this, we use the Z tables in order to calculate probability.

Example
When finding probability from the Z
table, you start going down the first
column and then look at the values in the
top row and then match them up.

Things to Remember

1 1
Pr 𝑍 < − = Pr 𝑍 >
3 3
Pr 𝑍 < −𝑎 = Pr 𝑍 > 𝑎

Subtracting / Adding / Multiplying Normal Distributions

Subtracting:

𝑋d − 𝑋g ~𝑁(𝜇d − 𝜇g , 𝜎dg + 𝜎gg )

Adding:

𝑋d + 𝑋g ~𝑁(𝜇d + 𝜇g , 𝜎dg + 𝜎gg )

Multiplying a Constant in:

𝑎𝑋d ~𝑁(𝑎𝜇, 𝑎g 𝜎 g )

Lower/Upper Quartiles with Normal Distributions

Lower Quartile

𝑋−𝜇
−0,67 =
𝜎g

Upper Quartile

𝑋−𝜇
0,67 =
𝜎g
Example of Normal Distribution Question
What is the probability of X lying between 4 and 14? (Pr (4 < 𝑋 < 14)) With
𝑋~𝑁(10, 2)

Because we have two ‘𝑋’ values essentially, we need to find the z values for both:

4 − 10 14 − 10
𝑧= 𝑧=
√2 √2
𝑧 = −4.24 𝑧 = 2.83
Lower Bound Upper Bound

Therefore the inequality changes to Pr (−4.24 < 𝑧 < 2.83)


Draw a normal distribution graph:

-4.24 2.83

Look in Z table for -4.24 and 2.83:


• 2.83 = 0.49767
• -4.24 = 0.49998
*For -4.24: the biggest the table goes up to is 4.09, and hence we have to use
that number.

0.49998 0.49767

-4.24 2.83

Therefore, Pr 4 < 𝑋 < 14 = 0.49998 + 0.49767


= 0.99765
Week 7
Sample v Population

Sample is a portion of the population from which inference (conclusion based on


reasoning and evidence) about the population can be drawn.
• Quantity of interest in the population is called a parameter.
• What we measure in a sample is called a statistic.

In order to make an inference about the population, our sample needs to be


representative (similar in structure to the population) and random.

𝑋 𝑖𝑠 𝑎 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
The probability distribution of a statistic is called a sampling distribution.

The distribution of the sum of ‘n’ normally distributed random variables is given by:

Sample mean
varies from
sample to sample

The mean of 𝑋 is given by:

The mean of the sum ( ) is given by:


The expected value of the expected mean 𝑋 is given by:

Where ‘mu’ is
whatever the true
population mean is.

Variance of a constant times X is given by:

Knowing that 𝑋 is given by , we can calculate the variance of


𝑋 by:

And we know that the variance of the sum of the X’s is equal to:

Hence,

1
𝑉𝑎𝑟 𝑋 = ×(𝑛𝜎 g )
𝑛g

Therefore, the distribution of 𝑋 is:


The above calculations apply to normal distributions only. However, the
central limit theorem says that the average of a large number of random variables
always has a normal distribution.
• Sample size should be bigger than 30 in order for it to be normally
distributed though.

When a probability distribution is chosen as a statistical model for a


population, when determining the parameters of the distribution, the mean and
variance of the probability distribution should be equal to the population’s
mean and variance.
• Remember that the true mean and true variance would be the values you
would find if you were able to examine the entire population without
logistical issues.

Clarifying Some Stuff

If a sample of any size [n] is taken from a population with a normal


distribution with a mean of µ and a variance of s2, then the mean of the sample
will have a normal distribution with mean µ but a variance of s2/n. This holds
true when taking a sample from a population which is not normally distributed.
(Central limit theorem)

• The approximation of the sampling distribution to a normal distribution


shape gets better and better as the sample size ‘n’ increases.
• As the sample size gets bigger, the sample mean is going to get
closer and closer to the true population mean.

To find probabilities, we need to convert to Z:


Where 𝑋« is what we are
𝑋−𝜇 trying to find.
𝑍= 𝜎
𝑛

Videos to help:

http://www.statisticshowto.com/central-limit-theorem-examples/
Percentage Point Notation

𝑧 ¬.d = 10% of distribution lies to the right of the 𝑧 ¬.d value. Hence, 0.5 − 0.1 = 0.4
Therefore the corresponding 𝑧 ¬.d value can be found by looking in the z table for
the one which is as similar to 0.4 as possible
• = 0.3997
• Z-score which corresponds to this is 1.28.
o Therefore, 𝑧 ¬.d = 1.28

Confidence Intervals

Can be written as: 100 1−∝ %

Confidence intervals are the difference in reporting between ‘point estimates’ and
‘intervals’.

Point Estimate
• Just a number
o No information with regards to how uncertain the estimate is.

Interval
• Range of values
o Shows how much certainty is in the estimate.
A wide interval would show that we weren’t too sure what the true value is and
the value could be anything really. Opposite applies for a narrow interval.

𝜎g
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 =
𝑛

Confidence Interval Formula:

¯ 𝜎g ¯ 𝜎g
= 𝑋 − 𝑍𝑉𝑎𝑙𝑢𝑒 g × ; 𝑋 + 𝑍𝑉𝑎𝑙𝑢𝑒 g ×
𝑛 𝑛

Width of Confidence Interval:


¦§
Where 𝑍×( ) is half the width of the confidence interval.
{
¦§
• We define 𝐿 = 𝑍× for future use.
{
𝑊𝑖𝑑𝑡ℎ 𝑜𝑓 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 2×𝐿
𝜎g
𝑊𝑖𝑑𝑡ℎ 𝑜𝑓 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙 = 2× 𝑍×
𝑛

Determining Sample Size

¦§
• We want our estimate to lie within 𝑍×( ) units of the true mean.
{
• How confident do we want to be of our interval? (Z)
• How variable is the population? (𝜎)

General Sample Size Formula


Using the ‘L’ formula from earlier, we can derive a general sample size
formula:

𝜎g
𝐿 = 𝑍×
𝑛
g
𝑍×𝜎
𝑛=
𝐿

Sample Size Formula When Trying to Achieve ‘L’ Accuracy

𝑍×𝜎 g
𝑛=
𝐿
Where ‘L’ is within how much of the true mean you need to be.

Some Common Z Values:

Confidence Percentage Z Value


90 1,65
95 1,96
98 2,33
99 2,58

Some Things to Remember:


Week 8
The Hypothesis Test
Step 1:

Define null hypothesis (H0)


o Statement about one of the true, unknown parameters of the
population. Tests if the mean is a specific value.

Null hypothesis [H0] will always say that the true parameter = some
hypothesized value.
o H0 generally assumes no effect.

𝐻¬ : 𝜇 = 𝜇¬ = 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟

Step 2:

Define alternative hypothesis (H1)


o Statement about the true, unknown population parameter.

H1 will either have a ≠, <, > but not an = sign.

If we suspected that the true mean was greater than a particular number,
we would say:
𝐻d : 𝜇 > 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟

If we didn’t weren’t sure if it was bigger than or smaller than a particular


number but rather, just bias, we say:
𝐻d : 𝜇 ≠ 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟

= 2-sided test
Step 3:

Set up significance level (a) for test.


o Determines how tough the test is.
§ Very tough test = difficult to find a difference from
hypothesized mean.
§ a=0.01 (1%) is tough
o What this means is, we will reject H0 1% of the time, even if H0 is
in fact correct. (Probability of rejecting the null hypothesis
when it is in fact true)

Commonly used a = 0.05 (5%)


o If the implications of wrongly rejecting the null hypothesis are
serious, a 1%/0.01 significance level is used.

Step 4:

Set up a rejection region based on a.


o If a=0.05, we reject H0 if sample mean is within 5% most
extreme values of the standard normal distribution.

As in our above example, if our alternative hypothesis says that mu is


greater than a particular number, then we would only reject the 5%
greatest possible values.
1.645 is gotten through: 0.5-
0.45=0.05.(What we need from
our Z value) You get this by
looking between 1.64 and 1.65.

For 5% level of significance


we use 1.645/1.64 in
hypothesis testing.

Critical value is obtained by


going ‘0.5 – x = desired alpha’.
Then we have to find that x on the
normal table.
Step 5:

Calculate the test statistic Z


o The hypothesis test is conducted under assumption that H0 (null
hypothesis) is true.

If H0 is true then:

Now we take our observed sample mean and transform it using the Z
formula. What this does, is it shows us how many standard errors (above
or below) the expected mean does the sample mean lie.

The value we look up in the Z table is called the critical value.


Test statistic calculation assumes H0 is
true (𝜇d − 𝜇g = 0)

𝑋−𝜇
𝑍= 𝜎
𝑛

We compare test statistic to the critical value from the rejection region..

Step 6:

Conclusion:

We will reject H0 if test statistic we obtained from step


5 does not fall within the bounds of our rejection
region/critical value

2-Sided Test

If we didn’t weren’t sure if it was bigger than or smaller than a particular


number but rather, just bias, we say:

𝐻d : 𝜇 ≠ 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟

This is a good default position unless there is a specific reason to suspect a


particular side of bias. (Greater or less than)

Rejection Region in a 2-Sided Test

If a previously 0.05 in the one-sided test, in a 2-sided test, we would have to


split the a between both extremities:

It is therefore harder to reject H0 under a two-tailed test.


Which Level of Significance (a) to Use?

Type 1 Errors:
When we reject H0 erroneously.
o We can control this through a:
§ If a is small then it’s more difficult to reject H0 as rejection
region is small. So the chances of making a type 1 error when
there’s a small a are small.
Type 2 Errors:
When we accept H0 erroneously.

There is a trade-off between the two errors:


• If we make a very small: we reduce the chance of a type 1 error but
then increase the chance of a type 2 error.

Some Things to Remember:


Comparing 2 Sample Means

Do they come from the same population with the same underlying true mean?

We need the distributions of both:

Subtracting Distributions

Finding the Z Value for Calculating the Test Statistic

Variances can be equal!

The Modified Approach

• Doesn’t require a specific significance level.


o Instead of a specific significance level, we report on the
observed level of significance / the p-value.

Step 1:

Calculate test statistic.


Step 2:

Now report on how significant the statistic is.


o We want the probability of getting a test statistic at least as
extreme as the test statistic we just calculated. (Assumes H0
is true)
• When calculating p-value, we base our calculation on the test statistic; we
look up the value of our test statistic in the z-table.
o If H1 is 2-sided:

𝑃 𝑉𝑎𝑙𝑢𝑒 = 𝑃𝑟 𝑍 > 𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 × 2


o H1 is 1-sided:

𝑃 𝑉𝑎𝑙𝑢𝑒 = 𝑃𝑟 𝑍 > 𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐

The P Value Explained

Basically, if H0 is true (the null hypothesis), then the p-value is the


probability of getting a test statistic more/as extreme as
our calculated test statistic.

• Example of P-Value Question

30 − 31
∴ 𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡 =
4 9
+
50 40
∴ 𝑇𝑒𝑠𝑡 𝑆𝑡𝑎𝑡 ≈ −1.81
Look -1.81 up in Z table:
= 0.4649

∴ 0.5 − 0.4649 ×2
𝑃 − 𝑉𝑎𝑙 = 0.0702
Week 9
The T-Distribution
Unknown Population Variances
We have sample size “n” and we estimate our 𝜎 g using 𝑠 g , then the test
statistic has a t-distribution.
- Makes test statistic more variable.
o As sample size increases, distribution “peaks” and looks more
like a normal distribution; distribution determined by “n”.

Finding the t-value:

“S” = sample
standard deviation

- “n-1” is also referred to as degrees of freedom.


o Hence, test statistic has a t-distribution with n-1 degrees of
freedom.
- P199

Because the t-distribution varies


according to sample size, we use a
subscript which indicates the
degrees of freedom; “n-1”
The value which you look up in the t-table is called the critical value.
- In the t-table on the left, those numbers are for the degrees of freedom
[19], the values in the top row are the probabilities of the t19 distribution
being bigger than the critical value; 1.729.
o i.e. The probability of a t19 random variable being greater than
1.729 (the critical value) is 5%.
- Note that if the exact degrees of freedom is not seen in the table,
choose the closest one.

Comparing to Z-Table

You can see the impact of the


t-distribution being flatter
with fatter tails.

Confidence Interval Without Knowing Population Variance

Instead of using population variance, we use sample variance 𝒔𝟐 .


Do examples on p202/203
¯
g
𝑠
𝑋± (𝑡{…d × )
𝑛

Testing the Mean:

Step 1:

Define null hypothesis (H0)


o Statement about one of the true, unknown parameters of the
population. Tests if the mean is a specific value.

Null hypothesis [H0] will always say that the true parameter = some
hypothesized value.
o H0 generally assumes no effect.

𝐻¬ : 𝜇 = 𝜇¬ = 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 (𝑖𝑛 𝑜𝑢𝑟 𝑒𝑥𝑎𝑚𝑝𝑙𝑒; 3,5)


Step 2:

Define alternative hypothesis (H1)


o Statement about the true, unknown population parameter.

H1 will either have a ≠, <, > but not an = sign.

If we suspected that the true mean was greater than a particular number,
we would say:
𝐻d : 𝜇 > 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 (𝑖𝑛 𝑒𝑥𝑎𝑚𝑝𝑙𝑒, 3,5)

If we didn’t weren’t sure if it was bigger than or smaller than a particular


number but rather, just bias, we say:
𝐻d : 𝜇 ≠ 𝑠𝑜𝑚𝑒 𝑛𝑢𝑚𝑏𝑒𝑟

= 2-sided test
Step 3:

Set up significance level (a) for test.


o Determines how tough the test is.
Commonly used a = 0.05 (5%)
o Used in the example below.
Step 4:

Set up a rejection region based on a.


o If a=0.05, we reject H0 if sample mean is within 5% most
extreme values

As in our above example, if our alternative hypothesis says that mu is


greater than a particular number, then we would only reject the 5%
greatest possible values.

= a = significance level

With n=30 and therefore n-1=


29, look up 29 on the left of
the t-table and 0,05 at the top
and you get 1,699.

Critical value of 1,699


demarcates the top 5% of the
distribution is 1,699.
Step 5: Test statistic calculation assumes H0 is
true (𝜇d − 𝜇g = 0)
Calculate the test statistic T
o The hypothesis test is conducted under assumption that H0 (null
hypothesis) is true.

If H0 is true then:

We will reject H0 if test statistic we obtained from step


5 does not fall within the bounds of our rejection
region/critical value

Now we take our observed sample mean and transform it using the T
formula. What this does, is it shows us how many standard errors (above
or below) the expected mean does the sample mean lie.

𝑋−𝜇
𝑇= 𝑠
𝑛

We compare test statistic to the critical value from the rejection region.

With a test statistic of


3,14, which is bigger than
1,699 (hence falling into
rejection region), we can
conclude that H0 can be
rejected and say that the true
mean is greater than H0; 3,5.

The true mean is greater than


3,5.
Step 6:

Conclusion.

Two-Sided Test with Same Example

Instead of using the 5%, we divide it in to two


2,5%s.

Now, the rejection region is at -2,045


and 2,045; it is now more difficult to reject
H0 if we are doing a two-sided test.

The Modified Approach

If we didn’t want to choose a significance level/we weren’t given one, and we just
wanted to report the observed significance level/the p-value:

P-Value with One-Sided

In order to do this we need to find the probability of our test statistic being
greater than 3,14; hence, look in the t-table for the largest value which
the test statistic exceeds; we do this because we are trying to see where the
test statistic would lie along the line.
- Our test statistic is 3,14 and in the table, 3,038 is the largest value which
the test statistic exceeds.
- This means that the probability of getting a test statistic bigger than 3,14
is less than 0,0025. Therefore the p-value is < 0,0025.
P-Value with Two-Sided
We would need to double the probability, hence in our example we would be
looking for a probability of observing a test statistic greater than 3,14 or less than
3,14.
- Probability/p-value would be < 0,005
o Because the p-value is small, it provides strong evidence
to reject H0.

• P-Value Example

What are Degrees of Freedom?

The number of degrees of freedom is the number of values in the final


calculation of a statistic that are free to vary without violating any
constraint imposed on it.

For example, if we are told that 𝑋 = 25 and there are 𝑛 = 6 terms, and we are
told that the first 4 terms are 4, 10, 9 and 2, when we add them up we see they =
25. Hence, even though the last two terms are not given to us, we can conclude
that the last two terms must equal 11, and they can be: 6&5, 7&4, 10&1 etc.
But the degrees of freedom here is 5: the first 5 terms can be various numbers,
but the last term must be a number which, when added to the first 5
terms, sums to 11. Hence, there is 5 degrees of freedom.

The Degree of Freedom “Rule”

For every parameter which we estimate before evaluating the current parameter
of interest, we lose one degree of freedom.
Some Things to Remember

Comparing Two Means with the T-Distribution (6 Step Approach)


Finding a T-Value when Dealing with Two Means

- Step 4: Calculating the rejection region

Therefore, 𝑠dg 𝑎𝑛𝑑 𝑠gg can be viewed as estimates of the same true
variance, hence, we can combine the two sample variances to form 𝑠€ ,
a pooled estimate of the true population variance.

Weighted average
of the two sample
variances.

- Step 5: Calculating the test statistic

x
Comparing Two Means with the T-Distribution (Modified Approach)

We want to report the observed significance of the test statistic.


- Essentially, we are asking what the probability of finding a test
statistic which is as extreme or more than the one we calculated.

Finding the P-Value

o Go to the t-table, use the correct degrees of freedom and then find
the greatest number which the test statistic exceeds.
§ We then multiply that probability (seen in the top row) by 2 (if
two-sided test).
Week 10
Comparing Two Means in Paired Data Sets (6 Step Approach)

When two sets of data are dependent, we conduct a test using a single
sample of differences; dependent measures are known as repeated
measures.

All steps are the same up until step 5, just bear in mind that with the degrees of
freedom, “n” now refers to the number of pairs of data.

- Step 5: Calculating the test statistic:

Where 𝑑̅ refers to the difference


between the two means.

- Step 6: Conclude

If test statistic is more extreme than critical value, then we have enough
evidence to reject H0.

Comparing two Means in Paired Data Sets (Modified Approach)

Since it is a modified test, we are just asking for the probability of getting a
test statistic as small as our test statistic or smaller.

Calculate P-Value

This is done by looking in the left-hand column for the correct degrees of
freedom, then you look for a critical value which the test statistic only just
exceeds. If this critical value was 3,601 for instance (at a 34 degrees of freedom
level), the p-value would be < or > 0,0005, depending on the question.

P-Value and Rejecting or Accepting H0


- P-value is less than (or equal to) 𝜶 (level of significance), then null
hypothesis is rejected in favour of the alternative hypothesis.
- P-value is greater than 𝜶 (level of significance), then the null
hypothesis is not rejected.

Excel and the T-Distribution

- Right-Tailed Test ( > )

Use the “T.DIST.RT” formula. This returns the right-tailed t-distribution’s p-


value. x: is the value at which the p-value is
evaluated; ie. x = the test statistic.

= 𝑇. 𝐷𝐼𝑆𝑇. 𝑅𝑇(𝑥; deg_𝑓𝑟𝑒𝑒𝑑𝑜𝑚)

- Left-Tailed Test ( < )

Use the “T.INV” formula. This returns the left-tailed t-distribution’s critical
value.

= 𝑇. 𝐼𝑁𝑉(𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦; deg_ 𝑓𝑟𝑒𝑒𝑑𝑜𝑚)

Probability: the level of significance.

- Two-Tailed Test ( < > )

Use the “T.INV.2T” formula. This returns the two-tailed t-distribution’s critical
value.

= 𝑇. 𝐼𝑁𝑉. 2𝑇(𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦; deg _𝑓𝑟𝑒𝑒𝑑𝑜𝑚)

Probability: the level of significance.

Confidence Intervals Under the T-Test

¯ ¯
g
𝜎g g
𝜎g
= 𝑑 − 𝑇𝑉𝑎𝑙𝑢𝑒 {…d × ; 𝑑 + 𝑇𝑉𝑎𝑙𝑢𝑒 {…d ×
𝑛 𝑛
Some Things to Remember
Example of a Question

We’ll obviously need to use the confidence interval formula:

¯
g
𝜎g
𝑑 − 𝑇𝑉𝑎𝑙𝑢𝑒 {…d ×
𝑛

¸
And to find the 𝑇𝑉𝑎𝑙𝑢𝑒 {…d
§
, we do the following:
- Since it is a 99% confidence interval, we are going to look in the top row
for 1% (they have the same t-values), but since it is a confidence
interval, we will look for 1%/2 = 0,005 due to the fact that the
interval has an upper and lower bound, over which the 1% needs to be
spread.
- Therefore at a degree of freedom of 45 and a t-value of 2,960, we have
the following:

14,2
9,2 − 2,690×
46
= 3,568
Goodness-of-Fit-Test: Whether Data Fits Various Distributions

Comparing what we observe in sample of data to what we expect under a


specified hypothesis.

Goodness-of-Fit-Test Under the 6-Step Approach

We can use the 6-step approach to test whether some data fits various
distributions:

Step 1:

Define null hypothesis (H0)

𝐻¬ : 𝑋 ℎ𝑎𝑠 𝑠𝑜𝑚𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛

Step 2:

Define alternative hypothesis (H1)

𝐻d : 𝑋 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 ℎ𝑎𝑣𝑒 𝑠𝑜𝑚𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛

Step 3:

Set up significance level (a) for test.


o Determines how tough the test is.
Commonly used a = 0.05 (5%)

Step 4:

Set up a rejection region

Chi-squared Distribution
- Similar to t-distribution; has degrees of freedom which influence
shape of distribution.
- However, the 𝜒 g distribution is skewed to the right and is always
positive.
o Chi-squared distribution has its own table too.
Distribution of 𝜒 g
changing according
to degrees of
freedom.

Don’t need to split our a


in two; chi-squared is only
positive.

Getting Correct Degrees of Freedom for Chi-Squared

𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐷𝐹 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑒𝑠 𝐵𝑒𝑖𝑛𝑔 𝐶𝑜𝑚𝑝𝑎𝑟𝑒𝑑 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑎𝑟𝑎𝑚𝑡𝑒𝑟𝑠 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 − 1

(Example)

With a dice: we are comparing 6, assume no parameters in data – therefore


degrees of freedom = 5.

Now in order to get rejection region, go to the chi-squared distribution


table, search for the degrees of freedom in the left hand column, and
match it up with the significance level. (For this example assume 5%)

What Does the Critical Value Mean?

The probability of a random variable, which follows a chi-squared


distribution with 5 degrees of freedom, exceeding 11,070 is 5%.

Therefore back to Step 4:


If D2, our test statistic, exceeds the
critical value/rejection region, it
means that the observed and expected
differences are too far apart for their
differences to be explained by chance
sampling fluctuations alone.
- If test statistic > critical value then: reject H0

Step 5:

Calculate the test statistic T: our test statistic is where we compare


what we have observed to what we would have expected if H0 is true.

(Example, back to the dice)

d
Because the expected value of a 1,2..6 on a dice is , we can say that “If
h
d
H0 is true, we can expect of the total number of tosses to produce each
h
outcome.” (ie. If total number of tosses was 60: 10 x 1’s, 10 x 2’s etc.)

Test Statistic Formula for Chi-Squared

𝐷g has approximately a 𝜒 g
distribution, provided that
all of the expected
frequencies exceed 5.

Measure of discrepancy between what you have observed and what you
would’ve expected under H0.
- If test statistic is large then it means you observed something very
different to what you expect. Hence, a large test statistic provides
good evidence to reject H0.
- If test statistic is small then it means you observed something very
similar to what you expect. Hence, a small test statistic does not
provide good enough evidence to reject H0.

Step 6:

Conclude

- If test statistic > critical value (rejection region), we then reject


H0 at the specific significance level and conclude that the
distribution of ____ is different to what we would’ve expected
from ____ .
Goodness-of-Fit Test Under the Modified Approach

Hypotheses and test statistics are the same as 6-step.

We want to report on the observed significance of the test statistic; this is


done through the p-value.
- The p-value is the probability of getting a test statistic as big as the
one we calculated, or bigger; assuming H0 is true.

- Finding the P-Value

o Find the degrees of freedom.


o Find the largest critical value which is still smaller than our
test statistic.

Our critical value (in the example) of 14,8 is bigger than 12,832
but it is also less than 15,086, the next critical value. Hence, we
have observed something which will occur with a probability
of less than 0,025, assuming H0 is true.

Therefore: 𝑃 − 𝑉𝑎𝑙𝑢𝑒 < 0,025


“<” because our critical value is bigger than
12,832 which correlates to a probability of
0,025.
Our p-value is really small, hence showing
that it is unlikely that we would’ve observed
this difference and so we reject H0 and
conclude that X doesn’t have some
specific distribution.

- Small p-value: unlikely we would - Large p-value: likely we would see


see such a difference; hence reject such a difference; hence do not
H0; X doesn’t have some reject H0; X actually has some
distribution. distribution.
See IntroStat
Example 1A
p227 - 234
In order for the goodness-of-fit test to be valid, you need to have sufficient
data. In the case of Chi-squared, sufficient data is having an expected value of
at least 5 in every category that we are comparing.
- If you don’t have sufficient data, you may need to merge
categories together so that you have enough data and so that you get
expected values which are greater than 5.

A Note on Both the Modified and 6-Step Approach

When you are comparing the observed to the expected, in order to find the
expected, you need to use the H0 hypothesized distribution:
Some Things to Remember

- If you use the sample to estimate one or two parameters of the


distribution you are testing, your test statistic will lose one or two
degrees of freedom.
- If we were testing whether the data followed a normal distribution,
and there were 4 categories being compared, the degrees of
freedom would be 1; 𝜇 𝑎𝑛𝑑 𝜎 are estimated.
Week 11
Testing for an Association Between Two Categorical Variables

Testing for Association Using the 6-Step Approach

Step 1:

Define null hypothesis (H0)

𝐻¬ : 𝑡ℎ𝑒𝑟𝑒 𝒊𝒔 𝒏𝒐 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌

Step 2:

Define alternative hypothesis (H1)

𝐻d : 𝑡ℎ𝑒𝑟𝑒 𝒊𝒔 𝑎𝑛 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌

Step 3:

Set up significance level (a) for test.


o Determines how tough the test is.
Commonly used a = 0.05 (5%)

Step 4:

Set up a rejection region

Our test statistic (D2) will follow a chi-squared distribution and will
compare observed and expected values.

𝐷g has approximately a 𝜒 g
distribution, provided that
all of the expected
frequencies exceed 5.

And assuming that H0 is true.


Getting Correct Degrees of Freedom Under Two Variable Association Test

𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐷𝐹 = 𝑁𝑜. 𝑟𝑜𝑤𝑠 − 1 × 𝑁𝑜. 𝑐𝑜𝑙𝑢𝑚𝑛𝑠 − 1

Now in order to get rejection region, go to the Chi-squared distribution


table, search for the degrees of freedom in the left hand column, and
match it up with the significance level. (For this example assume 5%)

Therefore back to Step 4:

We will reject H0 if the test statistic [D2] is


bigger than the critical value which follows a chi-
squared distribution at the 5% significance level and
5 degrees of freedom.

Step 5:

Calculate the test statistic T: our test statistic is where we compare


what we have observed to what we would have expected if H0 is true.
- Now because the testing is done under the assumption that H0
is true, we want to calculate the expected values assuming H0 is
true. [ie. Assuming that there is no association]

(Example) We are testing for an association between owning a die and the
outcome of a die.

Let A = ownership of a die and B = the outcome of a die

Pr 𝑇ℎ𝑎𝑡 𝒚𝒐𝒖𝒓 𝒅𝒊𝒆 𝑟𝑜𝑙𝑙𝑠 𝑎 1 = Pr 𝐼𝑡 𝑏𝑒𝑖𝑛𝑔 𝑦𝑜𝑢𝑟 𝑑𝑖𝑒 × Pr (𝑆𝑐𝑜𝑟𝑖𝑛𝑔 𝑎 1)


Probability of getting a count in row Ai and
column Bj

(Example)

Degrees of Freedom for Tests of Association

Degrees of freedom, with a row x column table is given by: (𝑟 − 1)×(𝑐 − 1)


We can also see that the probability of your die rolling a 1 can also be
obtained using the row/column formula:

So now, we have the expected and observed values. Therefore our test
statistic (D2) can be calculated:

Step 6:

Conclude

- If test statistic > critical value (rejection region), we then reject


H0 at the specific significance level and conclude that the
distribution of ____ is different to what we would’ve expected
from ____ .

We would say something like: ‘in the case of the test statistic being bigger
than the critical value (rejection region); at the 5% significance level, we
have enough evidence to reject H0 and we can conclude that the outcome
of ____ does not depend on ____.’

Testing for Association Using the Modified Approach

(Knowing that D2 = 11,03 and the critical value is 11,07):

Finding the P-Value

- Find the degrees


of freedom.
- Find the largest
critical value
which is still
smaller than our
test statistic.

With a test statistic of 11,03 we look in our chi-squared distribution table and
can see under the 5 degrees of freedom level, that the largest critical value
which is still smaller than our test statistic is 9,236. This [9,236] correlates with a
probability of 0,1.
- Because our test statistic is bigger than 9,236 but smaller than
11,070, our p-value is smaller than 0,1.

Excel and the Chi-Squared Test of Association

- P-Value

Use the formula = 𝐶𝐻𝐼𝑆𝑄. 𝑇𝐸𝑆𝑇(𝑎𝑐𝑡𝑢𝑎𝑙_𝑟𝑎𝑛𝑔𝑒, 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑_𝑟𝑎𝑛𝑔𝑒)


o Actual range = observed

This formula will give you the p-value immediately; the test statistic is
calculated behind the scenes. All you need to have are both the expected
and observed values/tables.
- Critical Value

Use the formula = 𝐶𝐻𝐼𝑆𝑄. 𝐼𝑁𝑉. 𝑅𝑇 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙, 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚

- Test Statistic
o Is calculated from the p-value.

Use the formula = 𝐶𝐻𝐼𝑆𝑄. 𝐼𝑁𝑉. 𝑅𝑇 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙, 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚

Here, the significance level is actually the p-value.

Testing for a Linear Relationship Between Two Variables

Dependent variable is a linear


Is There a Linear Relationship Between X and Y? function of the independent
variable.
We make use of correlation coefficients.
- True (population) coefficients is more often than not, unknown, and
hence, we use the evidence of correlation measured in a sample to
infer about that of the population.

Sample correlation, and the population correlation, lie between -1 and 1:


The Coefficient of Determination: R2

Measures the proportion of the variation in y which is explained by x.

(Example)

R2 often expressed as
percentage: 74% of the
variation in y is explained by
x.

- Therefore:
o Higher the value of r/r2, the stronger the relationship between
x and y.
o Lower the value of r/r2, the weaker the relationship between x
and y.

Using X to Predict Y

Straight Line Formula with True Parameters


𝛼 = y intercept/intercept
We often don’t know 𝛽 𝑎𝑛𝑑 𝛼 ,
paramter
hence we estimate them.
𝛽 = slope of line

Straight Line Formula with Estimated Paramters


𝑎 = y intercept a and b are random variables, which
𝑏 = slope of line estimate 𝜷 𝒂𝒏𝒅 𝜶, and vary from sample to
sample.

Generally, we test hypotheses about 𝛽.


- If 𝛽 = 0, then no linear relationship between x and y.
o We use this fact [𝛽 = 0] as a basis from which we test whether
or not there is a linear relationship.
- More About ‘a’ and ‘b’

They are statistics since they are computed from a sample. They also
follow a normal distribution.

The constants, ‘a’ and ‘b’ are chosen in order to minimize the sum of
squared residuals. (ie the sum of the squared differences between
the residuals.)

Essentially, residuals are the differences between observed and


expected values, and hence we are trying to minimize this red space:

A residual [e] is defined as the difference between the actual y and the
predicted value of y (PV of y denoted as 𝑦.)
- 𝑒 =𝑦−𝑦

- Sum of the Squared Residuals

= 𝑒‚g
‚Åd
We do not need to know the formula for calculating regression coefficient; we rely
on excel.

Hypothesis Test About 𝛽 Using the 6-Step Approach


This entire example is about whether or not there is a relationship
Step 1: between the year marks and final marks of a course.

Define null hypothesis (H0)

𝐻¬ : 𝛽 = 0

- No relationship between x and y.

Step 2:

Define alternative hypothesis (H1)

𝐻d : 𝛽 < 0
𝐻d : 𝛽 > 0
𝐻d : 𝛽 ≠ 0

Step 3:

Set up significance level (a) for test.


o Determines how tough the test is.
Commonly used a = 0.01 (1%)

Step 4:

Set up a rejection region

Our test statistic (𝑡{…g ) follows a t-distribution with n-2 degrees of


freedom (where n = the number of pairs of observations/pairs of x
and y)

For a One-Sided Test:


For a Two-Sided Test:

Step 5:

Calculate the test statistic T:

Step 6:

Conclude:

- Test statistic falls in rejection region: reject H0


- Otherwise we conclude that we have no sufficient evidence to reject
our null hypothesis.

Computing Regression in Excel

Data > Data Analysis > Regression

Y = what we are trying to


predict (dependent)

X = the ‘predictor’
(independent)

Labels: if you select your data


from the very top row (where
your labels are), click that box.

Constant is zero: makes line go


through origin.

Line fit plots: will plot the line


for you.
Multiple R (r) = correlation coefficient
- Square Multiple R, you get R Square
Results from Regression in Excel (coefficient of determination)
o Correlation coefficient (r) lies
between -1 and 1.

R Square (r2) = coefficient of determination


- Indicates the proportion of variation in y
that x is able to explain.
Observations: o Correlation of determination (r2)
- Number of pairs of data. lies between 0 and 1.

Standard Error: If r is close to 1 or -1, then r2 will be


- Average amounts that each point is close to 1. Of r is close to 0 then r2 will
away from the regression line. be close to 0.
o Smaller value = points are
closer to regression line.

P-Value
Intercept = a value in ‘y = a + bx’
- 1E-277 = 1 × 10…gÍÍ
- Value you would expect to receive for your
- Calculated by:
final mark if your year mark was 0. o =
Year Mark = b value in ‘y = a + bx’ 𝑇. 𝐷𝐼𝑆𝑇. 2𝑇(𝑥; deg _𝑓𝑟𝑒𝑒𝑑𝑜𝑚)
- Intercept is 3,06 and slope is 0,82. o x = t-stat
o You can expect your final mark (Y) to be o deg_freedom = (number of
82% of your year mark, +/-- 3,06%. pairs – 2)

t-Stat for Year Mark


(ÆÇ}ÈÈ‚É‚}{Ê…¬)
- =
~Êz{ËzSË ÌSSÇS

Lower 95%/Upper 95%


- Gives confidence interval for a 95% confidence interval for the true value 𝜷.

Residuals:
- Difference between observed
and expected for each
observed pair of data.
Hypothesis Test About 𝛽 Using the Modified Approach

Step 1:

Define null hypothesis (H0)

𝐻¬ : 𝛽 = 0

- No relationship between x and y.

Step 2:

Define alternative hypothesis (H1)

𝐻d : 𝛽 ≠ 0

Step 3:

Find the test statistic T:

Step 4:

Find the p-value:

Step 5:
H0 said that there was no relationship
Conclude between x and y.
Regression and Correlation

Correlation:

Is a measure of the strength and direction of the linear relationship between


two quantitative variables.

- Correlation Coefficient

r = 0,69 (Fairly strong positive linear relationship)


r = -- 0,64 (Moderate negative linear relationship)
Regression:

Predicting values for one variable, given particular values for another variable.
- Regression analysis is only effective if there is a relationship
between the dependent and independent variables.

(Example)

Y = death rate
X = number of gwaais smoked per
person

Examples of Regression Analysis Questions

On Vula, read the slides in week 11 titled ‘Regression and Correlation Oct2015’.

You might also like