MATH1710 - Probability and Statistics I - Full Notes: Harry Collins 2021

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

MATH1710 - Probability and Statistics I - Full

Notes

Harry Collins

2021
Part I

Notes from the Notes

1
Chapter 1

Exploratory Data Analysis

1.1 What is Exploratory Data Analysis?


Exploratory data analysis is about ”first impressions” of the data. We are going
to concentrate on summary statistics and data visualisation.
Summary Statistics Summary statistics summarise the data - they can tell
us what the ”typical” values of the data are, how spread
out it is, or how two variables relate to each other.
Data Visualisation Data visualisation is how we draw pictures to represent
the shape of the data, or how two variables are related.

We also have other important questions to ask about the data, before we calcu-
late or draw anything:
• What is the data? - What has been measured and how? How much
data is there?
• How was it collected? - Is it from a whole population or a sample? If
sampled, how was the sample chosen?
• Are there any outliers? - If so, should they be excluded?

• Are there any ethical questions? - Was there informed consent,


proper storage of data, etc. ?

2
1.2 What is R?
R is a programming language well suited to statistical problems. It is widely used
in university statistics modules and in academia, and is increasing in popularity
for use in industry.
RStudio is a program for using R.

1.3 Summary Statistics and Boxplots


Suppose that we have n datapoints that are real numbers. We can express this
as a vector x = (x1 , x2 , . . . , xn ).
A statistic is then a calculation from the data x, which is usually also a real
number.
The two types of summary statistic we will look at are:
• Measures of Centrality - Where the ”middle” of the data is.
• Measures of Spread - How far the data spreads from the middle.

Definitions: Measures of Centrality


Let x be some real valued data such that x = (x1 , x2 , . . . , xn ).
• The mode is the most common value of xi .
• The median is the central value in the ordered list where
x1 ≤ x2 ≤ . . . ≤ xn .
If n is odd, the median is x(n+1)/2 but if it is even it is 21 (xn/2 + x(n+1)/2 ).
• The mean, denoted x̄, is defined as so:
1
x̄ = (x1 + x2 + . . . + xn )
n
n
1X
= xi , using sigma notation.
n i=1

Quantiles
The median is an example of a ”quantile” of the data. The α-quantile of some
data, called q(α), is the datapoint α of the way through the ordered list of the
data, where 0 ≤ α ≤ 1. The median is therefore the 12 -quantile, q( 12 ).
Generally, if 1 + α(n − 1) is an integer, q(α) = x1+α(n−1) .
Two other important quantiles are q( 41 ) and q( 43 ), called the lower and upport
quar tiles, respectively.

3
Definitions: Measures of Spread
Using the same definition of x = (x1 , x2 , . . . , xn ):
• The interquartile range is the difference between the upper and lower
quartiles: IQR = q( 43 ) − q( 14 ).
Pn
• The sample variance is s2x = n−1 1 2
i=1 (xi − x̄) ,
where x̄ is the sample mean.
• From this, the standard
p deviation is the square-root of the
sample variance: sx = s2x .

Definitional and Computational Formulae


The above formula for the sample variance is what we use to define sample
variance, but we can rearrange it to be more conventient for calculations:
n
1 X
s2x = (xi − x̄)2 the definitional formula
n − 1 i=1
n
1 X 2
= (x − 2xi x̄ + x̄2 ) expanding the brackets
n − 1 i=1 i
n n n
!
1 X X X
= x2i − 2xi x̄ + x̄ 2
taking the sum term-by-term
n−1 i=1 i=1 i=1
n n n
!
1 X X X
= x2i − 2x̄ xi + x̄ 2
1 taking out any constants
n−1 i=1 i=1 i=1
n
! n n
1 X X X
= x2i 2
− 2nx̄ + nx̄ 2
simplifying with the definitions x̄ = xi and 1=n
n−1 i=1 i=1 i=1
n
!
1 X
s2x = x2i − nx̄2 simplifying the last two terms
n−1 i=1

4
Boxplots
Boxplots are a way of illustrating data that make it easier to compare different
data sets ”by eye,” as opposed to comparing raw statistics.
The features of a box plot are as follows:

• The vertical axis reoresents the data values.


• A box is drawn from the lower quartile to the median.
• Another box is joined on top from the median to the upper quartile.
• Any outliers are plotted as circles. Generally in R an outlier is less than
q( 41 ) − 1.5IQR or more than q( 43 ) + 1.5IQR.
• From the ends of the boxes, ”whiskers” are drawn to the smallest and
largest non-outlier data points.

1.4 Binned Data and Histograms


When collecting data, it is often more practical to collect data in ”bins,” rather
than as exact values.
The frequency fj of bin j is the number of observations in that bin. Given that
there are n observations in total, we can also caluclate the relative frequency
fj /n, which is the proportion of observations in the bin.
Whilst we cannot find the exact median of the data, we can find out what
bin it is in. To do this, we can add up the relative frequencies of each bin until
the sum is greater than or equal to 0.5, and then the last bin we added contains
the median.
To find the mode, it doesn’t seem fair enough to simply use the bin with
the most observations, as some bins are larger than others. Instead, we use the
frequency density, calculated by dividing the relative frequency by the size of
the bin.
Again, as we don’t have exact data, we can’t exactly calculate the mean or
variance. However, we can make a good estimate by assuming that all the data
in each bin lies exactly on its midpoint, which we can call mj . Then, the mean
and variance become:
1X
x̄ = fj mj
n j

1 X
s2x = fj (mj − x̄)
n−1 j

5
Figure 1.1: A typical boxplot.

Histograms
Histograms are a way of representing data in bins. A histogram has the mea-
surement on the x-axis and frequency density on the y-axis, with one bar for
each bin, so that the area of each bar is its bin’s relative frequency.
You can also group exact data into bins of your own in order to draw his-
tograms, but you have to be careful not to use too many or too few bins, so
that the histogram is neither too ”noisy” or lacking in detail.

6
1.5 Multiple Variables and Scatterplots
Often, we want to compare multiple pieces of data taken from subjects and see
if any two (or more) variables are related. For example, from n subjects, we
could collect data xi and yi from each subject i, giving us two ”paired” datasets,
x = (x1 , x2 , . . . , xn ) and y = (y1 , y2 , . . . , yn ). Whilst we could calculate sample
statistics and plot the data for both x and y individually, we might also want
to see if there is a relationship between the two.
One good way to visualise this is using a scatterplot, where each pair of
datapoints is represented by a point or mark with the coordinates (xi , yi ).

A useful summary statistic for paired data like this is the correlation:
sxy
rxy = ,
sx sy

where sx and sy are the standard deviations of x and y, and sxy is the sample
covariance:
n
1 X
sxy = (xi − x̄)(yi − ȳ) .
n − 1 i=1

The correlation rxy is always a value between −1 and +1, with values closer to
+1 indicating the points are close to a straight line with an upwards slope, whilst
values closer to −1 indicate they are closer to a straight line with a downwards
slope. Where rxy is close to 0, there is only a weak linear relationship between
the two variables.

7
Part II

Notes from Lectures

8
Chapter 2

Lecture 1

2.1 General Weekly Work Schedule


Each week, the work will be split across three parts:
1. Notes and Videos Learn new material by reading and watching.
(2 hours)

2. Problem Sheets Test your knowledge by answering questions.


Each has two assessed questions you need to submit.
(3 hours + 1 hour to submit)
3. R Worksheets Learn how to use R and RStudio.
Some have assessed questions.
(1 hour, maybe 1 extra hour for assessed questions)
(all timings are per week)

9
2.2 Types of Teaching
1. Lectures:
• Mondays at 15:00.
• Online, via Zoom.
• We will go over last week’s materials.
• There will be some interactivity via polls and Q&As.
2. Tutorials:
• Weeks 2, 4, 6, 8 and 10 (check the timetable to be sure).
• In person, on campus.
• In small groups of about 10-12 people.
• We will go over answers to the problem sheets in an interactive
setting.
3. Office Hours Drop-In:
• Wednesday 10:00, Charles Thackrah SR G.07
or Wednesday 12:00, Emmanuel Centre SR 02.
• In person, on campus.
• Optional.
• For any questions about the module.

4. R Troubleshooting Drop-Ins:
• Weeks 2 and 3 (check the timetable).
• In person, on campus.
• Optional.
• If you have questions about R, get help.

2.3 R
R is a programming language well suited to statistical problems.
It is widely used in universty statistics modules and in academia, and is
increasingly popular for use in industry.
RStudio is a program for using R.

10
Part III

Assessed Questions

11
Chapter 3

Problem Sheet 1

Question C1
Let x be a vector of the monthly average exchange rate for US dollars into
British pounds over a 12-month period, such that
x = (1.306, 1.301, 1.290, 1.266, 1.290, 1.266, 1.290, 1.302, 1.317, 1.304, 1.284, 1.268, 1.247, 1.215).

Part (a): Calculate the median for this data.


In order to find the median of x, we will first have to order the data such that
x1 ≤ x2 ≤ . . . ≤ xn .
This gives us x = (1.215, 1.247, 1.266, 1.268, 1.284, 1.290, 1.290, 1.301, 1.302, 1.304, 1.306, 1.317).
As there are an even number number of datapoints, the median will be halfway
between the two most central values, x n2 and x n+2 , where n is the total number
2
of datapoints, so the median equals:
1 1 1
(x n + x n+2 ) = (x6 + x7 ) = (1.290 + 1.290) = 1.290.
2 2 2 2 2

Part (b): Calculate the mean for this data.


The mean of this dataset, x̄, is calculated by dividing the sum of all the data-
Xn
points. This can be written as x̄ = n1 xi .
i=1

1
For our dataset x, the mean x̄ = (1.306 + 1.301 + 1.290 + 1.266 + 1.290 + 1.266 + 1.290 + 1.302
12
+ 1.317 + 1.304 + 1.284 + 1.268 + 1.247 + 1.215)
1
= (15.39)
12
= 1.2825.

12
Part (c): Calculate the sample variance for this data.
n
X
1
The sample variance s2x is defined as n−1 (xi − x̄)2 , which is rather difficult
i=1
to use for calculations,!so instead we will use the ”computational formula”
Xn
1
s2x = n−1 x2i − nx̄2 .
i=1
 
n n
1  X X
Using this, s2x = x2i − 12(1.2825) 2
, where x2i = 19.747016,
11
i=1 i=1

= 0.008491818
≈ 8.49 × 10−4 to three significant figures.

Part (d): Is the mode an appropriate summary statistic for


this data? Why/why not?
I would say that the mode is not an appropriate summary statistic for this
data as it would not really give us any useful information. Each datapoint in
x is only the monthly average interest rate, as the interest rates themselves are
constantly changing, so knowing the most common average monthly rate is far
less useful than, for example, the mean rate for the year as given by x̄.

13
Question C2
Part (a): Prove the following computational formula for
the sample covariance:
n
!
1 X
sxy = xi yi − nx̄ȳ
n − 1 i=1

We know the definitional formula for the sample variance to be:


n
1 X
sxy = (xi − x̄)(yi − ȳ)
n − 1 i=1
n
1 X
= (xi yi − xi ȳ − x̄yi + x̄ȳ) after expanding the brackets,
n − 1 i=1
n n n n
!
1 X X X X
= xi yi − xi ȳ − x̄yi + x̄ȳ taking each term of the sum separately,
n − 1 i=1 i=1 i=1 i=1
n n n n
!
1 X X X X
= xi yi − ȳ xi − x̄ yi + x̄ȳ 1 taking out any constants,
n − 1 i=1 i=1 i=1 i=1
n
! n n
1 X X X
= xi yi − nx̄ȳ − nx̄ȳ + nx̄ȳ simplifying by using xi = nx̄ and 1 = n,
n − 1 i=1 i=1 i=1
n
!
1 X
= xi yi − nx̄ȳ after simplifying the last term,
n − 1 i=1

which is the computational formula we wish to prove.

14
Part (b): Suppose that a dataset x = (x1 , x2 , . . . , xn ) (with
n ≥ 2) has a sample variance s2x = 0. Show that all the
datapoints are in fact equal.
Given that s2x = 0 and the definitional formula for the sample variance is
n
X X n
1 1
s2x = n−1 (xi − x̄), we can write n−1 (xi − x̄) = 0.
i=1 i=1
n
X
Then, by multiplying both sides of the equation by (n−1) we get (xi − x̄) = 0.
i=1
n
X Xn
Now, if we take each term of the sum separately we have xi − x̄ = 0, and
i=1 i=1
n
X n
X n
X
then we can add x̄ to either side to give us the equation xi = x̄.
i=1 i=1 i=1

If every datapoint is in fact equal, then the mean value of the dataset would
equal the value of each datapoint. This could be written as xi = x̄ for all i.
Therefore, multiplying either of xi or x̄ by a number, for example n, would also
Xn
give the other number multiplied by n, so nxi = nx̄. We know that 1 = n,
i=1
n
X n
X
so we can rewrite this as xi = x̄, which is the same as the equation above
i=1 i=1
where we rearranged the formula for the sample variance in the case where
s2x = 0.
Therefore, when a dataset x has sample variance s2x = 0, all of its datapoints
xi are equal.

15
Part IV

R Worksheets

16
Part V

Other Things I am Proud of

17

You might also like