Professional Documents
Culture Documents
MATH1710 - Probability and Statistics I - Full Notes: Harry Collins 2021
MATH1710 - Probability and Statistics I - Full Notes: Harry Collins 2021
MATH1710 - Probability and Statistics I - Full Notes: Harry Collins 2021
Notes
Harry Collins
2021
Part I
1
Chapter 1
We also have other important questions to ask about the data, before we calcu-
late or draw anything:
• What is the data? - What has been measured and how? How much
data is there?
• How was it collected? - Is it from a whole population or a sample? If
sampled, how was the sample chosen?
• Are there any outliers? - If so, should they be excluded?
2
1.2 What is R?
R is a programming language well suited to statistical problems. It is widely used
in university statistics modules and in academia, and is increasing in popularity
for use in industry.
RStudio is a program for using R.
Quantiles
The median is an example of a ”quantile” of the data. The α-quantile of some
data, called q(α), is the datapoint α of the way through the ordered list of the
data, where 0 ≤ α ≤ 1. The median is therefore the 12 -quantile, q( 12 ).
Generally, if 1 + α(n − 1) is an integer, q(α) = x1+α(n−1) .
Two other important quantiles are q( 41 ) and q( 43 ), called the lower and upport
quar tiles, respectively.
3
Definitions: Measures of Spread
Using the same definition of x = (x1 , x2 , . . . , xn ):
• The interquartile range is the difference between the upper and lower
quartiles: IQR = q( 43 ) − q( 14 ).
Pn
• The sample variance is s2x = n−1 1 2
i=1 (xi − x̄) ,
where x̄ is the sample mean.
• From this, the standard
p deviation is the square-root of the
sample variance: sx = s2x .
4
Boxplots
Boxplots are a way of illustrating data that make it easier to compare different
data sets ”by eye,” as opposed to comparing raw statistics.
The features of a box plot are as follows:
1 X
s2x = fj (mj − x̄)
n−1 j
5
Figure 1.1: A typical boxplot.
Histograms
Histograms are a way of representing data in bins. A histogram has the mea-
surement on the x-axis and frequency density on the y-axis, with one bar for
each bin, so that the area of each bar is its bin’s relative frequency.
You can also group exact data into bins of your own in order to draw his-
tograms, but you have to be careful not to use too many or too few bins, so
that the histogram is neither too ”noisy” or lacking in detail.
6
1.5 Multiple Variables and Scatterplots
Often, we want to compare multiple pieces of data taken from subjects and see
if any two (or more) variables are related. For example, from n subjects, we
could collect data xi and yi from each subject i, giving us two ”paired” datasets,
x = (x1 , x2 , . . . , xn ) and y = (y1 , y2 , . . . , yn ). Whilst we could calculate sample
statistics and plot the data for both x and y individually, we might also want
to see if there is a relationship between the two.
One good way to visualise this is using a scatterplot, where each pair of
datapoints is represented by a point or mark with the coordinates (xi , yi ).
A useful summary statistic for paired data like this is the correlation:
sxy
rxy = ,
sx sy
where sx and sy are the standard deviations of x and y, and sxy is the sample
covariance:
n
1 X
sxy = (xi − x̄)(yi − ȳ) .
n − 1 i=1
The correlation rxy is always a value between −1 and +1, with values closer to
+1 indicating the points are close to a straight line with an upwards slope, whilst
values closer to −1 indicate they are closer to a straight line with a downwards
slope. Where rxy is close to 0, there is only a weak linear relationship between
the two variables.
7
Part II
8
Chapter 2
Lecture 1
9
2.2 Types of Teaching
1. Lectures:
• Mondays at 15:00.
• Online, via Zoom.
• We will go over last week’s materials.
• There will be some interactivity via polls and Q&As.
2. Tutorials:
• Weeks 2, 4, 6, 8 and 10 (check the timetable to be sure).
• In person, on campus.
• In small groups of about 10-12 people.
• We will go over answers to the problem sheets in an interactive
setting.
3. Office Hours Drop-In:
• Wednesday 10:00, Charles Thackrah SR G.07
or Wednesday 12:00, Emmanuel Centre SR 02.
• In person, on campus.
• Optional.
• For any questions about the module.
4. R Troubleshooting Drop-Ins:
• Weeks 2 and 3 (check the timetable).
• In person, on campus.
• Optional.
• If you have questions about R, get help.
2.3 R
R is a programming language well suited to statistical problems.
It is widely used in universty statistics modules and in academia, and is
increasingly popular for use in industry.
RStudio is a program for using R.
10
Part III
Assessed Questions
11
Chapter 3
Problem Sheet 1
Question C1
Let x be a vector of the monthly average exchange rate for US dollars into
British pounds over a 12-month period, such that
x = (1.306, 1.301, 1.290, 1.266, 1.290, 1.266, 1.290, 1.302, 1.317, 1.304, 1.284, 1.268, 1.247, 1.215).
1
For our dataset x, the mean x̄ = (1.306 + 1.301 + 1.290 + 1.266 + 1.290 + 1.266 + 1.290 + 1.302
12
+ 1.317 + 1.304 + 1.284 + 1.268 + 1.247 + 1.215)
1
= (15.39)
12
= 1.2825.
12
Part (c): Calculate the sample variance for this data.
n
X
1
The sample variance s2x is defined as n−1 (xi − x̄)2 , which is rather difficult
i=1
to use for calculations,!so instead we will use the ”computational formula”
Xn
1
s2x = n−1 x2i − nx̄2 .
i=1
n n
1 X X
Using this, s2x = x2i − 12(1.2825) 2
, where x2i = 19.747016,
11
i=1 i=1
= 0.008491818
≈ 8.49 × 10−4 to three significant figures.
13
Question C2
Part (a): Prove the following computational formula for
the sample covariance:
n
!
1 X
sxy = xi yi − nx̄ȳ
n − 1 i=1
14
Part (b): Suppose that a dataset x = (x1 , x2 , . . . , xn ) (with
n ≥ 2) has a sample variance s2x = 0. Show that all the
datapoints are in fact equal.
Given that s2x = 0 and the definitional formula for the sample variance is
n
X X n
1 1
s2x = n−1 (xi − x̄), we can write n−1 (xi − x̄) = 0.
i=1 i=1
n
X
Then, by multiplying both sides of the equation by (n−1) we get (xi − x̄) = 0.
i=1
n
X Xn
Now, if we take each term of the sum separately we have xi − x̄ = 0, and
i=1 i=1
n
X n
X n
X
then we can add x̄ to either side to give us the equation xi = x̄.
i=1 i=1 i=1
If every datapoint is in fact equal, then the mean value of the dataset would
equal the value of each datapoint. This could be written as xi = x̄ for all i.
Therefore, multiplying either of xi or x̄ by a number, for example n, would also
Xn
give the other number multiplied by n, so nxi = nx̄. We know that 1 = n,
i=1
n
X n
X
so we can rewrite this as xi = x̄, which is the same as the equation above
i=1 i=1
where we rearranged the formula for the sample variance in the case where
s2x = 0.
Therefore, when a dataset x has sample variance s2x = 0, all of its datapoints
xi are equal.
15
Part IV
R Worksheets
16
Part V
17