Intro To Statistics CH 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

MATH10282 Introduction to

Statistics
Semester 2, 2012-13

Course Notes

Chapter 1 - Populations and Samples


Suppose that we have some specified population and we wish to say or infer
something about the (numerical) characteristics of certain variables measured
on each member of the population. One possibility may be to collect information from each member of the population (ie. conduct a census) to give
a reliable assessment of the information required. However, this may not be
feasible for one of the following reasons:
(i) Time and/or financial constraints may render this impractical.
(ii) It is not possible if the measurement process is destructive.
(iii) The population maybe a conceptual one.
Some simple examples of populations are:
(i) All the adults in the UK who are eligible to vote and information is
required on the proportions who support each of the main political
parties. Clearly, interviewing all voters is not feasible.
(ii) We have a particular type of electronic component and it is required
to know the length of time it might work for before failure. The actual
population here is conceptual in that it consists of all the possible components which could be made under essentially unchanged conditions.
The testing process is necessarily destructive in order to determine a
components lifetime.
(iii) All full-time, working, adult males in Manchester for whom we would
like to know their mean gross income.
(iv) A particular measurement made during a laboratory experiment. Again,
the population is conceptual in that it comprises the (noncountable)
infinitely many measurements that would be obtained if the experiment
were to be carried out repeatedly under unchanged conditions.
The practical solution is to firstly sample in that we select a subset of the
population being studied. In the cases of the four examples above:
(i) Interview, say 1000, voters and ask each person which political party
they support.
2

In an opinion poll conducted in February 2008 voters were asked: If


there were a general election tomorrow which party do you think you
would vote for? A summary of the responses is as follows:
Party
Conservative
Labour
Liberal-Democrats
Other

Party code
1
2
3
4

Number of supporters
366
344
212
78

(ii) Take say 50 such manufactured items from the production line and
determine the lifetimes (in hours) of each of them. A sample of data
representing this scenario has been simulated and resulted in the following frequency table where each of the listed intervals is open on the
left and closed on the right:
Intervals
323.75 to
326.25 to
328.75 to
331.25 to
333.75 to
336.25 to
338.75 to
341.25 to
343.75 to
Totals

326.25
328.75
331.25
333.75
336.25
338.75
341.25
343.75
346.25

Frequencies
1
0
9
12
11
10
5
1
1
50

Percents
2
0
18
24
22
20
10
2
2
100

(iii) Obtain the gross income details (in units of a thousand pounds) of 500
adult males in Manchester who are working full-time. Such a sample
has been simulated and the data is summarized in the following table.
Again, each interval is open on the left and closed on the right.

Intervals
5 to 15
15 to 25
25 to 35
35 to 45
45 to 55
55 to 65
65 to 75
75 to 85
85 to 95
95 to 105
105 to 115
115 to 125
125 to 135
135 to 145
145 to 155
155 to 165
165 to 175
175 to 185
185 to 195
Totals

Frequencies
83
142
90
79
46
28
13
6
4
3
0
2
0
0
1
0
1
1
1
500

Percents
16.6
28.4
18.0
15.8
9.2
5.6
2.6
1.2
0.8
0.6
0.0
0.4
0.0
0.0
0.2
0.0
0.2
0.2
0.2
100.0

(iv) Repeat the laboratory experiment say 15 times, under the same conditions, and record the measurement of interest each time.
As we can see by the above examples, the nature of the data collected
can vary. In general, we have either qualitative or quantitative variables.
Qualitative variables are either nominal (such as the sex of a person or
the political party they support) or ordinal (such as the variable size with
categories small, medium and large). Quantitative variables are either
discrete (based on counting, for example) or continuous like the variables
income and lifetime in the above examples.
Sampling techniques used are probabilistic in nature in that members of
the population will be included in the sample with a certain probability so
that the actual composition of the final sample is random. If these samples
are representative of the populations from which they were drawn then the
information determined from them should enable us to say something about
4

the characteristics of the population as a whole. For example, in (i) above


we might estimate the unknown proportion of voters in the population who
support Labour by the proportion in the sample found to actually support
Labour. The subject of Statistics is concerned with studying methods for
drawing and measuring the reliability of such conclusions.
The most common sampling procedure is to select a simple random
sample for which each possible sample combination in the population has
an equal probability of being selected. This means that every element in
the population should have the same probability of occupying each position
in the sample, independently of which other members of the population are
chosen.
Formal definition: Let X be a random variable with cdf FX (x). Let
X1 , . . . , Xn be n independent random variables each having the same distribution as X. We call X1 , . . . , Xn a random sample from the distribution FX .
More informally, a random sample of size n from a distribution FX corresponds to n repeated independent measurements on X made under essentially unchanged conditions. This mathematically idealized notion of a
random sample can usually only be approximated by actual experimental
conditions.
The random variable Xi represents the numerical value that the ith member of the sample will assume. After a particular sample is observed the
actual values of X1 , . . . , Xn are known and denoted by x1 , . . . , xn .
Now let X1 , . . . , Xn be n independent random variables each having the
same distribution with cdf FX . X1 , . . . , Xn are said to be independent if and
only if
F(X1 ,...,Xn ) (x1 , . . . , xn ) = P (X1 x1 , . . . , Xn xn ) =

n
Y

FX (xi )

i=1

for all possible {x1 , . . . xn } R(X1 ,...,Xn ) and where FX denotes the common
cdf of each Xi .
In the discrete case, the joint pmf of X1 , . . . , Xn under independence is
p(X1 ,...,Xn ) (x1 , . . . , xn ) = P (X1 = x1 , . . . , Xn = xn ) =

n
Y
i=1

pX (xi )

for all possible {x1 , . . . xn } R(X1 ,...,Xn ) and where pX denotes the common
pdf of each Xi .
We have an an analogous result in the continuous case.
Suppose that the population is finite of size N and that we wish to obtain
a random sample of size n. Random sampling with replacement involves
at each draw selecting an item and then returning it to the population. Hence,
at any of the n draws all N members of the population have an equal chance
of being selected, no matter how often they have already been selected and
there are N n possible samples. In this scenario the Xi s are independent
and identically distributed.
 Random sampling without replacement
requires that each of the Nn possible samples has an equal probability of
being selected. Such a sample is usually drawn one-at-a-time and at each
draw each item left in the population has an equal chance of being selected.
The Xi s in this case are actually dependent but the level of dependence is
very small provided N is large. In fact, when N >> n they can be regarded
as being essentially independent.
For illustrative purposes, consider the very simplistic scenario where we
have a population of size N = 3 consisting of three items A, B, C taking
values 1, 2, 3, respectively. We want to select a random sample of size (n=2)
from this population using random sampling with replacement and then calculate the sample mean which will be the average of the two selected items.
This means that, at each draw, each of A,B and C has an equal probability
(of 1/3) of being selected. There are thus 9 possible, equally likely samples
which are:
{A, B}, {A, C}, {B, C}, {B, A}, {C, A}, {C, B}, {A, A}, {B, B}, {C, C}
We can then define sample variables to be Xj = the measured value associated with the jth member of the sample for j = 1, 2. This leads to 9 possible
pairs of data values which, in the corresponding order as above, are:
{1, 2}, {1, 3}, {2, 3}, {2, 1}, {3, 1}, {3, 2}, {1, 1}, {2, 2}, {3, 3}
Under this scheme, A, B, and C each have probability 1/3 of being the
first member of the sample and probability 1/3 of being the second member
of the sample. Consequently, their respective values 1, 2 and 3 also each have
6

probability 1/3 of being the first member of the sample and of being the
second member of the sample.
The set of all possible sample means is
{1.5, 2.0, 2.5, 1.5, 2.0, 2.5, 1.0, 2.0, 3.0}
Note that the mean of the 9 possible sample means listed above is 18.0/9 =
2.0 which is the same as the population mean calculated as = (1+2+3)/3 =
2.0. This is an important property of random samples in that the mean of all
the possible samples of size n from the population is equal to the population
mean.
2.02 = 0.667 whereas the variance
The population variance is 2 = 14.0
3
39.0
2
of the above set of means is 9 2.0 = 0.333 which can be shown is equal
to 2 /n = 0.667/2 = 0.333.
If we now use random sampling without replacement to select samples of
size two then the set of possible samples is
{1, 2}, {1, 3}, {2, 3}
with corresponding sample means
1.5, 2.0, 2.5.
The mean of these sample means is 6.0/3 = 2.0 which is again equal to the
2.02 = 0.1667
population mean. The variance of these sample means is 12.5
3
which
 can
 be shown to be related to the population variance by the formula
2 N n
. When sampling without replacement, the variance of the sampling
n
N 1
n
than in the case
distribution of sample means is smaller by the fraction N
N 1
when we are sampling with replacement.
We can select a random sample from any size population by making use of
random numbers. A sequence of random numbers is a collection of digits
(0 9) such that each one is equally likely to occur in any one position,
independently of all the others. To select a random sample:
(i) Number the individual members of the target population (if finite).
Sample by time and/or space if this is not possible.
(ii) Obtain a random number, for example from a published table of random numbers or by using a routine programmed on a computer and
then identify the corresponding member of the population.
7

(iii) Repeat step (ii) n times to obtain a random sample of size n.


.
For example, if N = 8500 then label the population members 0001 to
8500 and select 4 digit random numbers (with or without replacement, as
appropriate) ignoring 0000 and 8501 9999.
For the four examples above we can obtain random samples as follows:
(i) Use random numbers to sample individuals by firstly location (eg.
town) and then time or order of appearance until we have a sample
of n = 1000.
(ii) Enumerate the tax records of all full-time working adult males in Manchester and then use random numbers to select a sample of n = 500.
(iii) Use random numbers to select n = 50 items from the production line.
(iv) If we repeat a particular experiment n = 15 times then we can hope
that the results represent, to a reasonable degree of approximation, a
random sample.
In the above examples we are trying to do our best from a practical and
feasibility point of view to obtain a truly random sample. Sometimes, as in
(iv), we have to compromise but hope that the data we do obtain can be
regarded as representing a random sample.

You might also like