STA 201 Lecture Note NEW-1

STA 201 (STATISTICS FOR PHYSICAL SCIENCES AND ENGINEERING)
LECTURE NOTE
INRTODUCTION
Definition
Statistics is the science of learning from experience, especially experiences that
arrives a little bit at a time. This century has seen statistical techniques become the
analytic methods of choice in biomedical science, genetic studies, epidemiology,
agricultural science and other areas. Statistics implies both statistical data and
statistical method. When it means statistical data it refer to numerical descriptions of
quantitative aspect of things. These descriptions could be inform of counts or
measurements. Thus statistics of students of a faculty of science include count of the
number of students, such as males and females, married and unmarried, or
postgraduates and undergraduates. They may also include such measurements as
their height, weight and IQ (Intelligent Quotient). Statistics is broadly divided into
Descriptive and Inferential statistics
Population, Sample and Model
The data in medical, biomedical, nutritional or agricultural studies are generally
based on individual observations. They are observations or measurements taken on
the smallest sampling unit. These smallest sampling units, frequently but not
necessarily, are also individuals in the biological sense.
Population: Population or universe is well-defined in the science of statistics. Through
biological definition of the term “population” is the totality of individuals of a given
species per given time and given area, population in “statistics” always means the
totality of the individual observations about which inferences are to be made. A
population may refers to variables of a concrete collection of object or creatures such
as weight or tail lengths of all the albino rats, anthropometric measurements and
haemoglobin or serum protein levels of adults, and nutrients contents of varieties of
foods.
Samples: Sample is a part of the population. Large number of samples may be taken
from the same population, still all members may not be covered. Inferences drawn
from the sample refer to the defined population from which sample or samples are
drawn.
The drop of blood examined in the laboratory is a sample from the “population” of all
blood in the body.
Model: Inductive inference is based on the assumption that the values in the
population under study are scattered according to a certain pattern. This pattern is
modeled by a probability distribution.
For example, “the height of students in the faculty of science, Olabisi Onabanjo
University follow the normal distribution with mean 120cm and standard deviation
10cm” is the specification of the probability (or stochastic) model.
Fitting of a probability model to the values of a certain population is done by
specifying the probability distribution of the underlying random variable. The
following purposes are the reasons for fitting a probability model.
a. It may be used to describe the population,
b. It may be used to predict some future value.
c. Usually the probability models are fitted as a first step to take one among
the set of several possible actions
d. Sometimes the value of the parameter may be of independent interest.
Data Collection
Data form the bedrock on which statistical analysis mostly relied upon. It is an
activity aimed at getting information to satisfy some decisions or objectives. The
process of collecting data varies and depends upon the kind of data to be collected.
Sources of Data
Basically, there are two major sources of data, namely primary and secondary
sources of data collection.
Primary Sources
This refers to the statistical data or information which the investigator originates
himself for the purpose of the enquiry at hand. Examples are census, surveys and
experiments
Advantages
i. It allows detailed and accurate information to be collected.
ii. It is more reliable
iii. The method of data collection and level of accuracy known.
Disadvantages
i. Often time consuming.
ii. More expensive.
Secondary sources
This refers to those statistical data which are not originated by the investigator
himself, but which he obtains from someone else’s records or from some
organization, either in published or unpublished forms. Examples include
publications of the Federal Office of Statistics (FOS), Central Bank of Nigeria (CBN),
National Population Commission (NPC) World Health Organization (WHO), etc.
Advantages
i. Not expensive
ii. Not time consuming
iii. Very easy to collect, especially in a computerized organization.
Disadvantages
i. The information may be misleading.
ii. It may not allow detailed and accurate information to be collected.
Method of Data Collection
Three methods of collecting data are:
i. Postal questionnaires
ii. Personal interviews
iii. Telephone interviews
Questionnaires
A questionnaire contains a sequence of questions relevant to the data or information
being sought. This is a formal questions prepared which but to be answered by the
respondent. Questionnaires are usually of two parts, parts one is the classification
section. It contains such details of the respondents like sex, age, marital status,
occupation, state of origin etc. The second part is related to the subject matter of the
enquiry.
Types of questionnaire
a. Close-End Questionnaire: this is a questionnaire designed in such a way that
respondents are limited to stated alternatives or options thereby not permitting
further or additional explanation and is called structure questionnaire.
b. Open-End Questionnaire: This is unstructured questionnaire design which allows
the respondent free to make whatever reply that they chooses, that is, the
respondents are not in any way restricted to options.
Quality of a Good Questionnaire

1. Questionnaires should be simple and easily understood.
2. It should be in logical sequence.
3. It should be short and unambiguous.
4. Questions should not offend, frightened or be tele-guiding.
5. Questions that may arouse the resentment of the respondents should be
avoided.
6. Question should not require calculation to be made.
7. Question should be able to have precise answer like “Yes” or “No”.
8. Questions that rely too much on memory should be avoided. Since some people
forget events too soon.
Editing
This is a way of checking the answered questionnaire to correct some of the
mistakes. The returned questionnaires filled by the informants or by enumerators
should be scrutinized at an early stage with a view to detect errors, omissions and
inconsistencies. The work of editing requires skill and scientific impartiality of a high
degree and four types of editing are: editing for consistency, uniformity,
completeness and accuracy.
Coding
The responses in the edited questionnaire are now to be translated in
numerical terms in order to facilitate analysis. This is done by setting out a list of
codes for the possible responses to questions.
Tabulation and Classification
This is an act of arranging facts and figures in the form of table(s) or list. In
order to make the data easily understandable, the first task of the statistician is to
condense and simplify them in such a manner that irrelevant details are eliminated
and their significant features stand out prominently. The procedure that is adopted
for this purpose is known as the method of classification and tabulation.
Data Presentation
It is the representation of data in appropriate form in order to make the
comparison and understanding easy through charts, diagram or graph. No matter
how informative and well designed a statistical table is, as a medium for conveying to
the reader an immediate and clear impression of its content, it is inferior to a good
chart, diagram or graph. The most popular charts, diagrams and graphs are, pie
charts, bar diagrams (bar chart and histogram) and graphs (frequency polygons and
Ogives).
Pie charts
A pie chart is simply a circle divided into sections. This circle represents the
total of the data being presented and each section is drawn proportional to its relative
size. The main advantage of a pie chart is that it is easy to understand.
Example
An investigation of the marital status of the staff of an institution reveals the
following:
Marital status No of staff
Single 35
Married 130
Widowed 25
Divorced 10
Draw a pie chart using the above information.

Solution
Total no of staff in the institution is
35 + 130 + 25 + 10 = 200
Angle corresponding to each status are found thus:
35
Single = × 360 0 = 630
200
130
Married = × 360 0 = 234 0
200
25
Widowed = × 360 0 = 45 0
200
10
Divorced = × 360 0 = 18 0
200
Thus, the pie chart is:
Series1, Series1, Single,
Total Number
Series1,
Divorced, of Staff in the
18, 5% Institution
63, 17%
Widowed, 45,
13%
Series1, Married
, 234, 65%
Observation: the chart clearly shows that majority of the staff in the institution are
married.
Bar Charts
Bar charts could be simple, multiple or component in nature. A single bar chart
comprises of a number of equally spaced rectangles.
A multiple bar chart is usually used in the comparison of two or more attributes.
A component bar chart comprises of bars which are subdivided into components.
Example
Represents the data used above in bar chart.
Solution:
Bar Chart
Single
Married
Widowed
Divorced
Example
The sex distribution of staff in five departments of the faculty of science are given
below
S/No Departments Male Female Total
1 Chemical science 25 15 40
ii. Mathematical science 65 30 95
iii. Biological science 45 40 85
iv. Physics 35 5 50
v. Earth Science 30 10 40
Total 200 110 310
Present the above information on a
i.) Multiple Bar Chart
ii.) Component Bar Chart.
Solution:
i.)
Multiple Bar CHart
Male
Female
ii.) Component Bar Chart.
Female
Male
HISTOGRAMS
Histograms and bar charts look alike in presentation, but while the bars of the bar
charts are usually not joined, those of the histogram are usually joined. Furthermore,
while the bar chart attaches importance only to its heights, histogram attaches
importance to both heights and the widths.
Example
Obtain the histogram of the data in example above
Solution
Histogram.
Single
Married
Widowed
Divorced
Descriptive Statistics
Statistics is concerned with variability. It is of interest to know, how to describe it?
How to measure it? And how to reach sensible conclusions from the results of
experiments and comparative studies? Descriptive statistics deals with classification
of data, the drawing histograms, diagrams and graphs such as line graphs, bar graphs,
pictograms that correspond to frequency distribution that result after the data are
been classified. It also include the computation of sample means, medians and modes,
the computation of ranges, mean absolute deviations and variances.
Variable, Variation and Distribution
The results of an experiment of comparative study can always be presented as a set
of measurements on each of a group of units. For example, the units may be animals
of a particular species, patients of a particular diseases, or families living in a
particular housing estate. A general terms for any feature of the unit which is
observed or measured is variable. Thus, the weight of an animal, the presence or
absence of a symptom in a patient are variables.
Variations
There are two main types of variation: one is variation between units and the other
is variation within units, variation between units is universal in any scientific
investigation. Variation within units is seen when observations are made over a
period of time. Variation is best described by the relative frequencies with which
different observed values occur.
Distribution
The variation between observations is best described by distributions. The way in
which the relative frequencies of the observed values of a variable are displayed,
depends to some extent on the scale on which the variable takes its values. Variables
can be on a qualitative scale consisting of values like red, white, black or white or
presence or absence of a disease. Qualitative variables are also named as attributes.
These variables are not capable of being described numerically. Examples are: sex,
religion, nationality, colour of the eye or skin etc. these characteristics are called
“attributes” or “attributive variates” or “descriptive characteristics”.
Second type of variable are those taking values n a quantitative scale for which a
comparison of magnitude is involved. Example of quantitative variables are height,
weight, heamoglobin, calorie and nutrient content of foods.
Measure of Central Tendency
Classification and tabulation of data are helpful in reducing and understanding the
bulk of the large mass of data. But they are descriptive. To be more precise, the data
should be expressed in numerical terms. So the need arises, to find a constant which
will be the representative of a group of data. This is a measure of how the data are
centrally placed. It is also called measure of location. There are three possible
measures of location namely: the mean, the median and the mode. The mean can also
be divided into three parts namely; arithmetic, geometric and harmonic mean. By
careful observation of data, it can be noticed that the observations tend to cluster
around a central value. This is called central tendency of that group. This central value
is known as average.
Essential of a Good Average
Since an ‘average’ is to represent the statistical data and is used also for purposes of
comparison, it must possess the following properties:
i. It must be rigidly defined, and not left to the mean estimation of the
observer.
ii. The average must be based on all values given in the distribution.
iii. It should be easily understandable.
iv. It should be capable of being calculated with measurable ease and rapidity.
v. It should be as little affected as may be possible by fluctuations of sample.
vi. It should be such that it can lead itself readily to algebraical treatment.
The Arithmetical Mean
The arithmetic mean of a series is obtained by adding the values of all observations
and dividing the total by the number of observations. This is generally called the
measure. In symbols, X1, X2, …, Xn are n observed values, then the mean is given by:
X =
Total of all individual values x + x + , ..., xn
= 1 2 =
∑x 1
sample size n n
Example: The gain in weights of 5 albino rats over a period of 5 days are 5, 6, 4, 4, 4,
7.
The arithmetic mean or mean is
5 + 6 + 4 + 4 + 7 30
x = = = 5.0
6 6
Mean of a Group Data
Three method of calculation are: the long method, the assumed mean method and the
coding method.
Long method
x=
∑ fx
∑f
Assumed mean method
X =A +
∑ fd
∑f
Where
A = is a guessed or assumed mean
d = X-A are the deviations from the assumed mean.
Coding method
 ∑ fu 
X =A +  c
∑f 
 
Where
A is an appropriately chosen x values
C is the common classes size
U = …, -3, -2, -1, 0, 1, 2, 3, …
Example 2.2
The weights in kg of a collection of 40 students in the faculty of science of O.O.U. are
given below:
59, 53, 66, 55, 57, 65, 48, 59, 51, 58, 52, 68, 60, 70, 71, 55, 70, 64, 54, 67, 62, 53, 49,
56, 63, 48, 57, 61, 58, 55, 50, 55, 61, 52, 54, 65, 56, 50, 62, 60
Calculate the mean using:
a. The long method
b. The assumed mean 61
c. The coding method
Solution :
Weights (kg) F x fx d = x – a= x – 61 fd u Fu
48 – 50 8 50 400 -11 -88 -2 -16
53 – 57 12 55 660 -6 -72 -1 -12
58 – 62 10 60 600 -1 -10 0 0
63 – 67 6 65 390 4 24 1 6
68 – 72 4 70 280 9 36 2 8
Total 40 2330 -110 -14
a. Long method
X =
∑ fx
∑f
2330
= = 58.25
40
b. Assumed mean 61
X =A +
∑ fd
∑f
 − 110 
= 61 +  
 40 
= 61 – 2.75 = 58.25
c. Coding Method
 ∑ fu 
X =A +  c
∑f 
 
Hence C = 5. A is the value of X corresponding to U – 0 for odd number of classes
we choose u = 0 at the centre.
− 14 
Thus, X = 60 +   5
 40 
= 60 – 1.75
= 58.25
The Geometric Mean (GM):
If the observations instead of being added, are multiplied, the geometric mean
would be the nth root of the product. In algebraic symbols, the geometric mean of n
observations, x1 , x2 , x3 , ..., xn , is given by the formula:
GM = Geometric mean = n (x1 ) (x2 )(x3 ), ...., (xn ) = (x1 , x2 , ..., xn ) n

1
For its logarithmic calculation, the relationship used is,
=
log x1 + log x2 + log x3 ... + log xn
=
∑ log x i
n n
= simple arithmetic mean of the logarithmic values of individual values.
Anti-logarithms values of this log mean is the geometric mean. The geometric mean
is preferable to the arithmetic mean if the series of observations contain one or more
usually large value.
Example
The intakes of baby milk food observed in fifteen children in one day are provided
below:
101 114 109 135 122
184 196 185 217 198
148 233 227 336 253
Calculate the geometric mean?
Solution:
n = 15
GM = 15 (101 × 114 × 109 × ..., × 253) = 173.7
log x1 + log x2 + log x3 ... + log x15

log GM =
15
log101 + log114 + ... + log 253 33.5959
= = 2.2397
15 15
Anti-logarithm of 2.2397 is 173.7. so, the geometric mean,
GM = 173.7
Harmonic Mean
It is the reciprocal of the arithmetic mean of the reciprocals of observations. For
x1 , x2 , ..., xn individual values, harmonic mean (HM) is
1
HM = 1
n ∑ 1
xi
n n
= =
∑ 1
xi ( )+ ( )
1
xi
1
x2 + ,..., ( ) 1
xn
Example
For the numerical values of 1, 2, 3, 4, 5, calculate and compare the AM, GM and HM
Solution
1+ 2 + 3 + 4 + 5 15
Arithmetic Mean (AM) = x = = = 3.0
5 5
Geometric mean (GM) = (1× 2 × 3× 4 × 5) 5 = (120) 5 = 2.605

1 1
With logarithms, the calculations are provided below for GM.

1
log GM = log(1× 2 × 3× 4 × 5) 5 = (log1 + log 2 + log 3 + log 4 + log 5)
1
5
1
= (0 + 0.30103 + 0.47712 + 0.60206 + 0.69897 )
5
1
= (2.07918) = 0.415836
5
GM = Anti.log of GM = Anti-log (0.415836) = 2.60517 is equivalent to 2.605
1 5
Harmonic mean (HM) = = 2.242
1
5 (
1
1 + + 3 + 4 + 5 ) 2.23
1
2
1 1 1
Therefore, AM is the highest followed by GM and HM.

The Mode
This is the value or number that has the highest frequency in a distribution. The mode
may not exist and even when it does exist, it may not be unique.
For example:
5, 2, 4, 7, 5, 3; has mode 5 (unimodal)
2, 6, 3, 4, 3, 2, 5 has two modes 2 and 3 (bimodal)
4, 7, 2, 1, 3 has no mode
The mode can be obtained both graphically and by calculations. For grouped data, we
use the histogram to estimate the mode, while by calculation we use the formula.
 fm − fa 
Mode = L +  C
2
 mf − f a − f b
Where
L = Lower class boundary of the modal class
Fm = Frequency of the modal class
Fa = frequency of the class above the modal class
Fb = frequency of the class below the modal class
C = size of the modal class interval.
THE MEDIAN
If a set of data is arranged in order of magnitude, the middle value, which divides the
set into two equal groups is the media. Generally, for N data
th
 N + 1
Median =  item
 2 
For example find the median of the following sets of data
a. 3, 6, 2, 4, 3
b. 2, 5, 3, 4, 8, 3
Solution
a. Arrangement in order: 2, 3, 3, 4, 6
Here N = 5
th
 N + 1
Median =  item
 2 
 5 + 1
=  = the 3rd item
 2 
=3
b. Arrangement in order 2, 3, 3, 4, 5, 8
Here N = 6
th
 6 + 1
Median =  item = 3.5th item
 2 
This will be interpreted as the
3rd item + 4th item 3+ 4
= = 3.5
2 2
Median of a Group Data
The median can be obtained graphically from the cumulative frequency curve (Ogive)
or by calculation using the formular.
N − F 
Median = L +  2 C
 f 
Where
L = value of the lower class boundary of the median class.
F = Cumulative frequency of the class just above the one containing the
median.
f = frequency of the median class
C = size of the median class interval
Example: Using the data given in example above
i. Construct the histogram and from it estimate the mode of the distribution.
ii. Calculate the mode and compare your answer with the estimated value in
(i) above
iii. Construct the cumulative frequency curve and from it estimate the median.
iv. Calculate the median and compare your results.
Solution
i. Histogram
1
2
3
4
5
47.5 52.5 57.5 62.5 67.5 72.5
The mode is approximately 56

 fm − fa 
ii. L +  C
 2 fm − fa − fb 
The mode class is 53 – 57
Hence, L = 52.5, fm = 12, fa = 8, fb = 10 and c = 5
Thus
 12 − 8 
Mode = 52.5 +  5
 2(12 ) − 8 − 10 
= 52.5 + 3.33
= 55.83
Comparison: graphical value = 56
Estimated value = 55.83
These values agreed approximately
iii.
Weight (kg) Frequency Cum frequency (f)
(f)
48 – 52 8 8
53 – 57 12 20
58 – 62 10 30
63 – 67 6 36
68 – 72 4 40
Total 40
1
2
3
4
5
47.5 52.5 57.5 62.5 67.5 72.5
Estimated = 37.5
N − F 
iv. Median = L +  2 C
 f 
N
2 = 40
2 = 20 i.e the median is the 20th value. From the cumulative frequency
distribution table 20th item falls within the class 53 – 57. Thus the median class is 53-
57, hence, L = 52.5, F = 8, f = 12 and C =5
 20 − 8 
Median = 52.5   5 = 57.5
 12 
Comparison = Both of them are equal
Measures of Variation and Dispersion
While studying a frequency distribution of a variable, it is important to know how the
frequencies are clustered around or scattered away from the measures of averages
or central tendency. Two distributions may centre around the same point i.e.
arithmetic means, but differ in variation from arithmetic mean. Such variation is
called dispersion, spread or variability. The degree to which numerical data tend to
spread about an average value is called the variation or dispersion of the data. Various
measures of variation are, range, quartile deviation, mean deviation, standard
deviation, variance and standard error.
i. The Range
Range is the difference between the largest and smallest items of the sample of
observations. If sample of observations: 5, 6, 7, 8 and 9 are there, the range is 9 – 5 =
4 i.e. maximum value – minimum value, it depends on two extreme values. It is the
difference between the largest and the smallest numbers of a distribution.
ii. Quartile Deviation
Quartile deviation in semi-interquartile range Q is given by the formula
1
Q=
2 (Q3 − Q1 )
Where Q1 and Q3 are the first and third quartiles respectively. Quartile deviation is
better than range, since it is calculated using first and third quartile values.
iii. Mean Deviation

The mean deviation is the arithmetic mean of the absolute values of the
deviations from some average like mean or median or mode.
Mean deviation =
∑ f (x
i i − x)
for group data
N
Mean deviation =
∑ (x i − x)
for ungrouped data
N
Where
fi = is the frequency of the ith class interval
xi = is the ith mid value of class interval or ith individual value.
x = is the arithmetic men
N = is the number of observations or N = ∑f i
iv. Standard Deviation

This is the most commonly used measure of variation or dispersion. It takes into
account all the values of the variable. Standard deviation is defined as the square root
of the arithmetic mean of the squared deviations of the individual values from their
arithmetic mean. The formula for large samples.
∑ (x − x)
2
SD 2 = 1
n i
Where
xi = is the ith individual value
n = sample size
for small samples, the formula is,
∑ (x − x)
2
SD 2 = 1
n −1 i
= 1
n −1
[SS − CF ]
Where
SS = sum of squares = ∑x 2
i
CF = correction factor =
(∑ x ) i
2
n
For grouped data the formula is,
∑ f (x − x)
2
SD 2 = 1
n −1 i i
∑ f (x − x)
2
SD = 1
n −1 i i

∑ f i xi2 −
(∑ f x )
i i
2


SD = 1
n −1
 n 
Where
fi = is the frequency of the ith class interval
xi = is the mid value of the ith class interval
n = sample size
Example
Ungrouped data for the values 5, 6, 7, 8, 9
AM = x = 35
5 =7
SS = ∑ xi2 = 5 2 + 6 2 + 7 2 + 8 2 + 9 2 = 255
(∑ x ) i
2
35 2
CF = = = 245
n n
SD = 1
n −1
(SS − CF ) = 1
4
(255 − 245) = 1.58
Grouped data using the frequency distribution of weights (kg) of 70 adults below
Class interval of Middle Frequency Cumulative fixi
weights (kg) value of xi (fi) frequency
45 – 50 47.5 2 2 95.0
50 – 55 52.5 3 5 157.5
55 – 60 57.5 6 11 345.0
60 – 65 62.5 4 15 250.0
65 – 70 67.5 6 21 405.0
70-75 72.5 4 25 290.0
75 – 80 77.5 5 30 387.5
Total 30 30 1930.0
N = Σfi = 30
7
∑fx
i =1
i i = 1970
x =
∑f x i i
=
1930
= 64.33
∑f i 30
∑f i xi2 = 126637.50
(∑ f x ) i i
2
3724900
= =124163.33
n 30

∑ f i xi2 −
(∑ f x )
i i
2


SD = 1
n −1
 n 
= 1
29
[126637.50 − 124163.33]
= 9.24
v. Standard Error
The standard deviation of mean values is known as standard error. This is used to
compare means with one another.
S tan dard deviation SD
S tan dard Error ( SE ) = =
(sample size ) n
vi. Coefficient of Variation

To compare the variability of two series which differ widely in their averages
or which are measured in different units, a relative measure of dispersion is used
which is known as coefficient of variation or dispersion. The formula is,
S tan dard deviation
Coefficient of var iation (CV ) = × 100
mean
when the variability of two series are compared, the series having greater CV is said
to have more variation than the other and the series with lower CV is said to be more
homogeneous than the other.
Example 2.6 Using the table in example 2.5

SD 9.24
Mean = 64.33 SE = = = 1.69
n 30
9.24
SD = 9.24 CV = 100 × = 14.36
64.33
For ungrouped data 5, 6, 7, 8, 9
1.58
Mean = 7 CV = 100 × = 22.57
7
SD = 1.58
vii. Variance
The variance is measured in the square of the units in which the variable X is
measured.
The formula for variance is:
∑ (x − x) ∑x
2 2
i i − nx 2
Variance = =
n n
A better estimate of the population variation is obtained by suing a division (n-1)
instead of n.
∑ (x − x)
2
2 i
Estimated variance = S =
n −1
∑ (x − x)
2
Estimated standard deviation = (S ) =

i
n −1
Characteristics of sample and population are provided

Sample Population
Number n N
Mean x µ
Variance S2 σ2
Standard deviation S σ
S2 will be a representative unbiased estimate of the population variance σ2 only if (n-

1) is used in the denomination of S2.
Example
The body surface area of fifteen children are given. Calculate the mean, variance,
standard deviation and standard error.
Body Surface Area
196 101 184 227 253
185 217 126 336 148
114 135 233 198 109
Solution
Mean = ∑ x = 2758 = 183.9

n 15
∑x − ( ∑ x)
2
2 n 58499.7
Variance = S = = = 4178.55
n −1 14
S tan dard Deviation, SD = S2 = 4178.55 = 64.65
SD 64.65
S tan dard Error = , = = 16.71
n 15
The standard deviation(s) is a measure of the variation or dispersion of a group of
values around an arithmetic mean (or mean).
Variance (S2) is the square of the standard deviation. The standard error (SE) is a
measure of the variation or dispersion of the means of a set of measurements.
REGRESSION ANALYSIS
Regression analysis is often used to predict the response variables from the
knowledge of the independent variables. Likewise, regression analysis is utilized
primarily to examining the nature of the relationship between the independent
variables and the response (dependent) variable Therefore regression is the study of
relationship among variables. One purpose of regression may be to predict, or
estimate, the values of other variables related to it.
EXAMPLE
The table below shows the weight of males X and females Y staff of an institution.
i) Find least square regression line of Y on X
ii) Find least square regression line of X on Y is considering X as dependent and Y as
independent variable respectively
X 65 63 67 64 68 62 70 66 68 67 69 71
Y 68 66 68 65 69 66 68 65 71 67 68 70
Solution:
i) The regression line of Υ on × is given by
Y = α + βx
Χ Υ Χ2 Y2 ΥΧ
65 68 4225 4624 4420
63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 1096 4225 4160
68 69 4624 4769 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4290
67 67 4489 4889 4889
69 68 4761 4624 4692
71 70 5041 4900 4970
800 811 53418 54849 54107
For Y on x
n∑ xx − ∑ x∑ y
β=
n∑ x 2 − (∑ x ) 2
12(54107) − (800)(811)
=
12(53418) − (800) 2
β = 0.4764
α = ∑ −β ∑x = y−βx
y
n n
811 800
= − 0.4764 x
12 12
= 35.8233
The regression equation of Y on X is given as Y = 35.823 + 0.476X
(ii) The regression line of x on y is given by

x = α + βY
β = n∑ xy − ∑ x∑ y
n∑ y 2 − (∑ y ) 2
12(54107) − (800)(811)
=
12(54849) − (811) 2
β = 1.036
α= ∑ x − β ∑ y = 800 − 1.036 x 811

n n 12 12
= −3.38
The regression equation of x on y is given as
Y = - 3.38 + 1.036Y
CORRELATION ANALYSIS
We have dealt with the problem of regression or estimation of one variable (the
dependent variable) from one or more related variables (the independent variables).
We shall now consider the degree of relationship that exists between variables, the
correlation analysis.
Correlation analysis is a technique for estimating the closeness or degree of
relationship between two or more variables. Correlation is the degree of association
between two or more variables. The degree of relationship may be positive that is, an
increase in one variable accompanied by an increase in the other or negative when
decrease in one variable is accompanied by an increase in the other. The patterns of
correlation are perfect and positive correlation when r = 1, perfect and negative
correlation when r = - 1, positive correlation when r > 0, negative correlation when
r <0 and no correlation when r = 0.
The correlation coefficient or coefficient of correlation denoted by r, is a measure of
the strength of the linear relationship between two variables. Two types of the
measures of correlation are:
(i) Karl Pearson’s’ product moment correlation coefficient (r)
(ii) Spearman’s rank correlation coefficient (R)
PRODUCT MOMENT CORRELATION COEFFICIENT

The Karl Pearson’s product moment correlation coefficient is devoted by r and given
by:
n∑ xy −∑ x ∑ y
r=
[n∑ x 2 − (∑ x) 2 ][n∑ y 2 − (∑ y ) 2 ]
Where – 1 < r <1

It should be noted that the higher the magnitude of r, the more stronger the
association.
EXAMPLE
The table below gives the weight of heart (x) and the weight of kidneys (y) in a
random sample of 12 adult males between the ages of 25 and 55 years
Male no Heart weight (X) Kidney weight (Y)
1 11.50 11.25
2 9.50 11.75
3 13.00 11.75
4 15.50 12.50
5 12.50 12.50
6 11.50 12.75
7 9.00 9.50
8 11.50 10.75
9 9.25 11.00
10 9.75 9.50
11 14.25 13.00
12 10 12.00
Calculate the coefficient of correlation

Solution:
∑x ∑y
∑ xy - n
r=
(∑ y ) 2
[∑ x - (∑ x ) [∑ y -
2 2 2
n
∑ x = 138.00,∑ y = 138.25, ∑ x = 1608.12
∑ x = 1632.75, ∑ y =1602.81
2 2
138.00 x138.25
1608.12 −
r= 12
2
(138.00) (138.25)
(1632.75 − )(1607.81 −
12 12
r = 0.70 (to 2 decimal places)
There is a significant relationship between heart weight and kidney weight.
SPEARMAN RANK CORRELATION COEFFICIENT
When variables do not follow normal distribution and one desires to assess the
relationship, correlation coefficient known as spearman rank correlation coefficient
is used. The variable are ranked based on the magnitude. The correlation between
ranks of variables x and y is obtained. The symbol used is R, the formula is:
6∑ d i2
R = 1−
(
n n 2 −1 )
Where d is the difference between ranks given to the variables of each pair and n is
the number of pairs studied. The procedure was developed by spearman. Hence, it is
known as spearman rank correlation coefficient. Its value also ranges from – 1 to 1.
EXAMPLE
From the table below, calculate the spearman rank correlation between smoking and
cancer.
Individual ranks
1 2 3 4 5 6 7 8 9 10
Grades of smoothing (x) 1 2 3 4 5 6 7 8 9 10
Severity of cancer (y) 1 2 3 4 5 6 7 8 9 10
d = difference between the -1 1 -1 1 -1 -1 2 -1 1 0
ranks of x and y
d2 1 1 1 1 1 1 4 1 1 0
∑d 2
= 1 + 1 + 1 + 1 + 1 + 1 + 4 + 1 + 1 + 0 = 12
6∑ d 2 6(12) 6(12)
R =1− =1− =1−
2
n(n − 1) 2
10(10 − 1) 10(99)
=1-0.073
=0.927.
Severity of cancer and grades of smoking are positively correlated
EXAMPLE
Calculate the value correlation coefficient between the corresponding values of X and
Y given is the table below
X 22 24 25 16 28 19
Y 48 42 40 38 47 45
Solution
The varying is in ascending order of magnitude
X Y RX RY d d2
22 48 3 6 -3 9
24 42 4 3 1 1
25 40 5 2 3 9
16 38 1 1 0 0
28 47 6 5 1 1
19 45 2 4 -2 4
24
6∑ d 2
R=1-
n(n 2 − 1)
6(24)
=1-
6(36 − 1)
=1-0.6857
=0.3143
=0.31
There is a low or weak positive correlation between the two variables.
TIE IN RANKS
Most times, two or more values of a variable might be equal. In such cases, we
assign to each of the tied observations the mean of the ranks which they jointly
occupy. For example if the 5th and 6th largest values of a variable are equal, we assign
(5 + 6)
to each the rank =5.5, and if the of fifth, smith and seventh largest values of a
2
(5 + 6 + 7)
variable are the same we assign each the rank =6.6
3
EXAMPLE
The table give below shows the respective weight Χ and Υ (in kg) of 12 fathers
and their eldest sons.
Father ( Χ ) 66 64 68 65 69 63 71 67 69 68 70 72
Sons ( Υ ) 69 67 69 66 70 67 69 66 72 68 69 71
Calculate the coefficient of rank correlation and comment on the degree of correlation
between the father’s weight and their son.
Solution
Χ Υ RX RX D= RX - RY d2
66 69 4 7.5 -3.5 12.25
64 67 2 3.5 -1.5 2.25
68 69 6.5 7.5 -1.0 1.00
65 66 3 1.5 1.5 2.25
69 70 8.5 10 -1.5 2.25
63 67 1 3.5 -2.5 6.25
71 69 11 7.5 3.5 12.25
67 66 5 1.5 3.5 12.25
69 72 8.5 1.5 -3.5 12.25
68 68 6.5 1.2 1.5 2.25
70 69 10 5 2.5 6.25
72 71 12 11 1.0 1.00
72.50
6εd 2
R=1-
n(n 2 − 1)
6(72.50)
=1-
12(144 − 1
72.50
=1-
2(143)
=1-0.2535
=0.7465
=0.75
Comment: There is a fairly high positive correlation between the father’s weights and
that of their eldest sons.
ELEMENTS OF PROBABILITY
Probability concepts are the foundations of statistics. The understanding of the
concepts of probability will help the interpretation of the statistics in a skilful way.
Probability is a term applied to events that are not certain. It is the study of random
or non-deterministic experiments. So probability is defined as the ratio of favorable
events to the total number of events. Briefly, the interpretation of probabilities can be
summarized as follows:
i. Probabilities are numbers between 0 and 1, inclusive, that reflect the
chances of a particular physical event occurring.
ii. Probabilities near 1 indicate that the event involved is expected to occur.
iii. Probabilities near ½ indicate that the event is just as likely to occur as not.
The above properties are guidelines for interpreting probabilities once these
numbers are available, but they do not indicate how to actually go about assigning
probabilities to event. Three methods are commonly used: the classical approach, the
relative frequency approach and personal or subjective approach.
The Classical Approach

This method can be used whenever the possible outcomes of the experiment are
equally likely. In this case, the probability of the occurrence of event A is given by:
n ( A) Number of ways A can occur
P [A] = =
n (s ) number of ways the experiment can proceed
Where S is the sample size and ACS.

Its main drawback is that it is not always applicable; it does require that the possible
outcomes be equally likely. Its main advantage is that, when applicable, the
probability obtained is exact.
Example
What is the probability that a child born to a couple, each with genes from both brown
and blue eyes, will be brown-eyed?
Solution
We note that since the child receives one gene from each parent, the possibility for
the child are (brown, blue), (blue, brown), (blue, blue) and (brown, brown).
Where the finish member of each pair represents the gene received from the father.
Since each parent is just as likely to contribute a gene for brown eyes as for blue eyes,
all four possibilities are equally likely.
Since the gene for brown eyes is dominant, three of the four possibilities lead
to a brown-eyed child. Hence, the probability that the child is brown-eyed is ¾ = 0.75.
Example
What is the probability of drawing an ace at random from a well shuffled deck
of 52 playing cards?
Solution
There are 4 aces in a check of 52 cards that is x = 4 and n = 52.
Hence, probability of ace x
n = 4
52 = 1
13
The Relative Frequency Approach

This method can be used in any situation in which the experiment can be repeated
many times and the results observed. Then the approximate probability of the
occurrence of event A, denoted P (A), is given by:
n ( A) Number of times event A occured
P [A] = =
N number of times experiment was run
The disadvantage of this method is that the experiment cannot be a one-short

situation, it must be repeatable. The advantage in this method or approach is that
usually it is more accurate, because it is based on actual observation rather than
personal opinion.
Thus for a large number of trials, the approximate probability obtained by using the
relative frequency approach is usually quite accurate.
Example
A researcher is developing a new drug to be used in desensitizing patients to
bee stings of 200 subjects tested, 180 showed a lessening in the severity of symptoms
upon being stung after the treatment was administered. It is natural to assumed, then,
that the probability of this occurring in another patient receiving treatment is at least
approximately
180
= 0.90
200
On the basis of this study, the drug is reported to be 90% effective in lessening the
reaction of sensitive patients to stings.
Example
If 1,000 tosses of a coin results in 520 heads, then the relative frequency of
520
heads is = = 0.52
1000
The subjective or personal approach
This is the probability assigned to an event based on subjective or personal
experience, information and believe.
Hence, probabilities are interpreted as the strength of one’s belief in the
occurrence of an event.
.
Some Basic Definitions
Experiment: this refers to any process of observation or measurement we may
not be able to predict.
Outcome: This refers to results obtained from an experiment.
Sample point: This is an outcome in the sample space
Sample space: This refers to the collection of all possible outcomes of an
experiment.
Event: This refers to any subset of a sample space.
Axioms of probability
1. Let S denote a sample space of an experiment. Then P [S] = 1
2. P [A] ≥ 0 for every event A
3. Let A1, A2, A3, … be a sequence of mutually exclusive events. Then P [A1 ∪ A2 ∪
A3…] = P [A1] + P [A2] + P [A3]…
Axiom 1 states a fact that most people would regard as obvious, namely that the
probability assigned to a sure or certain, event is 1.
Axiom 2 ensures that probabilities can never be negative.
Axiom 3 is called the property of countable additivity.
Probability Laws
1. If A and A′ are complementary events in a sample space S, then
P(A′) = 1 – P (A)
Complementary events: Two events A and A′ are said to be complementary if
they are mutually exclusive.
P (A) + P (A′) = 1
Mutually exclusive events: Two events A and A′ are said to be mutually
exclusive if the occurrence of one event excludes or prevents the probability of
occurrence of the other event.
2. P (∅) = 0 for any sample sizes
S and ∅ are mutually exclusive and S ∪ ∅ = S
∴ P (S) = P (S ∪ ∅)
= P (S) + P(∅)
P (∅) = P(S) – P (S) = 0
3. If A and B are events in a sample space S and ACB, then P(A) ≤ P(B)
4. 0 ≤ P (A) ≤ 1 for any event A.
5. Addition rule: if A and B are any two events in a sample space S, then
P (A∪B) = P(A) + P(B) – P(A∩B)
= P (A) + P(B) for any P (A∩B) = 0
6. Multiplication Rule:
The probability that an event will occurs jointly is the product of the
probabilities of each events. If A and B are independent events, then
P (A∩B) = P (A) × P(B)
This rules is generalized for an arbitrary number of independent events.
Example
What is the probability that a card drawn at random from a well shuffled standard
pack will be either a spade or a club?
Solution
S = 13, C = 13, n = 52.
13 1
P (s ) = P (spade ) = =
52 4
13 1
P (c ) = P (c lub ) = =
52 4
The outcomes are mutually exclusive, therefore, the P (S or C) = P (s) + P (c)
Hence P (s or c) = ¼ + ¼ = ½
Example 3.6
Find the probability of getting three heads in three random tosses of a balanced coin?
Solution
Probability of each toss is ½
Multiplying the three probabilities gives
½ × ½ × ½ = 1/8
7. Conditional Probability
Given a sample space S, let A be a non-empty proper subset of S. i.e. A ≠ ∅ and AcS.
The probability of an event B happening given that an event A has taken place is
denoted by P(B/A) and is defined as:
P (B ∩ C )
P (B A) =
P ( A)
If A and B are any two events in a sample space S and P(A) ≠ 0, the conditional
probability of B given A is:
P (A ∩ B) P (both events )
P (B A) = =
P ( A) P ( given event )
Likewise, the conditional probability of A given B and P (B) ≠ 0 is:

P (A ∩ B) P (both events )
P( A B ) = =
P (B ) P ( given event )
Example
It is estimated that 15% of the adult population has hypertension, but that 75%
of all adults feel that personally they do not have this problem. It is also estimated
that 6% of the population has hypertension but does not think that the disease is
present. If an adult patient reports thinking that he or she does not have hypertension,
what is the probability that the disease is, in fact, present?
Solution
Letting A denote the event that the patient does not feel that the disease is
present and B the event that the disease is present. We are given that P(A) = 0.75,
P(B) = 0.15 and P (A∩B) = 0.06
We are asked to find:
P (both ) P ( A ∩ B ) 0.06
P (B A) = = = = 0.08
P (given ) P ( A) 0.75
There is an 8% chance that a patient who expresses the opinion that she or he has no
problem with hypertension does, in fact, have the disease.
Baye’s Theorem
This theorem was formulated by the Reverend Thomas Bayes (1761). It deals with
conditional probability. Baye’s theorem is used to find P(A/B) when the available
information is not directly compatible with that required in conditional probability.
That is, it is used to find P[A/B] when P[A∩B] and P[B] are not immediately available.
Theorem:
Let A1 , A2 , A3 , ..., An be a collection of events which partition S. Let B be an event such
that P[B] ≠ 0. Then for any of the events Aj, j = 1, 2, 3, …, n
P (A j B ) =
[ ]
P [B A] P A j
∑ P [B A] P [A ]
n
j
i =1
Baye’s theorem is much easier to use in practical problem than to state

formally.
Example
The blood type distribution in Olabisi Onabanjo University is type A, 41%; type
B, 9%; type AB, 4%, and type O, 46%. It is estimated that during an investigation, 4%
of inductees with type O blood were typed as having type A; 88% of those with type
A blood were correctly typed; 4% with type B blood were typed as Aj and 10% with
type AB were typed as A. one student was wounded and brought to surgery. He was
typed as having type A blood. What is the probability that this is his true blood type?
Solution
Let
A1 = he has type A blood
A2 = he has type B blood
A3 = he has step AB blood
A4 = He has type O blood
B: It is typed as type A.
We want to find P [A1/B]
We are given that
P [A1] = 0.41 P [B/A1] = 0.88
P [A2] = 0.09 P [B/A2] = 0.04
P [A3] = 0.04 P[B/A3] = 0.10
P [A4] = 0.46 P [B/A4] = 0.04
By Baye’s theorem
P [B A1 ] P[A1 ]
P [A1 B ] = 4
∑ P [B A ] P[A ]
i =1
1 1
=
(0.88) (0.41)
(0.88) (0.41) + (0.04) (0.09) + (0.10) (0.04) + (0.04) (0.46)
0.93
Practically speaking, this means that there is a 93% chance that the blood type is A if
it has been typed as A, and there is a 7% chance that it has been mistyped as A when
it is actually some other type.
FACTORIALS
Factorial is a special multiplication operator. The factorial sign “!” indicates a special
repeated multiplication which is used frequently in statistical applications.
Examples
3! = 3 × 2 × 1 = 6
4! = 4 × 3 × 2 × 1 = 24
In general, n! = n × n-1 × n-2, …, 3 × 2 × 1
Where n is an integer
The operator “π”is used to indicate a multiplication of a series of numbers.
The operation “Σ” is used to indicate a summation of a series of numbers.
5
Π Y2 = Y1 × Y2 × Y3 × Y4 × Y5
i =1
5`
∑Y
i =1
2 = Y1 × Y2 × Y3 × Y4 × Y5
PERMUTATION
If r objects are selected from a set of n objects, any particular arrangement (order) of
these objects is called a permutation.
The number of permutations of r objects selected from a set of n distinct objects is
n n!
Pr =
(n − r )!
Example
Find the number of ways of arranging the letters of the world CHEMISTRY if:
a. All the letters are to be taken at a time
b. Four of the letters are to be taken at a time
Solution
a. Required number of arrangements = n!
9! = 362880
b. Required number of arrangements = nPr
9 9! 9! 362880
P4 = = = = 3024
(9 − 4)! 5! 120
Notes
i. 0! = 1 and nPn = n!
ii. The number of permutations of n objects of which n1 are of one kind, n2 of
n!
a second kind, …, nk of a kth kind is
n1! n2 ! ..., nk !
COMBINATION
This deals with the number of ways in which r objects can be selected from a set of n
objects. The number of ways in which r objects can be selected from a set of n distinct
n
objects is   or nCr and is given by:
r  
n n!
Cr =
r! (n − r )!
Example
In how many ways can a person select three items from a list of 7 such items?
Solution
Hence n = 7 and r = 3
Number of possible selections nCr
7 7! 7!
C3 = =
3! (7 − 3)! 3! 4!
7×6×5
= = 35
3× 2 ×1
=35
Mathematically speaking, an event which is impossible to occur, for example, an
animal giving birth to a human child, has a probability zero and the event which is
certain to occur, for example, death, has probability unity. If life birth of an octuplet
to a woman are not known to occur in the history of a community, the statistical
probability of such an event is zero in that community. But it does not meant that the
event is an impossibility. No probability can be negative nor can it exceed one. In
simple terms, if malnutrition is present in 8 percent of children in a population, the
probability that a randomly picked child would have that condition is 0.08. Thus, this
measures the likelihood of the event and in a way is complement of uncertainty.
Such qualification of uncertainties has proved immensely useful in effective
management of health conditions, both at individual level as well as at community
level. Knowing that the probability of developing coronary artery disease in senior
executives is, say, 3 times higher than in clerks, provides us a scienitific basis to give
appropriate advice or to institute an intervention at individual level and to plan and
excute preventive measures to combat the problem in the target group. If the analysis
of records show that 90 percent of the large number of patients of abdominal
tuberculosis (TB) came with complaint of pain in abdomen, vomiting and
constipation of long duration, the
P (pain, vomiting, constipation/abdominal TB) = 0.90
Such probabilities, which are restricted to a specific group, are called conditional
probabilities. Therefore, the above given illustrations are some of the biological and
health specific examples of probability.
EXPERIMENTAL DEGISN
Definition of Terms
Randomization: This is the allocation of treatments to units such that the probability
that a particular treatment will be allocated to a particular unit is the same for all
treatments. That is both the allocation of the experiment material and the order in
which the individual trials of the experiment are to be performed randomly
determined. Statistical method require that the observations (or errors) be
independency and identically distributed random variables and randomization
makes this assumption valid. Thus, randomization removes bias and allows the
application of probability concepts.
Replication: it is a complete repetition of the basic experiment, that is, it provides an
estimate of the magnitude of the experimental error and a more precise measure of
treatment effects.
Reduction of Random Variation
The third basic principle is the use of techniques of experimental design for the
reduction of random variation or local control of variability or error control. This
refers to the way in which the experimental units in a particular design is balanced,
blocked and grouped. Possession of local control is necessary to increase the
efficiency of the experiment. The commonly used terms are, experiment, treatment,
experimental unit, experimental error, grouping, blocking factors, balancing and
precision.
Experiment: It is a means of getting an answer to the question that the experimenter
has in mind. This may be to decide which of several pain relieving drugs is most
effective or whether they are equally effective.
Similarly, the effectiveness of various types of diets on growth status of children or
albino rats can be assessed. For assessing the effectiveness of the experiment, it
should have one group to serve as local control.
Treatment: This means the experimental conditions which are imposed on an
experimental unit in a particular experiment. In a dietary or medical experiment, the
different diets or medicines are the treatments. In an agricultural experiment, the
different varieties of a crop or different manures will be the treatments.
Experimental unit: An experimental unit is the material to which the treatment is
applied and on which the variable under study is measured. In a feeding experiment
of cows or albino rats, the whole cow or albino rat is the experimental unit.
Experimental Error: We usually come across variation in the measurement made on
different experimental units even when they get the same treatments. A part of this
variation is systematic and can be explained, whereas the remainder is to be taken to
be of the random type. The unexplained random part of the variation is termed the
experimental error.
Grouping: This is the placement of homogenous experimental units into different
groups to which separate treatments may be assigned.
Blocking: This is the assignment of the experimental units to blocks in such a manner
that the units within any particular block are as homogenous as possible.
Factors: A factor is a possible cause of response or variation. Factors include age, sex,
variety, etc. it may be observed that treatments are often different combinations of
the levels of one or more factors.
Balancing: This is the assignment of the treatment combinations to the experimental
units in such a way that a balanced or symmetric configuration is obtained.
One-way ANOVA is used when we wish to test the equality of k-population means.
The procedure is based on the assumptions that each of K groups of observation is a
random sample from a normal distribution and that the population variance σ2 is
constant among the groups. ANOVA models provide an appropriate estimate to
facilitate comparison of several means.
The statistical model for one-way classification of ANOVA is
X ij = µ + α i +  ij
i = 1, 2, ….., k
j = 1, 2, …, n
Where Xij = (ij)th observation from the jth unit receiving treatment
µ = overall or grand mean
αi = ith Treatment effect
 ij = random error
 ij ~ NID (0, σ2)
Notation trt total trt mean

Trt 1 X 11 X 12 .......... X 1n X 1. X 12
Trt 2 X 21 X 22 ......... X 2 n X. X 22
     
Trt k X k 2 X k 2 ......... X kn X k. X k2
Where
n
X i.. = 1
n ∑X j =1
ij i = 1, 2, ...k
k n
k ∑∑ X ij
X i.. = 1
k ∑ X i.
j =1
= i =1 j =1
kn
i = 1, 2, ...k
Sum of Squares identity

ANOVA is partitioning of total variability into components parts.
Total sum of square (TSS): The TSS is defined as the sum of the square of the
deviations from the grand mean.
k n 2
TSS = ∑∑ (X
i =1 j =1
ij − X .. )
It is a measure of the dispersion of all the variates about the grand mean. Its
degree of freedom (df) = k-1. It can be shown that the TSS, SStotal or total variations
can be partitioned into two.
k n 2 k n
TSS = ∑∑ (X ij − X .. ) = ∑ ∑ (X − X i. ) + n ∑ ( X i. − X .. )
2 2
ij
i =1 j =1 i =1 j =1
WSS BSS
TSS = Between treatment
Treatment sum of squares sum of squares
Within Sum of Squares (WSS): WSS or sum of squares due to error (residual error) is
defined as the deviation of Xij (original observation) from the treatment means. It
represents the experimental error of the given experiment its degree of freedom is
k(n-1) denoted by SSE.
Between sum of squares (BSS): It is defined as the deviations of the treatment mans
about the grand mean. The less the samples differ from each other, the smaller the
BSS or treatment sum of squares (SSTr).
For easy computation, we can use the following:
k n
∑∑ (X − X .. )
2
TSS = ij
i =1 j =1
k n
T2
∑∑ X ij −
2
=
i =1 j =1 nk
2
k n 
Where T = ∑∑ X ij 
2
 i =1 j =1 
n
X i.
X i. = ∑X
j =1
ij , X i. =
N
k n
X ..
X .. = ∑∑ X ij , X .. =
i =1 j =1
N
Where N = total number of observations

k
BSS = n ∑ ( X i. − X .. )
2
i =1
1 T2
=
n
∑T 2
i. −
nk
Where Ti. = sum of observations in treatment i group
WSS = TSS – BSS
One-way ANOVA Table (Equal observation)
Given model, X ij = µ + α i +  ij
To test the hypothesis

H0: α1 = α2 = … = αk
H1: at least two di’s are not equal.
Test Statistics: F from the ANOVA table below
ANOVA Table
Source SS df MS F
Between treatments BSS k-1 BSS
k −1 =A A/B
Within treatments WSS K(n-1) WSS

k ( n −1) =B
Total TSS Kn-1
The critical value is F(1−α ), v ,V where df, v1 = k-1, v2 = k(n-1) and α is the
1 2
significant levels.
Example 8.1
Given the following five treatments A, B, C, D and E of three variables each
perform an analysis of variance to test whether the treatment effects and the same or
not and compute the coefficient of variation to determine its precision at 5% level of
significance.
A B C D E
3 5 7 6 4
2 8 8 8 9
4 8 6 7 5
Solution:
Test of Hypothesis
H0: α1 = α2 = … = α5
H1: at least 2 di’s are not equal.
A B C D E
3 5 7 6 4
2 8 8 8 9
4 8 6 7 5
Ti. Total 9 21 21 21 18
X i. mean 3 7 7 7 6
∑ (X − X ) = (3 − 6) + (2 − 6) + (4 − 6) + ... + (9 − 6) + (5 − 6)
2 2 2 2 2 2
TSS = ij .. = 62
BSS = n ∑ ( X i. − X .. ) = 3 [(3 − 6 ) + (7 − 6 ) + (7 − 6 ) + (6 − 6 ) ]
2 2 2 2
= 3 [9 + 1 + 1 + 1] = 36
WSS = TSS – BSS
= 62 – 36 = 26
Another Computational Formula or Method
T2n
TSS = ∑ X − 2
ij T = Grand total
i =1 nk
= (32 + 2 2 + 4 2 + ... + 4 2 + 9 2 + 5 2 ) −
(90)2
3× 5
= 62
1 T2
BSS =
n
∑T 2
i. −
nk
= [
1 2
9 + 212 + 212 + 212 + 18 2 −
(90) ]
2
3 3× 5
= 36
WSS = TSS – BSS = 62 – 36 = 26
ANOVA TABLE
Source of Variation SS df MS F
Between treatments 36 4 36
4 =9 9 / 2.6 = 0.346
Within treatments 26 10 26
10 = 2.6
Total 62 14
Test statistics = Fcal = 3.462

Critical Value = F(1-α),4,10 = 3.48
Decision: Since Fcal > Ftab, we accept H0 and conclude that the treatment mean effects
in the five treatments are equal or there is no significant difference between the
treatment means in the five treatments.
Statistical Hypothesis
The most frequent application of statistics is to test some scientific hypotheses.
Results of experiments, and investigations are usually not clear cut and, therefore,
need statistical tests to support decisions between alternative hypothesis. A
statistical tests examines a set of sample data and on the basis of an expected
distribution of the data, leads to a decision on whether to accept the hypothesis or
whether to reject that hypothesis and accept an alternative one. The nature of the
tests varies with the data and the hypothesis, but the same general philosophy of
hypothesis testing is common to all tests. A statistical hypothesis is an assumption or
statement which may or may not be true concerning one or more population.
A statistical hypothesis (or inference) is a statement about the parameters or form of
a population. A test of a statistical hypothesis is a criteria which specifies for what
sample results the hypothesis is to be accepted or rejected. The hypothesis which is
to be tested is generally called the Null hypothesis denoted by the H0 and hypothesis
against which it is to be tested is called the alternative hypothesis and also denoted
by H1.
Type I and Type II Errors
A type I error has been committed if we reject the null hypothesis when it is true and
a type II error has been committed if we accept the null hypothesis when it is false.
The following table summarizes the various situations that can arise when testing H0
against H1:
Accept H0 Accept H1
H0 is true No error Type I Error
H1 is true Type II error No error
The probabilities of committing a type I and type II errors are called level of
significance of the tests and are written as α and β, respectively. α is called the size of
the test and (1-β) is called the power of the test, and (1-β) is also the probability of
rejecting null hypothesis (H0) when it is false. The area such that if the sample point
falls in it we reject H0 is called the critical region. When the primary concern of a test
is to see whether the null hypothesis can be rejected, such a test is called a test of
significance. In that case, the quantity α is called the level of significance at which the
test is being conducted.
One and Two Tailed Test
A test of any statistical hypothesis where the alternative is one sided such as:
H0: θ = θ0 or H0: θ = θ0
H1: θ > θ0 H1: θ < θ0
Is called a one-tailed test. The critical region for H1: θ > θ0 lies entirely in the right tail
while the critical region for H1: θ < θ0 lies entirely in the left tail.
A test of any statistical hypothesis where the alternative is two-sided such as:
H0: θ = θ0
H1: θ ≠ θ0
Is called a two-tailed test, values in the both tails of the distribution constitute
the critical region.
TEST PROCEDURE AND STEPS
The steps involved in general and in the utilization of any test of significance
are:
i. Find the type of problem and the question to be answered.
ii. To state the null hypothesis (H0) and the appropriate alternative (H1)
hypothesis
iii. Selection of the appropriate test to be utilized and calculation of the test
criterion based on the type of test.
iv. Fixation of the level of significance α
v. Decision making on test criterion value, whether to reject or accept the
hypothesis.
vi. Drawing of the conclusion (or inference) on the basis of level of significance
is deciding whether the difference observed is due to chance or due to some
other known factors.
‘P’ Values
‘P’ Values are used to assess the degree of dissimilarity between two or more
sets of measurements or between one set of measurements and a standard. The ‘P’
value is actually a probability, usually the probability of obtaining a result of extreme
as or more extreme than the one observed if the dissimilarity is entirely due to
variation in measurements or in subject response, that is, if it is the result of chance
alone.
‘P’ values measure the strength of evidence in scientific studies by indicating
the probability that a result at least as extreme as the observed would occur by
chance.
‘P’ values are derived from statistical tests that depend on the size and
direction of the effect. ‘P’ Values should be considered in making decisions about the
usefulness of a treatment.
One popular approach is to indicate only that the ‘P’ value is smaller than 0.05
(P < 0.05) or smaller than 0.01 (P < 0.01). When ‘P’ value is between 0.05 and 0.01,
the result is usually called statistically significant, when it is less than 0.01 or 0.005
are taken to be very highly significant.
TESTS CONCERNING THE MEAN (FOR LARGE SAMPLE).
We will assume that the sampling distribution of the sample estimates will be
approximately normal and that the variance is known. Hence, for large samples (n ≥
30), we can use the normal probability distribution for testing a hypothesized value
of the population mean.
The test statistics
X − µ
Z =
S .E. (X )
Where
X is the sample mean
µ is the population mean
S.E. ( X ) is the standard error of the sample mean.
σ
S .E . ( X ) =
n
Where
σ is the population standard deviation (usually known)
n is the sample size.
We then compare the modulus of Z, that is, (/Z/) to its value at the given level of
significance, usually at 5% and 1%. The corresponding values of Z for both one tailed
and two-tailed tests are tabulated below:
One-tailed Two-tailed
5% (or 0.05) 1.64 1.96
1% (or 0.01) 2.33 2.58
Decision
i. If Z calculated is less than the Z tabulated then there is no reason to reject
the null hypothesis H0.
ii. If Z calculated is more than the Z tabulated then we reject the null
hypothesis H0 and accept H1 the alternative hypothesis.
Example
A bottling company which bottles a soft drink claims that the liquids content is
35cl with standard deviation 0.75cl. A researcher randomly collects 50 bottles,
measured their contents and got mean of 34.2cl. Test at 0.01 level of significance that
the bottling company has been cheating their consumers.
Solution
µ = 35cl
σ = 0.75cl
n = 50
X = 34.2
α = 0.01 (1%)
H0: µ = 35 that is, the company has not been cheating the consumers.
H1: µ < 35 that the company has been cheating the consumers.
Test statistics is
Z =
(X − µ) n
σ
=
(34.2 − 35) 50
0.75
− 0.8 × 7.0711
=
0.75
= -7.54
Thus, |Z| = |-7.541| = 7.54
At 0.01 level of significance the Z tabulated value (one tailed) is 2.33
Decision: the Z calculated value 7.54 is greater than the Z tabulated value 2.33. we
reject H0 and accept H1.
Conclusion: There is significant difference between the population and sample mean.
Hence, the bottling company has been cheating their consumers.
Example
The mean height from a random sample of size 100 is 64cm. The standard
deviation is known to be 3cm. test the statement that the mean height of the
population is 67cm at 5% level of significance.
Solution
X = 64cm
σ = 3cm
µ = 67cm
n = 100
α = 0.05 level of significance
H0: µ = 67cm
H1: µ ≠ 67cm
Test statistics
Z =
(X − µ) n
σ
=
(64 − 67 ) 100
3
= -10
Thus, |Z| = |-10| = 10
At 0.05 level of significance the Z tabulated value (two-tailed) is 1.96
Decision: Since Zcal > Ztab we reject H0 and accept H1
Conclusion: The mean height of the population could not be 67cm.
TEST CONCERNING THE MEANS (SMALL SAMPLES)
There are situations in real life experiment, such as, testing the efficiency of a
newly produced drug, where it is impracticable to get a large sample and yet tests of
significance still have to be carried out. When we do not known the value of the
population standard deviation and the sample size is small (n < 30), we shall assume
again that the population we are sampling from has roughly the shape of a normal
distribution. The test statistics is:
t=
X −µ
=
(X − µ) n
S S
n
Whose sampling distribution is the t distribution with n-1 degree of freedom. S is the
sample standard deviation. As with large samples, we compare it with its value at a
given level of significance, and then draw our conclusions.
Example
Suppose that we want to test on the basis of a random sample of size n = 5
whether or not the fat content of a certain kind of ice cream exceeds 12 percent. What
can we conclude about the null hypothesis. µ = 12 percent at the 0.01 level of
significance, if the sample has the mean X as 12.7 percent and the standard deviation
S is 0.38 percent.
Solution
Hypothesis
H0: µ = 12%
H1: µ > 12
α = 0.01
n=5
d.f. = n – 1 = t0.01,4 degree of freedom
Test statistics
X −µ
t=
S
n
12.7 − 12
t=
0.38
5
0.7
t= = 4.12
0.1699
t0.01,4 = 4.12
Decision: Since tcal > ttab, we reject H0
Conclusion: Therefore, the content of the given kind of ice cream exceeds 12 percent.
Example
The life time of electric bulbs for a random sampling 10 from a large consignment
give the following data:
Item Life in 1,000hrs x- X (X - X )2
1 4.2 -0.1 0.01
2 4.0 -0.3 0.09
3 3.9 -0.4 0.16
4 4.1 -0.2 0.04
5 5.2 0.9 0.81
6 3.8 -0.5 0.25
7 3.9 -0.5 0.16
8 4.3 0 0
9 4.4 0.1 0.01
10 5.6 1.3 1.69
Can we accept the hypothesis that the average life time of bulbs is 4,000hours
at 5% level of significance?
Solution:
Hypothesis
H0: µ = 4,000hours
H1: µ ≠ 4,000hours
α = 0.05 level of significance
Since, n = 10 d.f. = n-1 = 10 – 1, 9
t α 2 ( n −1) ( n −1) = t 0.25,9
n
∑X i
4.2 + 4.0 + ,..., + 5.6 43.5
X = i =1
= =
n 10 10
X = 4.3
10
∑ (X − X)
2
i
i =1
S2 =
n −1
S2 =
(4.2 − 4.3)2 + (4.0 − 4.3) + ... + (4.4 − 4.3) + (5.6 − 4.3)
2 2 2
10 −1
0.01 + 0.09 + , ..., + 0.01 + 1.69

=
9
3.22
S2 = = 0.358
9
Test Statistics
t=
(X − µ) n
S
t=
(4.3 − 4 ) 10
0.598
Where S = 0.358 = 0.598
t = 1.587
t0.025,9 = 2.262
Decision: Reject H0 if tcal > ttab
Conclusion: Since tcal > ttab, then we accept H0 and conclude that the average life time
is 4,000hours

STA 201 Lecture Note NEW-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STA 201 Lecture Note NEW-1

Uploaded by

Copyright:

Available Formats

STA 201 (STATISTICS FOR PHYSICAL SCIENCES AND ENGINEERING)

Quality of a Good Questionnaire

Draw a pie chart using the above information.

Multiple Bar CHart

GM = Geometric mean = n (x1 ) (x2 )(x3 ), ...., (xn ) = (x1 , x2 , ..., xn ) n

For its logarithmic calculation, the relationship used is,

log x1 + log x2 + log x3 ... + log x15

Geometric mean (GM) = (1× 2 × 3× 4 × 5) 5 = (120) 5 = 2.605

With logarithms, the calculations are provided below for GM.

Therefore, AM is the highest followed by GM and HM.

47.5 52.5 57.5 62.5 67.5 72.5

The mode is approximately 56

47.5 52.5 57.5 62.5 67.5 72.5

iii. Mean Deviation

N = is the number of observations or N = ∑f i

iv. Standard Deviation

vi. Coefficient of Variation

Example 2.6 Using the table in example 2.5

Estimated standard deviation = (S ) =

Characteristics of sample and population are provided

S2 will be a representative unbiased estimate of the population variance σ2 only if (n-

Mean = ∑ x = 2758 = 183.9

S tan dard Deviation, SD = S2 = 4178.55 = 64.65

(ii) The regression line of x on y is given by

α= ∑ x − β ∑ y = 800 − 1.036 x 811

PRODUCT MOMENT CORRELATION COEFFICIENT

Where – 1 < r <1

Calculate the coefficient of correlation

The Classical Approach

Where S is the sample size and ACS.

The Relative Frequency Approach

The disadvantage of this method is that the experiment cannot be a one-short

Likewise, the conditional probability of A given B and P (B) ≠ 0 is:

that P[B] ≠ 0. Then for any of the events Aj, j = 1, 2, 3, …, n

Baye’s theorem is much easier to use in practical problem than to state

 ij ~ NID (0, σ2)

Notation trt total trt mean

Sum of Squares identity

Where N = total number of observations

To test the hypothesis

Within treatments WSS K(n-1) WSS

Total TSS Kn-1

Test statistics = Fcal = 3.462

0.01 + 0.09 + , ..., + 0.01 + 1.69

You might also like