Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

1 INTRODUCTION

What is an experiment? It is a planned inquiry to obtain new facts or to confirm or deny the
results of previous experiments. Such inquiry may aid or assist in recommending a seed
variety, an animal breed, a ration, a medicine, a pesticide or a management procedure.
Experiments have to be planned because we are investigating effects. Effects under
investigation vary from experiment to experiment i.e. keep on changing thus the drawing of
inference becomes very difficult. Therefore, the design of research (experiment) is aimed at
identifying and reducing this variation. We can reproduce our experiment with reasonable
degree of repeatability. If an experiment is improperly designed or blindly carried out, then
it's 100% likely that result will be irrelevant or unwanted.
Experiments can be classified into three categories
(a) Preliminary experiment. In this the investigator tries out a large number of treatments in
order to obtain leads for future work.
(b) Critical experiment. In this the investigator compares response to different treatments
using sufficient observations of the responses to give reasonable assurance of detecting
differences.
(c) Demonstrational experiments. These are performed when investigator compares a new
treatment with a standard.

1.1 Terminologies

1.1.1 Experimental units


The smallest unit of experimental material. In agronomy studies it might be a plot. In animal
studies, an animal may constitute an experimental unit or if pigs in a pen, all animals in that
pen may be an experimental unit or a group of animals receiving a single treatment. In some
cases, the experimental unit will be so large that it is impractical to use it as the sample unit. A
sample of it must then be used as the sampling unit. In such cases, two or more random
subdivisions of the experimental unit are measured e.g. milk from the 4 udder quarters.

1.1.2 Sampling unit


Is that fraction of the experimental unit on which the effect of the treatment is measured e.g.
each chick in a cage of 5, each pig in a group of 3, a pen or one tree in a plot.

1.1.3 Treatment
Various conditions, which distinguish the population of interest e.g., variety of crops, drugs in
animal experiments, fertilizer in agronomy studies etc.

1.1.4 Control
This is a standard treatment. It might also be a placebo (that is a tablet, cream or solution
looking exactly like the real medicine but lacking the active ingredient)

1
1.1.5 Factors
When several aspects are studied in a single experiment e.g. rainfall, fertilizer, each of these is
called a factor. They can also be defined as aspects under study. Factors can occur at different
levels.

1.1.6 Variation
These are differences exhibited by individuals in a population. The variation could arise from
different causes which could be known or unknown. However, there are two main causes of
variation: -
(1). Inherent (genetic) variability that exists in the experimental material e.g. between
animals, plants in a plot, leaves on a plant, each test tube or each machine.
(2) Lack of uniformity in the physical conduct of the experiment e.g. mixing feed, reading at
different times, where different people are involved etc.
If variation is measured in a population, we talk of total variation, which can further be sub-
divided into different sources causing the variation using analysis of variance (ANOVA).

1.1.7 ANOVA
This is the process of subdividing the total variability of experimental observation into
portions attributed to recognised causes. For example, gain in weight in humans is normally
attributed to amount of feed, genes, type of feed etc.

1.1.8 Systematic errors


These arise from failure to standardize experimental technique. They are errors that one can
account for through the use of the correct experimental design.

1.1.9 Experimental design


This refers to grouping of experimental units. There are 3 basic design namely completely
randomised design (CRD), randomised complete block design (RCD) and latin squares (LS).
The others are just modification of these three e.g. incomplete block designs etc.

1.1.10 Random error


In the absence of systematic error, the difference between the estimate of the treatment and
the true value is called the random error. The magnitude of the random error is measured by
the standard error.

1.1.11 Standard error


Standard error gives the precision of your experiment. It depends on: -
(1) Intrinsic variability of the experimental material (in built variability).
(2) Number of experimental unit used. The larger the number, the lower the standard error.
(3). Design and method of analysis.
Any experiment should try and reduce the standard error. If the standard error is too big, the
experiment is useless because conclusion cannot be made. If the standard error is too small,
there was a waste of experimental material and hence money.

2
The standard error of the mean is calculated as
2
Standard deviation x
Number of experimental units per treatment

1.1.12 Precision
Reflects the repeatability of measurement of any variable.

1.2 Koch's postulates and the stages of scientific inquiry


(1) Observation - something happens and someone wonders why? Being a keen observer
(2) Hypothesis - why something happened. An attempt to provide an answer. This can be
formal guess, a statistical hypothesis or literal statement
(3) Design an experiment to test, investigate or evaluate the hypothesis. In the experiment the
objectives (s) must be clearly stated as a hypothesis
(4) Carryout the experiment, sample and/or collect data on a selected population or area, and
analyse the data.

1.3 Steps considered in planning and conducting a scientific research experiment


Research has to be scientifically performed to avoid unwanted information. To achieve this,
the following steps must have considered before starting the experiment.
(1) Formulation of research plan
This involves two things:
(a) A precise statement of problem to be solved e.g., Factors affecting weaning weight in Red
Maasai sheep.
(b) Objective of research e.g. To calculate correction factor which can be used to compare
lambs at weaning which are born at different seasons, mothers, years
(2) Choice of factors to be used
Factors are aspects to be studied in a single experiment e.g., year, season the lambs were born,
number of young ones born by ewe, parity, sex etc. The factors must be relevant and all
important factors must be included. If one includes too many factors, the experiment will be
too big and unmanageable and analysis will also be complicated.
(3) Choice of variable to be measured
These are feed intake, birth weight, growth rates etc. a variable should be measured only if it
is providing information about the phenomena under investigation and this will be determined
by the cost and importance of the variable itself. For example, it might be expensive to
measure the birth weight because ewe lamb day and night and employing someone to look
over them will be expensive.
(4) Choice of inference space
This is the range of validity of the results and is defined as the set of population to which the
inference may be applicable e.g. is the variety better or valid at the coast. The range of
validity will determine the size of the experiment. If you want the experiment to cover a wide
range of validity, then you need a big experiment.

Steps 1 to 4 constitute the experimental setup phase.

3
(5) Selection of experimental material
The type and amount of experimental material will depend upon:
(a) Objective of the experiment
(b) The factors under study
(c) The inference space
(d) The budget available

(6) Choice of experimental design


This is the most important step.
(7) Formulation of a model
A linear mathematical model that describes the observation anticipated under the
experimental plan.

Steps 5 to 7 constitute the design phase of the experiment.

(8) Collection of the data


Make sure that measurements are accurate and precise.
(9) Analysis of the data
This will depend on the design and the model of the experiment
(10) Conclusion and interpretation
This is the practical implication of the results; which will relate to the objectives.

Steps 8 to 10 constitute the analysis phase.

1.4 Features of a good experiment


(1) Reduction of systematic errors
This refers to errors that are not easily explained i.e. get confounded with your treatment
effects e.g. if you have 2 rations A and B and you are feed animals in pens C and D,
respectively, the problems that could arise could be due to pen effect. The pen effect will have
an effect on production. When designing an experiment, this error should be reduced as much
as possible.
(2) Precision
We are interested in the repeatability of measurements. It is measured by the standard error of
the experiment. Precision is affected by number of experimental units, variation among the
animal i.e. their conditions etc. Therefore, the animals should be as uniform as possible to
reduce this standard error. To increase precision and hence reduce the standard error, we have
to reduce the standard deviation and increase the number of experimental units per treatment.
Increasing the experimental units, would make the experiment expensive and therefore we
have to look for a design to take care of this.
(3) The range of validity of the results (i.e. concept of accuracy)
Accuracy refers to concept of having a value close to the true value.
(4) Simplicity in design and also in analysis
(5) Proper statistical analysis without making artificial assumptions.
Assumptions should be made to fit our practical situation as much as possible.

4
1.5 Hypothesis testing

1.5.1 Basic concepts of testing population means


Hypothesis testing is important in assisting a researcher to make decisions concerning a
population based on a sample. A statistical hypothesis is an assertion or statement or
conjuncture about the distribution of one or more populations. Test of hypothesis is a rule for
deciding whether to reject your hypothesis. Testing hypothesis can be presented in eight steps
procedure.

Step 1. Understanding the Data


The nature of the data that form the basis of that procedure must be understood.

Step 2. Stating the assumptions.


This involves stating the various assumptions made on the data which may be
 Assumptions on normality of the data.
 Equality of variance.
 Independence or non-independence of sample.

Step 3. Stating the hypothesis.


There are two statistical hypothesis involved in hypothesis testing: -

Null hypothesis (H0): - this is set up for the purpose of being described. Consequently, the
opposite of the conclusion that the researcher is seeking to reach becomes the states of the
null hypothesis. If the H0 is accepted, we say the data does not provide sufficient evidence to
reject the H0. If H0 leads to rejection we conclude that the data is not compatible with H 0, but
are supportive of some other hypothesis.
Alternative hypothesis (H1).

Step 4: - Test statistic


This is a statistic that may be computed from the sample and it serves to make decision on
whether to reject or accept H0 e.g. Z.

Step 5: - Distribution of test statistic


The distribution of the test statistic may be known.

Step 6: - Decision rule.


All possible values of test statistic can assume any point on a horizontal axis of the graph of
the statistic and are divided into two groups which are rejection and acceptance regions. The
rejection or acceptance regions is determined by the desired level of significance denoted by
α, . The significance level α is the probability of rejecting a true null hypothesis.

The error committed when a true hypothesis is rejected is called TYPE I ERROR. The
probability of committing type I error is denoted α (Significance level). TYPE II ERROR is
the error committed when false hypothesis is accepted. The probability of committing a type
II error is denoted as β.
Step 7: - Computing Test Statistic.
From the sample data we compute the test statistic and compare it with acceptance and
rejection regions as determined by level of significance α.

Step 8: - Decision making.

5
A statistical decision consists of rejecting or accepting H0 depending on whether the computed
test statistic falls in the acceptance or rejection region.

When sampling from a normal population with known population variance σ 2.


This is assuming a random sample from a normally distributed sample with mean µ unknown
and variance σ2 known.

To test either of the following hypothesis


H0 µ = µ0 H0 µ ≤ µ0 H0 µ ≥ µ0
H1 µ ≠ µ0 H1 µ > µ0 H1 µ < µ0
The test statistic used is or test by confidence interval using the equation
.

Sampling from a normal population with mean unknown and variance unknown.
This is assuming a random sample from a normally distributed sample with mean µ unknown
and variance σ2 unknown.
To test either of the following hypothesis
H0 µ = µ0 H0 µ ≤ µ0 H0 µ ≥ µ0
H1 µ ≠ µ0 H1 µ > µ0 H1 µ < µ0
note that in this equation s (sample statistic) is used instead of (population
statistic).

Sampling from a population that is not normally distributed


This is assuming a sample from a population that is not normally distributed with mean µ and
variance σ2.
To test either of the following hypothesis
H0 µ = µ0 H0 µ ≤ µ0 H0 µ ≥ µ0
H1 µ ≠ µ0 H1 µ > µ0 H1 µ < µ0

The problem we don’t know the population distribution. If n is small, then we have a problem
but if n is large we use the central limit theorem in determining the test statistic which in this
case is: -
~ N (0, 1) where s is the sample variance.

6
2 POPULATION MEAN AND VARIANCE
It is imperative to first define statistics. Statistics is the science of collecting, analysing and
interpreting quantitative data in such a way that the reliability of the conclusions can be
evaluated in an objective way. There are two types of statistics: -
(1) Descriptive statistics: - Deals with organizing and summarizing numerical data. This also
includes the presentation of data in graphical or pictorial form.
(2) Inference statistics: - Deals with formulating theories and making conclusions from
empirical facts or data. The conclusions made can be useful knowledge applicable in life and
production.

2.1 Population
A population is a group of things whose numbers may not be known. However, a sample is a
part of a population whose size is known and is subject to investigation or research in order to
generate data. The purpose of statistical inference is to establish facts about populations.
Ordinarily, a statistical analysis involves only certain aspects of a population - only certain
attributes, activities or characteristics e.g. weight, height, motion etc.
The attributes, activities or characteristics in a population or sample are assumed to be
normally or randomly distributed according to the laws of probability.
We can measure a certain characteristic of an element of the population. Such measurement is
called an observation. Interest in statistics often lies in the set of measurements taken from the
elements of the population. Such a set often also will be called the universe or population of
observations. The number of elements in the population is N and is called population size. In
practise populations are finite. In statistical theory populations are assumed to be infinite. This
is a good model, if N is large.

2.2 Population mean and variance


One of the first objectives of statistics is to summarize a population of observations,
mY1 ,Y 2 ,......Y N . r
A measure of the location of the distribution of these observations is the population mean.
This is calculated as the arithmetic average and is denoted by the symbol :
N
   Yi N
i 1

A measure of how the distribution of observation is spread is called variance. This is


calculated as the arithmetic average of the squared deviations of the observations from the
population mean. The variance is denoted by the symbol 2

cY  h
N

 
2 2
i
N
i 1

7
A formula to calculate 2, if  is not an interger, is


L F I
 M Y  G  Y J
N
2
N 2
OP
H K
2

MN
N
PQ
i i
N
i 1 i 1

The variance is often replaced by another measure for how the distribution of observation is
spread, this is the square root of the variance, and this measure is called standard deviation.
The standard deviation is denoted by and is

LM  F I 2
OP
 cY   h
MN Y GH  Y JK
N N N
2
 N N = 2
2
N
i 1
i
i 1
i
i 1
i
PQ
The advantage of the standard deviation is that it has the same dimension as the observations.
Be aware that the two measures population mean and population variance generally do not
describe the population completely. To show the influence of how the distribution is spread,
measured by the variance, we describe a population of 10 observations.
Example 1.1
The population consists of (3, 4, 4, 5, 5, 5, 5, 6, 6, 7) i.e. N = 10. We can summarize these
observations in a so-called frequency table. For this we first look at the values in the
population, which are different. In our example we have five different values namely 3, 4, 5, 6
and 7. The number of different values of the population is denoted by M.

Yi fi fi/N
3 1 0.1
4 2 0.2
5 4 0.4
6 2 0.2
7 1 0.1

fi is called frequency and counts the number of times a certain observation Y i occurs
fi/N is called the relative frequency.

We find that  = 5 and 2 = 1.2 for this distribution because


10 10

 Yi  50 , Y  262 , hence  = 50/10 = 5.


2
i
i 1 i 1

2 = (262-502/10)/10 = (262-250)/10=12/10=1.2

The calculation of  and 2 is easier when we already have the frequency table by:

8
M
   f i Yi N
i 1


L F I
 M f i Y  G  f i Y J
M
2
M 2
OP
H K
2

MN
N
PQ
i i
N
i 1 i 1

M
where N   f i and M = number of different observations. (They are also denoted by Y but
i 1
now i is running from 1 to M and all the MYi -values must be different).

Exercise 1
Calculate the population mean and the population variance for each of the following five distributions.

1.
Yi 3 4 5 6 7
fi 1 3 2 3 1

2.
Yi 3 4 5 6 7
fi 2 2 2 2 2

3.
Yi 3 4 5 6 7
fi 1 4 2 0 3
4.
Yi 3 4 5 6 7
fi 3 2 0 2 3
5.
Yi 3 4 5 6 7
fi 2 4 0 0 4

9
3 TYPES OF LINEAR MODELS
In any experiment our main aim is to study any variable e.g. growth rate. The behaviour of
growth rate can be expressed in terms of an equation and this is referred to as a linear model.
This equation explains the response of the variable in terms of its components parts. For all
the models of analysis, a linear model equation of the form

Yi    ei
is defined. In this equation Yi denotes not only the trait but also an observation induced by the
model. This observation is the sum of mean and an error term e i containing observational
errors and the variability between experimental units. The models of analysis of variance
differ by the number and nature of parameters under study. The observations in an analysis of
variance are allocated in at least two classes, which are determined by the levels of the
factors. Each factor occurring in the models has at least two levels.
For example
An experiment to measure the effect of feeding dairy meal to cows of the same breed, age and
in the same stages of lactation on milk yield will have the following linear model (assuming
presence of a control group)

Yij =  + i + eij
where
Yij = observation on jth cow of the ith treatment
 = overall population mean
i = effect due to the ith treatment (deviation of each treatment mean from overall mean)
eij = random error associated with Yij.

3.1 Fixed effects model


Assume that our experiment can use all treatments of interest. After the experiment, the
inference drawn will be restricted to the set of treatment used. If this is the case, the linear
model is called fixed effect model e.g. if we are testing 3 rations our model will be fixed
because the inference will be related to the used treatments i.e. inference will refer to the
treatment.

3.2 Random effects model


Suppose we are conducting an experiment with a large number of treatments but we are not
able to test all the treatments but we want the inference made to apply to all these treatments.
We will be forced to use only a sample of treatments. Therefore, the set of treatment, which
can be used is larger than the treatment appearing in the experiment. In such a case we will be
dealing with a random effects model e.g. an experiment on the effect of humidity on larva
development - we are forced to work with three treatments - 10%, 20% and 30% humidity

10
(could have used more). The inference made will have something to do with how humidity
affects larva development.

3.3 Mixed effects model


If an experiment has more than one treatment some of which are fixed and others random,
then the model is called a mixed effect model.

Exercise 2
Mr. Nyasi designed an experiment to investigate the effect of Rhodes grass hay, maize silage,
sorghum silage and Napier silage on milk yield. His aim was to adopt and use the forage that
gives highest milk yield.
He used Friesian cows in 3 stages of lactation (First, mid and last third). Each group of cows
was kept in a zerograzing pen. Write down his linear model.

11
4 DATA MANAGEMENT

4.1 What is data?


At the simplest level, “data” are the values recorded in the field books, record books or data-
logging devices, that are to be entered into the computer and then analyzed. Roughly, one can
regard the data management task in a project as simple if all the data to be computerized have
been collected on a single type of unit, e.g. plots or animals. The task is complex where data
have been collected from a number of different units or levels. For example, in an on-farm
study there will often be interview data at the farm level and response measurements at the
plot, animal or tree level.

Examples of data:
 Univariate data e.g. the height of some individual drawn randomly from a population of
graduate students.
 Multivariate data e.g. post-graduate students weight, height, ages.
 Time series e.g. height of plants at different ages, height, weight at different ages.

4.2 Metadata
This is data about data or descriptive information about data, which allows a potential user to
determine a dataset's fitness for use. Metadata has many applications. It can be used to:
 Concisely describe datasets and other resources using elements such as the name of the
dataset, the quality, who is the custodian, how to access the data, what is its intended
purpose, and whom to contact for more information about the data.
 Enable effective management of data resources.
 Enable accurate search and data resource discovery.
 Accompany a dataset when it is transferred to another computer so that the dataset can be
fully understood, and put to proper use and to duly acknowledge the custodian of the
dataset.

Why should we use metadata?


Metadata help the users find the data they need and determine how best to use it. It is also of
benefit to the data-producing organization. For example, in case of personnel change in an
organization. Also new employees may have little understanding of the contents and uses of a
dataset and may find that they don't understand the results generated from these data. Lack of
knowledge about other organizations' data can lead to duplication of effort and waste of time.
Information needed to create metadata is often readily available when the data are collected.
A small amount of time invested at the beginning of a project will save money and time in the
future. Data producers cannot afford to be without documented data. The initial expense of
documenting data clearly outweighs the potential costs of duplicated or redundant data
generation.

How much metadata should be stored?


There are now many different 'standards' for metadata and there are metadata-authoring tools
available, which may or may not suit your needs. To help you decide how much metadata to
store we suggest you think what, where, when, how, why and who.
What do the data represent?

12
What was the name of the project that generated them? Perhaps include an introduction or
abstract to the project.
What is the format and structure of the data? Include here any naming conventions used for
the files.
Where were the data collected? Give details about the site and perhaps the sampling frame
used.
When were they collected? State the time period covered by the data.
How were the data collected? Describe the measuring instruments used.
Why were they collected? What was the purpose of the research?
Who collected the data, or who were the principle researchers involved? Who holds the data?
Who has property rights to the data? Include names and contact details.

These are just ideas but if you can answer these questions then you are well on the way to
having a well-documented dataset. Of course for this information to be useful for others it
must be stored somewhere that others can access - in other words not just your memory!
Below is an outline of what might be included in the metadata.
 Title
 The name of the dataset or project
 Authors
 Name of principal investigator and other major players in the research. Include mailing
address, phone number, fax numbers, email, web address, etc.
 Data set overview
 Introduction or abstract. Time period covered by the data. Physical location of the data.
Any references to the Internet.
 Instrument description
 Brief text describing the instrument, with references. Figures or links if applicable. Table
of specifications.
 Data collection and processing
 Description of the data collection. Description of any derived parameters. Description of
quality control procedures used.
 Data format
 Data file structure and naming conventions.
 Data format and layout.
 Data version number and date.
 Description of codes in the data.
 Data documentation should be sufficiently complete, so that persons unfamiliar with a
given project could read the documentation and be able to use and interpret the data.

Part of your data management strategy might be to develop a database of metadata for all your
research projects. Of course if your datasets are currently in disarray and undocumented, this
will be a mammoth task, but you could start with current and future projects. We don't expect
you to solve all your data management problems overnight.

4.3 Data management software


One of the first decisions is to decide on the appropriate software to use. This will depend on
the complexity of the study and the different types of observational units at different
hierarchical levels. So far we are considering simple data structures, essentially at one layer.
The different types of software that can be used for handling such data include:

13
 Database (DBMS) packages, e.g. Access, dBase, EpiInfo, Paradox, DataEase;
 Statistical packages, e.g. Genstat, MSTAT, SAS, SPSS, Statgraphics, Systat;
 Spreadsheet packages, e.g. Excel, Lotus-123;
 Word processors, e.g. Word, WordPerfect; or text editors, e.g. Edit.
 Geographic information systems are also available for storing spatial data.
Some of the advantages and disadvantages of the three types of packages for data
management are given in the following table.

Package Advantages Disadvantages

Statistical Data management and data Usually unsuitable for multi-level


analysis can be done within data structures
the same package.
Lacks security of a relational
Can do most that a database management system.
spreadsheet package can do.
Graphical facilities may not be as
Will usually have good as in specialized graphical
programming capabilities. software.

Spreadsheet User friendly, well known and Unsuitable for multi-level data
widely used structures
Good exploratory facilities Lacks security of a relational
database management system
Statistical analysis capabilities
limited to simple descriptive
methods

Database management Secure Needs computer expertise in


database development
Can handle complex multi-
level data structures Graphical facilities may not be as
good as in specialized graphical
Allows screen design for data software.
input
Statistical analysis capabilities
Will generally have standard limited to simple descriptive
facilities for reporting methods

Each type of package has its own special use. Nevertheless, statistical, spreadsheet and
database management packages have overlapping facilities for data management, and all can
now 'talk' to each other. In other words, a data file stored in one package can be exported
(transferred) to another. The researcher needs to anticipate at the start of a project how data
entry, management and analysis will proceed, and plan accordingly.
Excel is popular among researchers and is suitable for storing data from simple studies. But
care is needed in handling spreadsheets as it is easy to make mistakes. Excel does not offer
the same data security as a data base management system and so it is easy to lose and spoil
data.

14
There is no reason nowadays why data entry, management and analysis cannot all be done in
the one statistical package, especially for the simpler studies in which data can be stored in
standard spreadsheets. The problems of data security remain but the advantage, as indicated in
the above table, is that all the stages from data entry to analysis and interpretation can be done
within the same package.
Once all the data handling facilities of a statistical package have been exhausted attention can
then be turned to Excel or another spreadsheet package to see what additional facilities are
available there. Finally, the use of more specialized data base management software for
handling the multi-level datasets on which some of the case studies are based can be taught.
The advantage of this approach is that the student can find out what data handling facilities
are available within the statistical package itself. This means that the package will be taught
not just as a statistical tool but as a data handling tool too. The student will not have to move
from one package to another with the risk of making mistakes. At the end of the course the
student will be able to grasp better the appropriate roles for each type of package in the data
handling and management process.
Generally, the following views can be made on software for data management. Transfer of
data between packages is now simple enough that the same package need not be used for the
different stages of the work. The data entry task should be conceptually separated from the
task of analysis. This will help when thinking about what software is needed for data keying,
for checking purposes, for managing the “data archive” and for analysis.

Database management software (DBMS) should be used far more than at present. Many
research projects involve data management tasks that are sufficiently complex to warrant the
use of a relational database package such as Access. Spreadsheet packages are ostensibly the
simplest type of package to use. They are often automatically chosen for data entry because
they are familiar, widespread and flexible – but their very flexibility means that they can
result in poor data entry and management. They should thus be used with great care. Users
should apply the same rigor and discipline that is obligatory with more structured data entry
software.

More consideration should be given to alternative software for data entry. Until recently the
alternatives have been harder to learn than spreadsheets. Some statistical packages, for
example SPSS, have special modules for data entry and are therefore candidates for use at the
entry and checking stages. If a package with no special facilities for data checking is used for
the data entry, a clear specification should be made of how the data checking will be done. A
statistics package – not a spreadsheet – should normally be used for the analysis.

4.4 Data storage and backup strategies for your research data.
Data disasters
There exists much literature on perceptions of risk. Most people will overestimate the chance
of relatively rare disasters such as earthquakes and plane crashes. They underestimate the
chance of common problems. In particular, they underestimate the chance of their computer
crashing, a probability that approaches 1. Power fluctuations and dirt can damage the hard
disk, and the read/write mechanism can just wear out. In addition, the disk can get too
fragmented, slow down and this and several other reasons leads to corrupted files, lost files,
inaccessible hard disk or worse. On top of that there are different categories of human errors:
loss of a laptop, deleting files, deleting parts of files. It can also happen on purpose by
unscrupulous persons, jealous colleagues or disgruntled employees: theft of laptops, deleting

15
files or damaging computers on purpose, … And then of course there is always a chance of
fire, water damage, … Last but not least there is an ever growing group of “malware”:
computer viruses, Trojan horses, backdoors, spyware.
The goal of this section is not to protect you from a complete system failure. This is the quite
complex job of the better skilled system administrator. The goal of this note is to protect you
from loss of data from your current research activities. In a way, the goal of this note is to
protect you from yourself since it involves some discipline. The only way to avoid
unrecoverable data loss is to back up regularly in an organized way.
The following steps can be used to store and backup your research data.
Step 1: Organize your data.

In this step you want to organize your data in a structured logical way. Physically, you put
them in a structured set of subfolders within a folder. Give those folders a logical name that is
easy to understand. It will make life easier when trying to find a file. This data organization
can take some time, but you only have to do it once. The moment you have your data
structured, change the default location where some common software saves its files.

Step 2: Add documentation to your data

With your data logically ordered in subfolders you can easily locate the folder where a
specific data file is stored. But that folder might contain so many files that you still don’t
easily find what you’re looking for.

It is therefore important to add documentation to the data at the level of individual files. This
involves naming your files but also creating File Properties this is useful when performing
search. Give your data files descriptive and meaningful names for example if you wanted to
name an Excel workbook containing yield data from an experiment with tomatoes, you could
have named it something like “ tomyld.xls”. With such a file name, any person who is not
specialized in tomato yields needed a lot of imagination to figure out the contents of the file.
Instead use long, descriptive and meaningful file names. But don’t exaggerate the length of
the file names. It is good practice to name the file containing the tomato yield data as
something like “Tomato yield harvest year 1999.xls”. This file name is only 34 characters but
is long enough to understand what the file is about.

Step 3: Select an appropriate storage medium.

Your data are now ordered in a structured and logical way. And you have added sufficient
documentation to each file. The only thing you have to do now to make a backup is to copy
the folder containing your data (D:\Data or C:\Data) and paste it onto a storage medium.

The following is a review of the advantages and disadvantages of some common storage
media. As you will learn it has been concluded that the CD-R (and CD-RW) is the best
backup medium for research data of individual researchers and small organizations.

16
5 EXPERIMENTAL DESIGNS

5.1 Principles of experimental design


There are only three principles of experimental design. They are replication, randomization
and local control

5.1.1 Replication
If I were to tell you that a farmer gave his sick cow a home-made medicine and that a week
later she had got better, you would be rightly sceptical about this 'evidence'. That was not an
experiment because there was only one treatment. Any experiment must have at least two
treatments (one of which may, if we choose, be a control involving no treatment). Suppose
that the farmer had two sick cows and he gave his medicine to one of them: she got better in a
week and the other, untreated, cow took a month to recover. This is now an experiment but it
is un-replicated and will not impress us greatly. However, if the farmer had six cows showing
similar symptoms and he gave his medicine to three of them, who recovered in 5-10 days,
while the other three cows, left untreated, took 30-50 days to get better, you may well think
that this farmer's medicine deserves to be taken seriously. This is now a replicated
experiment. The whole basis of science is that observations are repeatable, although when we
are using biological material we have to allow for the effects of natural variation which clouds
our observations. We know that cows, like people, sometimes recover from illness whether
they are treated or not and so we demand controls (animals not treated) and replication before
we are satisfied that the evidence indicates that the medicine is the cause of the early recovery
of treated cows.
The essence of a scientific experiment conducted with variable biological material is that we
judge differences between units treated differently in the light of observed variation in units
treated alike.

Definition
Replication is the appearance of a treatment several times in an experiment. A treatment may
be replicated 2, 3, 4 times or more. Replicating a treatment is essential due to the following
reasons: -
1. Replicates provide a means of estimating experimental error. If no replicate, no true
estimate of experimental error because there will be no degree of freedom [t(r -1)].
2. Replication improves the precision of an experiment by reducing the standard
deviation and hence the variance of a treatment mean i.e. The replications there are,
the lower the variation or variance. The mean becomes more precise and closer to the
population mean.
3. Replicates increase the scope of inference of the experiment by selection and use of
more variable experimental units.
4. Replicates permit some control of experimental error variance through grouping of
experimental units.

17
5.1.2 Randomization
If I were now to tell you that the farmer reported above looked at his six ailing cows and
chose the three that appeared least affected to receive his medicine, you would immediately
revise your opinion of his 'experiment'. You would rightly say that the question of which
cows were to be treated should have been determined at random. This, like replication, is such
a common sense precaution against biased results that it does not need much elaboration as a
principle. The question of how randomness is achieved in practice will be dealt with later in
this course.
In medical research, where patients are randomly allocated to different medicines or to a
medicine and a placebo (that is a tablet, cream or solution looking exactly like the real
medicine but lacking the active ingredient), it is common practice to conceal the information
about which patient is receiving which treatment from both the patients themselves and from
the doctors recording the results of the trial. This is called a 'double-blind' trial. This
procedure is not commonly applied in animal experiments, on the grounds that the
observations made are 'objective' (e.g. weights of animals or their products). But when, for
example, behavioural observations are made, there is a serious risk that subjective bias may
creep in and you should then aim to guard against this by specifying that, wherever possible,
the person making the observations does not know which animals are receiving which
treatment. There are, however, cases where the treatments applied are apparent even to a
casual observer (trials assessing grazing behaviour on different sown pastures, for example)
and then it becomes important to make the observations as objective as possible by, for
example, formally recording with a stop-watch the times that animals spend in defined
activities.

Definition
Randomization is a procedure by which the experimental units are randomly (by chance)
assigned to the treatments (or treatments combination). The importance of randomisation
includes: -
1. It is aimed at avoiding bias. It is somewhat analogous to insurance, in that it is a
precaution against disturbance that may occur.
2. It assures a valid estimate of experimental error and avoids bias in estimating the
mean.
3. Randomization tends to destroy the correlation among errors and make valid the usual
tests of significance.

5.1.3 Local control


Local control means imposing some restriction on the random allocation of treatments to take
account of known or supposed initial differences in the experimental material. Much the most
common form of local control in formal experimentation is blocking, which will be described
later in this course.
Another kind of local control, often used in survey work, is stratification. This is like
blocking, except that the numbers are usually ragged. For example, you might be interested in
mastitis in dairy cows and the correlation between various husbandry and hygiene practices
and the incidence of the disease. For this purpose, you decide to take a random sample of the

18
farms in a study area (you cannot visit and study them all) but before drawing names and
addresses from a hat containing names of all the farmers in the designated area, you may
decide to classify farms according to the size of their dairy herds. You will then draw similar
proportions (at random) from each size class and thereby ensure that your survey fairly
samples the situation in small, medium and large herds, and thus gives a truer picture of
mastitis in the region. In such cases the number of farms in each stratum of the survey will
probably be different, and the number yielding reliable information will probably differ from
the number intended at the outset. Blocking is rather like stratification but, because it is
applied in a planned experiment, it is usually more regular in its features.

5.2 Types of treatments


Treatment components can either be structured or unstructured. Unstructured treatments are
those treatments that are unrelated e.g. an animal nutritionist is testing three feeds. In one feed
he is using maize as source of energy, in the other cassava and the last one sorghum. His aim
is to determine whether maize can be substituted by cassava or sorghum as an energy source.
These three treatments are unstructured.
Structured treatments are treatments with some sort of relationships. These relationships could
take the form of
1. A regression function
2. A factorial treatment with 2 or more factors
3. A gradient relationship
4. A group relationship

5.2.1 Gradient relationship


Example testing effect of humidity on larva development where the treatments are the various
percentages of humidity e.g. 10%, 20% and 30%.

5.2.2 Group relationship


This is difficult to ascertain but a good example is seen in milking machines. There are
different types of machines. They include machines with (a) roller bearing manually oiled (b)
roller bearing automatically oiled (c) brass bush manually oiled (d) roller bearing but sealed
(e) nylon bush automatically oiled. If our aim is to look at the maintenance, these machines
can be grouped into those with roller bearing and those without or those manually oiled and
automatically oiled or those that are sealed and those not sealed.
It is important to note the type of treatment one is dealing with because post ANOVA
procedures will be different. The ANOVA is the same for these treatments.

19
5.3 Basic statistics on experimental data
Experiments may entail the calculation of basic statistics (variance, standard deviation,
standard error of the men, coefficient of variation etc.) for the purpose of summarizing the
data. The following exercise shows how this can be done.

Example 4.1
Calculate the statistics for the following data of number of eggs laid in 30 days by a flock of
chickens treated with a vitamin (H = hens number 1 to 12).

Observation H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12


Data 14 18 15 24 19 16 15 26 22 20 14 18

n
1. The grand total =  Y = 14 + 18 + …+ 18 = 221
i 1
i

2. The mean = 221/12 = 18.42


n
3. The uncorrected sum of squares = Y i 1
i
2
= 142 + 182 + ….. + 182 = 4243

FG  Y IJ
n 2

4. Corrected sum of squares (SS) =


n

 Yi2 
H K
i 1
i

= 4243 - (221)2/12 = 173


i 1 n
5. The variance = 2 = SS/degree of freedom (df) = 173/(12-1) = 15.73
6. The variance of the mean = 2/n = 15.73/12 = 1.31

7. The standard deviation  =  2 = 15.73 = 3.97


8. The standard error of the mean (SEM) (the square root of the variance of the mean) =
 2 / n = 131
. =1.14
9. Coefficient of variation CV =  x 100/mean = 3.97 x 100/18.42 = 21.50.

5.4 Completely randomized design (CRD)


The CRD is used when all the experimental units are homogenous i.e. when there is no source
of variation among the experimental units. Therefore, one is not able to group the animals
based on a certain characteristic e.g. feeding different rations to dairy cows belonging to the
same breed and at the same stage of lactation and age. The only of variation would be the
effect of rations. The design is complete because exactly n i experimental units have to receive
the ith treatment and randomised because treatments have to be allocated randomly.

5.4.1 Characteristics of CRD


1. Treatments are completely assigned at random.
2. The experimental units must be homogenous.

20
3. Any difference obtained in the experimental units will be due to experimental error.
4. It is a one-way classification of ANOVA. Classification is by treatment only.

5.4.2 Randomization layout (randomisation of given treatments)


Assume an experiment carried with 4 treatments (A, B, C and D) each replicated 5 times.
Steps involved in randomisation include: -
1. Determine total number of experimental units or plots, n, given as a product of the number
of treatment (t) and number of replication (r)
i.e n = t x r
4 x 5 = 20 = n
2. Assign a plot (experimental unit) number in a convenient manner for example,
consecutively from 1 to n (1 to 20).

1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
3. Assign the treatments to the experimental units. Assign treatments by way of randomisation

(I) Using table of random numbers by: -


(a). Locate a starting point in the table of random number e.g. 6 th row 12th column (using 3
digits).
(b) Using the starting point obtained in (a) read downwards to obtain 20 (3 digit numbers).
We find the following:

Random Sequence Rank Random Sequence Rank


number number
937 1 17 918 11 16
149 2 2 772 12 14
908 3 15 243 13 6
361 4 7 494 14 9
953 5 19 704 15 12
749 6 13 549 16 10
180 7 4 957 17 21
951 8 18 157 18 3
018 9 1 571 19 11
427 10 8 226 20 5
(c) Rank these random numbers.
(d) Divide the n ranks into t groups each consisting of r numbers according to the sequence in
which the random numbers appeared.

21
Groups Ranks in the group
1 17 (A) 2 (A) 15 (A) 7 (A) 19 (A)
2 13 (B) 4 (B) 18 (B) 1 (B) 8 (B)
3 16 (C) 14 (C) 6 (C) 9 (C) 12 (C)
4 10 (D) 20 (D) 3 (D) 11 (D) 5 (D)

(e) Assign the t treatments to the n experimental plots (units) by using the group number as
the treatment number and the corresponding ranks in each group as the plot (unit) number in
which the corresponding treatment is to be assigned.

II Drawing lots
Rather than using random numbers e can use drawing lots by using papers which should
conform to your experimental units. For them to be homogenous, they have to be identical.
Steps involved include: -
1. Prepare n identical pieces of paper and divide them into t groups, each group with r pieces
of paper.
2. Label each paper of the same group with the same letter (or number) corresponding to a
treatment. Uniformly fold each of the n labelled pieces of paper; mix thoroughly and place
them in container.
3. For our example, there should be 20 pieces of papers, five each with treatments A, B, C, D
appearing on them.
4. Draw one piece of paper at a time without replacement and with a constant shaking of the
container after each draw to mix its contents.
5. For our example, the label and the corresponding sequence in which each paper is drawn
may be as follows: -

Treatment D B A B C A D C B D D A A B B C D C C A
label
Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
6. Assign the treatment to plots (units) based on the corresponding treatments label and
sequence drawn.

5.4.3 Linear model


The linear model for CRD is
Yij =  + i + eij
where
Yij = observation on jth cow of the ith treatment
 = overall population mean
i = effect due to the ith treatment (deviation of each treatment mean from overall mean - i -
)

22
eij = random error associated with Yij (deviation of each observation from its own mean - Yij -
)

5.4.4 Format for data recording and ANOVA in CRD


Data from a CRD experiment is normally organised as shown in the table below

Observation 1 2 …………… j Treatment total


Treatment
1 Y11 Y12 Y1j Y1
2 Y21 Y22 Y2j Y2

i Yi1 Yi2 Yij Yi


t Yt1 Yt2 Ytl Yt
Total Y..

Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows

Source Degrees of freedom Sum of Squares (SS) Mean Squares (MS) F


(d.f.)
Treatment t -1 SST SST/t-1 = MST MST/MSE
Experimental error t (r - 1) SSE SSE/t(r-1) =MSE
Total rt - 1 SSY
Where
Please note that rt = n
n
1. SSY = total sum of squares = Y
ij1
2
ij  CF

2
2. CF = correction factor = Y.. n
t 2
3. SST = sum of squares for the treatments = Y
i 1
t
r  CF

4. SSE = sum of squares for error = SSY - SST

Let us assume that our aim is to look at the effect of feeding dairy meal to cattle, which are
homogenous i.e. are of the same breed, age and in the same stage of lactation on milk yield
and that there is a group that is un-supplemented (control). The experimental design is a CRD.
In such an experiment the hypotheses are
H0 (null hypothesis): Feeding of dairy meal does not lead to changes in milk yield i.e. 1
(mean of control) = 2 (mean of supplemented).
Ha (alternative hypothesis): Feeding of dairy meal leads to changes in milk yield i.e. 1 (mean
of control)  2 (mean of supplemented).

23
To test which hypothesis is the correct one, we use F values, the tables of which are found in
most books of statistics. From ANOVA table F = MST/MSE. The tabulated value is obtained
by using the degrees of freedom of treatment (t - 1) as the numerator and that of error (rt -1)
as the denominator. If the tabulated F value is greater than calculated F value, then treatment
effect is not significant and therefore H0 is accepted (Ha is rejected) i.e. feeding of dairy meal
does not lead to changes in milk yield. If tabulated F value is less that calculated F value, then
treatment effect is significant and therefore Ha is accepted (H0 is rejected) i.e. feed does lead
to changes in milk yield.
If ANOVA gives a significant effect, all it tells you is that at least one of the mean is
significantly different from the others. If this is the case, then more analysis is needed to see
which treatment are significantly different. However, when there is no significant effect of
treatment, then that is the end of the analysis.

Example 4.2
An agronomy student has 2 varieties of soybeans. He wants to compare them with a control
variety to see if any of them is better than the control. The following is the data that was
obtained from the study

Variety Observation on the yield of soyabean per plot


Variety 1 6.6 6.4 5.9 6.6 6.2 6.7 6.3 6.5 6.5 6.8
Variety 2 5.6 5.2 5.3 5.1 5.7 5.6 5.6 6.3 5.0 5.4
Variety 3 6.9 7.1 6.4 6.7 6.5 6.6 6.6 6.6 6.8 6.8
Procedure
1. Calculate total for each treatment (Yij)

Variety Yij
Variety 1 Y1 64.5
Variety 2 Y2 54.8
Variety 3 Y3 67.0
TOTAL 186.3 (Y..)
2
Then calculate the correction factor CF = Y.. n = (186.3)2/30 = 1156.9

2. Calculate total sum of squares (SSY) by squaring each observation and summing it and
subtracting CF from it.

Y
ij1
2
ij  CF = [(6.6)2 + (6.4)2 + ………. (6.8)2] - CF

= 1167.5 - 1156.9
= 10.6
Note that the total uncorrected sum of squares is 1167.5
3. Calculate sum of squares of treatments (SST)


t
Yt2 LM (64.5) 2
OP
 (54.8) 2  (67.0) 2
 CF
r  CF =
i 1 N 10 Q

24
= 1165.2 - 1156.9
= 8.3
4. Calculate sum of squares of error (SSE) = SSY - SST
= 10.6 - 8.3 = 2.3

5. We are testing all treatments which are of interest and therefore our conclusion will refer to
only those treatments (fixed model). Therefore, the ANOVA tables looks as follows

Source Degrees of freedom Sum of Squares (SS) Mean Squares (MS) F


(d.f.)
Variety 3 -1= 2 8.3 4.15 4.15/0.09 = 46.1
Experimental error 3 (10 - 1) =27 2.3 0.09
Total 3x10 - 1 = 29 10.6

Note that the MS = (corrected sum of square)/degree of freedom. The F test requires the
comparison of the mean squares. For our case F= 46.1. To check for significance at certain
level e.g at 5% or 1%, we use the F tables and take the degree of freedom of variety (i.e. 2) to
be the numerator while that of experimental error (i.e. 27) to be the denominator. Under these
degrees of freedom, the table gives a value of 3.35. Remember that if calculated F value is
less than 3.35, then varieties are not statistically different at 5% level. If calculate F is 3.35 or
greater than 3.35 then the varieties are statistically different at 5% level.
Since our calculate F is 46.1 which is greater than the tabulated F (3.35), the varieties are
statistically different. ANOVA has simply told us that at least 2 varieties are significantly
different from each other.
It is customary to make a table and list the difference between pair of means and their
standard errors.

Remember that the standard error of a mean (SEM) is calculated as  2 / n where  is the
standard deviation and n is the sample size. If an ANOVA has been performed then 2 is the
MSE and in calculating SEM for a treatment mean, n is the number of replicates contributing
to that mean. For our case therefore let’s use letter r instead of n. The standard error of the
difference between two means (SED) is calculated as

SED  bSEM g  bSEM g


1
2
2
2

LF I F I O
 MG r J  G r J P
2 2

NH K H K Q
1 2
1 2

The above formula must be used if the two treatments being compared have unequal variance.
However, if you have performed an ANOVA and obtained a pooled MSE which is presumed
to be applicable to all treatments and if the treatments all have the same number of replicates,
then  12   12 and r1  r2 and therefore

25
SED  2 r 
2 2MSE
r

When the number of observation in each treatment is not equal. Then SED  MSE
FG 1  1 IJ
Hr r K
1 2

Please note that SED  2. e r j  2.bSEMg


2

The appropriate SED for difference between two differences can also be calculated. Given
four treatment means, 1, 2, 3 and 4 each with their own SEM

SED1 - 2 for the difference 1 - 2 = bSEM g  bSEM g


1
2
2
2

SED3 - 4 for the difference 3 - 4 = bSEM g  bSEM g


3
2
4
2

If we then wish to ask whether the difference (1 - 2) differs significantly from the difference
(3 - 4), the appropriate SED is

SEDb12 gb 34 g  cSEM  SEM 22  SEM 23  SEM 24 h FG 1  1  1  1 IJ


Hr r r r K
2
1 or MSE when a
1 2 3 4

pooled MSE is obtained which is presumed to be applicable to all treatments


If in a particular case, the four means 1, 2, 3 and 4 all have the same estimated variance and
are all based on the same number of replications, r, then SED for the difference between two
differences is

SEDb1 -2 gb 34 g  4 r 


2 4MSE
r

For our situation, the number of observation in each treatment is equal, therefore
2MSE
SED 
r
= (2 x0.09) / 10 = 0.13

1 2 Y1  Y2 SED
1 2 6.45 - 5.48 = 0.97 0.13
1 3 6.45 - 6.70 = - 0.25 0.13
2 3 5.49 - 6.70 = - 1.22 0.13

and

SEDb1 -2 gb 34 g  4 r 


2 4MSE
r
= (4 x 0.09) / 10 = 0.19

26
(i-j)-(k-l)
dY  Y i - cY  Y h
i j k l
SEDb i - jgb k l g
(1 - 2) - (1 - 3) 0.97 + 0.25 = 1.22 0.19
(1 - 2) - (2 - 3) 0.97 + 1.22 = 2.19 0.19
(1 - 3) - (2 - 3) - 0.25 +1.22 = 0.97 0.19

Exercise 3
The following are weekly first lactation yields of Friesian cows on 4 different diets. Two
weekly yields were incomplete due to death of the cows. The researcher is interested in
determining whether average weekly yield differ for the four diets. Each diet being a
treatment.

Diet Weekly lactation yields


1 231 209 226 214 230 218
2 160 183 210 179 191 -
3 251 246 238 227 240 -
4 195 188 204 192 210 197
Assume fixed model
(a) Is there significant difference between the diets?
(b) Tabulate difference of all pairs of means and their corresponding SED.
(c) Tabulate differences between two differences and their corresponding SED

As already alluded to, if ANOVA gives a significant effect, then more analysis is needed to
see which treatments are significantly different. There is therefore the need to separate the
means using a suitable procedure. This is called Post-ANOVA analysis and uses one of the
mean separation procedures described in the next section.

5.5 Mean separation procedures


There are five different procedures that can be used in cases where the model used was fixed
and the treatments were unstructured i.e. there is no relationships between treatments. These
procedures include: -
1. Least significant difference method (LSD)
Two ways
- using students' t-distribution tables (you are familiar with this)
- using studentized range values
2. Tukey's range procedure (TK or HSD)
- using studentized range value tables
3. Student Newman Keuls test (SNK)
- using studentized range values tables
4. Duncan's multiple Range test (DMRT)
- using values for Duncan's new multiple Range test (DMRT) tables

27
5. Dunnet's procedure.
- used where one mean is control and all other means are compared with it.

Each of these procedures has an advantage or disadvantage in guarding against type 1 and
type 2 errors. Rejecting or accepting a hypothesis means committing errors.
Type one error
Occurs when you judge 2 pairs of means significantly different when they are equal i.e.
falsely accepting Ha (1  2) or falsely rejecting H0 (1 = 2).
Type two error
Occurs when a pair of means is actually different but this difference was not detected i.e.
falsely accepting H0 (1 = 2) or falsely rejecting Ha (1  2). In other words, this error
occurs when you judge 2 pairs of means equal when they are significantly different.
(In other words, If we reject a hypothesis when it should be accepted, we say that a Type I
error has been made. If, on the other hand, we accept a hypothesis when it should be rejected,
we say that a Type II error has been made. In either case a wrong decision or error in
judgment has occurred).
Post ANOVA analysis using the above procedures is done to strikes a balance between these
two types of error.

If the comparison of means is between each treatment mean and the control, then the number
of comparison to be made is k - 1 where k is number of treatments. If k = 4, then number of
comparisons is 3 i.e. compare 1 with 2, 1 with 3 and 1 with 4. If the experiment has no

control, then the number of comparison to be made will be


b g
k k 1
. If k = 4, then number of
2
comparison will be 6 i.e. compare 1 with 2, 3 and 4, 2 with 3 and 4 and 3 with 4.

In cases where the treatments are structured i.e. related, separation of means using these
procedures is meaningless. Therefore, we do not compare the means but form contrast. A
contrast is a linear equation whose coefficients add up to zero. Note that contrast can also be
used in situations where the treatments are unstructured. How contrast can be used in
separating treatments means will be discussed later after under the sub section on orthogonal
polynomials.
To illustrate the different types of the separation procedures, let us use the following arbitrary
example.
Example 4.3
Suppose we are testing the effect of 4 diets (treatments) on milk yield per day and that number
of observation for each treatment is six and the design is CRD. The treatment means are 30.5,
30.3, 35.7 and 33.0. After carrying out an ANOVA, MSE = 4.76 while the df for MSE = 20
(i.e. n - t = 24 - 4 = 20). The standard error of the difference between two means (SED) is
1.26 ie

28
2MSE 2 x 4.76
SED  =  1.26
r 6

5.5.1 Least significant difference method (LSD)


This procedure is also called Fisher's LSD. The procedure normally looks at the differences at
5% significance levels. Two types of tables can be used in this case.
(a). Student t-distribution tables

LSDb 0.05g  t 0.025 df 5:20 b g 2MSE


r
= 2.086 x 1.26
= 2.63
Therefore, if the difference between the means above is greater that 2.63, then we say that the
means are significantly different from one another. The difference between the means is
obtained by first arranging them in the following order.

Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7* 2.5
1 30.5 0.2 0.2
2 30.3 0
(b) Studentized range values tables
Formula used is

LSD 
b
SED k  v g
2
where
(k  v) = studentized range value of order k,  = 5%, v = df of the MSE and SED = 1.26.
Please note that for LSD k is always 2.

LSD =
b g  2.63
1.26 2.95
2

Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7* 2.5
1 30.5 0.2 0.2
2 30.3 0

29
5.5.2 Tukey's range procedure (TK or HSD)
This test is similar to LSD except that it guards against Type I error more and therefore it is
also referred to as Honestly significant difference (HSD).

TK or HSD 
b
SED k  v g
2
where
k = number of means of comparisons (for our example k = 4), therefore

TK or HSD 
b
SED 4  v g
2

Since SED = 1.26 and v = 20 , TK =


b g = 3.52
1.26 3.96
2

Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7
4 33.0 2.7 2.5
1 30.5 0.2 0.2
2 30.3 0

5.5.3 Student Newman Keuls test (SNK)


This test compromises Type I and Type II error. In this case several critical values are
calculated, the number of which is determined by the number of treatments to be compared.
For example, in a case of 4 treatments, three values must be calculated i.e. k - 1 = p.
SNK 
b
SED p  v g
2
Remember that in LSD and TK only one value is calculated.

k = 2, SNKb 2 g 
1.26 2.95 b g = 2.63 (remember this was the value obtained for LSD)
2

k = 3, SNKb 3g 
1.26 3.58 b g = 3.19
2

k = 4, SNKb 4 g 
1.26 3.96 b g = 3.52
2
Then compare the first diagonal (lowest) with first value ie. 2.63, the second with 3.19 and the
third with 3.52 i.e.,

30
3.52, 2.19, 2.63

Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7 2.5
1 30.5 0.2 0.2
2 30.3 0
Treatment 3 is significantly different to all the others but the others are not significantly
different to one another.

5.5.4 Duncan's Multiple Range test (DMRT)


Similar to SNK but further guards against type I error. It uses duncans's tables (i.e. values for
duncan's new multiple range test).

b
Db p g  SED p  v g
Several values depending on the number of treatments are calculated.

b g
p = 2, Db 2 g  1.26 2.09 = 2.63

b g
p = 3, Db 3g  1.26 2.19 = 2.76

p = 4, Db g  1.26 b2.25g = 2.83


4

2.83, 2.76, 2.63

Increasing magnitude
2 1 4
i Mean 30.3 30.5 33.0
3 35.7 5.4* 5.2* 2.7*
4 33.0 2.7 2.5
1 30.5 0.2 0.2
2 30.3 0

5.5.5 Dunnet's procedure.


This procedure is used when we want to compare the treatments with a control.

b g
D  SED d k  v where k is always = 2, thus D  SED d 2  v b g

31
Let us go back to our soybean example where we had three treatments. Treatment 1 (ie variety
1) is a control. ANOVA showed that the treatment effect was significant and that MSE =
2 x 0.09
0.09, df of MSE = 27, therefore SED = = 0.134.
10
D = 0.134 (2.33) = 0.31
Remember in that example, the treatment means for the three treatments were 6.45, 5.48 and
6.70 respectively. If treatment one is the control, then the number of comparison to be made
are given by k - 1 i.e. 3 - 1 = 2.

i h Y1  Y2 SED
3 - 1 6.70 - 6.45 0.25
2 - 1 5.48 - 6.45 0.97*
Variety 3 is not significantly different from the control but variety 2 is.

Exercise 4
In a completely randomized design, a researcher recorded the following information
(1) Each treatment had 11 observations
(2) Y1  14 Y2  20 Y3  12 Y4  5
(3) SS for treatments was 200 df = 3
SS for experimental error was 440 df = 40
SS for total was 640 df = 43

(a) Give the linear model for this experiment. Explain terms used in the model and specify
ranges on the subscript.
(b) Test equality means.
(c) What difference of pairs of treatment means would be judged significantly different at the
5% level by:
(i) Fisher's LSD (use both tables)
(ii) Tukey's procedure
(iii) Student Newman Keul method
(iv) Duncan's multiple range test
(d) Assuming that Y1  14 is a control, what treatment means would be judged significantly
different from it at 5% level.

5.6 Randomized complete block design


The CRD is normally applicable when we are dealing with few treatments and using
experimental units that are homogenous. Homogeneity of experimental units is normally not
simple to achieve. Therefore experimental units are normally grouped in blocks before they
are allocated to treatments for an experiment. Attributes commonly used for blocking are

32
liveweight, age, parity, litter size (single vs twins in growing lambs), previous yield of milk or
eggs, breed, growth rate prior to the start of an experiment etc.
If the experimental units are not homogenous and one decides to use a CRD, then the
experimental error will be large to the extent that the sensitivity of the experiment is going to
be poor. There is however a price to pay for blocking. The cost of blocking is a loss of
degrees of freedom from the error term.
The purpose of blocking is to increase precision by reducing the error variance. An
experiment arranged in blocks is called a randomised complete block (RCB) design:
'randomized' because treatments have been allocated randomly to position within each block:
'complete' because each block contains every treatment. There is no limit of the number of
blocks one can have but the number of treatments appearing in each block should be the same.

5.6.1 Randomization layout


Let us take an example of field experiment with 6 treatments (A, B … F) and 4 replications.
Ramdomization is applied independently in each of the block.
(1) Experimental area is divided into 4 blocks.
(2) Within each block you sub-divide it into experimental plots t where t is the number of
treatments.
(3) Assign t treatments at random following the randomisation procedure in CRD.

5.6.2 Linear model


The linear model for RCB is
Yijk =  + i +  j + eijk
where
Yijk = observation on kth cow of the ith treatment in the jth block
 = overall population mean
i = effect due to the ith treatment (deviation of each treatment mean from overall mean - i -
)
j = effect due to the jth block (deviation of each block mean from overall mean - j -)
eijk = random error associated with Yijk (deviation of each observation from its own mean - Yijk
- )

5.6.3 Format for data recording and ANOVA in RCB


Data from a RCB experiment is normally organised as shown in the table below

Block 1 2 …………… j r Treatment totals


Treatment
1 Y11 Y12 Y1j Y1r Y1
2 Y21 Y22 Y2j Y2r Y2

33
i Yi1 Yi2 Yij Yir Yi
t Yt1 Yt2 Ytl Ytr Yt
Block totals Y.1 Y.2 Y.j Y.r Y..

Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows

Source Degrees of freedom Sum of Squares (SS) Mean Squares (MS) F


(d.f.)
Treatment t-1 SST SST/t-1 = MST MST/MSE
Block r-1 SSB SSB/r - 1 = MSB MSB/MSE
Experimental error (r - 1) (t - 1) SSE SSE/t(r-1) =MSE
Total rt - 1 SSY
Where
Please note that rt = n
n
1. SSY = total sum of squares = Y
ij1
2
ij  CF

2
2. CF = correction factor = Y.. n (note n = rt)
t
Yt2
3. SST = sum of squares for the treatments = i 1
r  CF (note r = number blocks)
r
Yi2
4. SSB = sum of squares for the blocks = i 1
t  CF

5. SSE = sum of squares for error = SSY - SST - SSB

Example 4.4
The following data was taken from an experiment in which four dietary treatments were
compared with eight sheep allocated to each treatment in a randomised complete block
design. The block was based on live weight of the sheep at the start of the trial.

Block number 1 2 3 4 Block totals


I 16.3 18.9 19.4 18.0 72.6
II 16.4 18.2 17.6 17.5 69.7
III 16.7 18.9 17.6 18.6 71.8
IV 17.7 19.5 19.8 19.1 76.1
V 18.0 17.4 19.3 18.4 73.1
VI 19.1 18.0 16.5 17.6 71.2
VII 19.1 21.0 18.9 21.3 80.3
VIII 18.0 21.3 19.9 21.1 80.3
Treatment totals 141.3 153.2 149.0 151.6 595.1

2
1. Calculate the correction factor CF = Y.. n = (595.1)2/32 = 11067.0003

2. Calculate total sum of squares (SSY)

34
b g b g c h b g
n

Y  CF = 16.3  16.4 ..... 213


. 2  211  11067.0003
2 2 2 2
ij .
ij1

= 58.9297
3. Calculate treatment sum of squares (SST)
t
Yt2 Lb141.3g  b153.2g  b149.0g  b151.6g OP  CF
 CF = M
2 2 2 2


i 1
r
MN 8 PQ
= 11077.43625 - 11067.0003
= 10.4359
4. Calculate block sum of squares (SSB)
r
Yi2 LMb72.6g + b69.7g ....+b80.3g OP  CF
2 2 2

 t  CF =
i 1 MN 4 PQ
= 11096.1325 - 11067.0003
= 29.1322
5. Calculate error sum of squares (SSE) = SSY - SST - SSB
= 58.9297 - 10.4359 - 29.1322 = 19.3616
6. Enter the sum of squares in an ANOVA table and calculate the means squares (MS =
SS/df).
7. Calculate F values as the ration of each MS to MSE.
8. Check F-ratios against the values tabulated to obtain the corresponding probability
estimates.

The ANOVA tables looks as follows

Source Degrees of freedom Sum of Squares (SS) Mean Squares (MS) F


(d.f.)
Treatment 3 10.4359 3.4786 3.77*
Block 7 29.1322 4.1617 4.51**
Experimental error 21 19.3616 0.92198
Total 31 58.9297

Exercise 5
In the above example, which treatments and blocks are significantly different from one
another? Use LSD, TK and SNK to separate the means.

Normally experiments are not predictable. An animal might die in the course of the
experiment. For RCB such a missing value, m, can be obtained using the following formula.

35
rBh  tTg  Y ..
m=
br  1gbt  1g
where
m = missing value
Tg = total of all non-missing observation for treatment g.
Bh = total of all non-missing observation for block h.
Y.. = grand total of all non-missing observations.
r = number of blocks.
t = number of treatments

Example 4.5
Assume the following data on 4 diets def to 5 different litters. During the experiment, one
animal died.

Litter number 1 2 3 4 5 Diet totals


Diet
I 21 18.9 25 18 22 86
II 26 MISSING 27 17 26 134
III 16 25 22 18 21 102
V 28 35 27 20 24 134
Litter totals 91 98 101 73 93 456

The missing value =


b g b g
5 98  4 86  456
= 31.5.
12

5.6.4 Improvement in precision due to blocking


Let us use Example 4.4 to illustrate how to calculate the effectiveness of blocking in
retrospect. The ANOVA showed that blocks accounted for a significant chunk of variation in
carcass weight and thus it was wise to block the animals on initial liveweight.
To work out exactly what has been gained by blocking, we first calculate that, without
blocking, the error MS would have been
29.1322  19.3616
7  21
b
 1.7319 with 28 d.f g
The error in RCB design is only 53% (i.e., 0.92198/1.7319) of what it would have been
without blocking. However, we must also allow for the cots in lost d.f. For a test of treatment
differences, we have 3 and 21 d.f. in the RCB design, for which the F value at P = 0.05 is
3.07, but 3 and 28 d.f. for the completely randomized design with an F value of 2.95. The
efficiency of the RCB relative to complete randomization in this example is therefore:
1.7319 x 2.95
 1.81
0.92198 x 3.07

36
Thus the RCB design was 81% more efficient that the completely randomized design in this
case.

5.6.5 Confounding
Confounding is a general term used to mean that two effects are so mixed up that they cannot
be separated. An extreme example would be a trial using two-year old heifers and three-year-
old steers. Suppose that we find there is a difference in fatness between these two groups, we
have no way of telling whether this is due to age or due to sex because the two effects are
completely confounded.

Exercise 6
The following data represent the weight gain in grams of rabbit breeds under different
temperatures.

Treatment Breed
1 2 3 4
1 20C 54 21 48 68
2 25C 63 17 50 56
3 30C 50 15 46 49
4 32C 55 13 45 70
5 35C 60 19 47 37

(a) Is there significant breed effect?


(b) Which is the best breed under all temperatures?
(c) Is there significant temperature effect?
(d) Which is the best temperature for all breeds?

5.6.6 Orthogonal polynomials


As already pointed out in section 5.5, in cases where the treatments are structured i.e. related,
separation of means using the described procedures is meaningless. Note that contrast can
also be used in situations where the treatments are unstructured. In this case, we do not
compare the means but form contrasts. A contrast is a linear equation whose coefficients add
up to zero. We will illustrate how means can be separated using orthogonal polynomials using
Example 4.4.
Suppose that we wish to find the SS corresponding to the contrast:
Treatment 1 versus mean of the other three treatments.
In numerical terms, this statement is represented by
141.3 - 1/3 (153.2 + 149.0 + 151.6)
or multiplying by 3 to get rid of fractions

+3(141.3) - 1(153.2) - 1(149.0) - 1(151.6) = -29.9


We call this number the contrast (C), which we are looking for and the corresponding SS is

37
C2
d  c2 x n i
where c represents the coefficients used (+3, -1, -1 and -1) and n = the number of plots in each
total used (8 in this case).

b29.9g  9.3126 . Suppose


2

b12 x 8g
2 2 2 2
Since 3 + (-1) + (-1) + (-1) = 12, the SS we are looking for =

we decide to look for three contrasts:

1. Treatment 1 versus the rest:


2. Treatment 2 and 4 versus treatment 3
3. Treatment 2 versus treatment 4.

38
The corresponding SS are shown in the table below.
Treatment totals Contrast
(n = 8) 1 2 3
Treatment 1 141.3 +3
Treatment 2 153.2 -1 +1 +1
Treatment 3 149.0 -1 -2
Treatment 4 151.6 -1 +1 -1
c= 0 0 0

c =
2 12 6 2

C= -29.9 6.8 1.6


C2 9.3126 0.9633 0.1600
SS =
d i
 c2 x 8
=

The following are the rules to be followed when making up orthogonal set of contrast using
coefficients (or polynomials).
Rule 1. Any contrast must be between two quantities and thus represent 1 d.f. only
Rule 2. The maximum number of contrast available in one orthogonal set is equal to the d.f
available.
Rule 3. To be a valid contrast, the coefficients must sum to zero (  c = 0)
Rule 4. For one contrast to be orthogonal to another, the sum of the products of their
d b g i
coefficients must be zero  c1c2  0 .

If we test this last ruleon the table of coefficients, we obtain for columns 1 and 2:

 bc c g  b3gb0g  b1gb1g  b1gb2g  b1gb1g  0


1 2

and therefore contrasts 1 and 2 are orthogonal.


Applying the same test to columns 2 and 3 we have:

 bc c g  b1gb1g  b2gb0g  b1gb1g  0


1 2

In any set, if column A id orthogonal with B and B is orthogonal with C, then A and C will
also be mutually orthogonal. It is thus only necessary to check that one column is orthogonal
with each of the others to know that you have an orthogonal set.
If a set of contrast obey the rules for orthogonality given above, then the corresponding
component SS will add to the treatment SS. You will have noticed that:

9.3126 + 0.9633 + 0.1600 = 10.4359 ( = treatment sum of squares).


After applying all these rules, we note that with four treatments for example, there are only
three orthogonal combinations available, although, by deciding which treatments you call A,
B, C and D there are substantially more permutations. The set available are shown in the table
below.

39
Set 1 Set 2 Set 3
Contrasts: Contrasts: Contrasts:
1 2 3 1 2 3 1 2 3
A 3 0 0 1 1 1 1 1 0
B -1 2 0 1 -1 -1 1 -1 0
C -1 -1 1 -1 1 -1 -1 0 1
D -1 -1 -1 -1 -1 1 -1 0 -1

Exercise 7
Using the information given in Example 4.3, find the corresponding SS for each contrast and
set in the table above.

5.7 Latin squares


Let us use an arbitrary example. You are employed as a manager of a dairy farm somewhere
in the Rift Valley province. The Director is in a dilemma of choosing which dairy meal to
purchase for feeding the animals. There are different companies in the supplying dairy meal
nowadays and some of them might be conning farmers by selling low quality dairy meal. He
wants to make a choice between the following companies: Unga feeds, ABC feeds, United
Millers and Golden feeds.
He tells you to make a decision, which must have a scientific backing and you should use all
available breeds and ages in your experiment. The dairy herd is mixed i.e. all ages and breeds
are represented. For one to use CRD, the experimental units must be homogenous- they are
not in your case. In RCB, the EU are arranged in blocks based on one aspect (factor) e.g.
weight, height, lactation, breed etc. In this particular case, they blocking factors are more than
one.
In Latin square design, the EU are arranged by "double blocking" using two aspects of the EU
e.g. weight and lactation number or breed and lactation number. If we have t treatments, then
we require t2 EU which can be grouped b 2 blocking factors. This will form a basic Latin
square. If we have 4 treatments, then the Latin square will look as below.

Breed
A B C D
Lactation no. B C D A
C D A B
D A B C

When we have 4 treatments A, B, C and D the LS is refered to as a 4 x 4 latin square. To form


a LS, the number of animals and the number of periods must equal the number of treatments
and this generally restricts their use to cases where four or five treatments are to be compared.
A latin square is defined as an array where each letter occurs once in every row and once in
every column. The number of squares will be determined by number of treatments. The name
latin squares comes from ancient Roman tessellated pavements which were often laid out
using different coloured tesserae in rows and columns such that each colour occurred once
only in each row and each column.

5.7.1 Characteristics of LS
1. Gives more accurate treatment comparisons

40
2. Has a greater sensitivity because we have identified 2 sources of error thus our
experimental error is low
3. It is a fairly easy to analyze.
4. Number of treatments will determine the number of rows and columns. Their use is
restricted to cases where four or five treatments are to be compared. If used to
compare a lot of treatments, there will be a problem in that the MSE will be inflated.
This means that we will be dividing by a bigger number and thus chances of getting
significant effect will be minimal i.e. F value will be reduced.

5.7.2 Randomization layout


The process is same as in CRD and RBD but the only difference is that we also have to
randomize for both rows and columns. Let us illustrate the process using an experiment with 5
treatments.
5x5
A B C D E
B A E C D
C D A E B
D E B A C
E C D B A

1. Select type of square you are dealing with i.e. select a LS plan with 5 treatments from a
statistic book (see the above existing plan).
2. Randomize the row arrangement on the plan selected in step 1 following the randomization
scheme of random numbers. e.g. select 3 digits’ random numbers. for example: 628, 846, 475,
902 and 452. then rank them from lowest to highest.

Random number Sequence Rank


628 1 3
846 2 4
475 3 2
902 4 5
452 5 1

Use the rank to represent the existing row number of the selected plan and the sequence to
represent the row number of the new plan.
New plan
C D A E B
D E B A C
B A E C D
E C D B A
A B C D E

3. Randomize the column arrangement using the same procedure used for row in step 2 e.g.
select 5 random 3 digits’ numbers - 792, 032, 947, 293 and 196.

Random number Sequence Rank


792 1 4
032 2 1
947 3 5
293 4 3
196 5 2

41
The rank will now be used to represent the column number of the plan obtained in step 2 and
the sequence will be used to represent the column number of the final plan.
Final plan
E C B A D
A D C B E
C B D E A
B E A D C
D A E C B

5.7.3 Linear model


The linear model for LS s
Yijkl =  + i + Cj + Rk + eijkl
where
Yijkl = observation from lth EU of the ith treatment in the jth column and kth row.
 = overall population mean
i = effect due to the ith treatment (deviation of each treatment mean from overall mean - i -
)
Cj = effect due to the jth column (deviation of each column mean from overall mean - j -)
Rk = effect due to the kth row (deviation of each row mean from overall mean - k -)
eijkl = random error associated with Yijkl (deviation of each observation from its own mean -
Yijkl - )

5.7.4 ANOVA setup LS


Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows

Source Degrees of freedom Sum of Squares (SS) Mean Squares (MS) F


(d.f.)
Treatment t-1 SST SST/t -1 = MST MST/MSE
Columns t-1 SSC SSC/t - 1 = MSC
Rows t-1 SSR SSR/r - 1 = MSR
Experimental error (t - 1) (t - 2) SSE SSE/(t - 1)(t - 2) =MSE
Total t2 - 1 SSY
Where
Please note that rt = n
n
1. SSY = total sum of squares = Y
ij1
2
ij  CF

2
2. CF = correction factor = Y.. n (note n = t2)
t
Yi..2
3. SST = sum of squares for the treatments = 
i 1
t  CF

42
t
Y.2j.
4. SSC = sum of squares for the columns = 
i 1
t  CF
t 2
5. SSR = sum of squares for the rows = Y
i 1
..k
t  CF

6. SSE = sum of squares for error = SSY - SST - SSC - SSR

Example 4.6
An animal scientist is testing 4 diets A, B, C and D and he is using 4 breeds of cows, which
are in lactation number 1, 2, 3 and 4. The animals were put on new ration 10 days after
lactation for 3 months. The following are total 3-month milk yield.

Breeds
Ayrshire Friesian Jersey Guernsey Lactation total
1 810: B 1080: C 700: A 910: D 3500
Lactation no. 2 1100: C 880: D 780: B 600: A 3360
3 840: D 540: A 1055: C 830: B 3265
4 650: A 740: B 1025: D 900: C 3315
Breed total 3400 3240 3560 3240 13440

1. Calculate treatment totals


A = 700 + 600 + 540 + 650 = 2490
B = 3160
C = 4135
D = 3655

2
2. Calculate correction factor Y.. n = (13440)2/ 16 = 11289.600

3. Calculate total sum of squares (SSY)

b g b g c h b g
n

Y  CF = 810  1080 ..... 8302  900  11289.600


2 2 2 2
ij
ij1

= 11723250 - 11289.600
= 433650
4. Calculate treatment sum of squares (SST)
t
Yi..2 Lb2490g  b3160g  b4135g  b3655g OP  CF
 CF = M
2 2 2 2


i 1
t
MN 4 PQ
= 11660737.5 - 11289.600
= 371137.50
5. Calculate column sum of squares (SSC)

43
t
Y.2j. Lb3400g  b3240g  b3560g  b3240g OP  CF
 CF = M
2 2 2 2


i 1
t
MN 4 PQ
= 11307200 - 11289.6000
= 17600
6. Calculate row sum of squares (SSR)
t
Y..k2 Lb3500g  b3360g  b3265g  b3315g OP  CF
 CF = M
2 2 2 2


i 1
t
MN 4 PQ
= 11297262.5 - 11289600
= 7662.5
7. Calculate error sum of squares (SSE) = SSY - SST - SSC - SSR
= 433750 - 371137.50 - 17600 - 7662.5 = 37250
8. Enter the sum of squares in an ANOVA table and calculate the means squares (MS =
SS/df).
9. Check F-ratios against the values tabulated to obtain the corresponding probability
estimates.

44
The ANOVA tables looks as follows

Source Degrees of freedom Sum of Squares (SS) Mean Squares (MS) F


(d.f.)
Treatment 3 371138 123712 19.9*
Columns 3 7662 2554
Rows 3 17600 5867
Experimental error 6 37250 6208
Total 15 433650

Under LS ordinarily one does not perform test for row and columns because they were not
randomized in LS and therefore validity of such tests is questionable. Since treatments were
assigned at random, then significant test can be performed.

Ho: all treatments are the same i.e., 1 = 2 = 3 = 4


Ha: the treatments are not the same i.e., 1  2  3  4

F = MST/MSE = 1123712/6208 = 19.9.


With 3 as the numerator and 6 the denominator and at 5% significant levels, the value is 4.76
from the F tables. These treatments are significantly different at 5% - accept Ha. Separating
the treatments means using LSD we get

b g
LSDb 0.05g  t 0.025 df 5:20
2MSE
r

= 2.447 x
b g
2 6208
4
= 136.3

The four treatment means are then compared in a table.

D C B
i Mean 913.75* 1033.75* 622.50*
A 622.5 291.25 411.25 167.70
B 790.2 123.75 243.75*
C 1033.75 120
D 913.75

When we have missing data in a row or column we have a formula which can be used:

m
b
t T  R  C  2Gg
b gb g
t 1 t  2
where

45
T = treatment totals corresponding to the missing value.
R = row totals corresponding to the missing value.
C = columns totals corresponding to the missing value.
t = number of treatments.
G = sum of all observations.

Example 4.7
Calculate the missing value in the data below

Breeds
Ayrshire Friesian Jersey Guernsey Lactation total
1 810: B 1080: C 700: A 910: D 3500
Lactation no. 2 1100: C 880: D 780: B MISSING: A 2760
3 840: D 540: A 1055: C 830: B 3265
4 650: A 740: B 1025: D 900: C 3315
Breed total 3400 3240 3560 2640 12840
The treatment total for treatment A is 1890.

m
b g b
4 1890  2760  2540  2 12840 g
3 2b gb g
m = 513.
Exercise 7
A LS design was used to test the effect on egg weight of including molasses in the diet of
laying hens at four concentrations (0, 70, 140 and 210g/kg). Four groups, each of 48 birds,
received diet in turn for a period of 4 weeks. Data for the first 2 weeks after changing diets
were discarded.

Molasses in the diet (g/kg)


0 70 140 210 Group totals
Group A 55.4 (II) 55.1 (III) 53.6 (IV) 53.5 (I) 217.6
Group B 55.0 (IV) 56.1 (I) 54.8 (II) 53.9 (III) 219.8
Group C 55.2 (III) 52.9 (IV) 53.8 (I) 54.1 (II) 216.0
Group D 53.1 (I) 54.4 (II) 53.0 (III) 51.1 (IV) 211.6

Treatment totals 218.7 218.5 215.2 212.6 865


Treatment means 54.68 54.63 53.80 53.15
(The corrected total sum of squares = 22.6575)

The table above shows mean egg weight for each group on each diet (based on weighing eggs
in bulk on weekdays in weeks 3 and 4 in each period). The Roman numerals in brackets
alongside the data give the period in which the results were obtained.
1. Carry out an analysis of variance of these data
2. Do the figures justify a conclusion that the inclusion of molasses in the diet depresses
egg weight?

46
5.8 Factorial experiments
It is not usual to plan experiments that investigate two or more factors simultaneously. When
the treatment consists of 2 or more factors then the experiment is referred to as a factorial.
The term factorial refers to the treatment design (i.e. the relationship among treatments). For
example, (a) A 2 x 2 factorial experiment refers to a situation where we are dealing with 2
factors occurring at 2 levels. For example, one factor could be energy level (factor A) in the
diet which occurs at two levels - high energy or low energy levels. The other could be protein
level (factor B) also occurring at two levels - high and low.

Factor B
b1 b2
Factor A a1 a1b1 a1b2
a2 a2b1 a2b2

Where
a1b1 = high energy high protein.
a1b2 = high energy low protein.
a2b1 = low energy high protein.
a2b2 = low energy low protein.

Each animal is receiving 2 treatments ate one time. Ina 2 x 2 factorial experiment, we need 4
EU. If each treatment is replicated r times, then the number of EU needed is 4 x r.

(b) 3 x 3 factorial experiment refers to a situation where we are dealing with 2 factors
occurring at 3 levels.

Factor B
b1 b2 b3
Factor A a1 a1b1 a1b2 a1b3
a2 a2b1 a2b2 a2b3
a3 a3b1 a3b2 a3b3

The number of treatments is 3 x 3 = 9. If we have r replicates, then number of EU required in


a 3 x 3 s factorial is 9 x r.

(c) 3 x 4 factorial experiment refers to experiment with 2 factors with one occurring at 3
levels and the other at 4 levels.

Factor B
b1 b2 b3 b4
a1 a1b1 a1b2 a1b3 a1b4
Factor A a2 a2b1 a2b2 a2b3 a2b4
a3 a3b1 a3b2 a3b3 a3b4
a4 a4b1 a4b2 a4b3 a4b4

The number of treatments is 3 x 4 = 12. If we have r replicates, then number of EU required in


a 3 x 4 s factorial is 12 x r.

Please note that "levels" does not always imply a numerical description: it can also imply
qualitative differences or multidimensional differences e.g. breed, which could differ in a

47
number of quantifiable ways. We can also have three, four factors etc factorial experiments.
For example, a 2 x 2 x 2 factorial experiment simply means one with 8 EU, 3 factors each
having 2 levels.

5.8.1 Characteristics of factorial experiments

1. Factorial experiments allow us to look at more than one factor at a time. Instead of
conducting 2 or more experiments at a time, one is conducted which incorporates all
the factors and levels.
2. The bigger the factorial the bigger the number of EU required.
3. In factorial, one is able to study the main effect as well as the interaction effect
between the factors under study.
4. Factorial experiments can be used with any experimental design.

5.8.2 Linear model


As already indicated factorial experiments can be used with any experimental design and
therefore the variables in model will depend on the experimental design and the number of
factors under investigation. Normally it is only the treatment effect that is subdivided into
effects due to the factors (main effects) and that due to interaction between these factors
(interaction effects). Depending on the number of factors, interactions can be two, three, four
etc way i.e. if there are only two factors we can only have 2 way interactions, if three factors,
we can have two way and three way interaction, with four factors, we can have two, three and
four way interactions and so on.
In the example linear models given below, we will assume a situation with only 2 factors, A
and B.
1. The linear model under CRD
Yijk =  + Ai + Bj + (AB)ij + eijk
where
Yij = observation on kth EU
 = overall population mean
Ai = effect of level of factor A
Bj = effect of level of factor B
(AB)ij = effect of interaction of factors A and B.
eijk = random error associated with Yijk.

2. The linear model under RCB


Yijkl =  + Ai + Bj +  k + (AB)ij + eijkl
where
Yijjkl = observation on lth EU
 = overall population mean

48
Ai = effect of level of factor A
Bj = effect of level of factor B
k = effect due to kth block.
(AB)ij = effect of interaction of factors A and B.
eijkl = random error associated with Yijkl.

3. The linear model under LS


The linear model for LS is
Yijklm =  + Ai + Bj + Ck + Rl + (AB)ij + eijklm
where
Yijjkl = observation on lth EU
 = overall population mean
Ai = effect of level of factor A
Bj = effect of level of factor B
Ck = effect due to the kth column
Rl = effect due to the lth row
(AB)ij = effect of interaction of factors A and B.
eijklm = random error associated with Yijklm.

5.8.3
Format for data recording and ANOVA in factorial experiment
under CRD
Data from a factorial experiment under CRD is normally organised as shown in the following
table
Factor B
1 2 …………… b A total

1 Y11. Y12. Y1b. Y1..


2 Y21. Y22. Y2b. Y2..
Factor A
i Yi1. Yi2. Yib. Yi..
a Ya1. Ya2. Yab. Ya..
B total Y.1. Y.2. Y.j. Y…

Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows

Source Degrees of freedom Sum of Squares Mean Squares (MS) F

49
(d.f.) (SS)
Factor A a-1 SSA SSA/a - 1 = MSA MSA/MSE
Factor B b-1 SSB SSB/b - 1 = MSB MSB/MSE
Interaction (AB) (a - 1) (b - 1) SSAB SSAB/(a - 1) (b - 1) = MSAB MSAB/MSE
Experimental error ab (a - 1) SSE SSE/ ab (a - 1)=MSE
Total abr - 1 SSY
Where a= levels in factor A, b = levels in factor B and r = number of replications.
Please note that abr = n
n
1. SSY = total sum of squares = Y
ij1
2
ijk  CF

2
2. CF = correction factor = Y... n
ab
Yij.2
3. SAB = sum of squares for the AB subclass = i 1
r  CF
a
Yi..2
4. SSA = sum of squares for factor A = 
i 1
br  CF
b
Y.2j.
5. SSB = sum of squares for factor B = 
i 1
ar  CF

6. SSAB = sum of squares for the interaction between A and B = SAB - SSA- SSB
7. SSE = sum of squares for error = SSY - SAB

Example 4.8
The data below is from experiment in which chicks were fed protein from two different
sources with the objective of finding which one resulted in faster daily growth rates in grams.
Source one was fed at three levels while the other at 4 levels. Each combination was fed to a
12 groups of 3 chicks each from 7 to 21 days of age.
Factor B
b1 b2 b3 b4
11 8 12 9
a1 12 10 10 11
9 10 13 10
Factor A 13 14 8 9
a2 11 10 12 9
14 10 10 8
9 10 11 7
a3 9 8 11 11
9 11 9 6

1. Calculate the AB subclass total


Factor B
b1 b2 b3 b4 A totals
a1 32 28 35 30 125
Factor A a2 38 34 30 26 128
a3 27 29 31 24 111
B totals 97 91 96 80 364

2
2. Calculate the correction factor = Y... n = (364)2/36 = 3680.44

50
3. Calculate the total sum of squares (SSY)
b g b g c h bg
n

Y  CF = 11  12 ..... 112  6  3680.44


2 2 2 2
ijk
ij1

= 3798 - 3680.44
= 117.56
4. Calculate the AB subclass sum of squares (SAB)
ab
Yij.2 Lb32g  b28g .....b24g OP  CF
 CF = M
2 2 2


i 1
r
MN 3 PQ
= 3738.67 - 3680.44
= 58.23
5. Calculate the factor A sum of squares (SSA)
a
Yi..2 LMb125g  b128g  b111g OP  CF
2 2 2

  CF =
i 1
br
MN 3x4 PQ
= 3694.17 - 3680.44
= 13.73
6. Calculate the factor B sum of squares (SSB)
b
Y.2j. LMb97g  b91g  b96g  b80g OP  CF
2 2 2 2

  CF =
i 1
ar
MN 3x3 PQ
= 3700.67 - 3680.44
= 20.23
7. Calculate the sum of squares for the interaction between A and B (SSAB) = SAB - SSA-
SSB
= 58.23 - 13.73 - 20.23
= 24.27
8. Calculate error sum of squares for error (SSE) = SSY - SAB.
= 117.56 - 58.23
= 59.33
9. Enter the sum of squares in an ANOVA table and calculate the means squares (MS =
SS/df).
10. Calculate F values as the ratio of each MS to MSE.
11. Check F-ratios against the values tabulated to obtain the corresponding probability
estimates.

51
The ANOVA tables looks as follows
Source Degrees of freedom Sum of Squares Mean Squares (MS) F
(d.f.) (SS)
Factor A 2 13.73 6.86
Factor B 3 20.23 6.74
Interaction (AB) 6 24.27 4.04
Experimental error 24 59.33 2.47
Total 35 117.56

Fill in the F values and come up with conclusions.

5.8.4 Some examples of ANOVA tables under different situations

1. Factorial experiment in RCB with two factors

Source Degrees of freedom Sum of Squares Mean Squares (MS) F


(d.f.) (SS)
Factor A a-1 SSA SSA/a - 1 = MSA MSA/MSE
Factor B b-1 SSB SSB/b - 1 = MSB MSB/MSE
Block r-1 SSBL SSBL/r - 1 = MSBL MSBL/MSE
Interaction (AB) (a - 1) (b - 1) SSAB SSAB/(a - 1) (b - 1) = MSAB MSAB/MSE
Experimental error ab (r - 1) SSE SSE/ ab (a - 1)=MSE
Total abr - 1 SSY

2. Factorial experiment in CRD with three factors

Source Degrees of freedom Sum of Mean Squares (MS) F


(d.f.) Squares
(SS)
Factor A a-1 SSA SSA/a - 1 = MSA MSA/MSE
Factor B b-1 SSB SSB/b - 1 = MSB MSB/MSE
Factor C c-1 SSC SSC/c - 1 = MSC MSBL/MSE
Interaction (AB) (a - 1) (b - 1) SSAB SSAB/(a - 1) (b - 1) = MSAB MSAB/MSE
Interaction (AC) (a - 1) (c - 1) SSAC SSAC/(a - 1) (c - 1) = MSAC MSAC/MSE
Interaction (BC) (b - 1) (b - 1) SSBC SSBC/(b - 1) (b - 1) = MSBC MSBC/MSE
Interaction (ABC) (a - 1) (b - 1) (c - 1) SSABC SSABC/(a - 1) (b - 1) (c - 1) = MSABC MSABC/MSE
Experimental error abc (r - 1) SSE SSE/abc (r - 1) MSE
Total abcr - 1 SSY

Exercise 8
An experiment was conducted to determine the effects of 4 different anthelminths on the
weight gain (kg) in 3 breeds of sheep. Four animals from each breed were randomly selected
from a flock of sheep and drenched. The four anthelminths were then randomly given to the
animals. Weight gain was obtained after one month.
The data appears in the table below. Set up an analysis of variance table, computing all the
sum of squares and mean squares. Separate the means (use  = 0.05 for all F tests). What
conclusions can you draw?

Anthelmintic
1 2 3 4
1 29 50 43 53
Breed 2 41 58 42 73
3 66 85 69 85

52
5.9 Split-plot design
This design can also be referred to as a factorial design with unequal replication. In most
animal experiments where two or more factors are investigated, all the factorial treatments
combinations will receive equal replication. There are a few circumstances, however, where
this is either not possible or not convenient, and you are then likely to end up with a split-plot
design. In animal trials, split-plot designs usually arise because circumstances dictate.

5.9.1 Characteristics of split-plot design


1. Whole plots (units) to which levels of one or more factors are divided into subplots (or
subunits to which levels of one or more additional factors are applied).
2. Precision for the measurement of the effects of the main plot factor is sacrificed to
improve that of the subplots.
3. Measurement of the main effect of the subplot factor and its interaction with the main
plot factor is more precise than that obtained with a randomized complete block
design. On the other hand, the measurement of the effect of the main plot treatments
(i.e. the levels of the main plot factor) is less precise than that of a randomized
complete block design.

5.9.2 When to use a split-plot design


1. May be used when the treatments associated with the levels of one or more of the
factors require larger amounts of experimental materials in an experimental unit than
do treatments of the other factor.
2. When additional factor is to be incorporated in an experiment to increase its scope. For
example, suppose that the major purpose in an experiment is to compare the effect of
several fungicide as protectants against infection from a disease. To increase the scope
of the experiment, several varieties are included which are known to differ in their
resistance to the disease. Here, the variety could be arranged in whole units and the
seed protectants in subunits.
3. From previous information, it may be known that larger differences can be expected
among the levels of certain factors than among levels of others. In this case, treatment
combinations for the factors where large difference are expected could be assigned at
random to the whole units simply as a matter of convenience.
4. Where greatly precision is desired for comparisons among certain factors than for
others. Guideline for deciding where to apply the factors are: -
(a) Degree of precisions. For a greater degree of precision for factor B than for
factor A, assign factor B to the subplot and factor A to the main plot.
(b) Relative size of the main effects. If the main effect of one factor (factor B) is
expected to be much larger and easier to detect than that of the other factor
(factor A), factor B can be assigned to the main plot and factor A to the
subplot.

53
5.9.3 Randomization layout
1. Consists of two separate randomization processes for
(a) Main plots
(b) Subplot
2. In each replication main plot treatments are first randomly assigned to the main plots
followed by a random assignment of the subplot treatment within each main plot. Each is
done by any of the randomization schemes discussed earlier.
3. Using a as the number of main plots treatment, b as the number of subplot treatments and r
as the number of replications. For illustration, a two factor experiment involving 6 levels of
nitrogen (main plot treatments) and 4 rice varieties (subplot treatments) in 3 replications are
used.
Step 1
Divide the experimental area into r = 3 blocks, each of which is further divided into a = 6
main plots.

Replication 1
1 2 3 4 5 6
Replication 2
1 2 3 4 5 6
Replication 3
1 2 3 4 5 6

Step 2
Following the RCD randomization procedure with a = 6 treatments r = 3 replications and
randomly assign the 6 nitrogen treatments to 6 main plots in each of the 3 blocks. The result
may be as shown below.

Replication 1
N4 N3 N1 N0 N5 N2
Replication 2
N1 N0 N5 N2 N4 N3
Replication 3
N0 N1 N4 N3 N5 N2

Step 3
Divide each of the (r)(a) = 3 x 6 = 18 main plots in b = 4 subplots and following the BCD
randomization procedure for b = 4 treatments and (r)(a) = 18 replications, randomly assign the
4 varieties to the 4 subplots in each of the 18 main plots. The result may be as shown below.
Replication 1
N4 N3 N1 N0 N5 N2
V2 V1 V1 V2 V4 V3

54
V1 V4 V2 V3 V3 V2
V3 V2 V4 V1 V2 V1
V4 V3 V3 V4 V1 V4
Replication 2
N1 N0 N5 N2 N4 N3
V1 V4 V3 V1 V1 V3
V3 V1 V4 V2 V4 V2
V2 V2 V1 V4 V2 V4
V4 V3 V2 V3 V3 V1
Replication 3
N0 N1 N4 N3 N5 N2
V4 V3 V3 V1 V2 V1
V2 V4 V2 V3 V3 V4
V1 V1 V4 V2 V4 V2
V3 V2 V1 V4 V1 V3

5.9.4 Linear model


The linear model for split plot design is
Yijkl =  + Ai + Bj + Cij + Rk + (BR)jk + eijkl
where
Yijjkl = observation on lth EU
 = overall population mean
Ai = effect of the ith block
Bj = effect of the jth whole unit treatment
Cij = between main plot error (a)
Rk = effect of the kth subunit treatment
(BR)jk = effect of interaction between the whole unit and subunit.
eijkl =within main plot error (b).

5.9.5 ANOVA setup for split-plot


As seen in the model above, normally there are two distinct errors, one for testing treatments
applied to whole main plots and the other for testing effects compared within mainplots
Assuming that the model is fixed and that the treatment is unstructured, the ANOVA is
tabulated as follows

Source Degrees of Sum of Mean Squares (MS) F


freedom Squares (SS)
(d.f.)
Replication (Blocks) r-1 SSBL SSBL/(r - 1) = MSBL
Mainplot factor (A) a-1 SSA SSA/(a - 1) = MSA MSA/MSEa
Error (a) between mainplot (r -1)(a - 1) SSEa SSEa/(r - 1)(a - 1) = MSEa
Subplot factor (B) (b - 1) SSB SSB/ (b - 1) = MSB MSB/MSEb
Interaction between A and B (a - 1)(b - 1) SSAB SSAM/(a -1)(b - 1) = MSAB MSAB/MSEb
Error (b) within mainplot a(r -1)(b - 1) SSEb SSEb/a(r - 1)(b - 1) = MSEb
Total abr - 1 SSY

55
Example 2.9
The following data (grain yield) were obtained in a two factor experiment involving 6 levels
of nitrogen (main plot treatments) and 4 rice varieties (subplot treatments) in 3 replications.

Variety Replication 1 Replication 2 Replication 3


N0 (0 kg N/ha)
V1 4430 4478 3850
V2 3944 5314 3660
V3 3464 2944 3142
V4 4126 4482 4836

N1 (60 kg N/ha)
V1 5418 5166 6432
V2 6502 5858 5586
V3 4768 6004 5556
V4 5192 4604 4652

N2 (90 kg N/ha)
V1 6076 6420 6704
V2 6008 6127 6642
V3 6244 5724 6014
V4 4546 5744 4146

N3 (120 kg N/ha)
V1 6462 7056 6680
V2 7134 6982 6564
V3 5792 5880 6370
V4 2776 5036 3638

N4 (150 kg N/ha)
V1 7290 7848 7552
V2 7682 6594 6576
V3 7080 6662 6320
V4 1414 1960 2766

N5 (180 kg N/ha)
V1 8452 8832 8818
V2 6228 7387 6006
V3 5594 7122 5480
V4 2248 1380 2014

1. Construct two tables of totals.


(a) Replication x factor A two-way tables of totals, with the replication totals. Factor A and
grand total.
Replication x nitrogen yield totals

Yield total Nitrogen total (A)


Nitrogen Replication 1 Replication 2 Replication 3
N0 15964 17218 15488 48670
N1 21880 21880 21632 65738
N2 22874 24015 23506 70395
N3 22167 24954 23252 70373
N4 23466 23064 23214 69744
N5 22522 24721 22318 69561
Replication total (B) 128873 135604 130004
Grand total 394481

(b) Factor A x Factor B. two way table of totals

56
Nitrogen x variety of yield totals

Yield total (AB)


Nitrogen V1 V2 V3 V4
N0 12758 12918 9550 13444
N1 17016 17946 16328 14448
N2 19200 18777 17982 14436
N3 20198 20685 18042 11448
N4 22690 20852 20062 6140
N5 26102 19621 18196 5642
Variety total 117964 110799 100160 65558

2. Compute correction factor (CF) and sums of squares for the main plot analysis
2
CF = correction factor = Y... n = (3944819)2/(3)(6)(4) = 2161323047

Please note that abr = n


n
SSY = total sum of squares = Y
ij1
2
ijk  CF

= b4430g  b3944g .....b2014g


2 2 2
 2161323047

= 204747916
r 2
SSBL = replication (block) sum of squares = R
i 1
i.
ab  CF

=
LMb128873g  b135604g  b130004g OP  CF
2 2 2

MN 6x4 PQ
= 1082577
a 2
SSA = sum of squares for the main plot (nitrogen) = A
i 1
i
rb  CF

=
LMb48670g ....b69561g OP  CF
2 2

MN 3x4 PQ
= 30429200
ar
RA 2i
SSEa = error between mainplots =  i 1
b  CF  SSBL  SSA

Lb15964g ....b22318g OP  2161323047 -1082577 - 30429200


= M
2 2

MN 4 PQ
= 1419678
3. Compute sum of squares for subplot analysis.

57
b
B2i
SSB = sum of squares for the subplot (variety) = 
i 1
ra  CF

Lb117964g ....b65558g OP  CF
= M
2 2

MN 3x6 PQ
= 89888101
ab
AB2i
SSAB = sum of squares for the interaction between A and B = 
i 1
r  CF  SSA  SSB

Lb12758g ....b5642g OP  2161323047 - 30429200 -89888101


= M
2 2

MN 3 PQ
= 69343487
SSEb = error within mainplots = SSY - SSBL - SSA - SSEa -SSB - SSAB
= 12584873
4. For each source of variation compute MS
MSBL = SSBL/(r - 1) = 541228
MSA = SSA/(a - 1) = 6085840
MSEa = SSEa/(r - 1)(a - 1) = 141968
MSB = SSB/ (b - 1) = 29962700
MSAB = SSAM/(a -1)(b - 1) = 4622899
MSEb = SSEb/a(r - 1)(b - 1) = 349580

5. Compute F
F (A) = MSA/MSEa = 42.87
F (B) = MSB/MSEb = 85.71
F (AB)= MSAB/MSEb = 13.22

6. Compute the two coefficients of variation.

bg
CV a 
MSEa
Y...
x 100 note that Y... abr is the grand mean.
abr
141968
 x 100
394481
b gb gb g
3 6 4
= 6.9%
This indicates the degree of precision attached to the main plot factor.

58
bg
CV b 
MSEb
Y...
x 100
abr
349580
 x 100
394481
b gb gb g
3 6 4
= 10.8%
This indicates the precision of the subplot factor and its interaction with the main plot factor.
The ANOVA tables looks as follows

Source Degrees of Sum of Mean Squares (MS) F


freedom Squares (SS)
(d.f.)
Replication (Blocks) 2 1082577 541228
Nitrogen (A) 5 30429200 6085840 42.87**
Error (a) between nitrogen t 10 1419678 141968
Variety (B) 3 89888101 29962700 85.71**
Interaction between A and B 15 69343487 4623899 13.22**
Error (b) within nitrogen 36 12584873 349580
Total 71 204747916
Separate the means using LSD and draw conclusions.

Exercise 9
The figure below shows the plan of an experiment testing three temperatures (allocated to six
rooms at random) in combination with four diets (A, B, C and D) allocated at random within
rooms in a split-plot design. This experiment was planned to investigate the interaction
between environmental temperature and dietary nutrient concentration. The question was
whether putting more energy (and protein and minerals in due proportion) into the diet helps
to overcome the adverse effects of heat stress on laying performance. The numbers in
parenthesis indicate the total number of eggs laid after a three-month period. Analysis these
data and draw conclusions at  = 0.05.

Room 1 2 3 4 5 6
Temp 27 33 27 30 30 33
(C)
C D A C B A C D B C C A
(56) (22) (61) (62) (74) (88) (71) (14) (66) (67) (57) (64)
A B B D C D B A D A B D
(85) (62) (60) (46) (71) (14) (77) (73) (20) (78) (61) (57)

59
6 REGRESSION AND CORRELATIONS
If we are interested in the question whether two variables are related, we speak about
correlation and focus our attention on the correlation coefficient, r; if we are interested in the
dependence of one variable on another, we call this regression and describe the relationship
with an equation such as Y = a + bX, where b is the regression coefficient. Regression and
correlation can be classified based on the following:
1. Number of variables
2. Form of relationship
If we have only 2 variables (independent and dependent), then we call that a simple regression
or correlation. If the number of variables is more than 2, we have multiple regression or
correlation. There are cases where by there can be k independent variables but only 1
dependent variable. When classified based on the form of relationship, two forms are
distinguishable: linear and non-linear. If the data appears to be approximated well by a
straight line, we say that a linear relationship exists between the variables. If a relationship
exists but it is not linear, then we call it a non-linear relationship. Non-linear relationships can
sometimes be reduced to linear relationships by appropriate transformation of variables.
The two types of classification basis result in different types of regression and correlation
namely: - simple linear regression or correlation, simple non-linear regression or correlation,
multiple linear regression or correlation and multiple non-linear regression or correlation.

6.1 Simple correlation


If X and Y denote the two variables under consideration, a scatter diagram shows the location
of pints (X, Y) on a rectangular coordinate system. If all the points in this scatter diagram
seem to lie near a line, the correlation is called linear - could be positive or negative. If a all
points seem to lie near some curve, the correlation is called non-linear. If there is no
relationship indicated between the variables, we say that there is no correlation between them
i.e. they are uncorrelated.
The measure of the amount of correlation between two variables is called a coefficient of
correlation, r, and is given by the equation

r
 XY   cX  XhcY  Yh
d  X  Y i  c X  X h  cY  Y h
2 2 2 2

There is another way of writing the above formula which is generally easier and quicker to
use when performing actual calculations, this is

60
 X Y
 XY  n
r
R|L U
MSM X  dnXi OPPLMM Y  dnYi OPP|V
2 2
2 2

|TN QN Q|W

6.1.1 Factors affecting the correlation between X and Y


1. It is a chance event, there is a probability of events being correlated.
2. Possibility that X Causes Y
3. Possibility that Y Causes X
4. Y and/or Y are caused by a third variable Z.

6.1.2 Characteristics of correlation


1. It has no units or dimension as regression
2. It does not imply dependent-independent relationships as in regression.
3. It can vary from +1 to -1 while a regression coefficient (b - change in Y when X
changes by one unit) vary from - to +.
4. Its other names are correlation coefficient and pearson product moments correlation.
5. It does not require that X or Y be fixed or independent hence they are in bivariate
distribution.
A correlation can be significant but it does not prove cause and effect. Hence a valid reason
must be found for the association of the two factors. No need to test for significance.

6.1.3 Uses of correlation


1. It is important as a descriptive statistic to emphasize a relationship. But it does not
point for serious cause and effect relationship.
2. For many situations correlation or regression can be used, but only the researcher
chooses the appropriate one to emphasize.

6.2 Simple regression


Simple regression analysis usually describes a relationship between two variables or
parameters. The variables are usually called Y and X. The Y variable is termed the dependent
variable since any Y value depends on the population sampled. The X variable is called the
independent variable or argument since X is being assessed and its effects are being measured
as Y values.

61
Y values have to be obtained from several populations, each population providing the Y
values plus a corresponding X value all measured at the same time. Randomness of Y is
essential for probability theory to apply. X is fixed but may also be random.

6.2.1 Uses of simple regression analysis


1. To examine or describe the mathematical relationship between X and Y.
2. To use a value of X to predict a value of Y.

6.2.2 Simple regression equation


The simple regression equation is of the form Y = a + bX. This is the equation of any straight
line where any point (X, Y) on this line has an X coordinate or abscissa and a Y coordinate or
ordinate. Coordinate of points not on the line do not satisfy the equation.
When x = 0, then y = a = intercept (the point where the line crosses the Y axis). When a = 0,
the line goes through the origin. A change in X results in a change of b units in Y so that b is a
measure of the slope of the line. When b is positive, both variables increase or decrease
together; when b is negative, one variable increases as the other decreases.

6.2.3 Calculation of a and b


The values of a (intercept) and b (slope) can be calculated mathematically using simple ways
as:

S XY
b
S XX

a  Y  bX

c
where SXX   X  X h 2
  X2 
d  Xi 2

and
n

c hc
SXY   X  X Y  Y   XY  h d Xid Yi
n
Note that in practical reality when a population is considered, lines computed in regression
problems are lines about which the pairs of values (X, Y) cluster; they are strictly not lines
upon which the points fall. A point on a regression line is therefore an estimate of a mean of a
population of Y's having the corresponding X value.

62
Example 5.1
In a random sample n = 9 steers, the live weight and dressed weights were recorded. Let Y =
dresses weight (in hundreds of kg) and X = live weight ( in hundreds of kg). Use the data
below to obtain a and b.

X Y
4.2 2.8
3.8 2.5
4.8 3.1
3.4 2.1
4.5 2.9
4.6 2.8
4.3 2.6
3.7 2.4
3.9 2.5
 37.2 23.7

c
SXX   X  X h 2
 X 2

d  Xi 2

 155.48 
b37.2g 2

n 9
= 155.48 - 153.96 = 1.72

c hc
SXY   X  X Y  Y   XY  h d Xid Yi  99.02  b37.2gb23.7g
n 9
= 99.02 - 97.96 = 1.06

b = 1.06/1.72 = 0.616

X
 X  37.2  4.133
n 9

Y
 Y  23.7  2.633
n 9
b
a  Y  bX = 2.633 - 0.616 4.133  0.087 g
Y = 0.087 + 0.616X
This is a deterministic model because we have not included the error term. We have assumed
that the error term is zero. Using this equation, we can predict the value of X or Y given the
value of either e.g., when value of X is 4, Y is 0.087 + 0.616 (4) = 2.551.

63
6.2.4 Testing the significance of a and b
It is assumed that the Ys are normally distributed and hence the estimators are also normally
distributed. Hence we may base the confidence intervals and tests of hypotheses on the t-
distribution.

Testing significance of slope b


1. Compute the residual mean squares as

 cX  XhcY  Yh
2

 cY  Y h
2
 SYY 
bS gbS g
 c X  Xh
XY XY
2
S XX
S2  
n2 n2

Remember that b 
S XY S  b SXY
therefore S2  YY
b g
S XX n2
Where

SYY   Y  Yc h 2
  Y2 
d  Yi 2

c hc
and SXY   X  X Y  Y   XY  h d Xid Yi
n n

2. Compute tb as
b
tb 
S2
S XX

3. Compare the computed tb value to the tabular t values with n- 2 degree of freedom. The
slope b is judged to be significantly different from 0 if the absolute value of the t b is greater
than the tabulated t value at the prescribed level of significance.

In our Example 5.1

L b37.2gb23.7g O 2

LM63.13  b23.7g OP  MN99.02  9 PQ


2

MN 9 PQ LM155.48  b37.2g OP 2
106
.
2

S2  NM 9 QP

.  62.41 
6313
172
. 
0.72  0.65
92 7 7
= 0.01

64
0.616
tb  = 8.11
0.01
1.72
The tabulated t value at 5% and 1% levels of significance with 7 (n - 2) degrees of freedom
are 2.365 and 3.499 respectively. Because the computed t b value is greater than the tabular t
values, it is concluded that there is linear response of dresses weight to changes in live weight.

Testing significance of intercept, a


1. Compute the residual mean squares as described above
2. Compute ta as
a
ta 
S2
LM 1  X OP
2

Nn S QXX

Then compare the tabular t value 0.05, n - 2 df with the calculated t. If calculated is greater
than tabular, then reject the null hypothesis and accept that there is no other better intercept.

6.2.5 Regression ANOVA


ANOVA setup
Regression has also sources of variation and can be expressed in terms of analysis of variance.
The table below shows the ANOVA setup for a regression analysis.

Source of variation Degrees of freedom Sum of squares Mean square F


Regression 1 SSR SSR/1 = MSR MSR/MSE
Error n-2 SSE SSE/n - 2 = MSE
Total n-1 SSY

Note that the sum of squares for regression has 1 degree of freedom, the sum of deviations has
n - 2 degrees of freedom and the total sum of squares has, as usual n - 1 degree of freedom.

1. Sum of squares due to regression (SSR) = bSXY


2. Total sum of squares (SSY) = SYY
3. Sum of squares for the error term (SSE) = SYY - bSXY
4. Variance of the error term = SSE/n - 2 =  2E
5. Standard errors of a and b are estimated as

SEa =  E
X 2

and SEb =
E
nS XX S XX

65
Coefficient of determination (regression index)
This is the fraction of the total variation in Y that is accounted for by the association between
Y and X and is given by:

SSR bSXY
R2  
SSY SYY
R2 must always be between 0 and 1. if all the points are close to the line, the value of R 2 will
be close to one; but as the scatter of the points becomes greater, R 2 will become smaller,
indicating a poor fit. For this reason, R2 is a useful measure of the strength of the relationship
between Y and X.
1 - R2 is referred to as the coefficient of alienation; and is the fraction of the variation in Y
that is unaccounted for by X. It is the fraction associated with the errors of prediction.

6.3 Multiple regression


A situation where we have more than one independent variable. We want to find the effect of
k independent variable on Y dependent variable.

Y = a + b1X1 + b2X2 + …bkXk

The independent variables are assumed to be independent of the other. They are also assumed
to be linear. Normally the above equation depends on the number of independent variables. If
2 then
Y = a + b1X1 + b2X2
If 3 then
Y = a + b1X1 + b2X2 + b3X3
Assuming that we have 2 independent variables X1 and X2 then

Corrected sum of squares and cross products


Variable Mean X1 X2 Y
X1 X1 x 2
1 x x1 2 x y 1
X2 X2 x 2
2 x y 2
Y Y y 2

Where

X
X
n

Y=
Y
n

66
 x   c X  Xh
2 2

 x x   cX  X hcX  X h
1 2 1 1 2 2

 xy   cX  XhcY  Yh
 y   cY  Y h
2 2

Example 5.2
Assume the following data on different varieties of maize.

Variety number Grain yield kg/ha (Y) Plant height cm (X1) Tiller number (X2)
1 5755 110.5 14.5
2 5939 105.4 16.0
3 6010 118.1 14.6
4 6545 104.5 18.2
5 6730 93.6 15.4
6 6750 84.1 17.6
7 6899 77.8 17.9
8 7862 75.6 19.4

Mean 6561 96.2 16.7

1. Solve the corrected sum of squares and cross product

 x  1753.72
2
1

 x x  156.65
1 2

 x  23.22
2
2

 x y  65194
1

 x y  7210
2

 y  3211504
2

2. Solve for b1, b2 …. bk from normal equations

b1  x12  b2  x1 x2 .. b k  x1 xk   x1 y

b1  x1 x2  b2  x22 .. b k  xk2   x2 y

b1  x1 x2  b2  x2 xk .. b k  xk2   xk y

In our case, we have 2 independent variables and thus the normal equations are

67
b1  x12  b2  x1 x2   x1 y

b1  x1 x2  b2  x22   x2 y

After some simplifications

b 
d x id x yi  d x x id x yi
2
2 1 1 2 2
1
d x id x i  d x x i
2
1
2
2 1 2
2

b 
b23.22gb65194g  b156.65gb7210g  23.75
1
b1753.72gb23.22g  b156.65g 2

and

b2 
d x id x yi  d x x id x yi
2
1 2 1 2 1

d x id x i  d x x i
2
1
2
2 1 2
2


b1753.72gb7210g  b156.65gb65194g  150.27
b2
b1753.72gb23.22g  b156.65g 2

3. Compute the intercept a.


a  Y  b1X1  b2 X2 ......b k Xk
For our case
a  Y  b1X1  b2 X2
b gb g b
a  6561  -23.75 96.2  150.27 16.7  6336 gb g
Therefore, the equation is
Y = 6336 - 23.75X1 + 150.27X2

4. Sum of squares due to regression (SSR)

b gd x yi
k
SSR   b1 i
i 1

For our case


SSR  b1  x1 y  b2  x2 y

b gb g b
= 23.75 65194  150.27 7210  2631804 gb g
5. Residual sums of squares (SSE)

68
SSE   y 2  SSR
= 3211504 - 2631804
= 579700

6. Calculate the coefficient of determination (R2)


SSR 2631804
R2  =  0.82
 y 3211504
2

7. Test the significance of the R2


SSR
F k
b
SSE n  k  1 g
where k = number of independent variables.

2631804
F 2  1135
b
579700 8 - 2 -1
.
g
Compare the computed F value to the tabular F value with df1 = k and df2 = (n -k -1). The R2
is significant if the computed F value is greater than the tabular F value at the prescribed level
of significance.
For our example, the tabular F values with df1 = 2 and df5 = 5 are 5.790 at 5% level of
significance and 13.27 at 1% level. Therefore, the estimated Y = 6336 - 23.75X1 + 150.27X2
is significant.

Exercise 10
An experiment was conducted to determine the association between minutes of mixing (X)
and an index of the textural quality (Y) of an animal feed. The data is shown below.

Number X Y Number X Y
1 5 67.45 16 20 55.65
2 5 67.90 17 20 57.74
3 5 69.41 18 20 53.54
4 5 67.64 19 20 57.05
5 5 64.17 20 20 56.98
6 10 61.94 21 25 54.78
7 10 60.32 22 25 51.91
8 10 62.47 23 25 49.45
9 10 64.78 24 25 52.25
10 10 67.96 25 25 53.94
11 15 60.83 26 30 49.11
12 15 55.78 27 30 50.29
13 15 55.90 28 30 43.63
14 15 63.89 29 30 50.82
15 15 59.00 30 30 48.76

Using the methods of linear regression:

69
(a) Calculate the values of a and b in the equation Y = a + bX
(b) Determine statistical significance of a and b by t test.
(c) Determine and interpreting R2
(d) Calculate the standard deviation of Y when X = 10
(e) Show the ANOVA

Exercise 11
The following data were recorded for a regression study

Y 15 12 14 18 19 16 17 26 20 22 24
X1 6 7 7 8 8 9 9 10 10 11 12
X2 10 12 13 14 15 15 16 17 18 19 19
(a) Write the regression model
(b) Use the sample data to fit the model.

70
7 DISCRETE DATA
Normally for an analysis of variance to be carried out, the data should be drawn from a
continuous variable which is normally distributed. However, it is not uncommon to encounter
data that are not continuous, either because the results are qualitative, not quantitative (e.g.,
male or female, alive or dead, pregnant or not pregnant), or because the numbers are small
(e.g. litter size in sheep or goats). You cannot be 'a little bit pregnant' and a single ewe cannot
have a litter of 1.7 lambs.
If you happen to have several large groups of animals, then the numbers become, for all
practical purposes, continuous and approximately normal, even though, at its root, the
character is discontinuous. A herd of cows can have a conception rate (to single insemination)
of 53.7% and a flock of sheep can have a mean litter size of 1.73. Comparisons amongst
numerous herds or numerous flocks can thus be made by treating the data as continuous. But
in formal experiments employing large animals, it is not usually possible to allocate replicated
groups to each treatment; the basis of replication is almost always the individual animal.
Some of the data collected may then be categorical, meaning that the result for any one animal
falls into one or another of a small number of categories. Although litter size in cattle, sheep
or goats can he represented by a number, you can also think of it as a set of categories (single,
twins or triplets). 'Alive or dead' are clearly two mutually exclusive categories which cannot
be realistically represented by numbers, even though you might assign dummy values of 0
and 1 to these conditions for certain analytical purposes.
For pigs and rabbits, where the litters are larger, it is usually safe to treat litter size as though
it were a continuous variable, and the same goes for egg numbers at a single ovulation in
polytocous mammals or for egg laying by poultry over an extended period. Individual litter
sizes in pigs might range from 6 to 14 and the number of eggs laid by one chicken in a month
might range from 20 to 3 1, and such data will generate variances that can be treated as part of
a normal distribution. However, if you were considering eggs laid by individual hens on 1
single day, that would be a discrete variable with values limited to 0, 1 or 2.
The usual method of analysing categorical data is to employ a chi-squared test (2). This is
easy to apply but be aware that the test is approximate if the number in any one category is
less than 5.

Event A1 A2 A3 …. Ak
Observed frequency O1 O2 O3 …. Ok
Expected frequency E1 E2 E3 ..... Ek

Suppose that in a particular sample a set of possible events A1, A2, A3,…, Ak (see table above)
are observed to occur with frequencies O1, O2, O3,…Ok called observed frequencies, and that
according to probability rules they are expected to occur with frequencies E1, E2, E3, …, Ek
called expected or theoretical frequencies, then Chi-square is given by

 =
bO  E g  bO
2 1 1
2
2  E2 g 2

....
bO k  Ek g 2


k
bO  E gi i
2

E1 E2 Ek i 1 Ei
If 2 = 0, observed and theoretical frequencies agree exactly; while if 2 > 0, they do not agree
exactly. The larger the value of 2, the greater the discrepancy between observed and
expected frequencies.

71
In practise, expected frequencies are computed on the basis of a hypothesis H o. If under this
hypothesis the computed value of 2 is greater than some critical value (such as  .295 or  .299 ,
which are the critical values at the 0.05 and 0.01 significance levels respectively), we
conclude that observed frequencies differ significantly from expected frequencies and would
reject Ho at the corresponding level of significance. Otherwise we would accept it or at least
not reject it. This procedure is called the chi-square test of hypothesis or significance.
It should be noted that we must look with suspicion upon circumstances where 2 is too close
to zero, since it is rare that observed frequencies agree to well with expected frequencies. To
examine such situations, we can determine whether the computed value of 2 is less than
 .205 or  .201 , in which cases we would decide that the agreement is too good at the .05 or .01
levels of significance respectively.
The following table shows a chi-squared analysis of deaths and survival in two groups of
animals.

Observed numbers (O) Expected numbers (E) Deviations (O - E) 2 = (O - E)2/E


Group A B Totals A B A B A B
Died 13 31 44 21.8 22.2 -8.8 8.8 3.55 3.49
Survival 204 190 394 195.2 198.8 8.8 -8.8 0.40 0.39
Totals 217 221 438 217 221 2 = 7.83

In this table, the expected number dying in each group are readily derived from the null
hypothesis that there is no difference (other than chance) between the groups and therefore the
expected proportion of deaths is what we observe in the entire sample i.e., 44/438.

72

You might also like