Introduction To Biostatistics Student Lecture Notes

lOMoARcPSD|24727256
introduction to biostatistics student lecture notes
Business Statistics (Kenyatta University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university

Downloaded by Divya Mohan (divyam17072001@gmail.com)
lOMoARcPSD|24727256
DIPLOMA IN HEALTH RECORDS & INFORMATION TECHNOLOGY
HEALTH STATISTICS
Thika School of Medical & Health Sciences

Department of Business & Informatics
Fax: +254 067 22280
E-mail: info@tsmhs.com
Website: www.tsmhs.com
1|Page

lOMoARcPSD|24727256
Table of Contents
CHAPTER
ONE .........................................................................................................................................
...... 3
INTRODUCTION TO
1.0 STATISTICS ............................................................................................... 3
What is
statistics? ...................................................................................................................
1.2 . 3
CHAPTER
TWO ........................................................................................................................................
...... 8
Introduction .................................................................................................................
2.1 .................. 8
Mean
(Arithmetic).............................................................................................................
2.1.1 ........ 8
Median .....................................................................................................................
2.2 ................... 10
Mode .........................................................................................................................
2.3 .................. 10
Skewed Distributions and the Mean and
2.4 Median ....................................................................... 13
Summary of when to use the mean, median and
2.5 mode ............................................................... 15
Measures of
2.6 Dispersion ................................................................................................................... 16
Introduction ............................................................................................................
2.6.1 ................ 16
Range ......................................................................................................................
2.6.2 ................ 16
2.6. Standard
3 Deviation .............................................................................................................. 16
CHAPTER
THREE ....................................................................................................................................
...... 18
CHAPTER
FOUR .....................................................................................................................................
...... 23
Introduction to
4.0 Probability ......................................................................................................... 23
CHAPTER 34
FIVE .......................................................................................................................................
2|Page

lOMoARcPSD|24727256
......
Correlation and
5.0 Regression ......................................................................................................... 34
RANK CORRELATION
FORMULA .............................................................................................................. 42
RANK
CORRELATION ......................................................................................................
5.13 ............. 43
Coefficient of
correlation ........................................................................................................................ 51
CHAPTER
SIX ...........................................................................................................................................
..... 54
The Chi square
6.0 test ..................................................................................................................... 54
Hypothesis
6.5 testing....................................................................................................................... 61
REFERENCES ........................................................................................................................
........................ 63
3|Page

lOMoARcPSD|24727256
4
|
P
a
g
e

lOMoARcPSD|24727256
5
|
P
a
g
e

lOMoARcPSD|24727256
CHAPTER ONE
1.0 INTRODUCTION TO BIOSTATISTICS
OBJECTIVES
By the end of this course, the participant will be able to:-
1. Discuss terminologies, basic principles and concepts of Bio-statistics
2. Discuss commonly used descriptive statistics
3. Discuss elements of inferential statistics
4. Discuss the application of statistical methods in data management
1.1 Definition of biostatics.
1. Is a scientific approach to information presenting itself in numerical form which enables

us to maximize our understanding of such information
2. This is a mathematical science pertaining to collection, analysis, interpretation or
explanation and presentation of data. (W. M. Harper)
3. Biostatistics is the application of statistics to a wide range of topics in medicine.
(Wikipedia)
4. Is the art and science of dealing with variation in such a way as to obtain reliable results
(Mainland 1963)
1.2 What is Biostatistics?
Biostatistics is the application of statistical techniques to scientific research in health-related

fields, including medicine, biology, and public health, and the development of new tools to study
these areas.
Statistical techniques are used in studies such as identifying the causes of diseases and injuries,
evaluating public health programs to determine what works best in solving health problems, and
designing mathematical models that describe the progression of diseases in populations.
6|Page

lOMoARcPSD|24727256
7
|
P
a
g
e

lOMoARcPSD|24727256
Biostatisticians collaborate with practitioners and researchers in clinical and public health and
with local, state, and national health institutions. Biostatisticians also advise public health
officials at the local, regional and national levels.
Biostatisticians find employment in various types of organizations and settings, including local
and state health departments, with the federal government such as at the Centers for Disease
Control and Prevention or other divisions in the Department of Health and Human Services,
and in academic settings, industry such as pharmaceutical companies, and health care providers
including hospitals and managed care organizations.
1.3 Use of Statistical Data

i. Making informed decisions
ii. Allows general conclusions to be made from data provided (limited/unlimited)
iii. Evidence based decision making affecting:-
a. Age and sex distribution of the population by social groups
– Birth rate, crude death rates, IMR, MMR, CDR, PMR, SBR, NDR, PNDR etc.
– Incidence, prevalence and attack rates
1.4 Importance of Statistics

i. Analyses general activities of an organization
ii. Planning for recurrent and development votes e.g. Capital projects of organizations
iii. Monitoring and evaluation of activities being undertaken e.g. clinical practices by
doctors etc.
iv. Clinical Research and Clinical trails
v. Epidemiological Studies
vi. An aid to supervision
vii. Base for planning
viii. Eyes of administration
ix. Arithmetic of human welfare
x. Disclose connection between related factors
xi. Helpful in business
xii. Used in all sciences
8|Page

lOMoARcPSD|24727256
9|Page

lOMoARcPSD|24727256
1
0
|
P
a
g
e

lOMoARcPSD|24727256
xiii. Helpful in data processing
1.5 Statistical Methods

i. Collection of data
a. Step one in statistical investigation
b. Foundation for statistical analysis
c. Must be accurate for reliable conclusions
ii. Organization
a. Involves editing to remove omissions
b. Data classification
c. Data tabulation
iii. Presentation
a. Facilitate analysis by having good presentation
iv. Analysis
a. Data analysis is done by using statistical techniques e.g. use of software's
(SPSS, EPI info, SAS etc.
v. Interpretation
– Finding/results/conclusions
1.6 Bio-statistics process involves

a) Basic understanding i) Use of data analysis software
b) Measurement j) Reviewer and conveyer of published
c) Data collection Research
d) Descriptive statistics k) Written and oral presentation
e) Inferential statistics l) Relation to PH
f) Methodological decision-making problems/issues/policies
g) Study Designs m) Ethical practices
h) Data management n) Working with other PH professionals
11 | P a g e

lOMoARcPSD|24727256
1
2
|
P
a
g
e

lOMoARcPSD|24727256
1.7 Types of Numerical DATA

As “data” we consider the result of an experiment or information collected in an
observational study. A rough classification is as follows:
• Nominal data
Numbers or text representing unordered categories (e.g., 0=male, 1=female)
• Ordinal data
Numbers or text representing categories where order counts (e.g., 1=fatal injury,
2=severe
injury, 3=moderate injury, etc.
• Discrete data
This is numerical data where both ordering and magnitude are important but only whole
number values are possible (e.g., Numbers of deaths caused by heart disease (765,156 in
1988) versus suicide (40,368 in 1988, page 10 in text).
• Continuous data
Numerical data where any conceivable value is, in theory, attainable (e.g., height, weight,
etc.)
1.8 Descriptive Statistics
i. Deals with methods of describing large data (masses of numbers)
ii. Describe a collection of data
iii. Identifies patterns in the data
iv. Describe samples in summary
v. Guides choice of statistical test
vi. Describe numerical data which includes mean, median, mode, standard deviation etc.
1.9 Inferential Statistics

i. Deals with the method of drawing conclusions from observed
variable/numbers. Involves use of statistical tests e.g. T-Test, Chi square.
ii. Used to determine the likelihood that a conclusion based on data from a sample is
true.
iii. Used in estimations, description of association (Correlation)
13 | P a g e

lOMoARcPSD|24727256
14 | P a g e

lOMoARcPSD|24727256
1
5
|
P
a
g
e

lOMoARcPSD|24727256
iv. Modeling of relationships (Regression)

1.10 Data Collection Methods
Direct Observation
Interviewing using interviewing schedule designs
Postal Questionnaires
Abstraction from already published statistics
1.11 Types of Data

i. Numerical
a. Continuous
b. Discrete
ii. Categorical
a. Ordinal
b. Nominal
1.12 Main function of statistics

Study design
Descriptive statistics
Inferential Statistics
– Estimation
– Detect relationships
There is no other way of representing "meaning" except in terms of
relations between some quantities or qualities; either way involves
relations between variables
– Prediction
Review questions
1. What are the terminologies, basic principles and concepts of Bio-statistics

2. Define used descriptive statistics
3. What are the elements of inferential statistics
16 | P a g e

lOMoARcPSD|24727256
1
7
|
P
a
g
e

lOMoARcPSD|24727256
CHAPTER TWO
2.0 Measures of Central Tendency

By the end of this session, the participant will be able to:-
1. Define the measures of central tendency
2. Calculate measures of central tendency
3. Define measure of dispersion
2.1 Introduction
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central tendency are
sometimes called measures of central location. They are also classed as summary statistics. The
mean (often called the average) is most likely the measure of central tendency that you are most
familiar with, but there are others, such as, the median and the mode.
The mean, median and mode are all valid measures of central tendency but, under different
conditions, some measures of central tendency become more appropriate to use than others. In
the following sections we will look at the mean, mode and median and learn how to calculate
them and under what conditions they are most appropriate to be used.
2.1.1 Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It can
be used with both discrete and continuous data, although its use is most often with continuous
data (see our Types of Variable guide for data types). The mean is equal to the sum of all the
values in the data set divided by the number of values in the data set. So, if we have n values in
a data set and they have values x1, x2, ..., xn, then the sample mean, usually denoted
by (pronounced x bar), is:
This formula is usually written in a slightly different manner using the Greek capitol letter, ,
pronounced "sigma", which means "sum of...":
You may have noticed that the above formula refers to the sample mean. So, why call have we
called it a sample mean? This is because, in statistics, samples and populations have very
different meanings and these differences are very important, even if, in the case of the mean,
18 | P a g e

lOMoARcPSD|24727256
19 | P a g e

lOMoARcPSD|24727256
2
0
|
P
a
g
e

lOMoARcPSD|24727256
they are calculated in the same way. To acknowledge that we are calculating the population
mean and not the sample mean, we use the Greek lower case letter "mu", denoted as µ:
The mean is essentially a model of your data set. It is the value that is most common. You will
notice, however, that the mean is not often one of the actual values that you have observed in
your data set. However, one of its important properties is that it minimizes error in the prediction
of any one value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of
the deviations of each value from the mean is always zero.
2.1.2 When not to use the mean
The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.
These are values that are unusual compared to the rest of the data set by being especially small
or large in numerical value. For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this
mean value might not be the best way to accurately reflect the typical salary of a worker, as most
workers have salaries in the $12k to 18k range. The mean is being skewed by the two large
salaries. Therefore, in this situation we would like to have a better measure of central tendency.
As we will find out later, taking the median would be a better measure of central tendency in this
situation.
Another time when we usually prefer the median over the mean (or mode) is when our data is
skewed (i.e. the frequency distribution for our data is skewed). If we consider the normal
21 | P a g e

lOMoARcPSD|24727256
distribution - as this is the most frequently assessed in statistics - when the data is perfectly
22 | P a g e

lOMoARcPSD|24727256
2
3
|
P
a
g
e

lOMoARcPSD|24727256
normal then the mean, median and mode are identical. Moreover, they all represent the most
typical value in the data set. However, as the data becomes skewed the mean loses its ability to
provide the best central location for the data as the skewed data is dragging it away from the
typical value. However, the median best retains this position and is not as strongly influenced by
the skewed values. This is explained in more detail in the skewed distribution section later in this
guide.
2.2 Median
The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order to calculate the median,
suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark
because there are 5 scores before it and 5 scores after it. This works fine when you have an odd
number of scores but what happens when you have an even number of scores? What if you had
only 10 scores? Well, you simply have to take the middle two scores and average the result. So,
if we look at the example below:
65 55 89 56 35 14 56 55 87 45
We again rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Only now we have to take the 5th and 6th score in our data set and average them to get a
median of 55.5.
2.3 Mode
The mode is the most frequent score in our data set. On a histogram it represents the highest
bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the
24 | P a g e

lOMoARcPSD|24727256
most popular option. An example of a mode is presented below:
25 | P a g e

lOMoARcPSD|24727256
10
26
|
P
a
g
e

lOMoARcPSD|24727256
Normally, the mode is used for categorical data where we wish to know which is the most
common category as illustrated below:
27 | P a g e

lOMoARcPSD|24727256
11
28 | P a g e

lOMoARcPSD|24727256
We can see above that the most common form of transport, in this particular data set, is the bus.
However, one of the problems with the mode is that it is not unique, so it leaves us with
problems when we have two or more values that share the highest frequency, such as below:
We are now stuck as to which mode best describes the central tendency of the data. This is
particularly problematic when we have continuous data, as we are more likely not to have any
one value that is more frequent than the other. For example, consider measuring 30 peoples'
weight (to the nearest 0.1 kg). How likely is it that we will find two or more people
with exactly the same weight, e.g. 67.4 kg? The answer, is probably very unlikely - many people
might be close but with such a small sample (30 people) and a large range of possible weights
you are unlikely to find two people with exactly the same weight, that is, to the nearest 0.1 kg.
This is why the mode is very rarely used with continuous data.
Another problem with the mode is that it will not provide us with a very good measure of central
tendency when the most common mark is far away from the rest of the data in the data set, as
depicted in the diagram below:
29 | P a g e

lOMoARcPSD|24727256
30 | P a g e

lOMoARcPSD|24727256
12
31
|
P
a
g
e

lOMoARcPSD|24727256
In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is
not representative of the data, which is mostly concentrated around the 20 to 30 value range. To
use the mode to describe the central tendency of this data set would be misleading.
2.4 Skewed Distributions and the Mean and Median
We often test whether our data is normally distributed as this is a common assumption
underlying many statistical tests. An example of a normally distributed set of data is presented
below:
32 | P a g e

lOMoARcPSD|24727256
13
33 | P a g e

lOMoARcPSD|24727256
When you have a normally distributed sample you can legitimately use both the mean or the
median as your measure of central tendency. In fact, in any symmetrical distribution the mean,
median and mode are equal. However, in this situation, the mean is widely preferred as the best
measure of central tendency as it is the measure that includes all the values in the data set for its
calculation, and any change in any of the scores will affect the value of the mean. This is not the
case with the median or mode.
However, when our data is skewed, for example, as with the right-skewed data set below:
34 | P a g e

lOMoARcPSD|24727256
14
35 | P a g e

lOMoARcPSD|24727256
we find that the mean is being dragged in the direct of the skew. In these situations, the median is
generally considered to be the best representative of the central location of the data. The more
skewed the distribution the greater the difference between the median and mean, and the greater
emphasis should be placed on using the median as opposed to the mean. A classic example of the
above right-skewed distribution is income (salary), where higher-earners provide a false
representation of the typical income if expressed as a mean and not a median.
If dealing with a normal distribution, and tests of normality show that the data is non-normal,
then it is customary to use the median instead of the mean. This is more a rule of thumb than a
strict guideline however. Sometimes, researchers wish to report the mean of a skewed
distribution if the median and mean are not appreciably different (a subjective assessment) and if
it allows easier comparisons to previous research to be made.
2.5 Summary of when to use the mean, median and mode
Please use the following summary table to know what the best measure of central tendency
is with respect to the different types of variable.
36 | P a g e

lOMoARcPSD|24727256
15
37 | P a g e

lOMoARcPSD|24727256
Type of Variable Best measure of central tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median
2.6 Measures of Dispersion
2.6.1 Introduction
While measures of central tendency are used to estimate "normal" values of a

dataset, measures of dispersion are important for describing the spread of the
data, or its variation around a central value. Two distinct samples may have the
same mean or median, but completely different levels of variability, or vice
versa. A proper description of a set of data should include both of these
characteristics. There are various methods that can be used to measure the
dispersion of a dataset, each with its own set of advantages and disadvantages.
2.6.2 Range
i. Defined as the difference between the largest and smallest sample values.
ii. One of the simplest measures of variability to calculate.
iii. Depends only on extreme values and provides no information about how
the remaining data is distributed.
2.6.3 Standard Deviation
i. The standard deviation is the square root of the sample variance.

ii. Defined so that it can be used to make inferences about the population variance.
38 | P a g e

lOMoARcPSD|24727256
iii. Calculated using the formula:

iv. The values computed in the squared term, xi - xbar, are anomalies.
v. Not restricted to large sample datasets, compared to the root mean square
anomaly discussed later in this section.
vi. Provides significant information into the distribution of data around the mean,
approximating normality.
a. The mean ± one standard deviation contains approximately 68% of the
measurements in the series.
b. The mean ± two standard deviations contain approximately 95% of
the measurements in the series.
c. The mean ± three standard deviations contain approximately 99.7% of
the measurements in the series.
vii. Climatologists often use standard deviations to help classify abnormal climatic
conditions. The chart below describes the abnormality of a data value by how many
standard deviations it is located away from the mean. The probabilities in the third
column assume the data is normally distributed.
Review question
1. Describe measures of central tendency

2. Describe measure of dispersion
39 | P a g e

lOMoARcPSD|24727256
HOSPITAL ADMINISTRATIVE STATISTICS

Hospital Administrative Statistics are statistics required for in-patient returns needed for
administrative purposes. These statistics are commonly collected using a media known
as Daily Bed Return (DBR) or Daily Bed State
Daily Bed Return is a document completed in a ward covering 24 hours ward bed state.
The actual time for completing DBR is determined by the Hospital Administration,
however it should be completed during the night normally at 12 midnight when patient
movement within the hospital is minimized.
Daily Bed Return should indicate
1. Patient movement in and out of the ward that is admissions and discharges from
the ward.
2. Patient movement within the hospital that is ward inter-transfers.
3. Actual patient counts that is number of patients in the ward
4. Number of vacant beds and cots
5. A section for computation of figures by the Medical Records Officer
Example of Daily Ward Return
Form 1
Section 1
Hospital……………. Date……………..Ward……….
Admissions Discharges and Deaths
Hospital NO. Name Hospital NO. Name Discharged To
Section 2 Inter-ward transfers within the hospital

ADMISSIONS DISCHARGES
From Ward Hosp. No Name To Ward Hosp. No. Name
Section 3 Paroles
Admission from Parole Discharge to Parole
Hospital Number Name Hospital Number Name
Section4 Abscondees
Admissions Absconded
Hospital Number Name Hospital Number Name
40 | P a g e

lOMoARcPSD|24727256
Section 5 Computation
Previously Daily Return Numbers Today’s Daily Return Numbers
Beds Cots Total Beds Cots Total
Patients Patients
Well/People Well/People
Vacant Vacant
Total Total
Signed…………….i/c forwaed
RECORDS USE ONLY Well people…….
Patients ……. Previous………
Previous……. Admissions…..
Admissions…. Discharges…..
Discharges ….. TOTAL……..
Total………… CHECKED BY…………..
Form II
Daily summary form for In-patient Statistics
Hospital……………………….. Ward……………… Month……….. Year…..
ADMISSIONS DISCHARGES
T/I T/O DEATH W/P OBD
DATE HOME PAROLE ABSC HOME PAROLE ABSC
1
2
3
4
5
TOTAL
The Daily summary form summarizes the days’ statistics and indicates the total
admissions and discharges of the ward. There is a column for OBD (Occupied Bed
Days) which is the total number of patients remaining in the ward each day.
In- patient Statistics Terminology

1. Bed complement or Staffed Bed. The number of beds permanently allocated to
a hospital for the treatment of in-patients only.
2. Bed State. The number of patients occupying hospitals at any given time.
3. Bed Turnover/ Turnover per year bed/ Throughout per bed. Is the average
number of patients expected to be treated per bed.
4. Turnover Interval. is the average number of days beds remain vacant between
successive patients.
5. Percentage occupancy. The percentage of actual patient days or occupied bed

41 | P a g e

lOMoARcPSD|24727256
days to the maximum available patients days or available bed days as determined
by the bed capacity during any given period.
6. Occupied Bed Days. The total number of patients remaining in the hospital or
ward each day added together over the reference period.
7. Length of Stay. The number of days patients occupy a bed.
8. Average occupancy or average number of patients. Is the average number of

patients in a hospital during a specified period.
9. In-patient. A person who has undergone through the full admission procedure of
the hospital and is occupying a bed in the in-patient department.
10. Day case. Are persons or patients attending hospital as non-resident patients for
investigation, therapeutic, operative procedures or other treatment and who
requires some form of preparation period of recovery or both, involving the
provision of accommodation and service.
Analytical Formulae
1) Available Bed Days or Staffed Bed Days = Available Beds x Days in Period
2) Available beds = total authorized beds in a hospital or beds awarded during a

a specified period or
=Available Bed Days

Days in Period
3) Occupied Bed Days or Patient Days or Inpatient Days refers to total patients
remaining each day or inpatient days added together for the days in period
OBD = Percentage Occupancy x Available Bed Days

100%
4) Percentage Occupancy = Occupied Bed Days x 100%

Available Bed Days
5) Average number of occupied beds or = Occupied Bed Days x 100%

Average Number of Patients Days in Period
6) Vacant Bed Days = Available Bed Days – Occupied Bed Days

42 | P a g e

lOMoARcPSD|24727256
7) Length of Stay or average length of stay = Occupied Bed Days

Discharges and Deaths
 Discharges and Deaths = Occupied Bed Days
Length of Stay
Occupied Bed Days = Length of stay x Discharges and Deaths
8) Excess Patient Days = Occupied Bed Days – Available Bed Days
9) Average Daily Population = Occupied Bed Days – Available Bed

without beds Days in Period
10) Turnover per Bed (T.O/B) or = Discharges and Deaths
Discharges and Deaths per AB Available Beds
Or Through put per Bed
 Discharges and Deaths = T.O/B x AB
 Available Beds = Discharges and Deaths
T.O/B
11) Turnover Interval = Vacant Bed Days
Discharges and Deaths
 TOI = VBD
D&D
VBD = TOI x D&D
D&D = VBD
TOI
EXERCISE
a. In the year 2003 Agha Khan gynecological specialty had 120 beds allocated
throughout the year. It had a percentage occupancy of 110%. Calculate
i. Occupied Bed Days
ii. Average daily Population
iii. Excess patient days
iv. Average daily population without beds
b. In the year 2004, Etihad hospital had 800 beds permanently allocated for in-
patient use. During the period the hospital percentage occupancy was 110% and
that there were 1500 patients discharged home alive, 80 patients went to parole
and 20 deaths. Use the information to calculate
i. Occupied bed days
ii. Average length of stay
iii. Turnover per bed
iv. Excess/ vacant Bed days
43 | P a g e

lOMoARcPSD|24727256
v. Turnover interval
CHAPTER THREE
44 | P a g e

lOMoARcPSD|24727256
3.0 Stem-and-Leaf Plots
By the end of this session, the participant will be able to:-
1. Define stem and leaf plot

2. Create stem and leaf plot
3.1 Definition: Stem-and-Leaf Plots is a convenient method to display every piece of

data by showing the digits of each number.
In a stem-and leaf plot, the greatest common place value of the data is used to form
stems.
The numbers in the next greatest place-value position are then used to form the leaves.
Leaf: The last digit on the right of the number.
Stem: The digit or digits that remain when the leaf is dropped.
Look at the number 284
The leaf is the last digit formed: the number 4.
The stem is the remaining digits when the leaf is dropped: the number 28.
The stem with the leaf forms the number 284.
Example
Stem and Leaf Plot
45 | P a g e

lOMoARcPSD|24727256
Array of Lengths (cm)

138 145 147 142
46 | P a g e

lOMoARcPSD|24727256
47
|
P
a
g
e

lOMoARcPSD|24727256
150 144 139 144
147 151 150 143
144 154 152 150
148 139 144 151
144 141 138 147
139 143 140 149
Arm Span Lengths (cm)
Stem Leaves
15 0 1 0 4 2 0 1
14 5 7 2 4 4 7 3 4 8 4 4 1 7 3 0 9
13 8 9 9 8 9
Legend: 15 I 0 = 150
A stem and leaf plot is a display that organizes data to show its shape and distribution. In a stem-
and-leaf plot each data value is split into a "stem" and a "leaf". The "leaf" is usually the last
digit of the number and the other digits to the left of the "leaf" form the "stem". The number 123
would be split as: stem = 12, leaf = 3.
A stem-and-leaf plot does resemble a histogram turned sideways. The stem values could
represent the intervals of a histogram, and the leaf values could represent the frequency for
48 | P a g e

lOMoARcPSD|24727256
each interval.
One advantage to the stem-and-leaf plot over the histogram is that the stem-and-leaf plot displays
not only the frequency for each interval, but also displays all of the individual values within that
interval.
Look at the stem and leaf plot above. Notice the following features of the graph:
1. title – “Arm Span Lengths (cm)”
49 | P a g e

lOMoARcPSD|24727256
19
50
|
P
a
g
e

lOMoARcPSD|24727256
a. all graphs need a title so that people analyzing the data can understand at a glance
what the graph is trying portray
2. stem – on this particular plot, the stem column consists of the hundreds and tens digits
3. leaves – on this particular graph, the leaves consist of the one digits
4. legend – helps the reader create numbers from the stem and leaf
3.3 HOW DO I CREATE A STEM AND LEAF PLOT?
i. write the data in numerical order to help organize

ii. separate each number into a stem and leaf
a. for the most part, for 2 digit numbers, the tens digit is the stem and the ones digit
is the leaf
b. for 3 digit numbers, the hundreds and tens are the stem and the ones are the leaf
iii. group the numbers with the same stems
iv. list the stems in numerical order
v. insert the leaf numbers for each stem
vi. prepare an appropriate legend for the graph (i.e. 3 I 6 means 36)
Special Case:
If you are comparing two sets of data, you can use back-to-back stem and leaf plots. For
example:
Data Set A: 40, 42, 43 Data Set B: 41, 45, 46, 47
Data Set A Data Set B
Leaf Stem Leaf
3 2 0 4 1 5 6 7
The data in the table below shows math test scores (out of 50) from a grade seven class. Using
the data from the table below and this graph as a guide, fill in the stems and leaves to complete
the plot:
51 | P a g e

lOMoARcPSD|24727256
Array of Test Scores (out of 50)

50 42 44 50
35 48 36 40
42 45 50 38
45 49 47 50
Title:_____________________________________
Stem Leaves
Legend:__________________________________
Stem and Leaf Plot Activities
Answer the following questions on a separate sheet of paper.
1. a) The chart shows the age of 34 actresses when they won the Academy Award. Display
the data in a stem and leaf plot.
Ages of Actresses (1959 – 1992)

38 28 27 31 37 30 24 34 60 61
52 | P a g e

lOMoARcPSD|24727256
26 35 34 34 26 37 42 41 35 31
41 33 30 74 33 49 38 61 21 41
26 80 42 29
53 | P a g e

lOMoARcPSD|24727256
21
54
|
P
a
g
e

lOMoARcPSD|24727256
b) How old was the youngest actress when she won the award? the oldest actress?
c) What patterns can you find?
d) Did you discover patterns more easily from the chart or from the stem and leaf plot?
Why?
2.a) Data similar to the data in problem 1 is shown for Best Actors. Display the data in a
stem and leaf plot.
Ages of Actors (1959 – 1992)
35 47 31 46 39 56 41 44 42 43
62 43 40 48 48 56 38 60 32 41
42 37 76 39 55 45 35 61 33 51
31 42 62 62
b) How do the ages of the youngest and oldest actors compare with the actresses in
problem 1?
c) Create a stem and leaf plot to display both sets of data together to help make
comparisons.
Review question
1.What is a stem and leaf plot?

2.Describe the application of stem and leaf
3.How can stem and leaf plot be used in data management?
55 | P a g e

lOMoARcPSD|24727256
56
|
P
a
g
e

lOMoARcPSD|24727256
CHAPTER FOUR
4.0 Introduction to Probability
By the end of the session, the participant will be able to:-
1. Define probability
2. Define sample space
3. Be able to calculate relative frequency
4. Apply multiplication and additional rules
4.1 Definition of Probability

The probability of an event equals the number of times it happens divided by the number of
opportunities.
These numbers can be determined by experiment or by knowledge of the system.
For instance, rolling a die (singular of dice). The chance of rolling a 2 is 1/6, because there is a 2
on one face and a total of 6 faces. So, assuming the die is balanced, a 2 will come up 1 time in 6.
Probability is a measure of how likely it is for an event to happen.

We name a probability with a number from 0 to 1.
If an event is certain to happen, then the probability of the event is 1.
If an event is certain not to happen, then the probability of the event is 0.
4.2 Outcome
An outcome is the result of an experiment or other situation involving uncertainty.
The set of all possible outcomes of a probability experiment is called a sample space.
4.3 Sample Space
The sample space is an exhaustive list of all the possible outcomes of an experiment. Each
possible result of such a study is represented by one and only one point in the sample
space, which is usually denoted by S.
Examples
Experiment rolling a die once:
57 | P a g e

lOMoARcPSD|24727256
Sample space S = {1,2,3,4,5,6}
Experiment Tossing a coin:

Sample space S = {Heads, Tails}
Experiment Measuring the height (cms) of a girl on her first day at school:
Sample space S = the set of all possible real numbers
4.4 Event
An event is any collection of outcomes of an experiment.
Formally, any subset of the sample space is an event.
Any event which consists of a single outcome in the sample space is called an elementary
or simple event. Events which consist of more than one outcome are called compound
events.
Set theory is used to represent relationships among events. In general, if A and B are two events
in the sample space S, then
(A union B) = 'either A or B occurs or both occur'
(A intersection B) = 'both A and B occur'
(A is a subset of B) = 'if A occurs, so does B'
A' or = 'event A does not occur'
(the empty set) = an impossible event
S (the sample space) = an event that is certain to occur
Example
Experiment: rolling a dice once -
58 | P a g e

lOMoARcPSD|24727256
Events A = 'score < 4' = {1,2,3}
B = 'score is even' = {2,4,6}
C = 'score is 7' =
= 'the score is < 4 or even or both' = {1,2,3,4,6}
= 'the score is < 4 and even' = {2}
A' or = 'event A does not occur' = {4,5,6}
4.5 Relative Frequency
Relative frequency is another term for proportion; it is the value calculated by dividing the
number of times an event occurs by the total number of times an experiment is carried out. The
probability of an event can be thought of as its long-run relative frequency when the experiment
is carried out many times.
If an experiment is repeated n times, and event E occurs r times, then the relative frequency
of the event E is defined to be
rfn(E) = r/n
Example
Experiment: Tossing a fair coin 50 times (n = 50)

Event E = 'heads'
Result: 30 heads, 20 tails, so r = 30
Relative frequency: rfn(E) = r/n = 30/50 = 3/5 = 0.6
If an experiment is repeated many, many times without changing the experimental conditions,
the relative frequency of any particular event will settle down to some value. The probability of
the event can be defined as the limiting value of the relative frequency:
P(E) = rfn(E)
59 | P a g e

lOMoARcPSD|24727256
For example, in the above experiment, the relative frequency of the event 'heads' will settle
down to a value of approximately 0.5 if the experiment is repeated many more times.
4.6 Probability
A probability provides a quantitative description of the likely occurrence of a particular event.

Probability is conventionally expressed on a scale from 0 to 1; a rare event has a probability
close to 0, a very common event has a probability close to 1.
The probability of an event has been defined as its long-run relative frequency. It has also been
thought of as a personal degree of belief that a particular event will occur (subjective
probability).
In some experiments, all outcomes are equally likely. For example if you were to choose one
winner in a raffle from a hat, all raffle ticket holders are equally likely to win, that is, they have
the same probability of their ticket being chosen. This is the equally-likely outcomes model and
is defined to be:
number of outcomes corresponding to event E
P(E) =
total number of outcomes
Examples
1. The probability of drawing a spade from a pack of 52 well-shuffled playing cards is

13/52 = 1/4 = 0.25 since
event E = 'a spade is drawn';
the number of outcomes corresponding to E = 13
(spades); the total number of outcomes = 52 (cards).
60 | P a g e

lOMoARcPSD|24727256
2. When tossing a coin, we assume that the results 'heads' or 'tails' each have equal
probabilities of 0.5.
4.7 Subjective Probability
A subjective probability describes an individual's personal judgment about how likely a

particular event is to occur. It is not based on any precise computation but is often a reasonable
assessment by a knowledgeable person.
Like all probabilities, a subjective probability is conventionally expressed on a scale from 0 to

1; a rare event has a subjective probability close to 0, a very common event has a subjective
probability close to 1.
A person's subjective probability of an event describes his/her degree of belief in the event.
Example
A Rangers supporter might say, "I believe that Rangers have probability of 0.9 of winning the
Scottish Premier Division this year since they have been playing really well."
4.8 Independent Events
Two events are independent if the occurrence of one of the events gives us no information about
whether or not the other event will occur; that is, the events have no influence on each other.
In probability theory we say that two events, A and B, are independent if the probability that they
both occur is equal to the product of the probabilities of the two individual events, i.e.
The idea of independence can be extended to more than two events. For example, A, B and C are
independent if:
a. A and B are independent; A and C are independent and B and C are independent
(pairwise independence);
61 | P a g e

lOMoARcPSD|24727256
If two events are independent then they cannot be mutually exclusive (disjoint) and vice versa.
Example
Suppose that a man and a woman each have a pack of 52 playing cards. Each draws a card from
his/her pack. Find the probability that they each draw the ace of clubs.
We define the events:
A = probability that man draws ace of clubs = 1/52
B = probability that woman draws ace of clubs = 1/52
Clearly events A and B are independent so:
= 1/52 . 1/52 = 0.00037
That is, there is a very small chance that the man and the woman will both draw the ace of clubs.
4.9 Mutually Exclusive Events
Two events are mutually exclusive (or disjoint) if it is impossible for them to occur together.
Formally, two events A and B are mutually exclusive if and only if
If two events are mutually exclusive, they cannot be independent and vice versa.
Examples
1. Experiment: Rolling a die once
Events A = 'observe an odd number' = {1,3,5}
B = 'observe an even number' = {2,4,6}
= the empty set, so A and B are mutually exclusive.
62 | P a g e

lOMoARcPSD|24727256
2. A subject in a study cannot be both male and female, nor can they be aged 20 and 30. A
subject could however be both male and 20, or both female and 30.
4.10 Addition Rule
The addition rule is a result used to determine the probability that event A or event B occurs or
both occur.
The result is often written as follows, using set notation:
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A or event B occurs
= probability that event A and event B both occur
For mutually exclusive events, that is events which cannot occur together:
=0
The addition rule therefore reduces to
= P(A) + P(B)
For independent events, that is events which have no influence on each other:
The addition rule therefore reduces to
Example
63 | P a g e

lOMoARcPSD|24727256
Suppose we wish to find the probability of drawing either a king or a spade in a single draw from
a pack of 52 playing cards.
We define the events A = 'draw a king' and B = 'draw a spade'
Since there are 4 kings in the pack and 13 spades, but 1 card is both a king and a spade, we have:
= 4/52 + 13/52 - 1/52 = 16/52
So, the probability of drawing either a king or a spade is 16/52 (= 4/13).
4.11 Multiplication Rule
The multiplication rule is a result used to determine the probability that two events, A and B,
both occur.
The multiplication rule follows from the definition of conditional probability.
where:
= probability that event A and event B occur
P(A | B) = the conditional probability that event A occurs given that event B has occurred
already
P(B | A) = the conditional probability that event B occurs given that event A has occurred
already
For independent events, that is events which have no influence on one another, the rule
simplifies to:
64 | P a g e

lOMoARcPSD|24727256
65
|
P
a
g
e

lOMoARcPSD|24727256
That is, the probability of the joint events A and B is equal to the product of the individual
probabilities for the two events.
4.12 Conditional Probability
In many situations, once more information becomes available, we are able to revise our
estimates for the probability of further outcomes or events happening. For example, suppose you
go out for lunch at the same place and time every Friday and you are served lunch within 15
minutes with probability 0.9. However, given that you notice that the restaurant is exceptionally
busy, the probability of being served lunch within 15 minutes may reduce to 0.7. This is the
conditional probability of being served lunch within 15 minutes given that the restaurant is
exceptionally busy.
The usual notation for "event A occurs given that event B has occurred" is "A | B" (A given B).
The symbol | is a vertical line and does not imply division. P(A | B) denotes the probability that
event A will occur given that event B has occurred already.
A rule that can be used to determine a conditional probability from unconditional probabilities is:
where:
P (A | B) = the (conditional) probability that event A will occur given that event B has
occurred already
= the (unconditional) probability that event A and event B both occur
P(B) = the (unconditional) probability that event B occurs
4.13 Law of Total Probability
66 | P a g e

lOMoARcPSD|24727256
where:
= probability that event A and event B both occur
= probability that event A and event B' both occur, i.e. A occurs and B
does not.
Using the multiplication rule, this can be expressed as
P(A) = P(A | B).P(B) + P(A | B').P(B')
Bayes' Theorem
Bayes' Theorem is a result that allows new information to be used to update

the conditional probability of an event.
Using the multiplication rule, gives Bayes' Theorem in its simplest form:
Using the Law of Total Probability:
P(B | A).P(A)
P(A | B) =
P(B | A).P(A) + P(B | A').P(A')
where:
P(A') = probability that event A does not occur
67 | P a g e

lOMoARcPSD|24727256
P(A | B) = probability that event A occurs given that event B has occurred already
P(B | A) = probability that event B occurs given that event A has occurred already
P (B | A') = probability that event B occurs given that event A has not occurred already
Review question
1. What is probability?
2. State what is meant by sample space
3. Describe how relative frequency is calculated
4. When do we use multiplication and additional rules?
68 | P a g e

lOMoARcPSD|24727256
CHAPTER FIVE
5.0 Correlation and Regression
By the end of the session, participant will be able to:-

1. Define and calculate correlation
2. Define and use regression line
3. Draw and interpret a scatter graph
4. Calculate and interpret correlation coefficient
5. Calculate and interpret rank correlation
5.1 Correlation and regression
Correlation and regression are concerned with the investigation of two variables.
Previously we have only considered a single variable - now we look at two associated
variables. We might want to know:
i. If a relationship exists between those variables;
ii. if so, how strong that relationship is;
iii. What form that relationship takes.
iv. Can we make use of that relationship for predictive purposes i.e. forecasting?
5.2 Correlation
Correlation describes the strength of the relationship. It is not concerned with 'cause' and
'effect'.
5.3 Regression
Regression describes the relationship itself in the form of a straight line equation which best
fits the data.
Some initial insight into the relationship between two continuous variables can be obtained
by plotting a scatter diagram and simply looking at the resulting graph.
Does the relationship seem to be linear or curved?
1. If there appears to be a linear relationship, it can be quantified. A correlation

coefficient is calculated as the measure of the strength of this relationship. Its symbol
is 'r' and its value lies between -1 and +1.
69 | P a g e

lOMoARcPSD|24727256
2. Is the association between the two variables strong enough to be useful?
3. If the relationship is found to be significantly strong, then its nature can be

found using linear regression, which defines the equation of the straight line
of 'best fit' through the bi-variant data, y = a + bx . For example, £x spent on
Advertising expected to increase Sales by £y.
4. The 'goodness of fit' can be calculated to see how well the line fits the data.
5. Once defined by an equation, the relationship can be used for predictive purposes.
Example
'Ice cream Sales' for a particular firm of manufacturers and 'Average Monthly Temperature'.
Month Av.Temp. Sales

(oC) (£'000)
January 4 73
February 4 57
March 7 81
April 8 94
May 12 110
June 15 124
July 16 134
August 17 139
September 14 124
October 11 103
November 7 81
December 5 80
From this data we need:
70 | P a g e

lOMoARcPSD|24727256
i. Scatter diagram of Sales against Temp.
iii. Correlation coefficient between them
iv. Regression line associating Sales with Temp.
v. Goodness of fit of data to line
vi. Prediction of estimations for Sales
5.4 Scatter diagrams
We look for a linear relationship with the bivariate points plotted being reasonably close to the,
yet unknown, 'line of best fit'.
Plot the independent variable, x, on the horizontal axis.

Plot the potentially dependent variable on the vertical, y, axis. (Minitab output below)
Looks promising: a straight line relationship, with all points fairly close to a 'line of best
fit'.
Sales against Average Monthly Temperature
71 | P a g e

lOMoARcPSD|24727256
140
130
120
110
sSale
100
90
80
70
60
50
5 10
15
Av.Temp.
72 | P a g e

lOMoARcPSD|24727256
5.5 Pearson’s Correlation Coefficient (r)

(for quantitative data only)
The strength of the relationship will be quantified by calculating the correlation coefficient and
tested for significance.
Best to use of calculator in L.R. mode as will be demonstrated in tutorials.
73 | P a g e

lOMoARcPSD|24727256
5.6 Correlation coefficient

Data input to calculator in L.R. mode.
Correlation coefficient, r, produced from shift ( 'open bracket' for
Casios. r = 0.9833
Is the size of this correlation coefficient, 0.9833, significant?

Hypothesis test for a Pearson’s correlation coefficient
H0: There is no association between ice-cream sales and average monthly temperature.
H1: There is an association between them.
5.7 Critical Value:
5%, 10 degrees of freedom = 0.576
Test statistic: 0.983
5.8 Conclusion: The test statistic exceeds the critical value so we reject the Null Hypothesis
and conclude that there is a significant association between ice-cream sales and average
monthly temperature.
5.9 Regression equation

There is a significant relationship between the two variables, the next step is to define it as
a regression equation.
This can be produced directly from a calculator in LR mode. (Shift 7 for a and shift 8 for b)
The regression line is described, in general, as the straight line of ‘best fit’ with the equation:
y = a + bx
where x and y are the independent and dependent variables respectively, at the intercept on the
y-axis, and b the slope of the line.
74 | P a g e

lOMoARcPSD|24727256
The values of the coefficients for this data are:
a = 45.5 (shift 7) b = 5.45 (shift 8)
giving the regression equation:
y = 45.5 + 5.45x
To draw this line on the scatter diagram, any three points are plotted and joined up. These may
be points, (0, a), the centroid, ( x, y ), and/or any other points calculated from the
regression equation as long as these are in the region of the observed
data. If x = 15; y = 45.5 + 5.45 x 15 = 127.2
For any value of x the corresponding value of y can be found directly from the calculator in L.R.
mode from key [ yˆ ].
Scatter diagram with Regression line added
Sales against Average Monthly Temperature
140
130
120
110
100
Sales
90
80
70
60
50
5 10 15
Av.Temp.
75 | P a g e

lOMoARcPSD|24727256
5.10 Goodness of Fit
How well does this line fit the data?

2
Goodness of fit is measured by (r x 100)%.
The correlation coefficient r was 0.983 so we have (0.983)2 x 100 = 96.6% fit.
This indicates the percentage of the variation in Ice-cream Sales which is accounted for by
the variation in Average monthly temperature.
5.11 Correlation coefficient

The formula used to find the Pearson’s Product Moment Correlation coefficient is:
S
xy
r −1 ≤ r ≤ 1
S
xx Syy
∑x∑x
2
∑
where S xx  x −
n
∑y∑y
2
∑
S
yy  y −
n
∑x∑y
S
xy  ∑ xy −
76 | P a g e

lOMoARcPSD|24727256
The following example gives ice-cream sales per month against mean monthly
temperature
(Fahrenheit)
Needed: ∑ x, ∑ y, ∑x2, ∑ y2 , ∑ xy, n.
Month Average Sales

temp (x) (y) x2 y2 Xy
Jan. 4 73 16 5329 292

Feb. 4 57 16 3249 228
March 7 81 49 6561 567
April 8 94 64 8836 752
May 12 110 144 12100 1320
June 15 124 225 15376 1860
July 16 134 256 17956 2144
Aug. 17 139 289 19321 2363
Sept. 14 124 196 15376 1736
Oct. 11 103 121 10609 1133
Nov. 7 81 49 6561 567
Dec. 5 80 80 6400 400
Sum Σ 1200 1200 1450 127674 13362
Calculations
n  12, ∑ x  120, ∑ y  1200
∑ x 2  1450, ∑ y2  127674, ∑ xy  13362
77 | P a g e

lOMoARcPSD|24727256
Sxx  1450 − 120  120  250

12
Syy  127674 − 1200  1200  7674

12
Sxy  13362 − 120  1200  1362

12
2433
r 
 0.9869 792  7674
5.12 Regression equation
Further use made of values:
∑x, ∑y, ∑x2, ∑y2, ∑ xy , n.
(as produced previously.)
The gradient, b, is calculated from :
S
xy
b
S
xx
S
where xy  ∑ xy − ∑ x ∑ y
n
and S xx ∑ x 2 − ∑x∑x
78 | P a g e

lOMoARcPSD|24727256
Since the regression line passes through the centroid, both means, its equation can be
used to find the value of a, the intercept on the y-axis:
a  y − bx
Ice-cream Sales example
From previously: (Sales now assumed to be dependent on temperature
Sxy  1362 Sxy  250
1362 1200 120

b  5.448 a  − 5.448   45.52
250 12 12
The values of a and b are therefore 45.5 and 5.45 respectively giving the regression line:
y = 45.5 + 5.45x
RANK CORRELATION FORMULA
The data is ranked in order of size, using the numbers 1, 2 …N If two variables x and
y are ranked in such a manner, the coefficient of rank correlation is given by
(r) Rank = 1- 6∑d2
n (n2 – 1)
where
79 | P a g e

lOMoARcPSD|24727256
d = difference between ranks of corresponding values of x and y
n = No. of parts of values (x and y) in the data spearman’s formula for rank
correlation
e.g.
The following table shows how 10 students arranged in the alphabetical order, were
ranked according to their achievements in both laboratory and lectures part of the
biology course. Find the coefficient of rank correlation
Laboratory 8 3 9 2 7 10 4 6 1 5
lecture 9 5 10 1 8 7 3 4 2 6
Solution
The difference of rank D in laboratory and lecture for each student is given in the
following
table. Also given in the table are d2 and Σ d2
Rank d -1 -2 -1 1 -1 3 1 2 -1 -1
difference
1 4 1 1 1 9 1 4 1 1
d2
∑d2 = 24
r Rank = 1- 6∑d2
n(n2 – 1) = 1 – 6 ( 24)
10 ( 102 -1)
= 0. 8545
80 | P a g e

lOMoARcPSD|24727256
5.13 RANK CORRELATION
This measures the degree of correlation between two sets of observation of paired
values when only the relative order of magnitude is available for each set of data.
An example would make you understand more.
Physics French x’ y’ d d 2
3 1 5 7 2 4
2 4 6 4 2 4
1 2 7 6 1 1
4 3 4 5 1 1
6 5 2 3 1 1
5 7 3 1 2 4
7 6 1 2 1 1
16
rr = 1- 6∑d2
n (n2 – 1) = 1 – 6 x 16
7(72 - 1) = 0. 714.
e.g. suppose the student set two papers in an examination, medical terminology and
Anatomy and physiology and instead of the actual marks a warded on each paper they
were told only their ranking in the order of merit. To establish whether the performance
on two papers are correlated or not the method of rank correlation is used.
rr = 1- 6∑d2
n(n2 – 1)
81 | P a g e

lOMoARcPSD|24727256
STUDENTS’ RANKING IN MEDICAL TERMINOLOGY AND PHYSYOLOGY
Rank Rank
student Med. Term Anat. & Physiology d d2
1 1 3 -2 4
2 2 2 0 0
3 3 1 2 4
4 4 6 -2 4
5 5 5 0 0
6 6 8 -2 4
7 7 4 3 9
8 8 10 -2 4
9 9 7 2 4
10 10 9 1 1
34
The coefficient of Rank correlation is given by spearman’s formula
Rank correlation
rr = 1- 6∑d2
n (n2 – 1)
82 | P a g e

lOMoARcPSD|24727256
When d is the numerical difference between corresponding parts of Ranks and the
number of pairs using this formula of the Rank correlation in an example.
rr = 1- 6 x 34
10(10 2 -1)
=1- 204
10(100-1)
= 1 – 204
990
= 1 – 0.206060
= 0.79394
This suggests that there is quite strong relationship between performances in two papers.
As with all the techniques described so far, correlation analysis has no value for its own
sake. It is useful surely because if properly used, it permits theories and hypothesis to be
verified or repeated on the basis of imperial evidence. It cannot be used on its own as a
profess of cause and effect. At all times it must be remembered that such specialized
tools may easily be misapplied and gives misreading results.
Another exercise
subject inches height pounds height R R d D
83 | P a g e

lOMoARcPSD|24727256
1 70 165 4 3 1 1
2 66 130 9 10 1 1
3 72 180 1 1 0 0
4 68 145 7 7.5 0.5 0.25
5 71 160 2.5 4.5 2 4
6 64 150 10 6 4 16
7 68 140 7 9 2 4
8 71 168 2.5 2 0.5 0.25
9 69 145 5 7.5 2.5 6.25
10 68 160 7 4.5 2.5 6.25
39.0
rr = 1- 6∑d2
n (n2 – 1) = 1 – 6 x 39
84 | P a g e

lOMoARcPSD|24727256
10(10 2 – 1)
= 1 - 234
999
=1 –
0.2363636 =
0.7636364
Assignment
The following observation were made on seven children in a clinic
children weight height
1 121 59.0
2 122 54.5
3 124 61.5
4 126 60.0
5 129 61.0
6 131 60.0
7 133 61.0
Compute:
(a) Rank correlation
85 | P a g e

lOMoARcPSD|24727256
(b) Regression value
(c) Coefficient of correlation
(a) Rank correlation
children w h Rw Rh d d2
1 121 59.0 7 6 1 1
2 122 54.5 6 7 1 1
3 124 61.0 5 1 4 16
4 126 60.0 4 4.5 0.5 0.25
5 129 61.0 3 2.5 0.5 0.25
6 131 60.0 2 4.5 2.5 6.25
7 133 61.0 1 2.5 1.5 2.25
27
rr = 1-6∑d2
n (n2 – 1)
= 1 – 6(27)
7(49-1)
86 | P a g e

lOMoARcPSD|24727256
= 1 - 162
336
= 1 – 0.48795
= 0.51204
= 0. 52
(b) Regression value
w h (w-w’) (h – h’) (w – w’)2 (h – h’)2 (w – w’) (h – h’)
121 59.0 6 1 36 1 6
122 54.5 5 5.5 25 30.25 27.5
124 61.5 3 1.5 9 2.25 4.5
126 60.0 1 0 1 0 0
129 61.0 2 1 4 1 2
131 60.0 4 0 16 0 0
133 61.0 6 1 36 1 6
886 417 127 35.5 46
Mean of weight = 127
Mean of height = 60
w = a + bh
b = ∑(w – w’) (h – h’)
87 | P a g e

lOMoARcPSD|24727256
(w – w’) 2
= 46
127
= 0.362
88 | P a g e

lOMoARcPSD|24727256
49
89
|
P
a
g
e

lOMoARcPSD|24727256
127 = a + (0.362 x 60)

a = 127 – 21.72
= 105. 2
(c) Coefficient of correlation
Formula
r = ∑(x – x’) (y – y’)
Σ ( x – x’ )2 Σ( y – ў) 2
∑(w – w’.) (h–h’’)
r=
Σ ( w –w’ )2 Σ( h – h’) 2
46
r =
127 x 35.5
= 46
67.1
= 0. 686
90 | P a g e

lOMoARcPSD|24727256
Coefficient of correlation
x y (x-x’) (y-y’) (x-x’)2 (y-y’)2 (x-x’) (y-y’)
121 59.0 -5 -0.1 25 0.01 0.5
122 54.5 -4 -4.6 16 21.16 18.4
124 61.5 -2 2.4 4 5.76 -4.8
123 54.5 -3 -4.6 9 21.16 13.8
124 60.0 -2 0.9 4 0.81 -1.8
126 64.0 0 4.9 0 24.01 0
127 61.0 1 1.9 1 3.61 1.9
129 54.5 3 -4.6 9 21.16 -13.8
131 60.5 5 1.4 25 1.96 7
133 61.0 7 1.9 49 3.61 13.3
∑1260 ∑590.5 0 -0.5 142 103.25 34.5
Mean x = x = Σx
= 1260
10
= 126
91 | P a g e

lOMoARcPSD|24727256
Mean y =
y = Σy
n
= 590.5
10
= 59.05
r = Σ(x - x) (y – y’)
Σ ( x – x ‘)2 ( y – ў)
34.5
r =
142 x 103.25
= 34.5
11.9 x10.16
= 34.5
120.92
= 0.285
= 0.29
or
92 | P a g e

lOMoARcPSD|24727256
r = Σ(x - x) (y – y)
Σ[ ( sDx ) ( sDx)]
34.5
r =
10(3.8x3.21)
= 34.5
10 x 12.2
= 34.5
122
= 0.283
= 0.29
Review questions
1. Define correlation
2. Define regression
3. How do you interpret a scatter graph?
4. How do you interpret rank correlation?
93 | P a g e

lOMoARcPSD|24727256
CHAPTER SIX
6.0 The Chi square test
By the end of the session, the participant will be able to:-

1. Understand hypothesis testing
2. Calculate and interpret chi squire
3. Use chi squire in measurement
The subject of statistics may be unfamiliar now, but you'll see more of it in future science
coursework. Statistics provides extremely important tools that investigators use to
interpret experimental results. Probably the most familiar statistic is the average, or
"mean." It tells us something useful about a collection of things: the mean grade on an
exam, the mean life span of a species, the mean rainfall for a locale, etc.
Chi is a letter of the Greek alphabet; the symbol is χ and it's pronounced like KYE, the
sound in "kite." The chi square test uses the statistic chi squared, written χ2. The "test"
that uses this statistic helps an investigator determine whether an observed set of
results matches an expected outcome. In some types of research (genetics provides
many examples) there may be a theoretical basis for expecting a particular result- not a
guess, but a predicted outcome based on a sound theoretical foundation.
A familiar example will help to illustrate this. In a single toss of a coin (called a "trial"),
there are two possible outcomes: head and tail. Further, both outcomes are equally
probable. That is, neither one is more likely to occur than the other. We can express this
in several ways; for example, the probability of "head" is 1/2 (= 0.5), and the probability
of "tail" is the same. Then if we tossed a single coin 100 times, we would expect to see
50 heads and 50 tails. That distribution (50:50 or 1:1) is an expected result, and you see
the sound basis for making such a prediction about the outcome. Suppose that you do the
100 tosses and get 48 heads and 52 tails. That is an observed result, a real set of data.
However, 48 heads: 52 tails is not exactly what you expected. Do you suspect
something's wrong because of this difference? No? But why aren't you suspicious? Is
94 | P a g e

lOMoARcPSD|24727256
95
|
P
a
g
e

lOMoARcPSD|24727256
the observed 48 heads: 52 tails distribution close enough to the expected 50 heads: 50
tails (= 1:1) distribution for you to accept it as legitimate?
We need to consider for a moment what might cause the observed outcome to differ from
the expected outcome. You know what all the possible outcomes are (only two: head and
tail), and you know what the probability of each is. However, in any single trial (toss) you
can't say what the outcome will be. Why… because of the element of chance, which is a
random factor. Saying that chance is a random factor just means that you can't control it.
But it's there every time you flip that coin. Chance is a factor that must always be
considered; it's often present but not recognized. Since it may affect experimental work, it
must be taken into account when results are interpreted.
What else might cause an observed outcome to differ from the expected? Suppose that at
your last physical exam, your doctor told you that your resting pulse rate was 60 (per
minute) and that that's good, that's normal for you. When you measure it yourself later
you find it's 58 at one moment, 63 ten minutes later, 57 ten minutes later. Why isn't it the
same every time, and why isn't it 60 every time? When measurements involving living
organisms are under study, there will always be the element of inherent variability. Your
resting pulse rate may vary a bit, but it's consistently about 60, and those slightly different
values are still normal.
In addition to these factors, there's the element of error. You've done enough lab work
already to realize that people introduce error into experimental work in performing
steps of procedures and in making measurements. Instruments, tools, implements
themselves may have built-in limitations that contribute to error.
Putting all of these factors together, it's not hard to see how an observed result may differ
a bit from an expected result. But these small departures from expectation are not
significant departures. That is, we don't regard the small differences observed as being
important.
96 | P a g e

lOMoARcPSD|24727256
97
|
P
a
g
e

lOMoARcPSD|24727256
What if the observed outcome in your coin toss experiment of 100 tosses (trials) had been
20 heads and 80 tails? Would you attribute to chance that much difference between the
expected and observed distributions? We expect chance to affect results, but not that
much. Such a large departure from expectation makes one suspect that the assumption
about equal probabilities of heads and tails is not valid. Suppose that the coin had been
tampered with, had been weighted, so as to favor the tail side coming up more often?
How do we know where to draw the line between an amount of difference that can be
explained by chance (not significant) and an amount that must be due to something other
than chance (significant)? That is what the chi square test is for, to tell us where to draw
that line.
In performing the χ2 test, you have an expected distribution (like 50 heads: 50 tails) and
an observed distribution (like 40 heads: 60 tails, the results of doing the experiment).
From these data you calculate a χ2 value and then compare that with a predetermined χ2
value that reflects how much difference can be accepted as insignificant, caused by
random chance. The predetermined values of χ2 are found in a table of "critical values."
Such a table is shown on the last page here.
6.1 Calculation of chi square
The formula for calculating χ2 is: χ2 = Σ [(o - e) 2 / e], where "o" is observed and "e" is
expected.
The sigma symbol, Σ, means "sum of what follows."
For each category (type or group such as "heads") of outcome that is possible, we would
have an expected value and an observed value (for the number of heads and the number
of tails, e.g.) For each one of those categories (outcomes) we would calculate the quantity
(o - e) 2/e and then add them for all the categories, which was two in the coin toss
example (head category and tail category). It is convenient to organize the data in table
form, as shown below for two coin toss experiments.
98 | P a g e

lOMoARcPSD|24727256
99
|
P
a
g
e

lOMoARcPSD|24727256
Experiment 1 Experiment 2
(100 tosses) (100 tosses)
heads Tails heads tails
o (observed) 47 53 61 39
e (expected) 50 50 50 50
o-e -3 3 11 -11
(o - e)2 9 9 121 121
(o - e)2/e 0.18 0.18 2.42 2.42
χ2 0.36 4.84
NOTE: do not take square root of χ2. The statistic is χ2, not χ.
Note that in each experiment the total number of observed must equal the total number
of expected. In expt. 1, for example, 47 + 53 = 100 = 50 + 50.
6.2 Selection of critical value of chi square
Having calculated a χ2 value for the data in experiment #1, we now need to evaluate that
χ2 value. To do so we must compare our calculated χ 2 with the appropriate critical value
of χ2 from the table shown on the last page here. [All of these critical values in the table
have been predetermined by statisticians.] To select a value from the table, we need to
know 2 things:
1. The number of degrees of freedom. That is one less than the number of categories
(groups) we have. For our coin toss experiment that is 2 groups - 1 = 1. So our critical
value of χ2 will be in the first row of the table.
2. The probability value, which reflects the degree of confidence we want to have in our
interpretation. The column headings 0.05 and 0.01 correspond to probabilities, or
confidence levels. 0.05 means that when we draw our conclusion, we may be 95%
100 | P a g e

lOMoARcPSD|24727256
confident that we have drawn the correct conclusion. That shows that we can't be
certain; there would still be a 5% probability of drawing the wrong conclusion. But 95%
is very good. 0.01 would give us 99% confidence, only a 1% likelihood of drawing the
101 | P a g e

lOMoARcPSD|24727256
10
2
|
P
a
g
e

lOMoARcPSD|24727256
wrong conclusion. We will now agree that, unless told otherwise, we will always use
the 0.05 probability column (95% confidence level).
For 1 degree of freedom, in our coin toss experiment, the table χ2 value is 3.84. We
compare the calculated χ2 (0.36) to that.
6.3 The interpretation
In every χ2-test the calculated χ2 value will either be (i) less than or equal to the critical χ2
value OR (ii) greater that the critical χ2 value.
• If calculated χ2 ≤ critical χ2, then we conclude that there is no statistically

significant difference between the two distributions. That is, the observed results are
not significantly different from the expected results, and the numerical difference
between observed and expected can be attributed to chance.
• If calculated χ2 > critical χ2, then we conclude that there is a statistically significant
difference between the two distributions. That is, the observed results are significantly
different from the expected results, and the numerical difference between observed and
expected cannot be attributed to chance. That means that the difference found is due to
some other factor. This test won't identify that other factor, only that there is some factor
other than chance responsible for the difference between the two distributions.
For our expt. #1, 0.36 < 3.84. Therefore, we may be 95% confident that there is no
significant difference between the 47:53 observed distributions and the 50:50 expected
distribution. That small difference is due to random chance.
For expt. #2 shown in the table above, the calculated χ2 = 4.84. 4.84 > 3.84. Therefore the
61:39 observed distribution is significantly different from the 50:50 expected
distributions. That much difference cannot be attributed to chance. We may be 95%
confident that something else, some other factor, caused the difference. The χ2-test won't
103 | P a g e

lOMoARcPSD|24727256
58
10
4
|
P
a
g
e

lOMoARcPSD|24727256
identify that other factor, only that there is some factor other than chance responsible for
the difference between the two distributions.
6.4 A common mistake in performing the chi square test
Suppose in a coin toss experiment you got 143 heads and 175 tails; see the table setup
below. That's 318 tosses (trials) total. In setting up the table to calculate χ2, note that the
expected 1:1 distribution here means that you expect 159 heads: 159 tails, not 50 heads
and 50 tails as previously when the total was only 100. The point here: the sum of
observed values for all groups must equal the sum of expected values for all groups. In
this example 143 + 175 = 318 and 159 + 159 = 318.
heads Tails
o (observed) 143 175
e (expected) 159 159
o-e -16 16
(o - e)2 256 256
(o - e)2/e 1.61 1.61
χ2 3.22
Another example, this time from genetics: Here's the situation.

Suppose that one locus controls height of a type of plant. Dominant allele "G" produces
tall plants; recessive allele "g" produces short plants. A second, unlinked locus controls
flower color. Dominant "R" produces red flowers, and recessive "r" produces white
flowers. Then, we cross two heterozygous tall, heterozygous red-flower plants, and
collect 80 seeds (progeny) from the mating. We plant the seeds and look to see the height
and flower color features of these offspring. The results: 40 tall red: 21 tall white: 9 short
red: 10 short white. Does this observed distribution fit the expected distribution? That is,
105 | P a g e

lOMoARcPSD|24727256
are these results significantly different from the expected outcome?
106 | P a g e

lOMoARcPSD|24727256
59
10
7
|
P
a
g
e

lOMoARcPSD|24727256
tall red tall white short red short white
o (observed) 40 21 9 10
e (expected) 45 15 15 5
o-e -5 6 -6 5
(o - e)2 25 36 36 25
(o - e)2/e 0.56 2.40 2.40 5.00
χ2 10.36
The given information says that both parents have the genotype G//g R//r. Then the
expected distribution of progeny phenotypes would be 9/16 tall red: 3/16 tall white: 3/16
short red: 1/16 short white. [This is a cross like the first hybrid cross we did in lecture.]
The total number of observed progeny in this cross is 80. So the expected values are
based on that total: 9/16 of 80 = 45 tall red expected, and so forth. (Refer to the setup
table above.) The total number expected must equal the total number observed.
Entering the fractions 9/16, 3/16, etc. in the setup table for expected values is incorrect
and would give a wildly incorrect χ2 value. And that, in turn, would probably lead us to
the wrong conclusion in interpreting the results.
The observed distribution (which is given in the problem) is obviously "different" from
the expected. The numbers aren't the same, are they? But that difference may just be due
to chance, as discussed earlier here. The χ2 -test will help us decide whether the
difference is significant. Here the calculated χ2 value is 10.36. There are 3 degrees of
freedom here (4 categories - 1). So, the critical χ2 value for 0.05 probabilities (see table
on the last page) is 7.81. Since our calculated value exceeds the critical value, we
conclude that there is a significant difference between the observed distribution and the
expected distribution. The difference found here could not be attributed to chance alone.
We may be 95% confident of this conclusion. What does this mean? Perhaps the
inheritance of these traits is not as simple as 2 unlinked loci with dominant and recessive
alleles at each. Or maybe there is some environmental factor that influenced the
108 | P a g e

lOMoARcPSD|24727256
outcome. That is for additional investigation to determine. The χ2-test alerted us to the
fact that our results were too much different from the expectation.
109 | P a g e

lOMoARcPSD|24727256
60
11
0
|
P
a
g
e

lOMoARcPSD|24727256
6.5 Hypothesis testing
Statisticians formally describe what we've just done in terms of testing a hypothesis. This
process begins with stating the "null hypothesis." The null hypothesis says that the
difference found between observed distribution and expected distribution is not
significant, i.e. that the difference is just due to random chance. Then we use the χ2-test to
test the validity of that null hypothesis.
If calculated χ2 ≤ critical χ2, then we accept the null hypothesis. That means that the two
distributions are not significantly different, that the difference we see is due to chance,
not some other factor. On the other hand, if calculated χ2 > critical χ2, then we reject the
null hypothesis. That means that the two distributions are significantly different, that the
difference we see is not due to chance alone.
Note this well: In performing the chi squared test in this course, it is not sufficient in
your interpretation to say "accept null hypothesis" or "reject null hypothesis." You will
be expected to fully state whether the distributions being compared are significantly
different or not and whether the difference is due to chance alone or other factors.
Table of chi square critical values

probability (P)
degrees of freedom* 0.05 0.01
1 3.84 6.63
2 5.99 9.21
3 7.81 11.34
4 9.49 13.28
111 | P a g e

lOMoARcPSD|24727256
5 11.07 15.09
112 | P a g e

lOMoARcPSD|24727256
61
11
3
|
P
a
g
e

lOMoARcPSD|24727256
6 12.59 16.81
7 14.07 18.48
8 15.51 20.09
9 16.92 21.67
10 18.31 23.21
11 19.68 24.73
12 21.03 26.22
13 22.36 27.69
14 23.68 29.14
15 25.00 30.58
16 26.30 32.00
17 27.59 33.41
18 28.87 34.81
19 30.14 36.19
20 31.41 37.57
* d.f. = one less than the number of categories (groups, classes) in

the test
Review question
1. What is hypothesis testing?
2. How do you calculate and interpret chi squire
114 | P a g e

lOMoARcPSD|24727256
62
11
5
|
P
a
g
e

lOMoARcPSD|24727256
REFERENCES
1. Abrahamson J. Survey Methods in Community Medicine (4th edition) Churchill

Livingstone
2. Altman D. G. Practical Statistics for Medical Research – Chairman and Hall 1991
3. Armitage, P (1987) Statistical Methods in Medical Research. -Oxford: OUP
4. Bland M. Introduction to Medical Statistics (2nd Edition)
5. Cochran, W.G (1977) Sampling techniques. -New York: John Wiley and Sons
6. Colton, T (1974) Statistics in Medicine. -Boston: Little Brown.
7. Combel M. J. Machin D. Medical Statistics. A common sense approach (2 nd edition)
Wiley 1993
8. Frank, Harry, Althoen, Steven C. (1994) Statistics. -New York: Cambridge University
Press
9. Garuer M. and Altman D. Statistics with confidence BMA 1989
10. Kirkwood, B.R. (1988) Essentials of medical statistics: Blackwell
11. Onyango J.OP and Phews A.M A text of basic statistics education series 1991
12. Oxford Medical publication 1995
13. Runyon, R.P, Harber, A, Pittenger, D.J, Coleman, A.K, (1996), Fundamentals of
behavioural statistics, New York: McGraw-Hill Companies, Inc.
116 | P a g e

lOMoARcPSD|24727256
63
11
7
|
P
a
g
e

lOMoARcPSD|24727256
Instructions
Answer ALL questions in Section A and any 2 in Section B
SECTION A
Q1. An outbreak of Pediculosis Capitis is being investigated in a girls’ school
containing 291 pupils. Of 130 children who live in a nearby housing estate, 18 were
infected and of 161 who live elsewhere, 37 were infected.
a) Calculate χ2 value of the difference.

b) What is the significance?
c) What is your conclusion on the χ2 value?
(20 Marks)
Q2. Describe the following terms used in Bio Statistics
a) Chi square measurement and results.

b) Statistically significant at P < 0.05 level
c) Degree of freedom
d) Expected values
e) Normal Distribution
(20 Marks)
Q3. a) Make a tree-diagram to illustrate the sample space for the event
“TOSSING A COIN 4 TIMES”
b) What is the complement of the event ‘Mr. Quinn’s car is red’? If the probability
of Mr. Quinn having a red car is 44.3 %, what is the probability of the
complement?
c) A bag contains 5 red marbles and 9 blue ones. If I draw two marbles from
64
118 | P a g e

lOMoARcPSD|24727256
the bag without replacement what is the probability that they are both blue?
d) Compute: 8C5 and P
17 4
(20 Marks)
Q5. Illustrate using diagram the following terms in correlation and regression
a) Positive or direct relationship

b) Negative or inverse relationship
c) Zero or no relationship
d) Line of best fit
(20 Marks)
Q6. The following information refers to patients who attended Kiambu District Hospital over
a
period of 2 months in 2001.
135, 132, 141, 131, 165, 142, 158, 171, 182, 164, 147, 161, 163, 158, 172,
140, 141, 150, 127, 166, 166, 172, 180, 158, 159, 149, 154, 161, 173, 182,
169, 159, 155, 148, 150, 157, 156, 141, 163, 168, 156, 169, 176, 175, 161,
176, 169, 154, 152, 144, 143, 159, 160, 135, 161, 152, 157, 185, 169, 170
From the information provided:-
a) Construct a grouped frequency table with a class interval

of 10 Calculate to one decimal place:-
b) The arithmetic mean

c) The median
d) The mode formula
e) The mode by use of graph.
(20 Marks)
119 | P a g e

lOMoARcPSD|24727256
Q7. A company doctor is investigating the possible effect of stress upon the health of the
company’s
management employees. She suspects that employees under stress will suffer from high
systolic blood pressure. She takes a random sample of ten employees, aged between 35
and 55
years, and records their age and blood pressure:
Management Age (x) Systolic blood

employee pressure (y)
A 37.2 133
B 39.8 143
C 42.1 135
D 44.6 151
E 47.2 143
F 48.9 158
G 50.0 163
H 51.3 146
I 52.8 168
J 54.4 160
a)Draw a scatter graph showing the above data

b) What do you conclude from the graph
(20 Marks)
Q8. The following table shows the marks of eight pupils in Biology and Chemistry.
a)Rank the results

b) Find the value of Spearman’s Rank Coefficient.
c)Comment on your result.
Biology 65 65 70 75 75 80 85 85
Chemistry 50 55 58 55 65 58 61 65
120 | P a g e

lOMoARcPSD|24727256
121 | P a g e

lOMoARcPSD|24727256
12
2
|
P
a
g
e

lOMoARcPSD|24727256
Instructions
Answer ALL questions in Section A and any 2 in Section B
SECTION A
Q1. The following data represent the number of children for 10 physicians on a particular
hospital staff; 5, 4, 2, 3, 6, 9, 8, 4, 7, 4. Using the above data, find the following
descriptive measures:-
(i) Arithmetic mean

(ii) Median
(iii) Mode
(iv) Quartile Deviation
(v) Mean Deviation
(vi) Standard Deviation
(vii) Coefficient of variation
(30 Marks)
Q2. The following information refers to patients who attended Mbagathi District Hospital
over a period of 2 months in 2001.
135, 132, 141, 131, 165, 142, 158, 171, 182, 164, 147, 161, 163, 158, 172,
140, 141, 150, 127, 166, 166, 172, 180, 158, 159, 149, 154, 161, 173, 182,
169, 159, 155, 148, 150, 157, 156, 141, 163, 168, 156, 169, 176, 175, 161,
176, 169, 154, 152, 144, 143, 159, 160, 135, 161, 152, 157, 185, 169, 170
From the information provided:-
a) Construct a grouped frequency table with a class interval of 10
67
123 | P a g e

lOMoARcPSD|24727256
b) Calculate to one decimal place:-

i. The arithmetic mean
ii. The median
iii. The mode formula
iv. The mode by use of graph.
v. The standard deviation
(30 Marks)
Q3. In-patient diseases 2001
Kenyatta National Hospital
April May June July Aug Sept Oct Nov Dec
Malaria 96 78 72 65 77 71 67 73 53
T.B 13 10 9 10 10 9 8 7 6
Make a graphical comparison, by means of:-
a) A component bar chart of the volume of diseases given in the above table from
April to August.
b) A multiple component bar chart from September to December.
c) Present the entire data for 2001 on a pie chart.
d) Use the chart to write a brief interpretation for the indicated period.
(20 Marks)
Q4. a) Describe how health data can be used in evidence based decision making in
hospital
day to day activities.
(20 Marks)
124 | P a g e

lOMoARcPSD|24727256
PART ONE
SECTION A
State whether the following statements are True or False (10mks)
a) Health statistics is not a scientific approach to health information that is presented in

numerical form
b) Sampling techniques can be classified into non-random and random techniques
c) Stratified is a sampling method in which representative sample is selected through
following specific stages
d) Length of stay and turnover interval have some relationship
e) Occupied bed days and in-patient bed days are not the same
f) Compound, Complex and Simple tables are the same
g) Literacy is not a major factor in the interview and questionnaire methods of data collection
h) Statistics is a mathematical method of data collection, collation, analysis and presentation
of data for decision making for an action
i) One disadvantage of using observation is that first hand information is collected
j) Entire population is not source of data
SECTION B
Stating or Describing (10mks)
i) State the clear difference between a questionnaire and interview schedule

ii) State TWO major disadvantages of the narrative method of data presentation
iii) Describe the major difference between open and closed ended questions in a questionnaire
iv) Differentiate between a sample and sample size
v) Distinguish between data editing and data analysis
SECTION C
Fill in the following blank spaces (8mks)
1. When inspecting data after collection for correction of obvious mistakes and omissions;
the process is referred to as data……………..
2. Arrangement of data according to geographic location…………………
125 | P a g e

lOMoARcPSD|24727256
3. Presentation of data in either written or oral form………………

4. …………………. Is any attempt to explain the meaning, features and drawing of
conclusion from data collected
5. ……………….. is the total number of patients remaining in the ward each day added
together
6. An actual data collection process being done in small scale is known as…………………
7. A tool commonly used in hospital bed state data collection………………
8. The systematic arrangement of data in columns and rows is called……………………
PART TWO
a) Describe 10 principles of table construction (20mks)

b) define the following terms as used in statistics (12mks)
i. sample
ii. Population
iii. Health problem
iv. Sampling error
v. Pilot survey
vi. Literature review
c) In the year 2004, Etihad hospital had 800 beds permanently allocated for in-patient use.
During the period the hospital percentage occupancy was 110% and that there were 1500
patients discharged home alive, 80 patients went to parole and 20 deaths occurred. Use the
information to calculate (20mks)
i. Occupied bed days
ii. Average length of stay
iii. Turnover per bed
iv. Excessive/vacant bed days
v. Turnover interval
d) Homa Bay County Hospital had three specialties of admission; males, females and
pediatrics. From the inpatient admission register, a total of 9 312 patients were admitted in
the year 2012, in which 5,584 pediatrics and 1,934 were males. 1,989 females and 2,056
males were admitted in the following year. And in the year 2014, a total of 9,209 patients
were admitted of which 2013 were females. Over the whole 3years, 26,796 were admitted
of which 14,934 were pediatrics.
i) What form of data presentation is used above (2mks)
ii) Present the above data in tabular form obeying all the principles of tabular
presentation (18mks)
126 | P a g e

lOMoARcPSD|24727256
68
12
7
|
P
a
g
e

lOMoARcPSD|24727256
12
8
|
P
a
g
e

lOMoARcPSD|24727256
69
12
9
|
P
a
g
e

Introduction To Biostatistics Student Lecture Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Biostatistics Student Lecture Notes

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|24727256

introduction to biostatistics student lecture notes

Business Statistics (Kenyatta University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

DIPLOMA IN HEALTH RECORDS & INFORMATION TECHNOLOGY

Thika School of Medical & Health Sciences

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

1.0 INTRODUCTION TO BIOSTATISTICS

1.1 Definition of biostatics.

1. Is a scientific approach to information presenting itself in numerical form which enables

1.2 What is Biostatistics?

Biostatistics is the application of statistical techniques to scientific research in health-related

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

1.3 Use of Statistical Data

1.4 Importance of Statistics

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

xiii. Helpful in data processing

1.5 Statistical Methods

1.6 Bio-statistics process involves

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

1.7 Types of Numerical DATA

1.9 Inferential Statistics

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

iv. Modeling of relationships (Regression)

1.11 Types of Data

1.12 Main function of statistics

1. What are the terminologies, basic principles and concepts of Bio-statistics

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

2.0 Measures of Central Tendency

2.1.1 Mean (Arithmetic)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

2.1.2 When not to use the mean

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

We again rearrange that data into order of magnitude (smallest first):

Downloaded by Divya Mohan (divyam17072001@gmail.com)

most popular option. An example of a mode is presented below:

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)

2.4 Skewed Distributions and the Mean and Median

Downloaded by Divya Mohan (divyam17072001@gmail.com)

Downloaded by Divya Mohan (divyam17072001@gmail.com)