Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

8/10/2015

Statistics is the science


which deals with


BIOSTATISTICS collection, classification and tabulation of
numerical facts
Niranjan Kanaki
K. B. Institute of Pharmaceutical Education and as the basis for explanation, description and
Research, Gandhinagar
comparison of phenomenon”.

------ Lovitt

© 2006

BIOSTATISICS NEED FOR BIOSTATISTICS


 Statisticsarising out of biological sciences, such  Variation is an inherent characteristic of experimental
as Medicine, public health, plant sciences, observations.
agriculture, etc.
Reasons of variations:
 The methods used in dealing with statistics in
 The instrument used for the analysis
the fields of medicine, biology and public health
 The analyst performing the assay
for planning, conducting and analyzing data
which arise in investigations of these branches.  The particular sample chosen

 Unidentified, uncontrollable background error – “Noise”

3 4

FUNCTIONS OF STATISTICS MAIN BRANCHES OF BIOSTATISTICS


Statistics is a field of study concerned with Descriptive Biostatistics
1- collection, organization, summarization and analysis  Methods of producing quantitative
of data. summaries of information in biological
- DESCRIPTIVE STATISTICS sciences
 Tabulation and graphical presentations
2- drawing of inferences about a body of data when only
 Measures of central tendency
a part of the data is observed.
 Measures of dispersion
- INFERENTIAL STATISTICS

5 6

1
8/10/2015

BRANCHES OF BIOSTATISTICS POPULATIONS AND SAMPLES


Inferential Biostatistics  Samples are usually a relatively small
Methods of making generalizations about a larger
number of observations taken from a
group based on information about a subset
(sample) of that group in biological sciences relatively large Population.
 Estimation

 Testing of hypothesis

7 8

POPULATION SAMPLES
A population is the group from which a sample A sample is a subset which should be
is drawn
representative of a population
Example
 All the students in a school
A sample should be representative if selected
 All the patients in a hospital
randomly
 Inresearch, it is not practical to include all
members of a population  In some cases, the sample may be stratified but
 Thus, a sample (a subset of a population) is then randomized within the strata
taken
9 10

EXAMPLE EXAMPLES OF POPULATION AND SAMPLE


POPULATION SAMPLE
We want a sample that will reflect a
 Tablet batch  20 tablets taken for content
population’s gender and age: uniformity
 Normal males between 18-  24 subjects selected for a phase
1. Stratify the data by gender 65 years I clinical study
 Sprague-Dawley rats  100 rats selected to study
2. Within each strata, further stratify by age toxicity of a new drug
 Analysts working for  3 analyst selected to test a new
company X assay method
3. Select randomly within each gender/age strata
 Persons with diastolic BP  120 such patients selected for
so that the number selected will be between 105 and 120 clinical study to compare two
proportional to that of the population mmHg antihypertensive drugs
 Serum cholesterol levels of  Blood samples drawn once in a
11
one patient week for 3 months from the 12
patient

2
8/10/2015

PARAMETER AND STATISTIC DATA


 The raw material of Statistics is data.
 Parameter: Summary value or characteristic of  We may define data as figures. Figures result
population or universe from the process of counting or from taking a
measurement.
 Statistic:
Summary value or characteristic of For example:
sample used for making inferences about  - When a hospital administrator counts the

parameter number of patients (counting).


 - When a nurse weighs a patient (measurement)

13 14

SOURCES OF DATA 2- External sources


1- Routinely kept records The data needed to answer a question may already
exist in the form of published reports, commercially
For example: available data banks, or the research literature, i.e.
- Hospital medical records contain immense
someone else has already asked the same question.
amounts of information on patients. For example:
- Records of students’ attendance, marks, etc. in a - studying patients’ data from various hospitals and correlating
school/college it with a disease condition

15 16

3- Surveys 4- Experiments
The source may be a survey, if the data needed is about Frequently the data needed to answer a question
answering certain questions. are available only as the results of an experiment.
For example:
Collecting information about patient’s lifestyle, dietary
habits, etc.

17 18

3
8/10/2015

A VARIABLE (DATA)
TYPES OF VARIABLES
It is a characteristic that takes on different
values in different persons, places, or  Independent variables

things.  Precede dependent variables in time


 Are often manipulated by the researcher
For example:
 The treatment or intervention that is used in a
- heart rate study
- the heights of adult males
- the serum cholesterol levels of a person  Dependent variables
 What is measured as an outcome in a study
- the weight of tablets from a batch
 Values depend on the independent variable
19 20

QUALITATIVE DATA OR VARIABLE


TYPES OF VARIABLES (CATEGORICAL DATA)
 A variable or characteristic which cannot be
measured in quantitative form
 can only be identified by name or categories
Qualitative Quantitative
 For example:
variables variables
 place of birth,
 Religion
 stages of breast cancer (I, II, III, or IV)

Nominal Ordinal Discrete Continuous


variables variables variables variables

21 22

QUALITATIVE NOMINAL DATA NOMINAL SCALE DATA EXAMPLE


 Data that represent categories or names.  survival status of propanolol - treated and
 There is no implied order to the categories of  control patients with myocardial infarction
nominal data.
 In these types of data, individuals are simply
Status 28 days Propanolol Control
after hospital -treated patient Patients
placed in the proper category or group, and the admission
number in each category is counted.
 Each item must fit into exactly one category. Dead 7 17

Alive 38 29

Total 45 46

Survival rate 84% 63%


23 24

4
8/10/2015

SOME OTHER EXAMPLES OF NOMINAL DATA QUALITATIVE ORDINAL DATA


Example: Sex ( M, F)  It is similar to nominal because the measurement
involve categories, however, the categories are
Exam result (P, F)
ordered by rank.
Blood Group (A,B, O or AB)
Color of Eyes (blue, green,  Pain level (Mild, Moderate, Severe)
brown, black)  Tumors (Stage 0, ……, IV)
Anemia's ( Microcytic, Macrocytic  Arthritis (Class 1, ……, 4 )
Religion - Christianity, Islam, Hinduism,  Military Rank (Lt., Capt., Maj., Col., General)
etc  Response to treatment (poor, fair, good)
 Severity of disease (mild, moderate, severe)
25 26
 Income status (low, middle, high)

QUANTITATIVE DATA OR VARIABLE


(NUMERICAL DATA) QUANTITATIVE DISCRETE VARIABLES
 Discrete variables have a set of possible values
A quantitative variable is one that can be that are finite or countably infinite.
measured and expressed numerically and  Often whole number (integers)
they can be of two types (discrete or  characterized by gaps or interruptions in the
continuous). values that it can assume.
 Examples:
- The number of daily admissions to a general
hospital
- Attendance in a class

27 28

QUANTITATIVE CONTINUOUS VARIABLES Discrete data -- Gaps between possible values

 can assume any value within a specified relevant


interval of values assumed by the variable.

Examples: Number of Children


- Height
Continuous data -- Theoretically,
- Weight
- Duration of seizure
no gaps between possible values

No matter how close together the observed heights


of two people, we can find another person whose
height falls somewhere in between.
Hb
29 30

5
8/10/2015

HOW TO DESCRIBE A CATEGORICAL


VARIABLE (QUALITATIVE DATA)?
FREQUENCY DISTRIBUTION TABLE
STATISTICS A table that organizes data values into classes or
 Frequency distribution
intervals along with number of values that fall in each
 Relative frequency distribution

 Cumulative frequency distribution


class (frequency, f ).

FIGURES/CHARTS
 Bar

 Pie

31 32

FREQUENCY DISTRIBUTION TABLE


PIE CHART
Distribution of Religion in a school

Distribution of Religion in a school

Religion Frequency % Frequency


Hindu 478 79.7
Muslim 65 10.8
Christian 51 8.5
Others 6 1.0
Total 600 100.0

33 34

CUMMULATIVE FREQUENCY
BAR CHART DISTRIBUTION
Distribution of Religion in a school
Patients undergoing treatment in a cancer hospital

Stage of cancer No. of patients Cummulative


frequency
distribution
I 52 52
II 24 (52+24=) 76
III 69 (52+24+69=) 145
IV 20 (52+24+69+20=) 165
Total 165

35 36

6
8/10/2015

HOW TO DESCRIBE A NUMERICAL


VARIABLE (QUANTITATIVE DATA)? FREQUENCY DISTRIBUTIONS
STATISTICS 1. Ungrouped Frequency Distribution –
 Frequency Distribution
for data sets with few different values.
 Central tendency

 Dispersion
Each value is a class in its own.

FIGURES/CHARTS 2. Grouped Frequency Distribution: for


 Histogram
data sets with many different values, which
 Frequency polygon
are grouped together in the classes.

37

GROUPED AND UNGROUPED


FREQUENCY DISTRIBUTIONS UNGROUPED FREQUENCY DISTRIBUTIONS

Ungrouped Grouped Number of Peas in a Pea Freq,


Pod Peas per pod f
Age of Frequency, f Age of Frequency, f Sample Size: 50
1 1
child Voters 5 5 4 6 4
3 7 6 3 5 2 2
1 25 18-30 202 6 5 4 5 5
3 5
2 38 31-42 508 6 2 3 5 5
5 5 7 4 3
4 9
3 217 43-54 620 4 5 4 5 6 5 18
4 1462 55-66 413 5 1 6 2 6
6 12
6 6 6 6 4
5 932 67-78 158 7 3
4 5 4 5 3
6 15 78-90 32 5 5 7 6 5

GRAPHS OF FREQUENCY DISTRIBUTIONS: FREQUENCY HISTOGRAM


FREQUENCY HISTOGRAMS Peas per Pod

Frequency Histogram
 A bar graph that represents the frequency Number of Peas in a Pod
Peas per pod Freq, f
distribution.
1 1 20

 The horizontal scale is quantitative and


2 2 15
Frequency, f

measures the data values.


3 5 10
 The vertical scale measures the frequencies of
the classes. 4 9 5

 Consecutive bars must touch. 5 18 0


frequency

1 2 3 4 5 6 7
6 12 Number of Peas
7 3

data
values

7
8/10/2015

RELATIVE FREQUENCY DISTRIBUTIONS RELATIVE FREQUENCY DISTRIBUTIONS


AND RELATIVE FREQUENCY HISTOGRAMS AND RELATIVE FREQUENCY HISTOGRAMS

Relative Frequency Distribution Peas per Rel. No. of peas in a pod


Shows the portion or percentage of the data that pod Freq, f Freq. 40
35
falls in a particular class.

Relative frequency
1 1 2 30
class frequency f 2 2 4 25
relative frequency = = 20
Sample size n 3 5 10 15
10
4 9 18 5

Relative Frequency Histogram 5 18 36 0


1 2 3 4 5 6 7
Has the same shape and horizontal scale as a 6 12 24 No. of peas per pod
histogram, but the vertical scale is marked with 7 3 6
relative frequencies.
Total 50
43

GROUPED FREQUENCY DISTRIBUTION


GROUPED FREQUENCY DISTRIBUTIONS TERMS
Grouped Frequency Distribution o Class Limits: the smallest value of a class is the lower
 For data sets with many different values.
class limit and the highest value of a class is its Upper
 Groups data into 5-20 classes of equal width.
class limit.
Exam Scores Freq, f o Class width: is the difference between two consecutive
30-39 1
lower class limits
40-49 0
50-59 4
60-69 9
70-79 13
80-89 10
90-99 3
46

LABELING GROUPED FREQUENCY CONSTRUCTING A GROUPED FREQUENCY


DISTRIBUTIONS DISTRIBUTION

o Class midpoints: the value halfway between LCL and 1. Determine the range of the data.
UCL
 Range = highest data value – lowest data value
(Lower class limit) + (Upper class limit)
2  May round up to the next convenient number

o Class boundaries: the value halfway between an UCL 2. Decide on the number of classes.
and the next LCL  Usually between 5 and 20

(Upper class limit) + (next Lower class limit) 3. Find the class width.
2 range
class width =
number of classes 48

Round up to the next convenient number.

8
8/10/2015

SERUM CHOLESTROL CHANGES (MG%) FOR 156 PATIENTS AFTER


CONSTRUCTING A GROUPED FREQUENCY ADMINISTRATION OF A ANTI-HYPERCHOLESTEROLEMIC DRUG
DISTRIBUTION
4. Find the class limits.
 Choose the first LCL: use the minimum data entry or

Larson/Farber 4th ed.


something smaller that is convenient.

 Find the remaining LCLs: add the class width to the


lower limit of the preceding class.

 Find the UCLs: Remember that classes must cover all


data values and cannot overlap.

5. Find the frequencies for each class.


49
50

FREQUENCY DISTRIBUTION HISTOGRAM / FREQUENCY POLYGON


Frequency distribution of serum cholesterol changes Age distribution of people watching a particular TV show

51 52

FREQUENCY POLYGON MEASURES OF CENTRAL TENDENCY


Age distribution of people watching a particular TV show  Mean
 Median
 Mode

53 54

9
8/10/2015

MEASURE OF CENTRAL TENDENCY:


MEAN MEAN
Mean : The sum of all the values of data series divided  Mean of ungrouped and grouped frequency
by the total number of values. distribution:

 Population mean: Σx
µ=
N
 Sample mean: where
Σx
x= X = data values of ungrouped data OR
n mid-points of the groups in grouped data
f = frequency of each group

56

FIND MEAN OF THE FOLLOWING DATA FIND MEAN OF THE FOLLOWING DATA
Weight in No. of
Peas per pod Freq, f
kg. persons
1 1 50-54 6
2 2 55-59 18
3 5 60-64 78
4 9 65-69 80
Ans.: 70.4
5 18 70-74 100
6 12 75-79 72
7 3 80-84 30
57 85-89 10 58

90-94 6

MEASURES OF CENTRAL TENDENCY:


MEDIAN MEDIAN OF GROUPED DATA
Median  Median = L + (n/2) − cfb × c
 The value that divides a series of values in half fm
when they are all listed in order where:
 When there are an odd number of values  L is the lower class boundary of the class
 The median is the middle value containing the median
 When there are an even number of values  n is the total number of data
 The median is the mean of the two middle values.
 cfb is the cumulative frequency of the class before
the median class
 fm is the frequency of the median class

 c is the class width


59 60

10
8/10/2015

MEASURE OF CENTRAL TENDENCY:


FIND MEDIAN OF THE FOLLOWING DATA MODE
Mode
Weight in No. of
 The data value that occurs with the greatest
kg. persons
frequency.
50-54 6
 If no value is repeated the data set has no mode.
55-59 18  If two values occur with the same greatest frequency,
60-64 78 each entry is a mode (bimodal).
65-69 80
Ans.: 70.4 a) 5.40 1.10 0.42 0.73 0.48 1.10
70-74 100  Mode is 1.10
75-79 72 b) 27 27 27 55 55 55 88 88 99  Bimodal - 27 & 55
80-84 30 c) 1 2 3 6 7 8 9 10  No Mode
85-89 10 61

90-94 6

MEASURES OF CENTRAL TENDENCY MODE OF GROUPED DATA


 Mode
 The modal value is
the highest bar in a
Mode Mode = L + f 1− f 0 ×c
histogram Number of Peas in a Pod 2f1 − f0 − f2
20 where
15  L is the lower class boundary of the modal class
Frequency, f

 f1 is the frequency of the modal class


10
 f0 is the frequency of the class before the modal
5
class in the frequency table
0
1 2 3 4 5 6 7
 f2 is the frequency of the class after the modal
Number of Peas 63 class in the frequency table 64

 c is the class width of the modal class

COMPARING THE MEAN, MEDIAN, AND MODE


FIND MODE OF THE FOLLOWING DATA  All three measures describe an “average”. Choose the
one that best represents a “typical” value in the set.
Weight in No. of
 Mean:
kg. persons
 The most familiar average.
50-54 6  A reliable measure because it takes into account every entry
55-59 18 of a data set.
60-64 78  May be greatly affected by outliers or skew.
65-69 80  Median:
Ans.: 71.58  A common average.
70-74 100
 Not as effected by skew or outliers.
75-79 72
 Mode: May be used if there is an overwhelming repeat.
80-84 30
85-89 10 65

90-94 6

11
8/10/2015

PROBLEM 1 PROBLEM 2
Pulse rate of 50 persons is given. Calculate the Calculate the mean, median and mode of the given data.
mean, median and mode of the data.
Pulse rate No. of persons %Hb No. of
67 4 Mean = 71.5 Mean = 13.25
persons
68 5 11.1-12 5
Median = 72 Median = 13.23
69 3
70 2 12.1-13 10
Mode = 73 Mode = 13.31
71 7 13.1-14 15
72 10
14.1-15 4
73 11
74 3 15.1-16 2
75 2 67 68
16.1-17 1
76 3

PROBLEM 3 PROBLEM 4
The table shows the daily expenditure of 100 college
Calculate the mean, median and mode of the given data.
students. Calculate the mean, median and mode of the given
data.
Particle size No. of particles
Expenditure No. of students Mean = 25.1 Mean = 883
(µ)
(Rs.)
Median = 25 100-300 3 Median = 838
0 to 10 14
Mode = 24.29 301-600 9 Mode = 777.8
10 to 20 23
601-900 48
20 to 30 26
901-1200 21
30 to 40 22 1201-1500 19
40 to 50 15 69 70

MEASURES OF DISPERSION MEASURES OF DISPERSION


 Range Range
 Standard Deviation (SD)  The difference between maximum and minimum
 Variance values
 Interquartile range (IQR)  Range = maximum value – minimum value

71 72

12
8/10/2015

STANDARD DEVIATION (SD) STANDARD DEVIATION (SD)


 SD is a measure of the variability of a set of data
 The mean represents the average of a group of
values, with some of the values being above the mean
and some below
 In effect, SD is the average amount of spread in a
distribution of values
 SD of sample S
 SD of population σ
 Variance (S2) is another measure of spread

73 74

DISTANCES AGES DEVIATE ABOVE AND


BELOW THE MEAN Adding deviations CALCULATING S2
always equals zero
 Since the total of differences from the mean
always equals zero
 Values must first be squared, which cancels the
negative signs

75 76

FORMULA TO CALCULATE THE SD Calculate the SD of following values:

101.8, 103.2, 104.0, 102.5, 103.5

Standard deviation,

Variance of sample,

Variance of population,

77 78

13
8/10/2015

FIND STANDARD DEVIATION OF THE


CALCULATION OF STANDARD DEVIATION FOLLOWING DATA

Weight in No. of
kg. persons
50-54 6
55-59 18
60-64 78
65-69 80
70-74 100
75-79 72
80-84 30
79 85-89 10 80

90-94 6

PROBLEM 1
Pulse rate of 50 persons is given. Calculate the standard
deviation of the data.
Pulse rate No. of persons
67 4
68 5
69 3
70 2
71 7
72 10
73 11
74 3
81
75 2 82
76 3

PROBLEM 2 PROBLEM 3
The table shows the daily expenditure of 100 college
Calculate the standard deviation of the given data.
students. Calculate the standard deviation of the given data.

%Hb No. of
persons Expenditure No. of students
11.1-12 5 (Rs.)
0 to 10 14
12.1-13 10
10 to 20 23
13.1-14 15
14.1-15 4 20 to 30 26

15.1-16 2 30 to 40 22

16.1-17 1 83 40 to 50 15 84

14
8/10/2015

PROBLEM 4 THE SHAPE OF DATA


Calculate the standard deviation of the given data.
 Histograms of frequency distributions have
different shapes.
Life of bulb No. of bulbs
 Distributions are often symmetrical with most
(hrs) scores falling in the middle and fewer toward the
40-55 10 extremes
55-70 12  Most biological data are symmetrically
distributed and form a normal curve (a.k.a, bell-
70-85 15 shaped curve)
85-100 13
100-115 10
85 86

THE SHAPE OF DATA (CONT.) THE NORMAL DISTRIBUTION


 The area under a normal curve has a normal
distribution (a.k.a., Gaussian distribution)
 Properties of a normal distribution

Line depicting  It is symmetric about its mean


the shape of  The highest point is at its mean
the data  The height of the curve decreases as one moves away
from the mean in either direction, approaching, but
never reaching zero

87 88

THE NORMAL DISTRIBUTION THE NORMAL DISTRIBUTION


(CONT.) (CONT.)
Mean
Mean = Median = Mode
The highest point of
As one moves away from
the overlying
the mean in either direction
normal curve is at
the height of the curve
the mean
decreases, approaching,
but never reaching zero

A normal distribution is symmetric about its mean

89 90

15
8/10/2015

SKEWED DISTRIBUTIONS SKEWED DISTRIBUTIONS (CONT.)


 The data are not distributed symmetrically in  Skew is always toward the direction of the longer
skewed distributions tail
 Positive if skewed to the right
 Consequently, the mean, median, and mode are not
 Negative if to the left
equal and are in different positions
 Scores are clustered at one end of the distribution
 A small number of extreme values are located
towards one end The mean is shifted
the most

91 92

COMPARING THE MEAN, MEDIAN, AND MODE


SKEWED DISTRIBUTIONS (CONT.)
 For a normal distribution :
 Because the mean is shifted so much, it is not the Mean = Median = Mode
best estimate of the average score for skewed
distributions  For a skewed distribution :
 The median is a better estimate of the center of
Mode = 3(Median) – 2(Mean)
skewed distributions
 It will be the central point of any distribution
 50% of the values are above and 50% below the
median

93

CENTRAL TENDENCY FOR DIFFERENT MORE PROPERTIES


TYPES OF DATA OF NORMAL CURVES
 About 68.3% of the area under a normal curve is
within one standard deviation (SD) of the mean
Best measure of
Type of data  About 95.5% is within two SDs
central tendency
 About 99.7% is within three SDs
Nominal Mode

Ordinal Median/Mode

Symmetrical – Mean
Quantitative
Skewed – Median

96

95

16
8/10/2015

MORE PROPERTIES WIDE SPREAD RESULTS IN HIGHER SDS


OF NORMAL CURVES (CONT.) NARROW SPREAD IN LOWER SDS

97 98

SPREAD IS IMPORTANT WHEN


COMPARING 2 OR MORE GROUP MEANS COEFFICIENT OF VARIATION (CV)
 Also known as Relative Standard Deviation
(RSD)
It is more difficult to
S
see a clear distinction % CV/ RSD = x 100
X
between groups
in the upper example
 %CV = 10, means that s.d. is 10% of the mean.
because the spread is
wider, even though the  Allows comparison of variability in different
means are the same kinds of measurements.

99 100

STANDARD DEVIATION OF THE MEAN


HOW MUCH SPREAD IS ACCEPTABLE? (STANDARD ERROR OF THE MEAN, SEM)
 This varies with the experiment. It is a measure of the variability of the mean
 Example: Means of Potencies of Five Sets of 100 Tablets
 Repeatability in normal HPLC analysis Selected from a Production Batch

- %CV should not be >2%

 Repeatability in bioanalysis by HPLC


- %CV should not be >20%

 In biological experiments, CV may be as high as 20-


50 %

101 102

17
8/10/2015

STANDARD ERROR OF THE MEAN, SEM PRECISION AND ACCURACY


Precision
 Instead of studying several samples, it is
statistically calculated by the formula,  Refers to the extent of variability of a group of
measurements observed under similar experimental
conditions
 Measure of reproducibility

103 104

PRECISION AND ACCURACY PRECISION AND ACCURACY


Accuracy
 The accuracy of a measurement is determined by how close a
measured value is to its “true” value.
 For example, if a sample is known to weigh 3.182 g, then weighed
five different times by a student with the resulting data: 3.200 g,
3.180 g, 3.152 g, 3.168 g, 3.189 g
 The most accurate measurement would be 3.180 g, because it is
closest to the true “weight” of the sample.
Difference between accuracy and precision

105 106

SHAPE OF DATA SKEWNESS, KURTOSIS

 Shape of data is measured by  Skewness (Sk), Pearsonian coefficient, is a


 Skewness measure of asymmetry of a distribution around
 Kurtosis its mean.
 Kurtosis characterizes the relative peakedness
or flatness of a distribution compared with the
normal distribution.

107 108

18
8/10/2015

SKEWNESS SKEWNESS
 Measures asymmetry of data  If skewness = 0, the data are perfectly
 Positive or right skewed: Longer right tail symmetrical.
 Negative or left skewed: Longer left tail  But a skewness of exactly zero is quite unlikely
for real-world data
n
n ∑ ( xi − x )3  If skewness is less than −1 or greater than +1,
Coefficient of Skewness = i =1
3/ 2
the distribution is highly skewed.
 n 
 ∑ ( xi − x ) 2   If skewness is between −1 and −½ or between +½
 i =1  and +1, the distribution is moderately skewed.
 If skewness is between −½ and +½, the
distribution is approximately symmetric.
If a normal distribution has a skewness of 0, right skewed is
greater then 0 and left skewed is less than 0. 110

109

EXAMPLE EXAMPLE
College Men’s Heights
 Here are grouped data
for heights of 100 Height Class Frequency,
xf x-mean (x-mean)2*f (x-mean)3*f
randomly selected male (inches) Mark, x f

students. 59.5–62.5 61 5 305 -6.45 208.01 -1341.68


62.5–65.5 64 18 1152 -3.45 214.25 -739.15
 Calculate the skewness
65.5–68.5 67 42 2814 -0.45 8.51 -3.83
coeficient and comment
68.5–71.5 70 27 1890 2.55 175.57 447.70
on the skewness of the
71.5–74.5 73 8 584 5.55 246.42 1367.63
data.
100 6745 852.75 -269.33
Mean 67.45

111 112

KURTOSIS
n
n ∑ ( xi − x ) 3  Measures peakedness of the distribution of data.
i =1
Coefficient of Skewness = 3/ 2  The height and sharpness of the peak relative to
 n

 ∑ ( xi − x ) 2  the rest of the data are measured by a number called
 i =1  kurtosis.
 Higher values indicate a higher, sharper peak;
= √100 (-269.33)
(852.75)3/2 lower values indicate a lower, less distinct
peak.
= -2693.3
 The kurtosis of normal distribution is 0.
24901.91

= - 0.108
113
Conclusion : Data has a normal distribution
114

19
8/10/2015

KURTOSIS
KURTOSIS Mesokurtic has a kurtosis = 0
There are three types of peakedness. Leptokurtic has a kurtosis that is +
Leptokurtic - very peaked Platykurtic has a kurtosis that is -
Platykurtic - relatively flat
Mesokurtic - in between
Let x1 , x2 ,...xn be n observations. Then,
n
n∑ ( xi − x ) 4
i =1
Kurtosis = 2
−3
 n 
 ∑ ( xi − x ) 2 
 i =1 

KURTOSIS EXAMPLE
College Men’s Heights
 Here are grouped data
for heights of 100
randomly selected male
students.
 Calculate the kurtosis
of the data and give
your interpretation.

118

Height Class Frequency, n


(inches) Mark, x f
xf x-mean (x-mean)2*f (x-mean)4*f
n∑ ( xi − x ) 4
i =1
59.5–62.5 61 5 305 -6.45 208.01 8653.84 Kurtosis = 2
−3
 n 
62.5–65.5 64 18 1152 -3.45 214.25 2550.05  ∑ ( xi − x ) 2 
65.5–68.5 67 42 2814 -0.45 8.51 1.72  i =1 
68.5–71.5 70 27 1890 2.55 175.57 1141.63 = 100*(19937.59) _ 3
(852.75)2
71.5–74.5 73 8 584 5.55 246.42 7590.35
= 1993759 _ 3
100 6745 -2.25 852.75 19937.59 727182.56
Mean 67.45 = - 0.258

119 120

20
8/10/2015

SAMPLING
 The sampling procedure is an essential
ingredient of a good experiment.
 An otherwise excellent experiment or

SAMPLING TECHNIQUES investigation can be invalidated if proper


attention is not given to choosing samples in a
manner consistent with the experimental design
or objectives.
 Statistical treatment of data and the inference
based on experimental results depend on the
sampling procedure.

121 122

SAMPLING TECHNIQUES PROBABILITY SAMPLING TECHNIQUES

Sampling techniques may be roughly divided into  Probability sample is one in which each
 Probability sampling (Random Sampling)
element of the population has a known
probability of being included in the
 Non-probability sampling (Authoritative sampling) sample and are chosen by some random
device.

123 124

SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUES SIMPLE RANDOM SAMPLING
 SIMPLE RANDOM SAMPLING  Most commonly used method
 STRATIFIED SAMPLING  Each individual (object) in the population to be sampled
has an equal chance of being selected.
 SYSTEMATIC SAMPLING
 Simple random sampling is most effective when the
 CLUSTER SAMPLING
variability is relatively small and uniform over the
population
 Eg. Playing cards, names drawn out of a bowl, lottery

125 126

21
8/10/2015

RANDOM SAMPLING
RANDOM SAMPLING RANDOM NUMBER TABLES

BOOK-
A MILLION RANDOM NUMBERS

eg.
(1) Select a sample of 10 bottles
from a batch of 800 bottles.

(2) Allocation of treatment to 20


patients in a BA study

127 128

MERITS OF RANDOM SAMPLING


 More scientific
 Theory of probability is applicable.
 Economical

 Good for homogeneous population

129 130

STRATIFIED SAMPLING
DEMERITS OF RANDOM SAMPLING
 The Stratification is the process of dividing members
 Complete list of all items is required of the population into homogeneous subgroups
before sampling.
 When units are spread over large area, this
method can’t be used.  Stratified sampling is a recommended way of sampling
when the strata are very different from each other, but
objects within each stratum are alike.
 The strata should be mutually exclusive: every
element in the population must be assigned to only one
stratum.
 The strata should also be collectively exhaustive: no
population element can be excluded.

131 132

22
8/10/2015

STRATIFIED SAMPLING TYPES OF STRATIFIED SAMPLING


 This method improves the representativeness of  Proportionate sampling uses a sampling fraction
the sample by reducing sampling error. in each of the strata that is proportional to that
 Example: of the total population.
 clinical study on asthmatics, the stratification could  For instance, if the population consists of 60% in
be accomplished by dividing the asthmatic patients the male stratum and 40% in the female stratum,
into subsets (strata) depending on age, duration of then the relative size of the two samples (three
illness, or severity of illness males, two females) should reflect this
 In quality control procedures, items are frequently proportion.
selected for inspection at random within specified
time intervals (strata) rather than in a completely
random fashion (simple random sampling)

133 134

TYPES OF STRATIFIED SAMPLING MERITS OF STRATIFIED SAMPLING


Disproportionate sampling (Optimum allocation )  More representative of the population
– based on standard deviation of the variable in  Ensures greater accuracy
each stratum  Good for non-homogenous population
 Larger samples are taken in the strata with the
greatest variability to generate the least possible
sampling variance.

135 136

DEMERITS OF STRATIFIED SAMPLING SYSTEMATIC SAMPLING


 Stratified sampling is not useful when the  Every nth item is selected
population cannot be exhaustively partitioned  Sampling is done at regular intervals
into disjoint subgroups.
 Sampling interval, k = N/n
 If multiple criteria exist for formation of strata it where, N = population size
makes the sampling plan more difficult. n = desired sample size
 Initial sample (j) is selected randomly and then
every kth sample is selected, eg. j+k, j+2k, etc.

137 138

23
8/10/2015

SYSTEMATIC SAMPLING SYSTEMATIC SAMPLING


 For example, N=64, n=8,  Care should be taken that the
then k=64/8 = 8 process does not show a cyclic or
 Randomly selected initial periodic behavior, because
sample (j) is 3. systematic sampling will then
not be representative of the
process.

139 140

SYSTEMATIC SAMPLING CLUSTER SAMPLING

Merits:  In this technique, the total population is divided


into groups (or clusters) and a simple random
 Simple and convenient to adopt
sample of the groups is selected.
 If the population is sufficiently large,
 Then the required information is collected from a
homogeneous and each unit is numbered it can simple random sample of the elements within
yield accurate results. each selected group.
Demerits:  This may be done for every element in these
 Periodicities in the list groups (single-stage cluster sampling) or a
subsample of elements may be selected within
each of these groups (two-stage cluster
sampling).
141 142

CLUSTER SAMPLING CLUSTER SAMPLING

 The population within a cluster should Merits


ideally be as heterogeneous as possible but  Flexibility is high, useful for large populations
there should be homogeneity between
clusters formed.
Demerits
 Each cluster should be a small scale
 Least accurate amongst all probability sampling
representation of the total population.
methods.
 The clusters should be mutually exclusive and
collectively exhaustive.

143 144

24
8/10/2015

STRATIFICATION AND CLUSTERING

STRATIFICATION CLUSTERING Stratified sampling Cluster sampling

 Divide population into  Divide population into


groups different from each comparable groups. eg.
other eg. sex,age, religion, cities, schools, etc.
etc.
 Sample randomly from  Randomly sample some of
each group the groups
 Less error compared to  More error compared to
simple random simple random
 More expensive  Less expensive

145 146

SAMPLING TECHNIQUES NON-PROBABILITY SAMPLING


 Here, some elements have no chance of getting
selected.
 Selection is done based on ease of access to
subjects (convenience sample).
 Selection is done based on judgement of a person.
(judgement sample)
 A pre-planned number of subjects may be
selected (quota sample)
 eg. 100 men, 100 women
 Sample size may be small

147
 Doesn’t allow estimation of sampling error. 148
 Should be restricted to small population.

MERITS OF NON-PROBABILITY SAMPLING DEMERITS


Merits:  Individual bias
 Simple, more representative sample can be  Inaccurate
obtained where random sample fails  Can’t be compared with other studies.
 Widely used in solving business problem and
making public policy decisions.

149 150

25
8/10/2015

FACTORS AFFECTING SELECTION OF


SAMPLING PROCEDURE
 The nature of the population.
 For example, can we enumerate the individual units,
such as packaged bottles of a product, or is the
population less easily defined, as in the case of
hypertensive patients?
 The cost of sampling in terms of both time
and money.
 Convenience.

 Desired precision.
 The accuracy and precision desired will be a function of
the sampling procedure and sample size.
151 152

SAMPLING ERROR SAMPLING ERRORS


 The discrepancy between a sample statistic and Types of errors:
its population parameter is called sampling Biased:
error.  Due to non-probability sampling
 Defining and measuring sampling error is a large
 Doesn’t decrease even if the sample size is
part of inferential statistics. increased.
 We can’t perfectly miniature the population
Unbiased:
hence errors do occur.
 Random sampling errors
 It incures when the statistical characteristics of a
 Due to chance difference between members of the
population are estimated from a subset, or
sample, of that population population included in the sample and members
excluded in the sample.
153 154

HOW TO REDUCE SAMPLING ERRORS?


 Increase the sample size.
 Choose correct sampling method.
 Choose correct method for data interpretation.

155 156

26
8/10/2015

EXAMPLE OF CALCULATING A
CONFIDENCE INTERVAL
CONFIDENCE INTERVALS
Consider measurement of dissolved Ti
in a standard seawater (NASS-3):
 Quantifies how far the true mean (µ) lies from the Data: 1.34, 1.15, 1.28, 1.18, 1.33,
measured mean, x. Uses the mean and standard 1.65, 1.48 nM
deviation of the sample. DF = n – 1 = 7 – 1 = 6

ts x = 1.34 nM or 1.3 nM ts
µ=x± s = 0.17 or 0.2 nM µ=x±
n 95% confidence interval
t(df=6,95%) = 2.447
n
CI95 = 1.3 ± 0.16 or 1.3 ± 0.2 nM
where t is from the t-table and n = number of
50% confidence interval
measurements.
t(df=6,50%) = 0.718
Degrees of freedom (df) = n - 1 for the CI.
CI50 = 1.3 ± 0.05 nM

157 158

COMPARING A MEASURED RESULT


WITH A “KNOWN” VALUE
INTERPRETING THE CONFIDENCE INTERVAL
 For a 95% CI, there is a 95% probability that the
true mean (µ) lies between the range 1.3 ± 0.2 nM,  “Known” value would typically be a certified value
or between 1.1 and 1.5 nM from a standard reference material (SRM)
 Another application of the t statistic

 For a 50% CI, there is a 50% probability that the true


mean lies between the range 1.3 ± 0.05 nM, or known value − x
between 1.25 and 1.35 nM t calc = n
s
 Note that CI will decrease as n is increased Will compare tcalc to tabulated value of t at appropriate
df and CL.
 Useful for characterizing data that are regularly
df = n -1 for this test
obtained; e.g., quality assurance, quality control 159 160

COMPARING A MEASURED RESULT COMPARING REPLICATE MEASUREMENTS OR


WITH A “KNOWN” VALUE--EXAMPLE COMPARING MEANS OF TWO SETS OF DATA
Dissolved Fe analysis verified using NASS-3 seawater SRM
 Yet another application of the t statistic
Certified value = 5.85 nM
 Example: Given the same sample analyzed by two
Experimental results: 5.76 ± 0.17 nM (n = 10)
different methods, do the two methods give the “same”
known value − x 5.85 − 5.7 6 result?
tcalc =
s
n =
0.17
10 = 1.674 x1 − x 2 n1 n 2
t calc =
(Keep 3 decimal places for comparison to table.) s pooled n1 + n 2
Compare to ttable; df = 10 - 1 = 9, 95% CL s12 (n1 −1) + s 22 (n 2 −1)
s pooled =
ttable(df=9,95% CL) = 2.262 n1 + n 2 − 2
If |tcalc| < ttable, results are not significantly different at the 95% CL. Will compare tcalc to tabulated value of t at appropriate df
and CL.
If |tcalc| ≥ ttable, results are significantly different at the 95% CL. 162
df = n1 + n2 – 2 for this test
For this example, tcalc < ttest, 161
so experimental results are not significantly
different at the 95% CL

27
8/10/2015

COMPARING REPLICATE MEASUREMENTS OR COMPARING REPLICATE MEASUREMENTS OR COMPARING


COMPARING MEANS OF TWO SETS OF DATA— MEANS OF TWO SETS OF DATA—EXAMPLE
EXAMPLE
s12 ( n1 − 1) + s22 (n2 − 1) (0.07 3 ) 2 (4 − 1) + (0.12 ) 2 ( 4 −1)
s pooled = = = 0.0993
n1 + n2 − 2 4+4−2
Determination of nickel in sewage sludge
using two different methods x1 − x2 n1 n2 3.945 − 3.59 (4)(4)
t calc = = = 5.056
s pooled n1 + n2 0.0993 4+4
Method 1: Atomic absorption Method 2: Spectrophotometry
spectroscopy Note: Keep 3 decimal places to compare to ttable.
Data: 3.91, 4.02, 3.86, 3.99 mg/g Data: 3.52, 3.77, 3.49, 3.59 mg/g
Compare to ttable at df = 4 + 4 – 2 = 6 and 95% CL.
ttable(df=6,95% CL) = 2.447
x1 = 3.945 mg/g x2 = 3.59 mg/g

s1 = 0.07 If |tcalc| < ttable, results are not significantly different at the 95%. CL.
3 mg/g s2 = 0.12 mg/g
If |tcalc| ≥ ttable, results are significantly different at the 95% CL.
n1 n2
=4 =4
163 164
Since |tcalc| (5.056) ≥ ttable (2.447), results from the two methods are
significantly different at the 95% CL.

COMPARING REPLICATE MEASUREMENTS OR


COMPARING MEANS OF TWO SETS OF DATA
F-TEST TO COMPARE STANDARD DEVIATIONS

Wait a minute! There is an important assumption  Used to determine if std. devs. are significantly
associated with this t-test: different before application of t-test to compare
replicate measurements or compare means of two
It is assumed that the standard deviations (i.e., the sets of data
precision) of the two sets of data being compared
are not significantly different.
 Also used as a simple general test to compare the
•How do you test to see if the two std. devs. are precision (as measured by the std. devs.) of two sets
different? of data

•How do you compare two sets of data whose std.  Uses F distribution
devs. are significantly different?
166

© 2006

F-TEST TO COMPARE STANDARD DEVIATIONS

Will compute Fcalc and compare to Ftable.

s12
Fcalc = where s1 > s2
s22

DF = n1 - 1 and n2 - 1 for this test.

Choose confidence level (95% is a typical CL).

From D.C. Harris (2003) Quantitative Chemical Analysis, 6th Ed.


167

28
8/10/2015

COMPARING REPLICATE MEASUREMENTS OR


COMPARING MEANS OF TWO SETS OF DATA--
F-TEST TO COMPARE STANDARD DEVIATIONS REVISITED
From previous example:
Let s1 = 0.12 and s2 = 0.073

s12 (0.12 ) 2 The use of the t-test for comparing means was justified
Fcalc = = = 2.70
s22 (0.07 3 ) 2 for the previous example because we showed that
standard deviations of the two sets of data were not
Note: Keep 2 or 3 decimal places to compare with Ftable. significantly different.
Compare Fcalc to Ftable at df = (n1 -1, n2 -1) = 3,3 and 95% CL.
If the F-test shows that std. devs. of two sets of data
If Fcalc < Ftable, std. devs. are not significantly different at 95% CL. are significantly different and you need to compare
the means, use a different version of the t-test 
If Fcalc ≥ Ftable, std. devs. are significantly different at 95% CL.

Ftable(df=3,3;95% CL) = 9.28

Since Fcalc (2.70) < Ftable (9.28), std. devs. of the two sets of data 169 170
are not significantly different at the 95% CL. (Precisions are
similar.)

COMPARING REPLICATE MEASUREMENTS OR FLOWCHART FOR COMPARING MEANS OF TWO


COMPARING MEANS FROM TWO SETS OF DATA WHEN
SETS OF DATA OR REPLICATE MEASUREMENTS
STD. DEVS. ARE SIGNIFICANTLY DIFFERENT
Use F-test to see if std. devs. of
the 2 sets of data are significantly
x1 − x2 different or not
tcalc =
s12 / n1 + s22 / n2

Std. devs. are significantly Std. devs. are not significantly


  different different
 
 
2 2 2
( s / n + s2 / n2 )
DF =  2 1 21 2 
−2
 2
  ( s1 / n1 ) + ( s2 / n2 )   Use the 2nd version of the t- Use the 1st version of the t-test

  n1 + 1 n2 + 1  
test (the beastly version) (see previous, fully worked-out
example)
171 172

EVALUATING QUESTIONABLE DATA POINTS


ONE LAST COMMENT ON THE F-TEST USING THE Q-TEST
 Need a way to test questionable data points (outliers) in an
Note that the F-test can be used to simply test whether unbiased way.
or not two sets of data have statistically similar  Q-test is a common method to do this.
precisions or not.
 Requires 4 or more data points to apply.

Can use to answer a question such as: Do method one Calculate Qcalc and compare to Qtable
and method two provide similar precisions for the
analysis of the same analyte? Qcalc = gap/range

Gap = (difference between questionable data pt. and its


nearest neighbor)
173 174
Range = (largest data point – smallest data point)

29
8/10/2015

EVALUATING QUESTIONABLE DATA POINTS


USING THE Q-TEST--EXAMPLE
Consider set of data; Cu values in sewage sample:
9.52, 10.7, 13.1, 9.71, 10.3, 9.99 mg/L

Arrange data in increasing or decreasing order:


9.52, 9.71, 9.99, 10.3, 10.7, 13.1

The questionable data point (outlier) is 13.1


gap (13.1 − 10.7)
Qcalc = = = 0.670
Calculate range (13.1 − 9.52)
Compare Qcalc to Qtable for n observations and desired CL (90% or
95% is typical). It is desirable to keep 2-3 decimal places in
Qcalc so judgment from table can be made.
176
Qtable (n=6,90% CL) = 0.56
175 From G.D. Christian (1994) Analytical Chemistry, 5th Ed.

Design Data summary Statistics & Tests

EVALUATING QUESTIONABLE DATA POINTS 2 independent groups Proportions


Rank Ordered
Chi-square, Fisher-exact
Mann-Whitney U
USING THE Q-TEST--EXAMPLE Mean
Survival
Unpaired t-test
Mantel-Haenzel, Log rank
2 related groups Proportions McNemar Chi-square
Rank Ordered Sign test
If Qcalc < Qtable, do not reject questionable data point at stated CL.
Mean Wilcoxon signed rank
Paired t-test
If Qcalc ≥ Qtable, reject questionable data point at stated CL. More than 2 independent Proportions Chi-square
groups Rank Ordered Kruskal-Wallis
Mean ANOVA
From previous example, Survival Log rank
More than 2 related groups Proportions Cochran Q
Qcalc (0.670) > Qtable (0.56), so reject data point at 90% CL.
Rank Ordered Friedman
Mean Repeated ANOVA
Subsequent calculations (e.g., mean and standard deviation) Study of Causation; one Proportion Relative Risk
should then exclude the rejected point. independent variable Mean Odd Ratios
(univariate) Correlation coefficient

Study of Causation; more Proportion Discriminant Analysis


Mean and std. dev. of remaining data: 10.04 ± 0.47 mg/L than one independent Mean Multiple Logistic Regression
177 Log Linear Model 178
variable (Multivariate)
Regression Analysis
Multiple Classification Analysis

Choosing a test for comparing the averages of 2 or more samples of


scores of experiments with one treatment factor
Scheme for choosing one-sample test
Data Between subjects Within subjects
(independent samples) (related samples) Nominal 2 categories >2 categories
2 samples
Interval Independent t-test Paired t-test
Binomial test Chi-square test
Ordinal Wilcoxon-Mann- Wilcoxon signed ranks Ordinal Randomness Distribution
Whitney test test, Sign test
Nominal Chi-square test Mc Nemar test Runs test Kolmogorov-
> 2 samples Smirnov test
Interval One way ANOVA Repeated measured
ANOVA
Interval Mean Distribution
Ordinal Kruskal-Wallis test Friedman test t-test Kolmogorov-
179 180
Nominal Chi-square test Cochran’s Q test Smirnov test
(dichotomous data only)

30
8/10/2015

Measures of association
between 2 variables Z-SCORES

 The number of SDs that a specific score is above


Data Statistic or below the mean in a distribution
 Raw scores can be converted to z-scores by
Interval Pearson Correlation (r) subtracting the mean from the raw score then
dividing the difference by the SD
Ordinal Spearman’s Rho,
Kendall’s tau-a, tau-b, tau-c

Nominal Phi, Cramer V X −µ


z=
σ
181 182

Z-SCORES (CONT.) Z-SCORES (CONT.)

 Standardization
 The process of converting raw to z-scores Refer to a z-table
to find proportion
 The resulting distribution of z-scores will always
under the curve
have a mean of zero, a SD of one, and an area under
the curve equal to one
 The proportion of scores that are higher or lower
than a specific z-score can be determined by
referring to a z-table

183 184

Partial z-table (to z = 1.5) showing proportions of the


-SCORES
Zarea (CONT
under a normal curve.)
for different values of z.

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517
Corresponds
0.55570.5596 0.5636
to the 0.5714
0.5675
area 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.59480.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293
under
0.63310.6368
the0.6406
curve in black
0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319185
1.5 0.9332
0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

31

You might also like