Professional Documents
Culture Documents
2 Summarizing Data
2 Summarizing Data
Objectives
By the end of this lecture, the student should be able
to
• Describe data by tables, charts and graphs
• Knows the steps to construct histogram
• Differentiate between Proportions, Rates &
Ratios.
• Knows the measures of central tendency
(measures of central location), its advantages and
disadvantages of each.
2
Tables.
Graphs.
Measures of Central
Tendency.
3
Describing data with tables
1. Frequency table
2. Relative and cumulative frequency
3. Grouped frequency table
4. Cross-tabulation(contingency tables)
5. Non - contingency tables
4
Frequency table
The following data: ages of patients admitted to hospital with
poliomyelitis
8,24,18,5,6,12,4,3,3,2,3,23,9,18,16,12,3,5,11,13,15,9,11,11,7,10,6,9,5,16,20,4,3,3,3,10,
3,2,1,6,9,3,7,14,8,1,4,6,4,15,22,2,14,7,1,12,3,23,4,19,6,2,2,4,14,2,2,
21,3,2,9,3,2,1,7,19
variables frequency
1 6 15 (n=11) 27.5
2 14 35 (n=25) 62.5
3 10 25 (n=35) 87.5
4 3 7.5 (n=38) 95
Total 40 100
7
Grouped frequency
8
Number of Intervals
• There is no clear-cut rule on the number of
intervals or classes that should be used.
• Too many intervals – the data may not be
summarized enough for a clear visualization of
how they are distributed.
• Too few intervals – the data may be over-
summarized and some of the details of the
distribution may be lost.
9
Cross-tabulation
(contingency table) 2 by 2
Breast lump Women with 2 children or less Totals
diagnosis
Yes (%) No(%)
11
Contingency tables(2 by 4)
Outcome ( number of cases)
Non-Fatal
Fatal Heart
Cancers Heart Healthy Total
Disease
Disease
Factor of diet
13
Graphs for categorical ( nominal) data
The pie chart
• 4-5 categories
• One variable
• Start in the same order as the table
Frequency
0-4
5--9
10--14
15--19
20--24
15
Simple bar chart
• One variable
Bar Chart: Hair colar of the chidren
• Same widths
receiving d-phenothrin
• Equal spaces between
60
bars
55
50
40
30
21
18
20
10 4
0
blonde brown red dark
16
Clustered bar chart
• Two of more variable
Cluster percetage bar chart of the hair
color receiving Malathion and d-
phenothrin
60 56
52
50
40 blonde
28 brown
30
22 red
20 16 18
dark
10 4 4
0
malathion d-penothrin
17
Stacked bar chart
Smoking status of nursing mothers • Two or more variables
100%
90%
80%
70%
Non-
60% smokers
50% Fomer
smokers
40%
Smokers
30%
20%
10%
0%
Breast-fed Bottle-fed
18
Graphs for continuous data
Histograms
• Blood pressure data on a sample of 113 men
20
15
Number of Men
10
5
0
Histogram of the Systolic Blood Pressure for 113 men. Each bar
spans a width of 5 mmHg on the horizontal axis. The height of each
bar represents the number of individuals with SBP in that range.
20
Histograms
60
Number of Men
40
20
0
6
Number of Men
4
2
0
R
• Common width =W=
k
where:
R = range of the data
k = the number of intervals (e.g groups of ages)
23
Consideration when Determining Width
24
Data Example
• Weight in pounds of 57 school children at a day-
care center:
68 63 42 27 30 36 28 32 79 27
22 23 24 25 44 65 43 23 74 51
36 42 28 31 28 25 45 12 57 51
12 32 49 38 42 27 31 50 38 21
16 24 69 47 23 22 43 27 49 28
23 19 46 30 43 49 12
25
Data Example – Step 1
27
28
Cumulative frequency curve
120
100
80
Attempting suicide
60
Later successful
40
20
0
15-24 25-34 35-44 45-54 55-64 65-74 75-84 > 85
Percentage cumulative frequency curves of age for male
suicide attempters and later succeeders
29
Describing data with numeric
summary values
1. Numbers, Percentages, Proportions,
Rates.
2. Measures of Central Location.
31
Numbers, Percentages, proportions, Rates
32
Measures of location
Also called measures of location(central
tendency)
Gives one number which is representative
of all the data
They are the:
Mean
Median
Mode
33
The mean
• Widely used n statistical calculation
• Calculated as:
The sum of observations divided by the number
of observations
Example ( 5,7,3,38,7) hours, n=5
Total =60 ( sum of data points)
Mean =60/5=12
34
Mean
Advantages Disadvantages
• Simple and easy • Affected by extreme
• Most widely used values
• Can be used for further • Sometimes looks
statistical tests ridiculous e.g. average
• All values are included number of children =
• Does not need 2.7
arrangement of data
35
The median
• Does not depend on the sum and number of
observations
• Depend on organizing the data on ascending
or descending order of magnitude
• Then the value of middle observation is
located.
• It divide the observations to two equal parts
of 50%
36
– If the number of observations in the dataset is odd
the median position will be (n+1)/2 of the
observations
– If the number of observations is even the median is
defined as the average of the two middle
observations
– Example: {8,5,4,12,15,7,28}
• First put observations in order: 4,5,7,8,12,15,28
• Find the (n+1)/2 which is the 4th position
observation.
37
Another Example of the Median
• First arrange the data in order from smallest to
largest.
• If the number of data points is ODD:
– 3, 5, 7, 7, 38
• The median is the value in the middle: 7
• If the number of data points is EVEN:
– 3, 5, 7, 7
• The median is the average of the two values
around the middle: (5+7)/2 = 6
38
Median
Advantages Disadvantages
• Not affected by extreme • Needs arrangement of
values data
• Used for growth curves • Difficult to calculate
and income from large amounts of
• Can be determined data
graphically • Not all values are
represented
40
It is the most common value found in the
dataset (fashionable value)
◦ Hb level of 5 pregnant women
12, 12.5, 11, 13, 12.5 Mode = 12.5
◦ Hb level of 6 pregnant women
12, 12.5, 11, 13, 12.5, 8 Mode = 12.5
More than one mode may occur (bimodal,
trimodal)
Sometimes there is no mode .
41
Mode
Advantages Disadvantages
• Not affected by extreme • Not all values are
values represented
42