Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

Summarizing data

Objectives
By the end of this lecture, the student should be able
to
• Describe data by tables, charts and graphs
• Knows the steps to construct histogram
• Differentiate between Proportions, Rates &
Ratios.
• Knows the measures of central tendency
(measures of central location), its advantages and
disadvantages of each.
2
Tables.
Graphs.
Measures of Central
Tendency.

3
Describing data with tables
1. Frequency table
2. Relative and cumulative frequency
3. Grouped frequency table
4. Cross-tabulation(contingency tables)
5. Non - contingency tables

4
Frequency table
The following data: ages of patients admitted to hospital with
poliomyelitis
8,24,18,5,6,12,4,3,3,2,3,23,9,18,16,12,3,5,11,13,15,9,11,11,7,10,6,9,5,16,20,4,3,3,3,10,
3,2,1,6,9,3,7,14,8,1,4,6,4,15,22,2,14,7,1,12,3,23,4,19,6,2,2,4,14,2,2,
21,3,2,9,3,2,1,7,19

variables frequency

frequency tally Age group


28 0-4
5-9
10-14
15-19
20-24
total
5
Age Number of Deaths
( frequency)
0-<1 564
1-<5 86
5-<15 127
15-<25 490
25-<35 66
35-<45 806
45-<55 1,425
55-<65 3,511
65-<75 6,932
75-<85 10,101
85+ 9825
Total 34,524
6
(2) Relative frequency, cumulative frequency
parity No. of women Percentage% Cumulative
(relative percentage%
frequency)
0 5 12.5 (n=5) 12.5

1 6 15 (n=11) 27.5

2 14 35 (n=25) 62.5

3 10 25 (n=35) 87.5

4 3 7.5 (n=38) 95

7 1 2.5 (n=39) 97.5

8 1 2.5 (n=40) 100

Total 40 100

7
Grouped frequency

Variable Birth weight No. of infants


A group interval 2500-2899 2
of 400 g width
3000-3399 3
The class
lower limit 3400-3899 9
3900-3899 9
The class Total 23
upper limit

8
Number of Intervals
• There is no clear-cut rule on the number of
intervals or classes that should be used.
• Too many intervals – the data may not be
summarized enough for a clear visualization of
how they are distributed.
• Too few intervals – the data may be over-
summarized and some of the details of the
distribution may be lost.

9
Cross-tabulation
(contingency table) 2 by 2
Breast lump Women with 2 children or less Totals
diagnosis
Yes (%) No(%)

Benign 21(84%) 11(73%) 32(80%)

Malignant 4 (16%) 4(27%) 8(20%)

Total 25(100%) 15(100%) 40(100%)

1. A two-way table (contingency table) is a useful tool for examining


relationships between categorical variables.
2. The entries in the cells can be numbers or relative frequencies 
3. Women with 2 or less children has more benign breast lump.
4. The malignant lumps are not influenced by parity 10
Contingency tables(2 by 2)
Outcome
Group 1 (n=106) Group II
(breast cancer) (n=226)(no breast
cancer)
Ever use of oral Yes 40 38% 138 61%
contraceptive No 66 62% 88 39%
(Exposure)
Totals 106 100% 226 100%

11
Contingency tables(2 by 4)
Outcome ( number of cases)

Non-Fatal
Fatal Heart
Cancers Heart Healthy Total
Disease
Disease
Factor of diet

Non healthy 15 24 25 239 303

Healthy 7 14 8 273 302

Total 22 38 33 512 605


Describing data with charts or graphs

13
Graphs for categorical ( nominal) data
The pie chart
• 4-5 categories
• One variable
• Start in the same order as the table
Frequency

0-4
5--9
10--14
15--19
20--24

15
Simple bar chart

• One variable
Bar Chart: Hair colar of the chidren
• Same widths
receiving d-phenothrin
• Equal spaces between
60
bars
55

50

40

30
21
18
20

10 4
0
blonde brown red dark

16
Clustered bar chart
• Two of more variable
Cluster percetage bar chart of the hair
color receiving Malathion and d-
phenothrin

60 56
52
50

40 blonde
28 brown
30
22 red
20 16 18
dark

10 4 4
0
malathion d-penothrin

17
Stacked bar chart
Smoking status of nursing mothers • Two or more variables

100%
90%
80%
70%
Non-
60% smokers
50% Fomer
smokers
40%
Smokers
30%
20%
10%
0%
Breast-fed Bottle-fed

18
Graphs for continuous data
Histograms
• Blood pressure data on a sample of 113 men
20
15
Number of Men
10
5
0

80 100 120 140 160


Systolic BP (mmHg)

Histogram of the Systolic Blood Pressure for 113 men. Each bar
spans a width of 5 mmHg on the horizontal axis. The height of each
bar represents the number of individuals with SBP in that range.
20
Histograms

60
Number of Men
40
20
0

80 100 120 140 160


Systolic BP (mmHg)

Another histogram of the blood pressure of 113 men. In this graph,


each bar has a width of 20 mmHg, and there are a total of only 5
bars making it hard to characterize the distribution of blood
pressures in the sample.
21
Histograms

6
Number of Men
4
2
0

80 100 120 140 160


Systolic BP (mmHg)

Yet another histogram of the same BP information on 113 men.


Here, the bin width is 1 mmHg, perhaps giving more detail than is
necessary.
22
Width of Intervals
• Without some specific reason, the intervals should
all be the same width.

R
• Common width =W=
k
where:
R = range of the data
k = the number of intervals (e.g groups of ages)

23
Consideration when Determining Width

• Width should be chosen so that it is convenient to use


or easy to recognize (multiples of 5).
• The beginning of the first interval must be low
enough so that the first interval includes the smallest
observation.
• If the data has x decimal places, the interval limits
should also have x decimal places.

24
Data Example
• Weight in pounds of 57 school children at a day-
care center:
68 63 42 27 30 36 28 32 79 27
22 23 24 25 44 65 43 23 74 51
36 42 28 31 28 25 45 12 57 51
12 32 49 38 42 27 31 50 38 21
16 24 69 47 23 22 43 27 49 28
23 19 46 30 43 49 12

25
Data Example – Step 1

• From the data we have:


R
• Minimum = 12 W=
• Maximum = 79
k
• R = 79-12 = 67
• If we use k=5 and 15 we get:
• W= 67/5 = 13.4
• W= 67/15 = 4.5
• Since the dataset is not large, we will choose
w=10 to have fewer intervals.
26
Data Example – Step 2
• Next we have to construct the intervals.
• With w = 10
• minimum observation =12
• choose the first interval to start at 10 to include 12.
INTERVALS (in lbs): 10-20
21-30
31-40
41-50
51-60
61-70
71-80

27
28
Cumulative frequency curve

120

100

80
Attempting suicide
60
Later successful

40

20

0
15-24 25-34 35-44 45-54 55-64 65-74 75-84 > 85
Percentage cumulative frequency curves of age for male
suicide attempters and later succeeders

29
Describing data with numeric
summary values
1. Numbers, Percentages, Proportions,
Rates.
2. Measures of Central Location.

31
Numbers, Percentages, proportions, Rates

• Numbers: the numerical summaries of data


• A percentage is a proportion multiplied by 100.
(categorical data)
• Proportion :number of existing cases in some
population / population at risk at a given time.
• Rate: number of cases/population at risk per
100, or per 1000, of the population

32
Measures of location
Also called measures of location(central
tendency)
Gives one number which is representative
of all the data
They are the:
Mean
Median
Mode
33
The mean
• Widely used n statistical calculation
• Calculated as:
The sum of observations divided by the number
of observations
Example ( 5,7,3,38,7) hours, n=5
Total =60 ( sum of data points)
Mean =60/5=12

34
Mean
Advantages Disadvantages
• Simple and easy • Affected by extreme
• Most widely used values
• Can be used for further • Sometimes looks
statistical tests ridiculous e.g. average
• All values are included number of children =
• Does not need 2.7
arrangement of data

35
The median
• Does not depend on the sum and number of
observations
• Depend on organizing the data on ascending
or descending order of magnitude
• Then the value of middle observation is
located.
• It divide the observations to two equal parts
of 50%

36
– If the number of observations in the dataset is odd
the median position will be (n+1)/2 of the
observations
– If the number of observations is even the median is
defined as the average of the two middle
observations
– Example: {8,5,4,12,15,7,28}
• First put observations in order: 4,5,7,8,12,15,28
• Find the (n+1)/2 which is the 4th position
observation.

37
Another Example of the Median
• First arrange the data in order from smallest to
largest.
• If the number of data points is ODD:
– 3, 5, 7, 7, 38
• The median is the value in the middle: 7
• If the number of data points is EVEN:
– 3, 5, 7, 7
• The median is the average of the two values
around the middle: (5+7)/2 = 6

38
Median
Advantages Disadvantages
• Not affected by extreme • Needs arrangement of
values data
• Used for growth curves • Difficult to calculate
and income from large amounts of
• Can be determined data
graphically • Not all values are
represented

Friday, November 13, 2020 39


The mode

• The most frequent observation


• Easily understandable
• Not affected by extreme values
• Its exact location is not clearly
• Not used in biological and medical statistics

40
It is the most common value found in the
dataset (fashionable value)
◦ Hb level of 5 pregnant women
12, 12.5, 11, 13, 12.5 Mode = 12.5
◦ Hb level of 6 pregnant women
12, 12.5, 11, 13, 12.5, 8 Mode = 12.5
More than one mode may occur (bimodal,
trimodal)
Sometimes there is no mode .

41
Mode
Advantages Disadvantages
• Not affected by extreme • Not all values are
values represented

42

You might also like