Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 35

Chap.

2 Summarizing data
 What is a statistic?
─ In this chapter we shall see how data can be
summarized to help to reveal information they contain.
We do this by calculating numbers from the data which
extract the important material. These numbers are
called statistic.
─ A statistic is anything calculated from the data alone.
Frequency distribution

 Frequency distribution for qualitative data


─ A frequency distribution for qualitative data lists all
categories and the number of elements that belongs to each
of categories.
─ When data are purely qualitative ,the simplest way to deal
with them is count the number of cases in each category. The
count of individuals having a particular quality is called
frequency.
─ The proportion of individuals having the quality is called the
relative frequency or proportion frequency.

Frequency of that category


Relative frequency of a category=
Sum of all frequencies
─ The set frequencies of all the possible categories is called
frequency distribution of the variable .
─ The cumulative freqncy for a value of a variable is the number
of individuals with values less than or equal to that value.
─ The relative cumulative frequency for a value is the
proportion of individuals in the sample with values less than or
equal to that value.

Cumulative frequency of that category


Relative cumulativefrequency of a category=
Sum of all frequencies
─ In this census we assessed whether patients were ‘ likely to be
discharged’, ‘possibly to be discharged’ or ‘unlikely to be
discharged’. The frequencies of these categories are shown in
Table 2.1 . Likelihood of discharge is a qualitative variable,like
diagnosis,but the categories are orderd. This enables us touse
another set of summary statistics,the cumulative frequencies.
Table 2.1 Likelihood of discharge of patients in tooting Bec Hospital

Discharge Frequency Relative Cumulative Relative


frequency frequency Cumulative
frequency
Unlikely 871 0.59 871 0.59

Possible 339 0.23 1210 0.82

Likely 257 0.18 1467 1.00

Total 1467 1.00 1467 1.00


─ For example,in the analysis of the cnesus of a psychiatric
hospital population, one of the variables of interest was the
patient’s principal diagnosis. To do summarizes these data,we
count the number of patients having each diagnosis. The results
are shown in Table 4.2 .
Table 4.2 Orincipal dianosis of patients in Tooting Bec Hospital

Diagnosis Number of patients

Schizophrenia 474

Affective disorders 277

Organic brain syndrome 105

Subnormality 58

Alcoholism 57

Other and not knowm 196

Total 1467
─ Table 2.3 shows the ferequency distribution of a quantitative
variable,parity. This shows the number of previous pregnancies
for a sample of women booking for delivery at St.George’s
Hospital. Only certain values are possible, as the number
pregnancies must be an integer, so this variable is discrete. The
frequency of each separate value is given.
Table 2.3 Party of 125 women attending antenatal clinics at St.George’s Hospital

Parity Frequency Relative Cumulative Relative cumulative


frequency frequency freqency (percent)
(percent)
0 59 47.2 59 47.2
1 44 35.2 103 82.4
2 14 11.2 117 93.6
3 3 2.4 120 96.0
4 4 3.2 120 99.2
5 1 0.8 125 100.0
Total 125 100.0 125 100.0
─ Table 2.4 shows a continuous variable, forced expiratory
volume in one second (FEV1) in a sample of male medical
students. How to get the ferequency distribution of a continuous
variable?

Table 2.4 FEV1 (litres) of 57 male medical students

2.85 3.19 3.50 3.69 3.90 4.14 4.32 4.50 4.80 5.20
2.85 3.20 3.54 3.70 3.96 4.16 4.44 4.56 4.80 5.30
2.98 3.30 3.54 3.70 4.05 4.20 4.47 4.68 4.90 5.43
3.04 3.39 3.57 3.75 4.08 4.20 4.47 4.70 5.00
3.10 3.42 3.60 3.78 4.10 4.30 4.47 4.71 5.10
3.10 3.48 3.60 3.83 4.14 4.30 4.50 4.78 5.10
─ As most of the values occur only once, to get a useful
frequency distribution we need to divid the FEV1 scale into
class intervals,e.g. from 3.0 to 3.5 ,from 3.5 to 4.0,and so on,
and count the number of individuals with FEV1s in each class
interval. The class intervals should not overlap, so we must
decide which interval contains the boundary point to avoid it
being counted twice. It is usual to put the lower bountory of an
interval into that interval and the higher boundary into the next
interval. Thus the interval starting at 3.0 and ending at 3.5
contains 3.0 but not 3.5. We can write this as ‘3.0 -’ or ‘3.0 -
3.5’ or ‘3.0 – 3.499’. Including the lower boundary in the class
interval has this advantage. Most distributions of
measurements have a zero point below which we cannot
go,whereas few have an exact upper limit.
─ If we take a starting point of 2.5 and an interval of 0.5 we get
the frequency distribution shown in Table 2.5.
─ Note that this is not unique
─ The frequency distribution can be calculated easily and
accurately using a computer softpackage such as SPSS.
Table 2.5 Frequency distribution of FEV1 in 57 male medical students

FEV1 Frequency Relative freqency (percent)


2.0 0 0.0
2.5 3 5.3
3.0 9 15.8
3.5 14 24.6
4.0 15 26.3
4.5 10 17.5
5.0 6 10.5
5.5 0 0.0
Total 57 100.0
Histograms
 Graphical methods are very useful for examing frequency
distributions
 The most common way of depicting a frequency
distribution is by a hidtogram. This is a diagram where
the class intervals are on an axis and rectangles with
heights (frequency of each class interval) or areas
proprotional to the frequencies elected on them.
 Figure 2.1 shows the histogram for FEV1 distrbution in
Table 2.5. The vertical scale shows frequency, the number
of observations in each interval.
 Sometimes we want to show the distribution of a discrete
variables (e.g. Table 2.3) a s a histagram. If our intervals
are 0-1-,1-2-.etc,the actual observations will all be at one
end of the the interval(Figure 2.2).
20

15
Frequency

10

0
2.50 3.00 3.50 4.00 4.50 5.00

FEV1(Litre)

Fig 2.1 Histogram of FEV1: frequency scale (Table 4.5 )


60

40
Frequency

20

0
0.0 1.0 2.0 3.0 4.0 5.0

PARITY

Fig 2.5 Histogram of parity (Table 2.3)


Shapes of frequency distribution
 Figure 2.1 shows a frequency distribution of a shape
often seen in medical data.
 The distribution is roughly symmetrical about its cnetral
value and has frequency concentrated about one central
point, which is called central tendency.
 The tendency of observation value deviating from the
central point toward two ends of distribution of data is
called tendency of dispersion or spread.
Measure of central and dispersion tendency or
summary of numerical variable
 Measure of central tendency
─ Mean or arithmetic mean

─ Geometric mean

─ Median

 Statistic and parameter


─ A desciptive measure computed from the data of a sample is
called a statistic
─ A descriptive measure computed from the data of a population is
called a parameter
 Definition of mean
─ It is obtained by dividing the sum of all values by the number of
values in the data set.
─ Mean = Sum of allvalues / number of values.
─ It is the most freqenctly used measure of central tendency for
the variable with the normal distribution.
─ The mean calculated for sample data is denoted by
x
(read as “ x bar ”)
Mean for sample data:


x
x
n
─ The mean calculated for population data is denoted by
(read as mu) 
─ Mean for population data:
  x
N
 The following are the ages of alleight employees of a smaoll
company: 53 32 61 27 39 44 49 58
Find the mean age of these employees.
 Solution:
 x  53  32  61  27  39  44  49  57  362

 The population mean is

  x 362
  42.25 years
N 8

 If n or N is larger than 10, calculation of mean is very tedious


and easily mistaken, so it can be done by using SPSS
software
 Definition of geometric mean (G)
─ When the distribution is positively skewed, a geometric mean
may be more appropriate than the arithmetic mean.
─ Formula:

1   log X 
G  log  
 n 
 For example, the doses of HbsAg at seven
patients with hepatitis are respectively as follows:
1: 16, 1:32,1:32, 1:64, 1:64,1:128,1:512.
─ Find the geometric mean.

 Solution:
─ The charateristic of the data set is that there is
mutiple correlation among the values.
─ Formula:
1   log X 
G  log  
 n 

G  log
1  lg 16  lg 32  lg 32  lg 64  lg 64  lg 128  lg 512 

 7 

1
 log 1.8062  64
 Definition of Median
─ The median is the value of the middle term in a data set that has
been ranked increasing order,which is often used a variable with
a skewed distribution.
─ As is obvious from the definition of the median,it divides a ranked
data set into two equal parts.
─ The calculation of the median consists of the following two steps:.
 Rank the data set in incresing order

 Find the middle term. The value of this term is the median

 Positive of the middle term n 1



2
─ If the number of observations in a data set is odd, then the
median is given by the value of the middle term in ranked data.
─ If the number of observations is even,then the median is given by
the average of the values of the two middle terms.
 For example, The following data give the weight lost (in
pounds) by a sample of five members of a health club at the
end of two months of membership.
10 5 19 8 3
─ Find the median.

 Solution:
─ First,we rank the given data inincreasing order as follows:
3 5 8 10 19
─ There are five observations in the data set. Consequently, n=5
and n 1 5 1
Position of the the middle term   3
2 2
─ Therefor,the median is the value of the third term in the ranked
data.
3 5 8 10 19
 Definition of Mode
─ The mode is the value that occurs with the highest feequency in
a data set.
─ For example: The ages 10 randomly selected students from a
class are 21,19,27,22,29,19,25,18,19,and 30.
─ For the variable with the normal distribution, the mode, median
and mean are the same value.
 For example, considering the following two data sets on the
ages of all workers in each of two samll companies:
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the
same, 40 years. But as we can observe the variation in the
workers’ages for each of these two companies is very
different. As illustrated in the diagram,the ages of the
workers in the second company have a much larger variation
than the ages of the workers in the first company.
Company 1

35 36 38 39 40 45 47
Company 2

18 27 33 52 70
 Therefor, to reveal the shape of the distribution of a
data set, it is nessesary to not only measure the central
tendency, but also dispersion tendency of a variable.
 The dispersion of a set of observations refers to the varity
that they exhibit.
 A measure of dispersion conveys information regarding the
amount of variability present in a set of data
 Measure of dispersion tendency
─ Range

─ Variance and standard Deviation

─ Percentiles and Quartiles

─ Coeffecient of variation

─ Box-and-whisker plots
 Range
─ The range is the simplest measure of dispersion to calculate.
─ It is obtained by taking the difference between the largest and the
smallest values in a data set.
─ If the number of observations in a data set is odd, then the
median is given by the value of the middle term in ranked data.
─ For example,
In company 1, the range = Largeset value(47) – Smalllest value
(35)=12
In company 2, the range = Largeset value(70) – Smalllest value
(18)=52
─ The advatage of using the range as a measure of dispersion is
very simple to compute
─ The disadvatage of using the range as a measure of dispersion is
that its calculation is based on two values: the largest and the
samllest. All other values in a data set are ignored when
calculating the range.
 The deviation of the x value from the mean
─ x   or x  x is called the deviation of the x value from the
mean
─ The sum of the deviation of the x values from the mean is
always zero because there are half of x value more than mean
and another half less than mean.That is

 ( x   )=0 and  ( x  x)  0
─ For this reason we square the deviations to caculate the
variance and standard deviation
─ For example,suppose there are the scores of four students in
Statistics,such as 82, 95,67, and 92. The mean score for these
four students is

82  95  67  92
─ The deviation of the four scores from the
x  84mean are calculated
4
in Table 4.1
Table 4.1

x xx
82 82-84=-2

95 95-84+11

67 67-84=-17

92 92-84=+8

 ( x  x)  0
 Variance and Satndard deviation
─ Variance for population data is denoted by  2(read as sigma
squared)
 2

 ( x   ) 2

, N= population size
The formula is N
─ Variance for sample data is denoted by s 2.
The formula is
s 2

 ( x  x)
2

,
n 1
n= sample size,
n-1 is called degree of freedom
─ The standard deviation is obtained by taking the positive
square root of the variance.
• Population sandard deviation:
• Sample standard deviation:    2

s s2
 The coefficient of variance
─ The standard deviation is useful a s a measure of variation within a
given set of data,which is absolute variation.
─ However, when one desires to compare the diapersion in two sets
of data, comparing the two standard deviations may lead to
fallencious results.
 It may be that the two variables involved are measured in
different units. For example, weight(gram) and
height(centimeter)
 Although the same unit of measurement is used,the two
means may be quite different. For example, adult height
and children height.
─ The coeffecient of variation expesses the standard deviation as a
percentage of the mean,which is relative variation.
─ The formular is given by
s
C.V.=  (100)
x
Suppose two samples of human males yield the following results:

Sample1 sample2

Age 25 years 11 years


Mean weight 145 pounds 80 pounds
Standard deviation 10 pounds 10 pounds

We wish to know which is more variable, the weights of the 25-


year-olds or the weights of the 11-year-olds.
 Solution:
A comparison of the standard deviations might lead one to
conclude that the two samples possess equal variability. If we
compute the coefficients of variation, however,we have for the
25-year-olds

10
C.V.=  100  6.9
145

and for the 11-year-olds

10
C.V.=  100  12.5
80

If we compare these results we get quite different impression.


 Percentiles and Quartiles
─ The so-called location parameters “locate ” the distribution on the
horizontal axis,such as the mean, median, percentiles and
qualitiles.
─ Defition of percentiles: Given a set of n observations x1,x2,…xn,
the pth percentile p is the value of X such that p percent or less of
the observations are less than p and (100-p) percent or less of the
observations are greater than p.
─ Supscripts on P serve to distinguish one percentile from another.
The 10th percentile,for example,is designated p 10,the 70th is
deisignated p70, and so on.
─ The 50th percentile is the median and is designated p 50.
─ The 25th percentile is often refered to as the first quartile and
denoted Q1.
─ The 50 percentile is refered to as the second or middle quartile
and written Q2.
─ The 75th percentile is referred ton as the third quartile,Q 3.
 Interquartile Range
─ Interquartile range reflects the variability among the
middle 50 percent of the observations in adata set .
─ The interquartile range (IQR) is the difference between
the third and first quartiles that is ,

IQR = Q3 -Q1
─ Calculations of above-mentioned statistics and
parameters are very complexible by using manual
methods, however, it is very simple by using the
software like SPSS and R.
 Box-and –Whisker plots (boxplot)
A useful visual device for communicating the information contained
in a data set is the box-and-whisker plot. The construction of a
box-and whisker plot (sometimes called ,simply, a boxplot) makes
use of the quarters of a data set and may be accomplished by
following these five steps:
─ Represent the variable of interest on the horizontal axis.
─ Draw a box in the space above the horizontal axis in such a way
that left end of the box allgns with the first quartile Q1 and the right
end of the box aliigns with the third quartile Q3.
─ Divide the box into two parts by a vertical line that aligns with the
median Q2.
─ Draw a borizontal line called a whisker from the left end of the box
to a point that aligns with the smallest measurement in the data
set.
─ Draw another horizontal line ,or whisker,from the right end of the
box to a point that aligns with the largest measurement in the data
set.
Table 4.2 Diameters (cm) of pure Sarcomas Removed from the
Breasts of 20 Women
0.5 1.2 2.1 2.5 2.5 3.0 3.8 4.0 4.2 4. 5.0
5
5.0 5.0 5.0 6.0 6.5 7.0 8.0 9.5 13.0

Please, try to reveals information regarding the


amount of spread,location of concentration ,and
symmetry of the data.
14
Outliers
20

12

TURNSIZE 10
Maximum
8

6
Q3

4
Median(Q2)
Q1
2

0
Minimum
-2

Figure 4.1 Box-and-whisker plot constructed by SPSS from


the data of Table 4,2

The longer right-hand whisker indicates that the distrbution of


diameters is skewd to the right.

You might also like