1.2 Mathematical Presentation of Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Mathematical Presentation of Data

Biostatistics Course 2021-2022 – Block 1

Ali Lateef Jasim


MBChB.
Learning objectives

At the end of this lecture you should be able to:


❑ List the measures of central tendency, describe their characteristics
and identify their uses.
❑ Calculate the measures of central tendency.
❑ List the measures of dispersion, describe their characteristics and
identify their uses.
❑ Calculate the range, standard error and coefficient of variation.
Has the variable got units? (this includes
numbers or things)

No Types of Yes
Categorical variables Numerical

Can the data be put in Do the data come from


meaningful order? measuring or counting?

No Yes Counting Measuring

Categorical Categorical
Discrete Continuous
Nominal Ordinal
Mathematical Presentation of Data

01 02 03
✓ Central Measures of Dispersion
Tendency locations ✓ The range
✓ The mean (Quantiles): ✓ The Variance
✓ The median such as, Quintiles, ✓ Standard
Quartiles, tertiles, Deviation
✓ The mode
etc.
✓ Standard Error
✓ Coefficient of
Variation
Central Tendency
❖ Central tendency refers to our intuition that there is a center
around which all these scores vary.
❖ A measure of central tendency is a single value that attempts to
describe a set of data by identifying the central position within
that set of data.
❖ The three branches of central tendency are:
✓ The mean,
✓ The median, and
✓ The mode
The Mean

❖ The Mean It is the average of the data or the sum of all values of
a set of observations divided by the number of these observations.
❖ The mean (or average) is the most popular and well known
measure of central tendency It can be used with both discrete and
continuous data.
❖ Calculated by this equation:

Mean of sample = =

Mean of population (mu) =


Properties of the Mean

Advantages of mean:
✓ Uniqueness: For a given set of data there is one and only
one mean, it is single value.
✓ Simple to compute.
✓ All values are included.
Disadvantage
✓ The main disadvantage of mean is the presence of extreme
values, i.e. very high or very low values.
The Weighted Mean:

❖ The individual values in the set are weighted by their respective


frequencies.
❖ A weighted mean is a kind of average Instead of each data point
contributing equally to the final mean, some data points
contribute more weight than others.

❖ It can be expressed as the sum of the mean of each group


multiplied by its respective weight (the n in each group) divided
by the sum of the weight (N).
The Weighted Mean:
Example
Let us imagine that four groups of medical students obtained the
following mean scores on the final examination of anatomy 85, 62,
80 and 79 To calculate the weighted mean we have to know the
number of students (weight) in each of the four groups.

So, Let us Assume that number of students (in each of the four
groups) is as follows:
Group A, 60 students their mean score was 85
Group B, 40 students their mean score was 62
Group C, 45 students their mean score was 70
Group D, 30 students their mean score was 79
The Weighted Mean:
❖ Example (continued):

(60×85 ) + (40×62 ) + (45×80 ) + (30×79 )


=
60 + 40 + 45 + 30
13,550
= = 77.5
175

What if?
what if we calculate the mean directly by summing 85+62+70+79
and divide it by 4?
Then the mean will be 74! So, the weighted mean is more accurate.
The Median (50th percentile)
The median of a data set is the value that lies exactly in the middle.
To calculate the median:
1. Create an ordered array of values.
2. Locate the position of the median which depends on the number
of observations and as follows:
✓ For odd number of observations: (n+1 / 2)
✓ For even number of observations: Two positions; (n/2 ) & (n/2 )+1
3. The value of the median will be the value in the middle for odd
number and the average of the two values for even numbers.
Properties of the Median

Advantages of median:
✓ It is a single value, simple, easy to compute easy to
understand, unaffected by extreme values.
Disadvantages:
✓ It provides no information about all values (observations).
✓ It is less amenable than the mean to tests of statistical
significance.
The Mode
❖ It is the value which occurs most frequently.
❖ Data distribution with one mode is called unimodal.
❖ If all values are different there is no mode or nonmodal.
❖ Sometimes, there are more than one mode: two modes is called
bimodal; more than two is called multimodal distribution.
❖ Normally, the mode is used for categorical data where we wish to
know which is the most common category.
Properties of the Mode

Advantage of mode:
✓ Sometimes gives a clue about the etiology of the disease.
Disadvantages:
✓ With small number of observations, there may be no mode
✓ It is less amenable to tests of statistical significance.
Other properties of mode:
✓ Sometimes, it is not unique
✓ It may be used for describing qualitative data
Example
In an outbreak of hepatitis A, 6 persons became ill with clinical
symptoms. The incubation periods for the affected persons (Xi) were
29, 31, 24, 29, 30, and 25 days.

A. Finding the Mean:

B. Finding the Median:


1. Arrange data in order (24, 25, 29, 29, 30, 31)
2. Find position of the median; in even no. = n/2 & (n/2)+1 , So
(observations no. 3 & 4)(29 & 29)
3. The value of the median is the average of the TWO VALUES = 29.
Example

C. Finding the Mode:


The most frequent observation, So the Mode = (29)

What if?
➢ What If the largest value of the six listed incubation periods were
131 instead of 31 , what will happen to the mean, median and
mode?
✓ The mean will be = (24+25+29+29+30+25+ 131)/6 = 44.7 days instead
of 28.
✓ The Median & the Mode will remain the same as they are not
affected by extreme values.
Measures of non central locations
Quantiles:
❖ Quantile is a value below which a certain proportion of
observations occurred in the ordered set of data values.
❖ Quantile is defined as equal sized segments of a population.
❖ A quintile is a statistical value of a data set that represents 20% of
a given population.
✓ The first quintile represents the lowest fifth of the data (1 -20%)
✓ The second quintile represents the second fifth (21% - 40%) and
so on.
❖ A population split into three equal parts is divided into tertiles.
❖ One of the most common metrics in statistical analysis, the
median , is actually just the result of dividing a population into
two quantiles.
Measures of non central locations
Centiles:
❖ Those values, in a series of observations arranged in ascending
order of magnitude, which divide the distribution into 100 equal
parts.
❖ 10th Percentile: it is the value below which 10% of the observations
lie.
❖ We also frequently use 3rd , 97th , and the 50th (median)
Measures of non central locations

Quartiles:
❖ These are the observations in an array that divide the distribution
into four equal parts
❖ 1st (lower Quartile): the value below which 25 of observations lie
in an ordered array.
❖ 2nd quartile = Median = 50th percentile
❖ Upper Quartile = 75th percentile
❖ Interquartile Range: is the middle 50 % of all observations (From
25-75)
Measures of Dispersion
The measures of central tendency are not adequate to describe
data. Two data sets can have the same mean but they can be
entirely different. Thus to describe data, one needs to know the
extent of variability. This is given by the measures of dispersion.
Dispersion is a key concept in statistical thinking. The basic question
being asked is how much do the scores deviate (Vary) around the
Mean?
1. The range
2. The Variance
3. Standard Deviation
4. Standard Error
5. Coefficient of Variation
The range
❖ The range is the difference between the largest and the smallest
observation in the data.
❖ It is an important measurement However, they do not give much
indication of the spread of observations about the mean.
❖ Should be used in conjunction with other measures of variability.

Advantages
✓ Simple to calculate
✓ Easy to understand
Disadvantages
✓ It neglect all values in the center and depend on the extreme
value, and extreme value are dependent on sample size.
✓ It is not based on all observations.
✓ It is not amenable for further mathematic treatment.
The Variance
❖ The average of sum of squares of the deviation from the mean
❖ The average of the squared differences from the Mean.
❖ Example: If we have the following values of certain observations:
3 , 5 , 7 , 9 , 11 . . . The mean will be 7 , and if we calculate the
difference of each value from the mean it will be:
3-7 = -4,
✓ If we sum the differences as ( 4 2+0+2+4) it will
5-7 = -2,
be zero.
7-7 = 0, ✓ But if we square the differences and then sum
9-7 = 2, them, they will not be zero as follows
(16+4+0+4+16) = 40
11-7 = 4
The Variance
❖ Calculated by the equation:

❖ Properties of the variance:


✓ Variance can never be a negative value.
✓ All observations are considered.
✓ The problem with the variance is the squared unit.

For the example mentioned above to calculate the variance we


have to divide the number by n-1 (for sample variance) = 5-1 = 4
So, 40/4 = 10
The standard deviation
❖ The standard deviation measured the variability between
observations in the sample or the population from the mean of
that sample or that population.
❖ It is the square root of the variance and the unit is not squared.
❖ SD is the most widely used measure of dispersion.

For the same example to calculate the SD we need to find the


square root of the variance. So, √10 = 3.16
And we will express the mean as: 7 +/- 3.16
Standard Error of the mean (SE)

❖ When we draw a sample from study population and compute its


sample mean, it is not likely to be identical to the population
mean.
❖ If we draw another sample from the same population and
compute its sample mean, this may also not be identical to the
first sample mean.
❖ It might also differs from the true mean of the total population
from which the sample was drawn;
❖ This phenomenon is called sampling variation.

Extra
Standard Error of the mean (SE)
❖ It measures the variability or dispersion of the sample mean from
population mean.
❖ It is used to estimate the population mean, and to estimate
differences between populations means.

SE=SD/√n
For the same example:
SE = 3.16 / √5 = 3.16 / 2.24 = 1.4
Coefficient of variation (CV)
❖ It expresses the SD as a percentage of the mean.
❖ CV= (SD / mean) x 100 (mean of the sample)
❖ It has no unit.
❖ It is used to compare dispersion in two sets of data especially when
the units are different.
❖ It measures relative rather than absolute variation.
❖ It takes in consideration all values in the set.

For the same example:


CV = 3.16 / 7 × 100 % = 45%
Thank You
Ali Lateef Jasim
MBChB.

You might also like