Professional Documents
Culture Documents
1.2 Mathematical Presentation of Data
1.2 Mathematical Presentation of Data
1.2 Mathematical Presentation of Data
No Types of Yes
Categorical variables Numerical
Categorical Categorical
Discrete Continuous
Nominal Ordinal
Mathematical Presentation of Data
01 02 03
✓ Central Measures of Dispersion
Tendency locations ✓ The range
✓ The mean (Quantiles): ✓ The Variance
✓ The median such as, Quintiles, ✓ Standard
Quartiles, tertiles, Deviation
✓ The mode
etc.
✓ Standard Error
✓ Coefficient of
Variation
Central Tendency
❖ Central tendency refers to our intuition that there is a center
around which all these scores vary.
❖ A measure of central tendency is a single value that attempts to
describe a set of data by identifying the central position within
that set of data.
❖ The three branches of central tendency are:
✓ The mean,
✓ The median, and
✓ The mode
The Mean
❖ The Mean It is the average of the data or the sum of all values of
a set of observations divided by the number of these observations.
❖ The mean (or average) is the most popular and well known
measure of central tendency It can be used with both discrete and
continuous data.
❖ Calculated by this equation:
Mean of sample = =
Advantages of mean:
✓ Uniqueness: For a given set of data there is one and only
one mean, it is single value.
✓ Simple to compute.
✓ All values are included.
Disadvantage
✓ The main disadvantage of mean is the presence of extreme
values, i.e. very high or very low values.
The Weighted Mean:
So, Let us Assume that number of students (in each of the four
groups) is as follows:
Group A, 60 students their mean score was 85
Group B, 40 students their mean score was 62
Group C, 45 students their mean score was 70
Group D, 30 students their mean score was 79
The Weighted Mean:
❖ Example (continued):
What if?
what if we calculate the mean directly by summing 85+62+70+79
and divide it by 4?
Then the mean will be 74! So, the weighted mean is more accurate.
The Median (50th percentile)
The median of a data set is the value that lies exactly in the middle.
To calculate the median:
1. Create an ordered array of values.
2. Locate the position of the median which depends on the number
of observations and as follows:
✓ For odd number of observations: (n+1 / 2)
✓ For even number of observations: Two positions; (n/2 ) & (n/2 )+1
3. The value of the median will be the value in the middle for odd
number and the average of the two values for even numbers.
Properties of the Median
Advantages of median:
✓ It is a single value, simple, easy to compute easy to
understand, unaffected by extreme values.
Disadvantages:
✓ It provides no information about all values (observations).
✓ It is less amenable than the mean to tests of statistical
significance.
The Mode
❖ It is the value which occurs most frequently.
❖ Data distribution with one mode is called unimodal.
❖ If all values are different there is no mode or nonmodal.
❖ Sometimes, there are more than one mode: two modes is called
bimodal; more than two is called multimodal distribution.
❖ Normally, the mode is used for categorical data where we wish to
know which is the most common category.
Properties of the Mode
Advantage of mode:
✓ Sometimes gives a clue about the etiology of the disease.
Disadvantages:
✓ With small number of observations, there may be no mode
✓ It is less amenable to tests of statistical significance.
Other properties of mode:
✓ Sometimes, it is not unique
✓ It may be used for describing qualitative data
Example
In an outbreak of hepatitis A, 6 persons became ill with clinical
symptoms. The incubation periods for the affected persons (Xi) were
29, 31, 24, 29, 30, and 25 days.
What if?
➢ What If the largest value of the six listed incubation periods were
131 instead of 31 , what will happen to the mean, median and
mode?
✓ The mean will be = (24+25+29+29+30+25+ 131)/6 = 44.7 days instead
of 28.
✓ The Median & the Mode will remain the same as they are not
affected by extreme values.
Measures of non central locations
Quantiles:
❖ Quantile is a value below which a certain proportion of
observations occurred in the ordered set of data values.
❖ Quantile is defined as equal sized segments of a population.
❖ A quintile is a statistical value of a data set that represents 20% of
a given population.
✓ The first quintile represents the lowest fifth of the data (1 -20%)
✓ The second quintile represents the second fifth (21% - 40%) and
so on.
❖ A population split into three equal parts is divided into tertiles.
❖ One of the most common metrics in statistical analysis, the
median , is actually just the result of dividing a population into
two quantiles.
Measures of non central locations
Centiles:
❖ Those values, in a series of observations arranged in ascending
order of magnitude, which divide the distribution into 100 equal
parts.
❖ 10th Percentile: it is the value below which 10% of the observations
lie.
❖ We also frequently use 3rd , 97th , and the 50th (median)
Measures of non central locations
Quartiles:
❖ These are the observations in an array that divide the distribution
into four equal parts
❖ 1st (lower Quartile): the value below which 25 of observations lie
in an ordered array.
❖ 2nd quartile = Median = 50th percentile
❖ Upper Quartile = 75th percentile
❖ Interquartile Range: is the middle 50 % of all observations (From
25-75)
Measures of Dispersion
The measures of central tendency are not adequate to describe
data. Two data sets can have the same mean but they can be
entirely different. Thus to describe data, one needs to know the
extent of variability. This is given by the measures of dispersion.
Dispersion is a key concept in statistical thinking. The basic question
being asked is how much do the scores deviate (Vary) around the
Mean?
1. The range
2. The Variance
3. Standard Deviation
4. Standard Error
5. Coefficient of Variation
The range
❖ The range is the difference between the largest and the smallest
observation in the data.
❖ It is an important measurement However, they do not give much
indication of the spread of observations about the mean.
❖ Should be used in conjunction with other measures of variability.
Advantages
✓ Simple to calculate
✓ Easy to understand
Disadvantages
✓ It neglect all values in the center and depend on the extreme
value, and extreme value are dependent on sample size.
✓ It is not based on all observations.
✓ It is not amenable for further mathematic treatment.
The Variance
❖ The average of sum of squares of the deviation from the mean
❖ The average of the squared differences from the Mean.
❖ Example: If we have the following values of certain observations:
3 , 5 , 7 , 9 , 11 . . . The mean will be 7 , and if we calculate the
difference of each value from the mean it will be:
3-7 = -4,
✓ If we sum the differences as ( 4 2+0+2+4) it will
5-7 = -2,
be zero.
7-7 = 0, ✓ But if we square the differences and then sum
9-7 = 2, them, they will not be zero as follows
(16+4+0+4+16) = 40
11-7 = 4
The Variance
❖ Calculated by the equation:
Extra
Standard Error of the mean (SE)
❖ It measures the variability or dispersion of the sample mean from
population mean.
❖ It is used to estimate the population mean, and to estimate
differences between populations means.
SE=SD/√n
For the same example:
SE = 3.16 / √5 = 3.16 / 2.24 = 1.4
Coefficient of variation (CV)
❖ It expresses the SD as a percentage of the mean.
❖ CV= (SD / mean) x 100 (mean of the sample)
❖ It has no unit.
❖ It is used to compare dispersion in two sets of data especially when
the units are different.
❖ It measures relative rather than absolute variation.
❖ It takes in consideration all values in the set.