Professional Documents
Culture Documents
G W Kulaba Assagnment Stat 3
G W Kulaba Assagnment Stat 3
#: U67273SF076372
COURSE NAME:
(Numbers of Life: Graphs & Data Distribution)
Assignment Title:
(Graphs, Data Description & Data Distribution)
Body of Assignment
What are frequency graphs? What do they show?
In statistics and research, graphs are used to convey the data to viewers in pictorial
form. For most people, data presented graphically is easier to understand than data
presented numerically in tables or frequency distributions. In addition, statistical graphs
can serve a number of other purposes including use as a tool for discussing,
summarizing, describing and analyzing the data.
A description of what the following types of frequency distribution graphs are and what
they show is given:
1. Histogram
2. Bar Chart
3. Frequency Polygon
4. Ogives
5. Pie Graph
6. Pareto Charts
7. Time Series Graphs
8. Stem and Leaf Plot
9. Dotplot
A. The Histogram
Figure 1 shows a histogram drawn from the data in Table 1.
Total 50 100%
From the frequency table (Table 1) we can quickly identify information such as 7
children (14% of all children) are in the 160 to less than 170 cm height range, and that
there are more children with heights in the 140 to less than 150 cm range (26% of all
children) than any other height range.
Histograms are visual displays of frequencies using columns plotted on a graph. The Y-
axis (vertical axis) generally represents the frequency count, while the X-axis (horizontal
axis) generally represents the variable being measured.
B. Bar Chart
A bar chart is shown in Figure 2. It is drawn from data in Table 2.
A bar chart is a type of graph in which each column (plotted either vertically or
horizontally) represents a categorical variable or a discrete ungrouped numeric variable.
It is used to compare the frequency (count) for a category or characteristic with another
category or characteristic.
The features of a bar chart are (ABS, 2013):
In a bar chart, the bar height (if vertical) or length (if horizontal) shows the frequency
for each category or characteristic.
The distribution of the dataset is not important because the columns each represent
an individual category or characteristic rather than intervals for a continuous
measurement. Therefore, gaps are included between each bar and each bar can be
arranged in any order without affecting the data.
For example:
If data had been collected for 'country of birth' from a sample of children (Table 2), a bar
chart could be used to plot the data as 'country of birth' is a categorical variable.
Australia 16 32%
Fiji 3 6%
India 8 16%
Italy 10 20%
United States of 4 8%
America
Total 50 100%
The bar chart below (Figure 2) shows us that 'Australia' is the most commonly observed
country of birth of the 50 children sampled, while 'Fiji' is the least common country of
birth.
Figure 6: Ogive
E. Graphs of Relative Frequencies
Graphs of relative frequencies instead of frequencies are used when the proportion of
data values that fall into a given class is more important than the actual number of data
values that fall into that class.
Mode None
Most frequent data value
Midrange MR
Lowest value plus highest
value, divided by 2
Each of these measures describes a different indication of the typical or central value in
the distribution.
The mean, also known as the arithmetic average, is found by adding the values of the
data and dividing by the total number of values. For example, the mean of 3, 2, 6, 5,
and 4 is found by adding 3 + 2 + 6 + 5 + 4 = 20 and dividing by 5; hence, the mean of
the data is 20 ÷ 5 = 4.
The population mean is indicated by the Greek symbol µ, but when the mean is
calculated on a distribution from a sample it is indicated by the symbol x̅ ( X-bar). In the
formulae (Table 3), n represents the total number of values in the sample while N
represents the total number of values in the population.
One advantage of the mean is that it can be used for both continuous and discrete
numeric data. However, its limitations include the following:
• The mean cannot be calculated for categorical data, as the values cannot be
summed.
• As the mean includes every value in the distribution the mean is influenced by
outliers and skewed distributions.
The mode is the value that occurs most often in the data set. Consider this dataset
showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Table 4. shows a simple frequency distribution of the retirement age data given above.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54, therefore the mode of this distribution is 54
years.
The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• It is also possible for there to be more than one mode for the same distribution of
data, (bi-modal, or multi-modal). The presence of more than one mode can limit
the ability of the mode in describing the centre or typical value of the distribution
because a single value to describe the centre cannot be identified.
• In some cases, particularly where the data are continuous, the distribution may
have no mode at all (i.e. if all values are different).
In cases such as these, it may be better to consider using the median or mean, or group
the data into appropriate intervals, and find the modal class.
The median is the middle value in distribution when the values are arranged in
ascending or descending order.
The median divides the distribution in half (there are 50% of observations on either side
of the median value). In a distribution with an odd number of observations, the median
value is the middle value.
Looking at the retirement age distribution example (which has 11 observations), the
median is the middle value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the
mean of the two middle values. In the following distribution, the two middle values are
56 and 57, therefore the median equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The median is less affected by outliers and skewed data than the mean, and is usually
the preferred measure of central tendency when the distribution is not symmetrical. A
limitation of the median is that it cannot be identified for categorical nominal data, as this
cannot be logically ordered.
The midrange is a rough estimate of the middle. It is defined as the sum of the lowest
and highest values in the data set divided by 2 (Bluman Allan, 2018). It is a very rough
estimate of the average and can be affected by one extremely high or low value.
For example, if you have the following data set:
3, 30, 148, 157, 71
The lowest data value is 3, and the highest data value is 157.
The weighted mean is a type of mean that is calculated by multiplying the weight (or
probability) associated with a particular event or outcome with its associated
quantitative outcome and then summing all the products together (See equation in
Table 1).
Weighted means are useful in a wide variety of scenarios. For example, a student A
may use a weighted mean in order to calculate his/her percentage score in a course:
TABLE 5: Weighted percentage score
STUDENT A
Item Weight Grade/Score
Assignment #1 10% 70%
Assignment #2 10% 65%
Midterm Exam 30% 70%
Final Exam 50% 85%
Weighted Mean 77%
Another example shows how the weighted mean is used to compute a grade point
average. If a student received an A in English Composition I (3 credits), a C in
Introduction to Psychology (3 credits), a B in Biology I (4 credits), and a D in Physical
Education (2 credits). Assuming A = 4 grade points, B = 3 grade points, C = 2 grade
points, D = 1 grade point, and F = 0 grade points, the student’s grade point average is
computed as (Bluman Allan, 2018):
The grade point average is 2.7.
The shape of a distribution influences the Measures of Central Tendency as shown in
the following graphs reproduced from Bluman (Bluman Allan, 2018):
In statistics, variability, dispersion, and spread are synonyms that denote the width of
the distribution.
Measures of variation or spread describe how similar or varied (scattered) the set of
observed values are for a particular variable (data item) as well as defining how far
away the data points tend to fall (differ) from the center or mean value. Measures of
variation include the range, interquartile range, variance, standard deviation, standard
error and coefficient of variation (ABS, Statistical Language - Measures of Spread,
2013). According to Bluman (Bluman Allan, 2018), the range, variance, and standard
deviation are the three measures commonly used for the spread or variability of a data
set.
For example:
Dataset A Dataset B
4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11
The mode (most frequent value), median (middle value) and mean (arithmetic average)
of both datasets is 6. If we just looked at the measures of central tendency, we may
assume that the datasets are the same.
However, if we look at the spread of the values in the following graph, we can see that
Dataset B is more dispersed than Dataset A. Used together, the measures of central
tendency and measures of spread help us to better understand the data.
Source: (ABS, Statistical Language - Measures of Spread, 2013)
The range is the highest value minus the lowest value in a dataset (Bluman Allan,
2018).
Continuing with the previous example, the range for Dataset A is 4, the difference
between the highest value (8 ) and the lowest value (4) and for Dataset B is 10, the
difference between the highest value (11 ) and the lowest value (1).
On a number line, you can see that the range of values for Dataset B is larger (therefore
data values less consistent or more spread) than Dataset A:
Dataset A
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Dataset B
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Quartiles are values that divide the data set into four equal parts.
One fourth of the data lie below the lower quartile and one fourth of the data lie above
the upper quartile. So, one half of the data lie between the lower quartile and upper
quartile. This is called the interquartile range. In the current example, the interquartile
range is 3.5 – 1 = 2.5, but the range is 5 – 0 = 5.
Therefore, the inter-quartile range provides a clearer picture of the overall dataset by
removing/ignoring the outlying values. Like the range however, the inter-quartile range
is a measure of dispersion that is based upon only two values from the dataset.
Statistically, the standard deviation is a more powerful measure of dispersion because it
takes into account every value in the dataset.
The variance and the standard deviation are measures of the spread of the data
around the mean. They summarise how close each observed data value is to the mean
value.
The population standard deviation is the square root of the variance. The symbol for
the population standard deviation is σ. The population variance, σ2, is defined as the
average of the squares of the distance each value is from the mean (Bluman Allan,
2018).
In datasets with a small spread all values are very close to the mean, resulting in a
small variance and standard deviation. Where a dataset is more dispersed, values are
spread further away from the mean, leading to a larger variance and standard deviation.
The standard
deviation of a
normal distribution
enables us to
calculate confidence intervals. In a normal distribution, about 68% of the values are
within one standard deviation either side of the mean and about 95% of the scores are
within two standard deviations of the mean according to the empirical rule (Bluman
Allan, 2018) diagrammatically presented as Figure 3.
The standard error is a measure of the variation between any estimated population
value that is based on a sample rather than true value for the population. It is defined as
1 standard deviation of the sample mean from a population mean. The standard error
can be used to construct a confidence interval.
It is a statistic that allows you to compare standard deviations when the units are
different, as in the example that follows:
The mean of the number of sales of cars over a 3-month period is 87, and the standard
deviation is 5. The mean of the commissions is $5225, and the standard deviation is
$773. The coefficients of variation are:
Since the coefficient of variation is larger for commissions, it can be concluded that the
commissions are more variable than the sales.
The standard score or the z-score tells how many standard deviations a data value is
above or below the mean for a specific distribution of values. If a standard score is zero,
then the data value is the same as the mean. A z-score less than 0 represents an
element less than the mean, a z-score greater than 0 represents an element greater
than the mean, a z-score equal to 1 represents an element that is 1 standard deviation
greater than the mean, a z-score equal to 2, 2 standard deviations greater than the
mean, a z-score equal to -1 represents an element that is 1 standard deviation less than
the meanand a z-score equal to -2, 2 standard deviations less than the mean; etc.
A z- score or standard score for a value is obtained by subtracting the mean from the
value and dividing the result by the standard deviation. The symbol for a standard
score is z. The formula is:
Assuming a student scored 85 on an English test while the mean score of all the
students was 76 and the standard deviation was 4. Also that the same student scored
42 on a French test where the class mean was 36 and the standard deviation was 3.
The relative positions on the two tests can be compared using the z- score.
Since the z- score for the English test is higher than the z- score for the French test, her
relative position in the English class was higher than her relative position in the French
class.
Assuming that the elements in a data set are rank ordered from the smallest to the
largest. The values that divide a rank-ordered set of elements into 100 equal parts are
called percentiles, into 10 equal parts are deciles, and into 4 equal parts are quartiles.
The 25th percentile is the same as quartile one (Q1), 50th is Q2 that is also the median,
75th is Q3 and 100th percentile Q4.
ABS. (2013, July 4). Statistical Language - Measures of Spread. Retrieved September 15, 2020, from
Australian Bureau of Statistics:
https://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-
+measures+of+spread
Bluman Allan, G. (2018). Elementary Statistics: A Step By Step Approach, 10th Edition. New York:
McGraw-Hill Education.
Frost, J. (2020). Normal Distribution in Statistics. Retrieved September 16, 2020, from Statistics by Jim
Frost: https://statisticsbyjim.com/basics/normal-
distribution/#:~:text=However%2C%20the%20standard%20normal%20distribution,score%20or
%20a%20Z%2Dscore.