Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

GEORGE WILSON KULABA

#: U67273SF076372

COURSE NAME:
(Numbers of Life: Graphs & Data Distribution)

Assignment Title:
(Graphs, Data Description & Data Distribution)

ATLANTIC INTERNATIONAL UNIVERSITY


September/2020
Introduction
This assignment relates to the Numbers of Life Seminar: Graphs & Data Distribution
Week 3- AIU watched on 13 September 2020. The Seminar was the third course of a 6-
week series aimed at exploring the Samples and Population. These two aspects are
essential to statistics and allow us to perceive reality. I need to use these basic
concepts when conducting research in my major field of study for the Doctorate in Food
Safety and Quality Management at AIU.
The assignment asked us to write a 3 to 6-page essay answering the following
questions:
- What are frequency graphs? What do they show?
- What are the measures of Central Tendency? Explain them.
- What do measures of variation show?
- What are measures of position?
- What is a normal distribution? Why is it called normal?
- How can you apply a normal distribution?
- What is a standard distribution? How is it different from a normal distribution?

Now, following are the answers.

Body of Assignment
What are frequency graphs? What do they show?
In statistics and research, graphs are used to convey the data to viewers in pictorial
form. For most people, data presented graphically is easier to understand than data
presented numerically in tables or frequency distributions. In addition, statistical graphs
can serve a number of other purposes including use as a tool for discussing,
summarizing, describing and analyzing the data.
A description of what the following types of frequency distribution graphs are and what
they show is given:
1. Histogram
2. Bar Chart
3. Frequency Polygon
4. Ogives
5. Pie Graph
6. Pareto Charts
7. Time Series Graphs
8. Stem and Leaf Plot
9. Dotplot

A. The Histogram
Figure 1 shows a histogram drawn from the data in Table 1.

Table 1: Height of Children

Height (cm) of Tally Absolute frequency Relative frequency


children

120 – less than 130 |||| |||| 9 18%

130 – less than 140 |||| |||| 10 20%

140 – less than 150 |||| |||| ||| 13 26%

150 – less than 160 |||| |||| | 11 22%

160 – less than 170 |||| || 7 14%

Total 50 100%

From the frequency table (Table 1) we can quickly identify information such as 7
children (14% of all children) are in the 160 to less than 170 cm height range, and that
there are more children with heights in the 140 to less than 150 cm range (26% of all
children) than any other height range.

Histograms are visual displays of frequencies using columns plotted on a graph. The Y-
axis (vertical axis) generally represents the frequency count, while the X-axis (horizontal
axis) generally represents the variable being measured.

A histogram (Figure 1) is a type of graph in which each column represents a numeric


variable, in particular that which is continuous and/or grouped. It is the most commonly
used graph to show frequency distributions.

A histogram shows the distribution of all observations in a quantitative dataset. It is


useful for describing the shape, centre and spread to better understand the distribution
of the dataset.

The features of a histogram are (ABS, 2013):


The height of the column shows the frequency for a specific range of values.
Columns are usually of equal width; however, a histogram may show data using
unequal ranges (intervals) and therefore have columns of unequal width.
The values represented by each column must be mutually exclusive and exhaustive.
Therefore, there are no spaces between columns and each observation can only ever
belong in one column.
10. It is important that there is no ambiguity in the labelling of the intervals on the
x-axis for continuous or grouped data (e.g. 0 to less than 10, 10 to less than 20,
20 to less than 30).

Figure 1: Histogram (Distribution of Children’s Height). Source: (ABS, 2013)

B. Bar Chart
A bar chart is shown in Figure 2. It is drawn from data in Table 2.
A bar chart is a type of graph in which each column (plotted either vertically or
horizontally) represents a categorical variable or a discrete ungrouped numeric variable.
It is used to compare the frequency (count) for a category or characteristic with another
category or characteristic.
The features of a bar chart are (ABS, 2013):
In a bar chart, the bar height (if vertical) or length (if horizontal) shows the frequency
for each category or characteristic.
The distribution of the dataset is not important because the columns each represent
an individual category or characteristic rather than intervals for a continuous
measurement. Therefore, gaps are included between each bar and each bar can be
arranged in any order without affecting the data.
For example:
If data had been collected for 'country of birth' from a sample of children (Table 2), a bar
chart could be used to plot the data as 'country of birth' is a categorical variable.

Table 2: Birthplace of Children

Country of Birth Absolute frequency Relative frequency

Australia 16 32%

Fiji 3 6%

India 8 16%

Italy 10 20%

New Zealand 9 18%

United States of 4 8%
America

Total 50 100%

The bar chart below (Figure 2) shows us that 'Australia' is the most commonly observed
country of birth of the 50 children sampled, while 'Fiji' is the least common country of
birth.

Figure 2: Bar Chart (Birthplace of Children), Source: (ABS, 2013).


C. The Frequency Polygon
A frequency polygon is a graph constructed by using lines to join the midpoints of
each interval, or bin. The heights of the points represent the frequencies. A frequency
polygon can be created from the histogram or by calculating the midpoints of the bins
from the frequency distribution table.

Figure 5: Frequency polygons


D. Ogives
The fourth type of graph that can be used represents the cumulative frequencies for
the classes. This type of graph is called the cumulative frequency graph, or ogive.
The cumulative frequency is the sum of the frequencies accumulated up to the upper
boundary of a class in the distribution.
The ogive is a graph that represents the cumulative frequencies for the classes in a
frequency distribution (Bluman Allan, 2018).

Figure 6: Ogive
E. Graphs of Relative Frequencies
Graphs of relative frequencies instead of frequencies are used when the proportion of
data values that fall into a given class is more important than the actual number of data
values that fall into that class.

Figure 7: Relative frequency Graphs


Figure 8: Other Types of Graphs Used in Statistics.Source: (Bluman Allan, 2018).
F. Pie Graph
Pie graphs are used extensively in statistics. The purpose of the pie graph is to show
the relationship of the parts to the whole by visually comparing the sizes of the sections.
Percentages or proportions can be used. The variable is nominal or categorical.
A pie graph is a circle that is divided into sections or wedges according to the
percentage of frequencies in each category of the distribution ( see Figure 8)
G. Pareto Charts
When the variable displayed on the horizontal axis is qualitative or categorical, a
Paretochart can also be used to represent the data ( See Figure 8)..
A Pareto chart is used to represent a frequency distribution for a categorical variable,
and the frequencies are displayed by the heights of vertical bars, which are arranged in
order from highest to lowest ( See figure 8).
H. Time Series Graphs
When data are collected over a period of time, they can be represented by a time series
graph.
A time series graph represents data that occur over a specific period of time (See
Figure 8)
I. Stem and Leaf Plot
The stem and leaf plot is a method of organizing data and is a combination of sorting
and graphing. It has the advantage over a grouped frequency distribution of retaining
the actual data while showing them in graphical form.
A stem and leaf plot is a data plot that uses part of the data value as the stem and part
of the data value as the leaf to form groups or classes. For example, a data value of 34
would have 3 as the stem and 4 as the leaf. A data value of 356 would have 35 as the
stem and 6 as the leaf. Example 2–14 shows the procedure for constructing a stem and
leaf plot (Bluman Allan, 2018)
J. Dotplot
A dotplot uses points or dots to represent the data values. If the data values occur
more than once, the corresponding points are plotted above one another.
A dotplot is a statistical graph in which each data value is plotted as a point (dot) above
the horizontal axis. Dotplots are used to show how the data values are distributed and
to see if there are any extremely high or low data values (Bluman Allan, 2018).
What are the measures of Central Tendency? Explain them.
In statistics, a measure of central tendency (also referred to as measures of
centre or central location) is a summary measure that attempts to describe a whole set
of data with a single value that represents the middle or centre of its distribution.
There are three main measures of central tendency: the mode, the median and the
mean. Others include the midrange and the weighted mean. Table 3 summarizes the
measures of central tendency, a detailed account of which is given in chapter three of
the textbook by Bluman (Bluman Allan, 2018).
TABLE 3: Summary of Measures of Central Tendency
Measure Definition Symbol(s) Formula
Mean Sum of values, divided by total
number of values
μ,

Middle point in data set that


Median MD
has been ordered

Mode None
Most frequent data value

Midrange MR
Lowest value plus highest
value, divided by 2

Weighted Mean Multiplying each value by its


corresponding weight and
dividing the sum of the products Where w1, w2, . . . , wn are
by the sum of the weights. the weights and X1, X2, . .
. , Xn are the values.

Each of these measures describes a different indication of the typical or central value in
the distribution.
The mean, also known as the arithmetic average, is found by adding the values of the
data and dividing by the total number of values. For example, the mean of 3, 2, 6, 5,
and 4 is found by adding 3 + 2 + 6 + 5 + 4 = 20 and dividing by 5; hence, the mean of
the data is 20 ÷ 5 = 4.
The population mean is indicated by the Greek symbol µ, but when the mean is
calculated on a distribution from a sample it is indicated by the symbol x̅ ( X-bar). In the
formulae (Table 3), n represents the total number of values in the sample while N
represents the total number of values in the population.
One advantage of the mean is that it can be used for both continuous and discrete
numeric data. However, its limitations include the following:
• The mean cannot be calculated for categorical data, as the values cannot be
summed.
• As the mean includes every value in the distribution the mean is influenced by
outliers and skewed distributions.
The mode is the value that occurs most often in the data set. Consider this dataset
showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Table 4. shows a simple frequency distribution of the retirement age data given above.

TABLE 4: Retirement age

Age Frequency

54 3

55 1

56 1

57 2

58 2

60 2

The most commonly occurring value is 54, therefore the mode of this distribution is 54
years.
The mode has an advantage over the median and the mean as it can be found for
both numerical and categorical (non-numerical) data.

The are some limitations to using the mode as follows:


• In some distributions, the mode may not reflect the centre of the distribution very
well. When the distribution of retirement age is ordered from lowest to highest
value, it is easy to see that the centre of the distribution is 57 years, but the
mode is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

• It is also possible for there to be more than one mode for the same distribution of
data, (bi-modal, or multi-modal). The presence of more than one mode can limit
the ability of the mode in describing the centre or typical value of the distribution
because a single value to describe the centre cannot be identified.
• In some cases, particularly where the data are continuous, the distribution may
have no mode at all (i.e. if all values are different).
In cases such as these, it may be better to consider using the median or mean, or group
the data into appropriate intervals, and find the modal class.
The median is the middle value in distribution when the values are arranged in
ascending or descending order.
The median divides the distribution in half (there are 50% of observations on either side
of the median value). In a distribution with an odd number of observations, the median
value is the middle value.

Looking at the retirement age distribution example (which has 11 observations), the
median is the middle value, which is 57 years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

When the distribution has an even number of observations, the median value is the
mean of the two middle values. In the following distribution, the two middle values are
56 and 57, therefore the median equals 56.5 years:

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The median is less affected by outliers and skewed data than the mean, and is usually
the preferred measure of central tendency when the distribution is not symmetrical. A
limitation of the median is that it cannot be identified for categorical nominal data, as this
cannot be logically ordered.
The midrange is a rough estimate of the middle. It is defined as the sum of the lowest
and highest values in the data set divided by 2 (Bluman Allan, 2018). It is a very rough
estimate of the average and can be affected by one extremely high or low value.
For example, if you have the following data set:
3, 30, 148, 157, 71
The lowest data value is 3, and the highest data value is 157.

The weighted mean is a type of mean that is calculated by multiplying the weight (or
probability) associated with a particular event or outcome with its associated
quantitative outcome and then summing all the products together (See equation in
Table 1).
Weighted means are useful in a wide variety of scenarios. For example, a student A
may use a weighted mean in order to calculate his/her percentage score in a course:
TABLE 5: Weighted percentage score
STUDENT A
Item Weight Grade/Score
Assignment #1 10% 70%
Assignment #2 10% 65%
Midterm Exam 30% 70%
Final Exam 50% 85%
Weighted Mean 77%

Another example shows how the weighted mean is used to compute a grade point
average. If a student received an A in English Composition I (3 credits), a C in
Introduction to Psychology (3 credits), a B in Biology I (4 credits), and a D in Physical
Education (2 credits). Assuming A = 4 grade points, B = 3 grade points, C = 2 grade
points, D = 1 grade point, and F = 0 grade points, the student’s grade point average is
computed as (Bluman Allan, 2018):
The grade point average is 2.7.
The shape of a distribution influences the Measures of Central Tendency as shown in
the following graphs reproduced from Bluman (Bluman Allan, 2018):

In a positively skewed or right-skewed distribution (Histogram (a)), the majority of the


data values fall to the left of the mean and cluster at the lower end of the distribution;
the “tail” is to the right. Also, the mean is to the right of the median, and the mode is to
the left of the median.
In a symmetric distribution (Histogram (b)), the data values are evenly distributed on
both sides of the mean. In addition, when the distribution is unimodal, the mean,
median, and mode are the same and are at the center of the distribution.

In a negatively skewed or left-skewed distribution (Histogram (c)), the majority of the


data values fall to the right of the mean and cluster at the upper end of the distribution,
with the tail to the left. Also, the mean is to the left of the median, and the
mode is to the right of the median. As an example, a negatively skewed distribution
results if the majority of students score very high on an instructor’s examination.

What do measures of variation show?


In statistics, to describe the data set accurately, statisticians must know more than the
measures of central tendency. Just as there are multiple measures of central tendency,
there are several measures of variability.

In statistics, variability, dispersion, and spread are synonyms that denote the width of
the distribution.

Measures of variation or spread describe how similar or varied (scattered) the set of
observed values are for a particular variable (data item) as well as defining how far
away the data points tend to fall (differ) from the center or mean value. Measures of
variation include the range, interquartile range, variance, standard deviation, standard
error and coefficient of variation (ABS, Statistical Language - Measures of Spread,
2013). According to Bluman (Bluman Allan, 2018), the range, variance, and standard
deviation are the three measures commonly used for the spread or variability of a data
set.

For example:

Dataset A Dataset B
4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11

The mode (most frequent value), median (middle value) and mean (arithmetic average)
of both datasets is 6. If we just looked at the measures of central tendency, we may
assume that the datasets are the same.

However, if we look at the spread of the values in the following graph, we can see that
Dataset B is more dispersed than Dataset A. Used together, the measures of central
tendency and measures of spread help us to better understand the data.
Source: (ABS, Statistical Language - Measures of Spread, 2013)

The range is the highest value minus the lowest value in a dataset (Bluman Allan,
2018).

Continuing with the previous example, the range for Dataset A is 4, the difference
between the highest value (8 ) and the lowest value (4) and for Dataset B is 10, the
difference between the highest value (11 ) and the lowest value (1).

On a number line, you can see that the range of values for Dataset B is larger (therefore
data values less consistent or more spread) than Dataset A:

Dataset A
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Dataset B
0 1 2 3 4 5 6 7 8 9 10 11 12 13

Quartiles are values that divide the data set into four equal parts.
One fourth of the data lie below the lower quartile and one fourth of the data lie above
the upper quartile. So, one half of the data lie between the lower quartile and upper
quartile. This is called the interquartile range. In the current example, the interquartile
range is 3.5 – 1 = 2.5, but the range is 5 – 0 = 5.

Therefore, the inter-quartile range provides a clearer picture of the overall dataset by
removing/ignoring the outlying values. Like the range however, the inter-quartile range
is a measure of dispersion that is based upon only two values from the dataset.
Statistically, the standard deviation is a more powerful measure of dispersion because it
takes into account every value in the dataset.

The variance and the standard deviation are measures of the spread of the data
around the mean. They summarise how close each observed data value is to the mean
value.

The population standard deviation is the square root of the variance. The symbol for
the population standard deviation is σ. The population variance, σ2, is defined as the
average of the squares of the distance each value is from the mean (Bluman Allan,
2018).

In datasets with a small spread all values are very close to the mean, resulting in a
small variance and standard deviation. Where a dataset is more dispersed, values are
spread further away from the mean, leading to a larger variance and standard deviation.

The smaller the


variance and
standard deviation,
the more the mean
value is indicative of
the whole dataset.
Therefore, if all
values of a dataset
are the same, the
standard deviation
and variance are
zero.

The standard
deviation of a
normal distribution
enables us to
calculate confidence intervals. In a normal distribution, about 68% of the values are
within one standard deviation either side of the mean and about 95% of the scores are
within two standard deviations of the mean according to the empirical rule (Bluman
Allan, 2018) diagrammatically presented as Figure 3.
The standard error is a measure of the variation between any estimated population
value that is based on a sample rather than true value for the population. It is defined as
1 standard deviation of the sample mean from a population mean. The standard error
can be used to construct a confidence interval.

A confidence interval is a range in which it is estimated the true population value


lies. Confidence intervals of different sizes can be created to represent different levels
of confidence that the true population value will lie within a particular range. A common
confidence interval used in statistics is the 95% confidence interval. In a 'normal
distribution', the 95% confidence interval is measured by two standard errors either side
of the estimate.

The coefficient of variation, denoted by CVar, is the standard deviation divided by


the mean. The result is expressed as a percentage.

It is a statistic that allows you to compare standard deviations when the units are
different, as in the example that follows:

The mean of the number of sales of cars over a 3-month period is 87, and the standard
deviation is 5. The mean of the commissions is $5225, and the standard deviation is
$773. The coefficients of variation are:

Since the coefficient of variation is larger for commissions, it can be concluded that the
commissions are more variable than the sales.

What are measures of position?


A measure of position is a method by which the position that a particular data value
has within a given data set can be identified. As with other types of measures, there is
more than one approach to defining such a measure. The most common measures of
position are percentiles, quartiles, and standard scores (z-scores).

The standard score or the z-score tells how many standard deviations a data value is
above or below the mean for a specific distribution of values. If a standard score is zero,
then the data value is the same as the mean. A z-score less than 0 represents an
element less than the mean, a z-score greater than 0 represents an element greater
than the mean, a z-score equal to 1 represents an element that is 1 standard deviation
greater than the mean, a z-score equal to 2, 2 standard deviations greater than the
mean, a z-score equal to -1 represents an element that is 1 standard deviation less than
the meanand a z-score equal to -2, 2 standard deviations less than the mean; etc.
A z- score or standard score for a value is obtained by subtracting the mean from the
value and dividing the result by the standard deviation. The symbol for a standard
score is z. The formula is:

Assuming a student scored 85 on an English test while the mean score of all the
students was 76 and the standard deviation was 4. Also that the same student scored
42 on a French test where the class mean was 36 and the standard deviation was 3.
The relative positions on the two tests can be compared using the z- score.

For the English test:

For the French test:

Since the z- score for the English test is higher than the z- score for the French test, her
relative position in the English class was higher than her relative position in the French
class.
Assuming that the elements in a data set are rank ordered from the smallest to the
largest. The values that divide a rank-ordered set of elements into 100 equal parts are
called percentiles, into 10 equal parts are deciles, and into 4 equal parts are quartiles.
The 25th percentile is the same as quartile one (Q1), 50th is Q2 that is also the median,
75th is Q3 and 100th percentile Q4.

What is a normal distribution? Why is it called normal?


Statistics is based on the data having a normal distribution and every sample has to be
tested to see if it is normally distributed. This means that the Empirical Rule is obeyed
with about 68% of the data values falling within 1 standard deviation of the mean, 95%
of the data values falling within 2 standard deviations of the mean, and 99.7% of the
data values falling within 3 standard deviations of the mean (See Figure 4).
The Key features of the normal distribution can be summarized as:
• Symmetrical bell shape typical of natural phenomena such as distribution of
people weights, heights, etc.
• Mode, median and mean are the same and are together in the centre of the
curve
• No skewness
• There can only be one mode (i.e. there is only one value which is most frequently
observed)
• Most of the data are clustered around the centre, while the more extreme values
on either side of the centre become less rare as the distance from the centre
increases (i.e. About 68% of values lie within one standard deviation (σ) away
from the mean; about 95% of the values lie within two standard deviations; and
about 99.7% are within three standard deviations. This is known as the empirical
rule or the 3-sigma rule.)
• Total area under the curve is 1
It is normal because of the symmetrical bell shape observed in natural probability
distributions. It is the most important continuous probability distribution in statistics. A
vast number of random variables of interest, in every physical science and economics,
are either approximately or exactly described by the normal distribution. Moreover, it
can also be used to approximate other probability distributions, thus justifying the usage
of the word normal as in pertaining to the one that is mostly used.

How can you apply a normal distribution?


In psychological studies it is assumed that the parameters are normally distributed. For
example, intelligence, anxiety, self-concepts, height, weight, etc. are approximately
normally distributed provided sufficiently large random sample is drawn from the
population. Therefore, in drawing the decision on these parameters in the population
based on the sample, the properties of normal distribution can be used. There are
enormous applications of normal distribution in solving our day-to-day problems. Some
of the applications are listed below:
1. If the distribution of scores is normal, then the area property of the normal
distribution can be used to draw different kinds of inferences. For example, if
marks obtained by the students in GMAT examination is normally distributed with
mean 450 and standard deviation 30, then the area property can be used to
conclude that around 68% students’ GMAT scores are in between 420 and 480
(M ± σ), around 95% students secure marks in between 390 and 510 (M ± 2σ)
and around 99% students’ performances are in between 360 and 540 (M ± 3σ).
2. Normal distribution can be used to develop scales on various behavioural
parameters to assess the performance of individuals.
3. Grading criteria can be developed for an individual on different behavioural
parameters using normal distribution.
4. Statistical tests assume the normality of data for testing of hypotheses
concerning population parameters.
5. If scores are normally distributed, the properties of normal distribution can be
used to solve varieties of problems, including testing for normality,.
6. The principle of central limit theorem is used in testing of parametric and many
non-parametric hypotheses.
7. Application of Standard Score: Since, for all practical purpose, the range of
standard score is from −3 to +3, the value of z closer to 3 is on the higher
side. Thus, standard score can be used to know the performance of an
individual on a particular parameter. Further, because standard scores
are free from unit, it can be used to compare the performance of an
individual on two variables measured on different scales. For instance,
the emotional status and creativity of an individual can be compared by
converting the scores into their standard scores although the emotional
status and creativity are assessed on different scales.

What is a standard distribution? How is it different from a normal distribution?


The standard normal distribution is a special case of the normal distribution. It is the
distribution that occurs when a normal random variable has a mean of zero and a
standard deviation of one. The standard normal distribution is centered at zero and the
degree to which a given measurement deviates from the mean is given by the standard
deviation.
The normal random variable of a standard normal distribution is called a standard
score or a z-score. Every normal random variable X can be transformed into a z score
via the following equation:
z = (X - μ) / σ
where X is a normal random variable, μ is the mean of X, and σ is the standard
deviation of X.
As for a normal distribution, approximately 68% of the observations lie within 1
standard deviation of the mean; 95% lie within two standard deviation of the mean; and
99.9% lie within 3 standard deviations of the mean as illustrated in Figure 3.
Conclusion
The webinar and the assignment reinforced each other in a very constructive way to
enable us to concretize statistical concepts and knowledge about the normal
distribution, measures of central tendency, measures of variation and measures of
position . I am sure what I learnt will bear fruit in my study undertakings at the AIU.
.
Bibliography
ABS. (2013, July 3). Statistical Language - Frequency Distribution. Retrieved September 15, 2020, from
Australian Bureau of Statistics (ABS):
https://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-
+frequency+distribution

ABS. (2013, July 4). Statistical Language - Measures of Spread. Retrieved September 15, 2020, from
Australian Bureau of Statistics:
https://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-
+measures+of+spread

Bluman Allan, G. (2018). Elementary Statistics: A Step By Step Approach, 10th Edition. New York:
McGraw-Hill Education.

Frost, J. (2020). Normal Distribution in Statistics. Retrieved September 16, 2020, from Statistics by Jim
Frost: https://statisticsbyjim.com/basics/normal-
distribution/#:~:text=However%2C%20the%20standard%20normal%20distribution,score%20or
%20a%20Z%2Dscore.

You might also like