Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

DATA REPRESENTATION & ANALYSIS

WHAT IS DATA?
Data is a collection of facts (such as numbers, words, measurements, observations or even just descriptions of things) from
which conclusions may be drawn.
WHY IS DATA IMPORTANT?
The collection of data is an important thing in statistical data analysis. Data can be collected from sources or through
observation, surveys, or by doing experiments.
TYPES OF DATA
Qualitative data deals with characteristics and descriptors that can't be easily
measured, but can be observed subjectively—such as smells, tastes, textures,
attractiveness, and colour. These observations fall into separate distinct categories.
E.g. Colour of eyes : blue, green, brown etc., Exam result : pass or fail, Socio-economic
status : low, middle or high.
Quantitative data deals with numbers and things you can measure objectively: e.g.
dimensions (such as height, width, and length), temperature, humidity, prices, area
and volume. These numerical responses may be discrete or continuous.
Discrete data is a count that can't be made more precise. Typically it involves integers. For instance, the number of children
(or adults, or pets) in your family is discrete data, because you are counting whole, indivisible entities: you can't have 2.5 kids,
or 1.3 pets.
Continuous data, on the other hand, could be divided and reduced to finer and finer levels. For example, you can measure the
height of your kids at progressively more precise scales—meters, centimetres, millimetres, and beyond—so height is
continuous data.

1
REPRESENTING DATA WITH DIAGRAMS
You know the saying, “A picture is worth a thousand words?” After collecting and organizing data, the next step is to display it
in a manner that makes it easy to read—highlighting similarities, disparities, trends, and other relationships, or the lack of, in
the data set. Using visual representations to present data collected makes them easier to understand. In selecting how best to
present your data, think about the purpose, what you want to present, then decide which variables you want to include and
whether they should be expressed as frequencies, percentages, or categories. We now focus on 2 diagrams – the stem-and-
leaf diagram and the box-and-whisker diagram.

STEM-AND-LEAF DIAGRAMS

A stem-and-leaf diagram, also called a stem-and-leaf plot, is a diagram that quickly organizes and
summarizes data while maintaining the individual data points. In such a diagram, the "stem" is a
column of the unique elements of data after removing the last digit. The final digits ("leaves") of
each column are then placed in a row next to the appropriate column and sorted in numerical order.
In general, stems may have as many digits as needed , but each leaf should contain only a single
digit. This diagram was invented by John Tukey. Look at the example below:

2
Elements of a good stem and leaf plot

A good stem and leaf plot

 shows the first digits of the number (thousands, hundreds or tens) as the stem and shows the last digit (ones) as the leaf.
 usually uses whole numbers. Anything that has a decimal point is rounded to the nearest whole number. For example, test
results, speeds, heights, weights, etc.
 looks like a bar graph when it is turned on its side.
 shows how the data are spread—that is, highest number, lowest number, most common number and outliers (a number
that lies outside the main group of numbers).

Tips on how to draw a stem and leaf plot

Once you have decided that a stem and leaf plot is the best way to show your data, draw it as follows:

 On the left hand side of the page, write down the thousands, hundreds or tens (all digits but the last one). These will be
your stems.
 Draw a line to the right of these stems.
 On the other side of the line, write down the ones (the last digit of a number). These will be your leaves.

For example, if the observed value is 25, then the stem is 2 and the leaf is the 5. If the observed value is 369, then the stem is 36
and the leaf is 9. Where observations are accurate to one or more decimal places, such as 23.7, the stem is 23 and the leaf is 7. If
the range of values is too great, the number 23.7 can be rounded up to 24 to limit the number of stems.

In stem and leaf plots, tally marks are not required because the actual data are used.

3
Example 1 - Making a stem and leaf plot

Each morning, a teacher quizzed his class with 20 geography questions. The class marked them together and everyone kept a
record of their personal scores. As the year passed, each student tried to improve his or her quiz marks. Every day, Elliot
recorded his quiz marks on a stem and leaf plot. This is what his marks looked like plotted out:

Table 1. Elliot's scores on the Analyse Elliot's stem and leaf plot. What is his most common score on the
basic facts quiz last year geography quizzes? What is his highest score? His lowest score? Rotate the
stem and leaf plot onto its side so that it looks like a bar graph. Are most of
Stem Leaf Elliot's scores in the 10s, 20s or under 10? It is difficult to know from the plot
whether Elliot has improved or not because we do not know the order of those
0 365 scores.

1 014356568979

2 0000

Example 2 - Making a stem and leaf plot

A teacher asked 10 of her students how many books they had read in the last 12 months. Their answers were as follows:

12, 23, 19, 6, 10, 7, 15, 25, 21, 12

Prepare a stem and leaf plot for these data.

Tip: The number 6 can be written as 06, which means that it has a stem of 0 and a leaf of 6.

4
The stem and leaf plot should look like this:

Table 2. Books read in a year In Table 2:


by 10 students
 stem 0 represents the class interval 0 to 9;
Stem Leaf  stem 1 represents the class interval 10 to 19; and
 stem 2 represents the class interval 20 to 29.
0 67

1 29052

2 351

Usually, a stem and leaf plot is ordered, which simply means that the leaves are arranged in ascending order from left to right.
Also, there is no need to separate the leaves (digits) with punctuation marks (commas or periods) since each leaf is always a
single digit. Using the data from Table 2, we made the ordered stem and leaf plot shown below:

Table 3. Books read in a year


by 10 students

Stem Leaf

0 67

1 02259

2 135

5
Example 3 – Splitting stems using decimal values

The weights (to the nearest tenth of a kilogram) of 30 students were measured and recorded as follows:

59.2, 61.5, 62.3, 61.4, 60.9, 59.8, 60.5, 59.0, 61.1, 60.7, 61.6, 56.3, 61.9,

65.7, 60.4, 58.9, 59.0, 61.2, 62.1, 61.4, 58.4, 60.8, 60.2, 62.7, 60.0, 59.3, Table 8. Weights of 30 students

61.9, 61.7, 58.4, 62.2 Stem Leaf

56 3

Prepare an ordered stem and leaf plot for the data.


57

58 449
Answer: In this case, the stems will be the whole number values and
the leaves will be the decimal values. The data range from 56.3 to 65.7, 59 00238

so the stems should start at 56 and finish at 65.


60 0245789

61 124456799

62 1237

63

64

65 7

6
Outliers

An outlier is an extreme value of the data. It is an observation value that is significantly different from the rest of the data. There
may be more than one outlier in a set of data.

Sometimes, outliers are significant pieces of information and should not be ignored. Other times, they occur because of an error
or misinformation and should be ignored.

In the previous example, 56.3 and 65.7 could be considered outliers, since these two values are quite different from the other
values.

By ignoring these two outliers, the previous example's stem and leaf plot could be redrawn as below:

Table 9. Weights of 30
students except for outliers

Stem Leaf

58 449

59 00238

60 0245789

61 124456799

62 1237

7
What is a Back-to-Back Stem plot?

On a normal plot, the stem is on the left and all the leaves are on the right. There is a vertical line separating the two. On a back
to back plot, the stem remains the same. But to add another set of data points, we begin adding leaves to the LEFT side.

Just like on a typical plot, the smallest leaves are placed closest to the stem, and larger leaves are further away. The stem now
serves a double purpose. It anchors both sets of data points, keeping them separate but it still organizes both.

In a back to back stem and leaf plot, you can compare two sets of data, and still be able to find the statistical measurements of
each set. It also retains the same pros and cons of a normal plot. In the picture above, you can see that the stems work for each
side of the plot, yet the data is separate. We added the points for Team B, but we started building at the centre of the plot. It is
pointed out on the last line of Team B’s points that the smaller leaves are closest to the stem and the larger leaves are farther
away. In this picture you can see that Team B scored 92, 92, 92, 93, 96, and 96, respectively. Note the locations of the “2″ leaf
and the “6″ leaf.

8
Using stem and leaf plots as graphs

A stem and leaf plot is a simple kind of graph that is made out of the numbers themselves. It is a means of displaying the main features of a
distribution. If a stem and leaf plot is turned on its side, it will resemble a bar graph or histogram and provide similar visual information.

Example 6 – Using stem and leaf plots as graph

The results of 41 students' math tests (with a best possible score of 70) are recorded below:

31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65, 50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35,
51, 63, 42

1. Is the variable discrete or continuous? Explain.


2. Prepare an ordered stem and leaf plot for the data and briefly describe what it shows.
3. Are there any outliers? If so, which scores?
4. Look at the stem and leaf plot from the side. Describe the distribution's main features such as:
a. number of peaks
b. symmetry
c. value at the centre of the distribution
Answers
1. A test score is a discrete variable. For example, it is not possible to have a test score of 35.74542341....
2. The lowest value is 4 and the highest is 67. Therefore, the stem and leaf plot that covers this range of values looks like this:

Table 10. Math scores of


41 students

Stem Leaf

0 4

1 89

9
2 346

3 1245579

4 012345589

5 00011234455677

6 02357

3. Note: The notation 2|4 represents stem 2 and leaf 4.


4. The stem and leaf plot reveals that most students scored in the interval between 50 and 59. The large number of students who obtained high
results could mean that the test was too easy, that most students knew the material well, or a combination of both.
5. The result of 4 could be an outlier, since there is a large gap between this and the next result, 18.
6. If the stem and leaf plot is turned on its side, it will look like the following:

The distribution has a single peak within the 50–59 interval.


10
Although there are only 41 observations, the distribution shows that most data are clustered at the right. The left tail extends farther from the
data centre than the right tail. Therefore, the distribution is skewed to the left or negatively skewed.

Since there are 41 observations, the distribution centre (the median value) will occur at the 21st observation. Counting 21 observations up
from the smallest, the centre is 48. (Note that the same value would have been obtained if 21 observations were counted down from the
highest observation.)

Advantages of stem-and -leaf plot:

1. The main advantage of a stem-and-leaf diagram is that the data are grouped and all the original data are shown, too.
2. Easy to construct
3. Shows range, minimum & maximum, gaps & clusters, and outliers easily

Disadvantages of stem - and -leaf plot:

1. Little flexibility in the choice of stem


2. Not visually appealing
3. Does not easily indicate measures of centrality for large data sets

11
BOX-AND-WHISKER PLOTS
Box plots provide a visual representation of a five-number summary of data, consisting of the median (the midpoint of the
data range), the upper and lower quartiles (the numbers below the highest quarter of the data and above the lowest quarter,
respectively) and the largest and smallest values (the extremes). Box plots are particularly useful for comparing distributions
of the results from several experimental conditions. A box and whisker plot is a good way to summarize large amounts of
data. It is usually drawn alongside a number line, as shown:

Example
The oldest person in Mathsminster is 90. The youngest person is 15.
The median age of the residents is 44, the lower quartile is 25, and the upper quartile is 67.
Represent this information with a box-and-whisker plot.

12
Solution

13
14
15
16
17
18
ANALYZING DIAGRAMS

1. Features of distributions

When you assess the overall pattern of any distribution (which is the pattern formed by all values of a particular variable), look
for these features:

 number of peaks
 general shape (skewed or symmetric)
 centre
 spread

Number of peaks

Line graphs are useful because they readily reveal some characteristic of the data.

The first characteristic that can be readily seen from a line graph is the number of high points or peaks the distribution has.

While most distributions that occur in statistical data have only one main peak(unimodal), other distributions may have two
peaks (bimodal) or more than two peaks(multimodal).

Examples of unimodal, bimodal and multimodal line graphs are shown below:

19
General shape

The second main feature of a distribution is the extent to which it is symmetric.

A perfectly symmetric curve is one in which both sides of the distribution would exactly match the other if the figure were
folded over its central point. An example is shown below:

A symmetric, unimodal, bell-shaped distribution—a relatively common occurrence—is called a normal distribution. In a
normal distribution, mean, median and mode are identical in value.

If the distribution is lop-sided, it is said to be skewed.

A distribution is said to be skewed to the right, or positively skewed, when most of the data are concentrated on the left of the
distribution. Distributions with positive skews are more common than distributions with negative skews.

Income provides one example of a positively skewed distribution. Most people make under $40,000 a year, but some make
quite a bit more, with a smaller number making many millions of dollars a year. Therefore, the positive (right) tail on the line
20
graph for income extends out quite a long way, whereas the negative (left) skew tail stops at zero. The right tail clearly extends
farther from the distribution's centre than the left tail, as shown below:

A distribution is said to be skewed to the left, or negatively skewed, if most of the data are concentrated on the right of the
distribution. The left tail clearly extends farther from the distribution's centre than the right tail, as shown below:

Centre and spread

Locating the centre (median) of a distribution can be done by counting half the observations up from the smallest. Obviously,
this method is impracticable for very large sets of data. A stem and leaf plot makes this easy, however, because the data are
arranged in ascending order. The mean is another measure of central tendency.

The amount of distribution spread and any large deviations from the general pattern (outliers) can be quickly spotted on a graph.

21
2. Measures of Central Tendency from Raw Data

22
3. Measures of Dispersion from Raw Data

While measures of central tendency indicate what value of a variable is (in one sense or other) “average” or
“central” or “typical” in a set of data, measures of dispersion (or variability or spread) indicate (in one sense or
other) the extent to which the observed values are “spread out” around that centre — how “far apart” observed
values typically are from each other and therefore from some average value (in particular, the mean). Thus:
if all cases have identical observed values (and thereby are also identical to [any] average value), dispersion is
zero;
if most cases have observed values that are quite “close together” (and thereby are also quite “close” to the
average value), dispersion is low (but greater than zero); and
if many cases have observed values that are quite “far away” from many others (or from the average value),
dispersion is high.
A measure of dispersion provides a summary statistic that indicates the magnitude of such dispersion and, like a
measure of central tendency, is a univariate statistic.

Because dispersion is concerned with how “close together” or “far apart” observed values are (i.e., with the
magnitude of the intervals between them), measures of dispersion are defined only for interval (or ratio)
variables,
or, in any case, variables we are willing to treat as interval .
There is one exception: a very crude measure of dispersion called the variation ratio, which is defined for ordinal
and even nominal variables.

There are two principal types of measures of dispersion: range measures and deviation measures. Range
measures are based on the distance between pairs of (relatively) “extreme” values observed in the data.
23
They are conceptually connected with the median as a measure of central tendency.

The (“total” or “simple”) range is the maximum (highest) value observed in the data [the value of the case at the
100th percentile] minus the minimum (lowest) value observed in the data [the value of the case at the 0th
percentile]
That is, it is the “distance” or “interval” between the values of the two most extreme cases,
e.g., range of test scores
The problem with the [total] range as a measure of dispersion is that it depends on the values of just two cases,
which by definition have (possibly extraordinarily) atypical values.
In particular, the range makes no distinction between a polarized distribution in which almost all observed values
are close to either the minimum or maximum values and a distribution in which almost all observed values are
bunched together but there are a few extreme outliers.
Recall Ideological Dispersion bar graphs =>
Also the range is undefined for theoretical distributions that are “open-ended,” like the normal distribution (that
we will take up in the next topic) or the upper end of an income distribution type of curve (as in previous slides).

Therefore other variants of the range measure that do not reach entirely out to the extremes of the frequency
distribution are often used instead of the total range.
The interquartile range is the value of the case that stands at the 75th percentile of the distribution minus the
value of the case that stands at the 25th percentile.
The first quartile is the median observed value among all cases that lie below the overall median and the third
quartile is the median observed value among all cases that lie above the overall median.
In these terms, the interquartile range is third quartile minus the first quartile.

24
Deviation measures are based on average deviations from some average value.
Since dispersion measures pertain to with interval variables, we can calculate means, and deviation measures are
typically based on the mean deviation from the mean value.
Thus the (mean and) standard deviation measures are conceptually connected with the mean as a measure of
central tendency.

25
26
PERCENTILES
Percentiles are like quartiles, except that percentiles divide the set of data into 100 equal parts while quartiles divide the set
of data into 4 equal parts. Percentiles measure position from the bottom.
Percentiles are most often used for determining the relative standing of an individual in a population or the rank position of
the individual. Some of the most popular uses for percentiles are connected with test scores and graduation standings.
Percentile ranks are an easy way to convey an individual's standing at graduation relative to other graduates.
About Percentile Ranks:
• percentile rank is a number between 0 and 100 indicating the percent of cases falling at or below that score.
• percentile ranks are usually written to the nearest whole percent: 74.5% = 75% = 75th percentile
• scores are divided into 100 equally sized groups
• scores are arranged in rank order from lowest to highest
• there is no 0 percentile rank - the lowest score is at the first percentile
• there is no 100th percentile - the highest score is at the 99th percentile.
• you cannot perform the same mathematical operations on percentiles that you can on raw scores. You cannot, for
example, compute the mean of percentile scores, as the results may be misleading.

Definition 1: A percentile is a measure that tells us what percent of the total frequency scored at or below that
measure. A percentile rank is the percentage of scores that fall at or below a given score.
Formula:

To find the percentile rank of a score, x, out of a set of n scores, where x is included:
Where B = number of scores below x
27
E = number of scores equal to x
n = number of scores
See this formula in more detail in the Examples section.

Example: If Jason graduated 25th out of a class of 150 students, then 125 students were ranked below Jason. Jason's
percentile rank would be:

Jason's standing in the class at the 84th percentile is as higher or higher than 84% of the graduates. Good job, Jason!

Examples: Finding Percentiles

1. The math test scores were: 50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 86, 88, 88, 90, 94, 96, 98, 98, 99. Find the percentile rank
for a score of 84 on this test.

Be sure the scores are ordered from smallest to largest.


Locate the 84.

Solution Using Formula:

Solution Using Visualization:

Since there are 2 values equal to 84, assign one to the group "above 84" and the other to the group "below 84".

50, 65, 70, 72, 72, 78, 80, 82, 84, | 84, 85, 86, 88, 88, 90, 94, 96, 98, 98, 99

28
The score of 84 is at the 45th percentile for this test.

2. The math test scores were: 50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 86, 88, 88, 90, 94, 96, 98, 98, 99. Find the percentile rank
for a score of 86 on this test.

Be sure the scores are ordered from smallest to largest.


Locate the 86.

Solution Using Formula:

Solution Using Visualization:

Since there is only one value equal to 86, it will be counted as "half" of a data value for the group "above 86" as well as the group
"below 86".

50, 65, 70, 72, 72, 78, 80, 82, 84, 84, 85, 8|6, 88, 88, 90, 94, 96, 98, 98, 99

29
The score of 86 is at the 58th percentile for this test.

3. Quartiles can be thought of as percentile measure. Remember that quartiles break the data set into 4 equal parts. If 100% is
broken into four equal parts, we have subdivisions at 25%, 50%, and 75% creating the:

First quartile (lower quartile) to be at the 25th percentile.


Median (or second quartile) to be at the 50th percentile.
Third quartile (upper quartile) to be a the 75th percentile.

Cumulative For the table at the left, find the intervals in which the first, second and third quartiles
Test Scores Frequency
Frequency lie.
76-80 3 3
If there are a total of 20 scores, the first quartile will be located (25% · 20 = 5) five
81-85 7 10 values up from the bottom. This puts the first quartile in the interval 81-85.
86-90 6 16
91-95 4 20
In a similar fashion, the second quartile will be located (50% · 20 = 10) ten values up from the bottom in the interval 81-85.

The third quartile will be located (75% · 20 = 15) fifteen values up from the bottom in the interval 86-90.

Practice with
Percentiles and Quartiles

30
1. The Final Exam test scores were: 62, 66, 71, 75, 75, 78, 81, 83, 84, 85,
85, 87, 89, 89, 91, 92, 93, 94, 95, 99. Find the percentile rank for a score
of 85 on this test.
Choose:
25th percentile
50th percentile
75th percentile
85th percentile

Explanation

2. The heights of students in inches in Block 3 math class are 55, 59, 59, 60,
61, 63, 64, 64, 65, 68, 68, 69, 72, 74. Find the percentile rank for a height
of 61 inches.
Choose:
28th percentile
29th percentile
30th percentile
32nd percentile

31
Explanation

3. The following data are exam grades of 10 students in a math class.

Interval Frequency
69 - 76 1
77 - 84 4
85 - 92 4
93 - 100 1

a) Which interval contains the first (lower) quartile?

b) Which interval contains the third (upper) quartile?

c) If students who received at least an 85% on this exam received a "math


star" pencil , what percent of the students received a pencil?

32
4. The following data represents the heights ( in inches) of 14 students in Mrs. Schultzkie's
math class: 65, 63, 68, 59, 74, 59, 68, 61, 64, 60, 69, 72, 55, 64.

Interval Frequency
55-58

59-62

63-66

67-70

71-74

a) Complete the table.

b) Which interval contains the median?

c) Which interval contains the third (upper) quartile?

d) What percent of the students are shorter than 5 feet 7 inches?

33
STANDARD DEVIATION

Standard Deviation Standard Deviation (often abbreviated as "Std Dev" or "SD") provides an indication of how far the individual
responses to a question vary or "deviate" from the mean. SD tells the researcher how spread out the responses are -- are they
concentrated around the mean, or scattered far & wide? Did all of your respondents rate your product in the middle of your scale,
or did some love it and some hate it?

NOTE: FOR GROUPED DATA, USE 𝒙 AS THE MID-INTERVAL VALUE

34
How to Interpret Standard Deviation in a Statistical Data Set

Standard deviation can be difficult to interpret as a single number on its own. Basically, a small standard deviation means that
the values in a statistical data set are close to the mean of the data set, on average, and a large standard deviation means that the
values in the data set are farther away from the mean, on average.

The standard deviation measures how concentrated the data are around the mean; the more concentrated, the smaller
the standard deviation.

A small standard deviation can be a goal in certain situations where the results are restricted, for example, in product
manufacturing and quality control. A particular type of car part that has to be 2 centimetres in diameter to fit properly had better not
have a very big standard deviation during the manufacturing process. A big standard deviation in this case would mean that lots of parts
end up in the trash because they don’t fit right; either that or the cars will have problems down the road.

But in situations where you just observe and record data, a large standard deviation isn’t necessarily a bad thing; it just reflects a
large amount of variation in the group that is being studied. For example, if you look at salaries for everyone in a certain company,
including everyone from the student intern to the CEO, the standard deviation may be very large. On the other hand, if you narrow the
group down by looking only at the student interns, the standard deviation is smaller, because the individuals within this group have
salaries that are less variable. The second data set isn’t better, it’s just less variable.

Here are some properties that can help you when interpreting a standard deviation:

 The standard deviation can never be a negative number, due to the way it’s calculated and the fact that it measures a distance
(distances are never negative numbers).

 The smallest possible value for the standard deviation is 0, and that happens only in contrived situations where every single
number in the data set is exactly the same (no deviation).

 The standard deviation is affected by outliers (extremely low or extremely high numbers in the data set). That’s because the
standard deviation is based on the distance from the mean. And remember, the mean is also affected by outliers.

 The standard deviation has the same units as the original data.

35

You might also like