Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Advanced Psychological Statistics Lecture IV-A

Dr. Sobia Aftab

GRAPHICAL PRESENTATION OF DATA

DISPLAYING UNIVERIATE CATEGORICAL DATA

• The first step in constructing a graphical display is often to summarize the data in a table
and then use information in the table to construct the display.

• For categorical data, this table is called a frequency distribution.

For Example:

The U.S. Department of Transportation established standards for motorcycle helmets. To


comply with these standards, helmets should reach the bottom of the motorcyclist’s ears.
The report “Motorcycle Helmet Use in 2005—Overall Results” (National Highway
Traffic Safety Administration, August 2005) summarized data collected by observing
1,700 motorcyclists nationwide at selected roadway locations. Each time a motorcyclist
passed by, the observer noted whether the rider was wearing no helmet, a noncompliant
helmet, or a compliant helmet.

Using the coding

NH = noncompliant helmet
CH = compliant helmet
N = no helmet

Few of the observations were:

CH N CH NH N CH CH CH N N

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
In total, there were 731 riders who wore no helmet, 153 who wore a noncompliant
helmet, and 816 who wore a compliant helmet.
The corresponding relative frequency distribution is given in Table 2.1:

• From the relative frequency distribution, you can see that a large number of the riders
(43%) were not wearing a helmet, but most of those who wore a helmet were wearing one
that met the Department of Transportation safety standard.

• The bar charts and comparative bar charts can be used to summarize univariate
categorical data.

1. BAR CHART

• A bar chart is a graphical display of categorical data.

• The bar chart is used when the purpose of the display is to show the data distribution.

• Each category in the frequency distribution is represented by a bar or rectangle, and the
display is constructed so that the area of each bar is proportional to the corresponding
frequency or relative frequency.

When to Use

• Number of variables: 1
• Data type: categorical
• Purpose: displaying data distribution

How to Construct

1. Draw a horizontal axis, and write the category names or labels below the line at
regularly spaced intervals.

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
2. Draw a vertical axis, and label the scale using either frequency or relative frequency.
3. Place a rectangular bar above each category label. The height is determined by the
category’s frequency or relative frequency, and all bars should have the same width.
With the same width, both the height and the area of the bar are proportional to the
frequency or relative frequency of the corresponding category.

What to look for

• Which categories occur frequently and which categories occur infrequently.

For Example:

• Above example used data on helmet use from a sample of 1,700 motorcyclists to
construct the frequency distribution in Table 2.1.

• A bar chart is an appropriate choice for displaying these data because:

─ There is one variable (helmet use).


─ The variable is categorical.
─ The purpose is to display the data distribution.

Using above mentioned three steps, bar chart will be calculated in following ways:

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• The completed bar chart is shown in Figure:

• The bar chart provides a visual representation of the distribution of the 1,700 values that
make up the data set.

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
• From the bar chart, it is easy to see that the compliant helmet category occurred most
often in the data set.

• The bar for compliant helmets is about five times as tall (and therefore has five times the
area) as the bar for noncompliant helmets because approximately five times as many
motorcyclists wore compliant helmets than wore noncompliant helmets.

2. COMPARATIVE BAR CHART

• Bar charts can also be used to provide a visual comparison of two or more groups.

When to Use

• Number of variables: 1 variable with observations for two or more groups


• Data type: categorical
• Purpose: comparing two or more data distributions

How to Construct

• This is constructed by using the same horizontal and vertical axes for the bar charts of
two or more groups.

• When constructing a comparative bar graph, you use the relative frequency rather than
the frequency to construct the scale on the vertical axis, so that you can make meaningful
comparisons even if the sample sizes are not the same.

• The same set of steps that were used to construct a bar chart are used to construct a
comparative bar chart, but in a comparative bar chart each category will have a bar for
each group.

For Example:

Each year, The Princeton Review conducts surveys of high school students who are
applying to college and of parents of college applicants. The report “2009 College
Hopes & Worries Survey Findings” (www.princetonreview/college-hopes-
worries-2009) included a summary of how 12,715 high school students responded to
the question “Ideally how far from home would you like the college you attend to be?”
Students responded by choosing one of four possible distance categories. Also
included was a summary of how 3,007 parents of students applying to college
responded to the question “How far from home would you like the college your child
attends to be?” The accompanying relative frequency table summarizes the student and
parent responses.

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• The completed comparative Bar chart is shown in figure:

• It is easy to see the differences between students and parents.

• A higher proportion of parents prefer a college close to home, and a higher proportion of
students believe that the ideal distance from home is more than 500 miles

• To see why it is important to use relative frequencies rather than frequencies to compare
groups of different sizes, consider the incorrect bar chart constructed using the
frequencies rather than the relative frequencies (in below Figure).

• Because there were so many more students than parents who participated in the surveys
(12,715 students and only 3,007 parents), the incorrect bar chart conveys a very different
and misleading impression of the differences between students and parents.

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

(Continue…….)

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
DISPLAYING UNIVARIATE NUMERICAL DATA

Three different types of graphical displays for univariate numerical data are:

1. Dotplots
2. Stem-and-leaf displays
3. Histograms.

1) DOTPLOTS

• A dotplot is a simple way to display numerical data when the data set is not too large.

• Each observation is represented by a dot above the location corresponding to its value
on a number line.

• When a value occurs more than once in a data set, there is a dot for each occurrence,
and these dots are stacked vertically in the plot.

When to Use

• Number of variables: 1
• Data Type: numerical
• Purpose: display data distribution

How to Construct

1. Draw a horizontal line and mark it with an appropriate measurement scale.

2. Locate each value in the data set along the measurement scale, and represent it by a
dot. If there are two or more observations with the same value, stack the dots
vertically.

What to look for

• Dotplots convey information about:

─ A representative or typical value in the data set.

─ The extent to which the data values spread out.

─ The nature of the distribution of values along the number line.

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
─ The presence of unusual values in the data set.

For Example:

The article “Keeping Score When It Counts: Graduation Rates and Academic
Progress Rates for 2009 NCAA Division I Basketball Tournament Teams” (The
Institute for Diversity and Ethics in Sport, University of Central Florida, March
2009) included data on graduation rates of basketball players for the universities and
colleges that sent teams to the 2009 Division I playoffs. The following graduation rates
are the percentages of basketball players starting college in 2002 who had graduated
by the end of 2008. (Note: Teams from 65 schools made it to the playoffs, but two of
them—Cornell and North Dakota State—did not report graduation rates.)

• A dotplot is an appropriate choice to summarize these data because the data set consists
of one variable (graduation rate), the variable is numerical, and the purpose is to display
the data distribution.

• The data set is not too large, with 63 observations. (If the data set had been much larger,
a histogram might have been a better choice.)

10

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• The completed dotplot is shown in Figure

• The dotplot shows how the 63 graduation rates are distributed along the number line.

• It can be seen that basketball graduation rates vary a great deal from school to school,
ranging from a low of 8% to a high of 100%.

• You can also see that the graduation rates seem to cluster in several groups, denoted by
the colored ovals that have been added to the dotplot.

• There are several schools with graduation rates of 100% (excellent!) and another group
of 13 schools with graduation rates that are higher than most.

• The majority of schools are in the large cluster, with graduation rates from about 30% to
about 72%.

• And then there is that bottom group of four schools with embarrassingly low graduation
rates for basketball players.

COMPARATIVE DOT PLOT

• Dotplots can also be used for comparative displays.

When to Use

• Number of variables: 1 variable with observations for two or more groups


• Data type: Numerical
• Purpose: Comparing two or more data distributions

How to Construct

• A comparative dotplot is constructed using the same numerical scale for two or more
dotplots.

• Be sure to include group labels for the dotplots in the display.

11

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
For Example:

The article referenced in above Example also gave graduation rates for all student
athletes at the 63 schools in the 2009 Division I basketball playoffs. The data are listed
below. Also listed are the differences between the graduation rate for all student athletes
and the graduation rate for basketball players.

12

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• Below Figure 2.7 shows a comparative dotplot:

• Notice that the comparative dotplot actually consists of two labeled dotplots that use the
same numerical scale.

• There are some striking differences that are easy to see when the data are displayed in
this way.

• The graduation rates for all student athletes tend to be higher and to vary less from
school to school than the graduation rates for just basketball players.

• The dotplots in Figure 2.7 are informative, but we can do even better.

• The data given here are paired data. Each basketball graduation rate can be paired with
the graduation rate for all student athletes from the same school.

• When data are paired, it is usually more informative to look at the differences.

• These differences (all─basketball) are also given in the above data table.

• Figure 2.8 gives a dotplot of the 63 differences.

13

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• Notice that:

─ One difference is 0. This corresponds to a school where the basketball graduation


rate equals the graduation rate of all student athletes.

─ There are 11 schools for which the difference is negative. Negative differences
correspond to schools that have a higher graduation rate for basketball players than
for all student athletes.

─ The most interesting features of the difference dotplot are the very large number of
positive differences and the wide spread. Positive differences correspond to schools
that have a lower graduation rate for basketball players.

─ There is a lot of variability in the graduation rate difference from school to school,
and three schools have differences that are noticeably higher than the rest: 53%,
55%, and 69%.

2) STEM-AND-LEAF-DISPLAY

• A stem-and-leaf display is an effective way to summarize univariate numerical data


when the data set is not too large.

When to Use

• Number of variables: 1
• Data type: Numerical
• Purpose: Display data distribution

How to Construct

1. Each number in the data set is broken into two pieces, a stem and a leaf.

─ The stem is the first part of the number and consists of the beginning digit(s).
14

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

─ The leaf is the last part of the number and consists of the final digit(s).

For Example:

─ The number 213 might be split into a stem of 2 and a leaf of 13 or a stem of 21 and
a leaf of 3.

─ The resulting stems and leaves are then used to construct the display.

2. List possible stem values in a vertical column.


3. Record the leaf for every observation beside the corresponding stem value.
4. Indicate the units for stems and leaves someplace in the display.

What to look for

• The display conveys information about:

─ A representative or typical value in the data set


─ The extent to which the data values spread out
─ The presence of any gaps and outliers (An outlier is an unusually small or large
data value).
─ The extent of symmetry in the data distribution
─ The number and location of peaks

For Example:

Many auto insurance companies give job-related discounts of 5 to 15%. The article
“Auto- Rate Discounts Seem to Defy Data” (San Luis Obispo Tribune, June 19,
2004) included the accompanying data on the number of automobile accidents per
year for every 1,000 people in 40 occupations.

15

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• Figure below shows a stem-and-leaf display for the accident rate data.

• The numbers in the vertical column on the left of the display are the stems.

• Each number to the right of the vertical line is a leaf corresponding to one of the
observations in the data set.

• The legend

Stem: Tens
Leaf: Ones

• tells you that the observation that had a stem of 4 and a leaf of 3 corresponds to the
occupation with an accident rate of 43 per 1,000.

• Similarly, the observation with the stem of 10 and leaf of 2 corresponds to 102 accidents
per 1,000.

16


Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• The display in above Figure suggests that a typical or representative value is in the stem 8
or 9 row, perhaps around 90.

• The observations are mostly concentrated in the 75 to 109 range, but there are a couple of
values that stand out on the low end (43 and 67) and one observation (152) that is far
removed from the rest of the data on the high end

17

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
COMPARATIVE STEM-AND-LEAF DISPLAYS

• It is also common to see a stem-and-leaf display used to provide a visual comparison


of two groups.

• This display is also sometimes called a back-to-back stem-and-leaf display.

When to Use

• Number of variables: 1 variable with observations for two groups


• Data type: Numerical
• Purpose: Compare two data distributions

How to Construct

• A comparative stem-and-leaf display, in which the leaves for one group are listed to
the right of the stem values and the leaves for the second group are listed to the left,
can show how the two groups are similar and how they differ.

• Be sure to include group labels to identify which group is on the left and which is on
the right.

For Example:

The article “Going Wireless” (AARP Bulletin, June 2009) reported the estimated
percentage of households with only wireless phone service (no landline) for the 50 U.S.
states and the District of Columbia. Data for the 19 Eastern states and for 13 western
states is in the following table.

18

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
• A comparative stem-and-leaf display (using only the first digit of each leaf) is shown in
Figure below:

Western States Eastern States

989 0 559875
1670681 1 66301164001
512 2 00

Stem: tens
Leaves: ones

• From the comparative stem-and-leaf display, you can see that although there was state-to
state variability in both the western and the eastern states, the data distributions are quite
similar.

3. HISTOGRAMS

• Dotplots and stem-and-leaf displays are not always effective ways to summarize
numerical data.

• Both are awkward when the data set contains a large number of data values.

• Histograms are displays that don’t work well for small data sets but do work well for
larger numerical data sets.

• Histograms are constructed a bit differently, depending on whether the variable of interest
is discrete or continuous.

FREQUENCY DISTRIBUTION & HISTOGRAM FOR


DISCRETE NUMERICAL DATA

• Discrete numerical data almost always result from counting.

• In such cases, each observation is a whole number.

• A frequency distribution for discrete numerical data lists each possible value (either
individually or grouped into intervals), the associated frequency, and sometimes the
corresponding relative frequency [which (relative frequency) is calculated by dividing
each frequency by the total number of observations in the data set].

19


Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
• It is possible to create a more compact frequency distribution by grouping some of values
into intervals.

For Example, you might group 1, 2, and 3 frequency to form an interval of 1–3, with a
corresponding frequency of 3 an so on.

• A histogram for discrete numerical data can be constructed using:

─ Frequency Distribution

─ Relative Frequency

─ Grouped Frequency

When to Use

• Number of variables: 1

• Data Type: Discrete numerical

• Purpose: Display data distribution

How to Construct

1. Draw a horizontal scale, and mark the possible values of the variable.

2. Draw a vertical scale, and add either a frequency or relative frequency scale.

3. Above each possible value, draw a rectangle centered at that value (so that the
rectangle for 1 is centered at 1, the rectangle for 5 is centered at 5, and so on).

4. The height of each rectangle is determined by the corresponding frequency or relative


frequency. When possible values are consecutive whole numbers, the base width for
each rectangle is 1.

What to Look For

• Center or typical value

• Extent of spread or variability

• General shape

• Location and number of peaks

20

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
• Presence of gaps and outliers

FREQUENCY DISTRIBUTION & HISTOGRAMS FOR


CONTINUOUS NUMERICAL DATA

• Before constructing histogram for continuous numerical data, the frequency distribution
for continuous numerical data has to be constructed.

Frequency Distribution for Continuous Numerical Data

• The first step in constructing a frequency distribution for continuous numerical data is to
decide what intervals will be used to group the data. These intervals are called class
intervals.

For Example:

States differ widely in the percentage of college students who attend college in their
home state. The percentages of freshman who attend college in their home state for each
of the 50 states are shown in Table.

96 73 60 73 79
86 93 58 81 75
81 76 89 73 59
84 86 86 72 59
77 78 80 56 43
90 76 66 55 50
73 88 70 75 64
53 86 90 77 80
90 87 89 82 82
96 64 82 83 75

Frequency distribution and relative frequency for above data is in following table:

21

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

S# Students Frequency Relative Frequency

1 40 < 50 1 0.02

2 50 < 60 7 0.14

3 60 < 70 4 0.08

4 70 < 80 15 0.30

5 80 < 90 17 0.34

6 90 < 100 6 0.12


• The smallest observation is 46 (Massachusetts) and the largest is 96 (Alaska and
Wyoming).

• It is reasonable to start the first class interval at 40 and let each interval have a width of
10.

• This gives class intervals starting with 40 to < 50 and continuing up to 90 to < 100.

• There are no set rules for selecting either the number of class intervals or the length of the
intervals.

• Using a few relatively wide intervals will bunch the data, whereas using a great many
relatively narrow intervals may spread the data over too many intervals, so that no
interval contains more than a few observations.

• Either way, interesting features of the data set may be missed.

• In general, with a small amount of data, relatively few intervals, perhaps between 5 and
10, should be used.

• With a large amount of data, a distribution based on 15 to 20 (or even more) intervals is
often recommended.

• The quantity

22

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
is often used as an estimate of an appropriate number of intervals: 5 intervals for 25
observations, 10 intervals when the number of observations is 100, and so on.

Histogram for Continuous Numerical Data

• When the class intervals in a frequency distribution are all of equal width, you
construct a histogram in a way that is very similar to what is done for discrete data.

When to Use

• Number of variables: 1
• Data Type: Continuous numerical
• Purpose: Displaying data distribution

How to Construct

1. Mark the boundaries of the class intervals on a horizontal axis.


2. Use either frequency or relative frequency on the vertical axis.
3. Draw a rectangle for each class interval directly above that interval (so that the edges
are at the class interval boundaries). The height of each rectangle is the frequency or
relative frequency of the corresponding class interval
What to Look For

• Center or typical value


• Extent of spread or variability
• General shape
• Location and number of peaks
• Presence of gaps and outliers

For Example

The article “Early Television Exposure and Subsequent Attention Problems in


Children” (Pediatrics, April 2004) investigated the television viewing habits of U.S.
children. The data summarized in the article were obtained as part of a large-scale
national survey. Table 2.5 gives approximate relative frequencies for the number of hours
spent watching TV per day for a sample of 1-year-old children.T6Class Intervals Are
Equal

23

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

• You can use the steps in the previous box to construct a histogram for the data
summarized in Table 2.5

• Figure 2.19 shows the completed relative frequency histogram. Notice that the
histogram has a single peak, with a majority of the children watching between 0 and 4
hours of TV per day.

24

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

Width

25

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
DISPLAYING BIVARIATE NUMERICAL DATA
• A bivariate data set consists of measurements or observations on two variables, x and y.
For example: x might be the weight of a car and y the gasoline mileage rating of the car.

• When both x and y are numerical variables, each observation consists of a pair of
numbers, such as (14, 5.2) or (27.63, 18.9).

• The first number in a pair is the value of x, and the second number is the value of y.

• An unorganized list of bivariate data doesn’t tell you much about the distribution of the x
values or the distribution of the y values, and tells you even less about how the two
variables are related to one another.

• Just as graphical displays are used to summarize univariate data, they can also be used to
summarize bivariate data.

• The Scatterplots and Time Series Plots can be used to summarize bivariate numerical
data.

1. SCATTORPLOTS

• The most important graph of bivariate numerical data is a scatterplot.

• In a scatterplot, each observation (pair of numbers) is represented by a point on a


rectangular coordinate system. Figure (a) and (b) shows the point representing the
observation (4.5, 15).

26

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

When to Use

• Number of variables: 2
• Data Type: Discrete numerical
• Purpose: Investigate the relationship between variables

How to Construct

1. Draw horizontal and vertical axes. Label the horizontal axis and include an
appropriate scale for the x variable. Label the vertical axis and include an appropriate
scale for the y variable.

2. For each (x, y) pair in the data set, add a dot at the appropriate location in the display.

What to Look For

• Relationship between x and y

For Example

The table gives the cost and an overall quality rating for 10 different brands of men’s athletic
shoes (www.consumerreports.org).

Is there a relationship between x 5 cost and y 5 quality rating? A scatterplot can help answer
this question.

Cost Rating
65 71
45 70
45 62
80 59
110 58
110 57
30 56
80 52

27

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

110 51
70 51

Figure below shows the completed scatterplot. There is an interesting and unexpected pattern.
The larger costs tend to be paired with the lower quality ratings, suggesting that there is
actually a negative association between cost and quality!

28

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab

29

Advanced Psychological Statistics Lecture IV-A


Dr. Sobia Aftab
2. TIME SERIES PLOT

• Data sets often consist of measurements collected over time at regular intervals so that
you can learn about change over time.

For example: stock prices, sales figures, and other socio-economic indicators might
be recorded on a weekly or monthly basis.

• A time-series plot (sometimes also called a time plot) is a simple graph of data
collected over time that can help you see interesting trends or patterns.

• A time series plot can be constructed by thinking of the data set as a bivariate data set,
where y is the variable observed and x is the time at which the observation was made.

• These (x, y) pairs are plotted as in a scatterplot. Consecutive observations are then
connected by a line segment.

For Example

The Christmas Price Index is calculated each year by PNC Advisors. The year 2008 was the
most costly year since the index began in 1984, with the “cost of Christmas” at $21,080. A
plot of the Christmas Price Index over time appears on the PNC web site
(www.pncchristmaspriceindex.com), and the data given there were used to construct the time
series plot of below Figure.

• The plot shows an upward trend in the index from 1984 until 1993.

• There has also been a clear upward trend in the index since 1995.

←×→×←×←×→

30

You might also like