Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

Unit 2 Tutorials: Data Representation and Distributions

INSIDE UNIT 2

Graphical Representation

Bar Graphs and Pie Charts


Histograms
Line Charts and Time-Series Diagrams
Stem-and-Leaf Plots
Dot Plots
Frequency Tables
Cumulative Frequency
Stack Plots
Misleading Graphical Displays

Distributions and Measures of Central Tendency

Distributions
Data Analysis
Shapes of Distribution
Mean, Median, and Mode
Weighted Mean
Measures of Center
Measures of Variation
Range and Interquartile Range (IQR)
Standard Deviation
Five Number Summary and Boxplots
Outliers and Modified Boxplots
Percentiles

Normal Distribution and Central Limit Theorem

Normal Distribution
68-95-99.7 Rule
Standard Scores and Z-Scores
Standard Normal Distribution
Introduction to Sampling Distribution
Center and Variation of a Sampling Distribution
Shape of a Sampling Distribution

Bar Graphs and Pie Charts


by Sophia

 WHAT'S COVERED

This tutorial will cover how to display qualitative data utilizing bar graphs and pie charts. We will explore how to create and interpret bar graphs and pie charts,
discussing the following:

1. Bar Graphs
2. Bar Graphs: Relative Frequency
3. Multiple Bar Graphs

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 1
4. Pie Charts

1. Bar Graphs
Categorical data is qualitative data, and it can be displayed visually in a bar graph, which will compare the number of values in each category. A bar graph is not the only
way to display categorical data, but it is a common way.

Suppose there are 2070 students enrolled in the following college courses, which are taken by different majors. At this particular college, there are 331 economics majors,
435 biology majors, 124 chemistry majors, etc.

Course Frequency

Economics 331

Biology 435

Chemistry 124

Statistics 248

Psychology 311

Sociology 248

Spanish 207

History 166

You begin by drawing a horizontal axis and labeling the categories beneath it. You could also label it on the vertical axis and label the categories from top to bottom if you
want. So, you wrote economics, biology, chemistry, and visually separated them.

Then, you need to create a vertical axis with frequency on it. The highest number in the data set is 435, so a smart choice would be to have the vertical axis frequencies go
up to 450, or even 500. This way, all values will be represented.

Finally, you set up a bar that goes up to the number that corresponds to that category. Therefore, economics will have a bar that goes up to 331, biology will go all the way
up to almost 450, etc.

The full bar graph looks like this:

 TERM TO KNOW

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 2
Bar Graph
A distribution of qualitative data that displays bars that are proportional in length to the frequency or relative frequency of a particular data value

2. Bar Graphs: Relative Frequency


You can also use a bar graph to show relative frequency. Relative frequency shows how much of the whole each class represents. It is the percent of the values that are in
each category.

How do you calculate relative frequency? Take each number, and divide it by the total number.

Course Frequency Relative Freq

Economics 331 /2070 = 16%

Biology 435 /2070 = 21%

Chemistry 124 /2070 = 6%

Statistics 248 /2070 = 12%

Psychology 311 /2070 = 15%

Sociology 248 /2070 = 12%

Spanish 207 /2070 = 10%

History 166 /2070 = 8%

There are 435 out of 2070 students in Biology, which means that Biology students make up 435/2070 = 21% of the total students. So you can see that the biology bar has
just over 20% of the students.

Notice that in the previous example with counts, and this example with relative frequency, the shape and size of the bars didn't change. The only thing that changed was
the vertical axis and what it was measuring.

3. Multiple Bar Graphs


Suppose that you wanted to know about the work habits of college students so you sample of 100 students. Perhaps you want to know if they were male or female, and
whether they did not work at all, worked during summer only, or had a job all year long.

The following table shows the data collected in this sample.

Male Female

No Job 25 28

Summer Only 17 10

Job All Year 11 9

You can create multiple bar graphs on the same set of axes and compare them by category. One way to display these items in a bar graph would be to break it up by male
and female. You choose green to be males and yellow to be females.

Next, you would break the horizontal axis into no job, summer only, and job all year.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 3
Both bar graphs would be presented side-by-side within each category. There are 25 males who had no job and 29 females who had no job, etc.

What this one tells us is that males are more likely than females to have a job all year and in summer, and a little bit less likely than females to never have had a job.

The other way would be to flip-flop which category is represented by the colors and which category goes on the axis. Male and female could go on the axis, and the job
status could be the colors. In that case, it would look like the following graph:

This graph will tell you that both males and females have a tendency towards not having a job, then have a summer job or have a job all year.

3. Pie Charts
As mentioned, qualitative data can be displayed in a couple of different ways. As discussed above, one way is to display it in a bar graph. Another way to display it is with a
pie chart. A pie chart displays relative frequencies for each category, which considers how these categories relate to the whole.

Let's use the same set of information from the first example with the 2070 students enrolled in college courses.

To make a pie chart, the first thing to do is to calculate relative frequencies. Remember, relative frequency is the percent of the values that are in each category and can be
calculated by dividing each data value by the total number.

Course Frequency Relative Freq

Economics 331 /2070 = 16%

Biology 435 /2070 = 21%

Chemistry 124 /2070 = 6%

Statistics 248 /2070 = 12%

Psychology 311 /2070 = 15%

Sociology 248 /2070 = 12%

Spanish 207 /2070 = 10%

History 166 /2070 = 8%

Now its time to create the pie chart by following the steps below:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 4
Step 1: First, find the relative frequency.

Step 2: Next, calculate the central angle for each category using the relative frequency. You may recall there are 360 degrees in a circle. The central angle for economics
has to be 16 percent of the circle. So how do you set this up? You need 16% of 360 degrees. Multiply each percent by 360.

Course Frequency Relative Freq Angle

Economics 331 ÷ 2070 = 16% x 360 = 57.6°

Biology 435 ÷ 2070 = 21% x 360 = 75.6°

Chemistry 124 ÷ 2070 = 6% x 360 = 21.6°

Statistics 248 ÷ 2070 = 12% x 360 = 43.2°

Psychology 311 ÷ 2070 = 15% x 360 = 54°

Sociology 248 ÷ 2070 = 12% x 360 = 43.2°

Spanish 207 ÷ 2070 = 10% x 360 = 36°

History 166 ÷ 2070 = 8% x 360 = 28.8°

 HINT

Remember, when multiplying by a percent, you need to change it into a decimal. For example, you would enter 0.16 for 16%.
0.16 times 360 gives you about 58 degrees. Therefore, your central angle, representing economics, with be approximately 58 degrees. Do the same thing with the
remainder of the categories to obtain angle measurements for each of these central angles.

Step 3: Once you have determined the relative frequency of each category, you can create the sectors of your pie chart, shown below:

But which sector corresponds to which category? You could write the words inside each sector, labeling each with the names of the majors. It's fairly clear that the biggest
slice is biology. But which ones are the rest? You need to create a key.

Step 4: Finally, add a key off to the side. You can either have written the word economics within the blue sector, or you can create a blue square and write "Economics"
next to it. That shows that anything that's blue means economics. You can do the same for each of the sectors.

 STEP BY STEP

Step 1: Find relative frequency


Step 2: Calculate the central angle using the relative frequency.
Step 3: Create the sectors

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 5
Step 4: Add a key
 TERM TO KNOW

Pie Chart/Circle Graph


A distribution of qualitative data that shows the relative frequency of each category as a sector of a circle.

 SUMMARY

Bar graphs are a nice way to view categorical data or the counts in a category for a qualitative data set. We can use frequencies or relative frequencies, if there's
no overlap between the categories, to show how each category relates to the others. We can also put multiple bar graphs on the same set of axes. Pie charts are
another form of visual display used for qualitative data. They display the relative frequency or percent of each category by dividing a circle into sectors that relate
those relative sizes. The biggest advantage of a pie chart over a bar graph is that it is possible to see how each category relates to the whole. It's important to note
that sometimes, because of rounding, the relative frequencies don't add up to 100%. They might add up to only 99% or 101%, as they did in our example.
However, as long as the relative sizes of each are in the right proportions, it is not a major cause for concern. You can also fix this issue by rounding your values
more precisely.

Good luck!

 TERMS TO KNOW

Bar Graph
A distribution of qualitative data that displays bars that are proportional in length to the frequency or relative frequency of a particular data value

Pie Chart/Circle Graph


A distribution of qualitative data that shows the relative frequency of each category as a sector of a circle.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 6
Histograms
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of histograms and binning. Our discussion breaks down as follows:

1. Histograms
2. Binning
3. Histograms and Bar Graphs

1. Histograms
Histograms are a type of distribution for quantitative data. When you have a quantitative data set, often the values are spread out over a large range of values.

Suppose there's an elementary school class in Muncie, Indiana that chooses to keep track of the high temperature on each of the 180 school days. In Indiana, the
temperature can get low in the winter, down to zero degrees Fahrenheit, and maybe near 90℉ at the beginning or the end of the school year.

To understand the overall trend of the data, you might not be interested in every single individual temperature. Instead, you might be more interested in how many days
were in the 20℉'s--that is, days that the temperature was in the range of 20℉ up to 29.999℉, or in the 30s℉, 40s℉, etc.

The idea that we can break those temperatures that occur over a wide range into more manageable intervals and categorize them that way is called binning. Binning
allows us to make a frequency table out of those categories.

Using the bin width of 10, the Muncie School District recorded the temperature on every day, then categorized them by whether they were in 0℉'s, 10℉'s, 20℉'s, 30℉'s,
40℉'s, 50℉'s, 60℉'s, 70℉'s, 80℉'s, or 90℉'s, and created the following frequency table.

Temperature Frequency

0's 10

10's 16

20's 25

30's 38

40's 31

50's 27

60's 13

70's 12

80's 7

90's 1

This means there was one day out of the year that it hit the 90℉'s, seven days were in the 80℉'s, etc. Now we can create a histogram.

A histogram is somewhat similar to a bar graph in that, on the horizontal axis, you’re going to display the temperatures, which are our categories now. The only difference
is these categories are numbers. Our bins go from 0 to 10, 10 to 20, 20 to 30, 30 to 40, etc. so it makes sense that we would put 0 as being first, and ten as being second,
and 20 as being third. The frequencies, just like a bar graph, will go up the vertical axis.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 7
As you can see from this histogram, the first bin goes from 0 degrees to 10 degrees, and ten days fall into that category. The second bin goes from 10 degrees to 20
degrees, and because there are 16 days there, that bar goes all the way up to 16. Every bar reflects the data from the table.

 TERMS TO KNOW

Histogram
A distribution of data that shows the frequency of different ranges of values. Each frequency is the height of a bar.

Binning
The method of deciding what widths of categories should be used on a histogram.

2. Binning
The way we choose to bin data can change the look, and at times, the shape of the histogram.

In the original histogram, data was classified by 10's. But what if you chose to classify it by 5-degree intervals instead? Instead of going 0℉ to 10℉, what if we split up the
values between 0℉ to 4℉ and 5℉ to 9℉, 10℉ to 14℉, and 15℉ to 19℉, etc.? In that case, the bins might look different.

Reviewing the frequency table, there are ten days in the range between 0℉ and 10℉. Suppose there were four days between 0℉ and 4℉, and six days were between 5℉
and 9℉. Therefore, we took one bin and split it into two bins. If you do that with every one of your bins, you end up with twice as many bins and twice as many bars on
your histogram.

In this new histogram, notice the frequencies. The bars are not as tall as they were before, but they still provide the same overall shape. However, there are not very many
bars overall. There's a lot of data in one part of the graph and not a lot in the other parts. You'll note that in the 90 to 95 bin, there's no bar. The reason for this is that when
we broke up that bar, the one data value that was in the 90's was actually in the 95 to 99 range. When there's no data in a particular bin, there's not going to be any bar
that extends up from the x-axis.

 BIG IDEA

The binning process is important. Problems can arise if you make the bins too narrow. In the previous examples, there were bins of width 10 degrees and bins of width
5 degrees. You could have decided on bins with width 1 or 2 degrees, but perhaps you wouldn't have gotten the same overall shape of the distribution.
There are two main problems you may have with binning:

The pancake effect: Bins that are too narrow can create the pancake effect, displaying too many bins with almost nothing in them. You don't really get to see the
overall shape of the data. Suppose that our bins go from 0 degrees to 2 degrees, 2 degrees to 4 degrees, etc. If we continued to do this with all of our data, we would
end up with much more bins, but hardly any data in each bin.

The skyscraper effect: Bins that are too wide can create the skyscraper effect. Suppose that your bins go from 0 degrees to 50 degrees and then 50 degrees to 100
degrees. If you have too few bins and lots of data in them, you don't get an accurate sense of what the shape of the distribution looks like. You know that most of the
data is in one bin and not the other, but you don't know where in the first bin that data is. The classes and bins were too wide, so you don't get an overall
understanding of the distribution.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 8
3. Histograms and Bar Graphs
You might confuse a bar graph with a histogram from time to time. However, there are two key differences between the two kinds of graphs.

Histograms vs. Bar Graphs

The bars touch in a This it makes sense because the bins run one into the other, like with the temperature example. The information goes right from the 0's into
histogram. the 10's, so it makes sense to have the bars right next to each other. In a bar graph, bars don't have to do that.

In a bar graph, typically, there's no reason to believe that one category has a higher value than the other. Suppose that you have a bar graph
of the number of students enrolled in different college majors. There's no reason to put economics further to the right than chemistry because
The order of the bars one is not numerically greater than the other.
matter in a histogram.
However, in a histogram, the values further to the right are, in fact, numerically greater than the values to the left. Because histograms deal
with higher numbers and lower numbers, the order of the bars does matter.

 SUMMARY

Histograms are distributions for quantitative data-- specifically, they are typically used for more "spread out" data. This "spread out" data is binned and used to
create bars, utilizing the frequencies in those bins. It is important to appropriately bin the data so that you don't get the pancake effect or the opposite problem, the
skyscraper effect. Histograms can look like bar graphs but are different in that in histograms, the bars actually touch, and the actual order of the bars matter.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Binning
The method of deciding what widths of categories should be used on a histogram

Histogram
A distribution of data that shows the frequency of different ranges of values. Each frequency is the height of a bar.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 9
Line Charts and Time-Series Diagrams
by Sophia

 WHAT'S COVERED

This tutorial will cover how to identify quantitative data with line charts and time-series diagrams. Our discussion breaks down as follows:

1. Line Charts
2. Frequency Polygon
3. Multiple Line Charts
4. Time-Series Diagrams

1. Line Charts
Line charts are a kind of display that can be used to represent quantitative data, using construction that is very similar to a histogram (refer to the tutorial on how to
construct histograms for more about those graphs).

Suppose that there is an elementary school class in Muncie, Indiana, keeping track of the high temperature for each school day. In Indiana, it might get down to, for
example, 0℉ in the winter, or as high as 90℉ at the beginning of the school year in September or at the end of the school year in May.

That frequecy table and histogram for this distribution could look something like this:

Temperature Frequency Histogram

0's 10

10's 16

20's 25

30's 38

40's 31

50's 27

60's 13

70's 12

80's 7

90's 1
There were ten days where the temperature fell between 0℉ and 10℉, 16 days that were between 10℉ and 20℉, etc.

So how do we transform the data in this histogram into a line chart?

 STEP BY STEP

Step 1: To create a line chart, take the heights of the bars, and instead of creating them as heights of bars, create dots instead.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 10
Step 2: Next, get rid of the boxes and connect the dots instead. This is the line graph, and it's virtually the same visual display as shown in the histogram.

As with histograms, binning makes a difference. Suppose you bin differently, such as by fives instead of by tens, you end up with a graph that looks this:

 TERM TO KNOW

Line Chart
A distribution of quantitative data that shows the frequency of different intervals of data. The frequencies are indicated by heights of dots, which are connected to
each other.

2. Frequency Polygon
If you put the histogram and the line chart on the same set of axes, you can create something called a frequency polygon.

When you have the histogram and the line chart together, you can connect the tops of the midpoints, which would look like this:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 11
 TERM TO KNOW

Frequency Polygon
A distribution of data that shows both a histogram and its line chart on the same set of axes.

3. Multiple Line Charts


It's also possible to do multiple line charts on the same set of axes. This can be helpful because it is difficult to plot multiple sets of data with a histogram.

 EXAMPLE Suppose that a school in Tucson, Arizona did the same project as the kids in Muncie, Indiana, and then the two schools shared their information.
Temperature Indiana Arizona

0's 10 0

10's 16 0

20's 25 1

30's 38 3

40's 31 14

50's 27 31

60's 13 43

70's 12 20

80's 7 25

90's 1 22

100's 0 14

110's 0 7

The line charts can be compared, showing the similarities and differences between the two data sets. Days in the 60's, 70's, 80s, 90's, and even 100's are very
common in Arizona, whereas in Indiana, the most common days are days in the 30's, 40's, and 50's.
 TERM TO KNOW

Multiple Line Charts


A distribution that shows more than one data set's values in line charts. This is advantageous because it is clearer than trying to compare multiple histograms on
the same set of axes.

4. Time-Series Diagrams
Time-series diagrams are among the most common graphs that you will see in everyday life. Time-series graphs are particularly useful for showing how information
changes over time.

IN CONTEXT
You have probably seen time-series graphs almost every day in the stock market report in the business section of the newspaper. This is a graph of the price of a
stock over several months--October, November, December, January.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 12
This graph shows that the stock price has been going up and down, but then during the first couple of months of 2012, it started getting high again. The graph
demonstrates the different value that a variable--in this case, stock price--takes over time. The time-series graph allows you to see certain values that you might
not be able to see in some other kind of graphical display.

 EXAMPLE Take a look at this histogram of high temperatures for the year. There are 365 different values from the National Weather Service for Chanhassen,
Minnesota, binned by 10's.

Now compare the histogram with a time-series diagram of the same data, below. Both graphs contain the same information, but in the time-series graph, you can see
this data over time.

The histogram shows you that the most common temperature throughout the year in Chanhassen, Minnesota was between 70 and 80 degrees, and it occurred about
80 times. You can also see that on the time series diagram. You'd need to look horizontally between 70 and 80 on the vertical axis and make horizontal lines over to
see how many points were within that band. Doing so would show that there were lots and lots of points within that band.
Additionally, this shows how the temperature tends to change over the course of the year. It gets high in the middle of the year, which is unsurprising because it's
summer. Conversely, it gets low at both the beginning and the end of the year, bottoming out in early January. This additional trend is something that you don't see on
the histogram.
 TERM TO KNOW

Time-Series Diagram
A graphical display that shows the values a variable takes over time.

 SUMMARY

Line charts are a nice way to visualize quantitative data. They use much the same construction as a histogram, but the heights are determined by dots and set of
boxes. When both a line chart and a histogram are shown on the same set of axes, you can create a frequency polygon. The best use of a line chart is when you
can compare a couple of different line charts on the same set of axes, creating multiple line charts. A time-series diagram illustrates the change of a value over
time. Time-series diagrams are not only useful in showing you what the data values are, but also when they occurred.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Frequency Polygon
A distribution of data that shows both a histogram and its line chart on the same set of axes.

Line Chart

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 13
A distribution of quantitative data that shows the frequency of different intervals of data. The frequencies are indicated by heights of dots, which are connected to each
other.

Multiple Line Charts


A distribution that shows more than one data set's values in line charts. This is advantageous because it is clearer than trying to compare multiple histograms on the
same set of axes.

Time-Series Diagram
A graphical display that shows the values a variable takes over time.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 14
Stem-and-Leaf Plots
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of stem-and-leaf plots. Our discussion breaks down as follows:

1. Stem-and-Leaf Plots
2. Variations
a. Splitting Stems
b. Rounding
c. Two-Digit Leaves
d. Back-to-Back
3. Advantages of Stem-and-Leaf Plots

1. Stem-and-Leaf Plots
While these graphs, known as stem-and-leaf plots, have a funny name, they actually serve a useful purpose and are very versatile.

Many quantitative data sets can be displayed in stem-and-leaf plots, such as the one below:

Percent of College Students Enrolled in Public Colleges by State


95 95 87 76 70 81 43 90 84 55

84 81 84 83 75 76 85 89 84 81

56 80 52 80 63 89 73 56 62 72

91 96 82 80 73 86 81 87 55 82

74 79 89 92 87 85 77 75 88 82

This data set represents the 50 states in the United States, and these numbers are the percent of college students in each state that are enrolled in public colleges. For
example, in one state, 95% of its college students are in public schools, whereas in another state, only 52% are enrolled in public colleges.

To create a stem-and-leaf plot, follow these steps:

 STEP BY STEP

Step 1: Decide on a natural classification. Here, 10 seems like an obvious choice. Those are going to be our bins. You should also choose your bins based on some digit.
In this case, go by the tens digit. Note, if these numbers were in the hundreds, you might decide to go by the hundreds digit, though you could still go by the tens if desired.

Step 2: Next, create “stems.” These are going to be the stems based on the bins you selected.

The 9 means that this is going to be a state with 90 percent, or a percent in the 90's, of their students at public school; the 8 represents states where the percent is in the
80's, and so forth. Write them in order, least to greatest or greatest to least (the direction doesn’t really matter), to the left of a vertical line, as shown above.

Step 3: List the values by their ones digit ascending away from those stems. Those are considered the leaves. When it is completed, it will look like the one shown below.
Notice, for example, that 43% of the students in one state were at a public school.

9 0 1 2 5 5 6

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 15
8 0 0 0 1 1 1 1 2 2 2 3 4 4 4 4 5 5 6 7 7 7 8 9 9 9

7 0 2 3 3 4 5 5 6 6 7 9

6 2 3

5 2 5 5 6 6

4 3

 HINT

It is important to note that if a value appears more than once, you will list it more than once. For example, 80 appears three times. Also, note that those numbers are
ascending away from the stem.
Step 4: Create a key. There's one more important feature of a stem-and-leaf plot. You need to be able to convey to someone who's looking at it, exactly what they're
looking at. For example, in our graph above, the “6 bar 2” means that there's a state that has 62% of its students going to public colleges. Therefore, make this clear to the
reader by saying, in a key, “4 bar 3 means 43%.” This tells the reader how these numbers should be interpreted.

9 0 1 2 5 5 6

8 0 0 0 1 1 1 1 2 2 2 3 4 4 4 4 5 5 6 7 7 7 8 9 9 9

7 0 2 3 3 4 5 5 6 6 7 9

6 2 3

5 2 5 5 6 6
Key: means 43%
4 3

 TERM TO KNOW

Stem-and-Leaf Plot/Stemplot
A distribution of quantitative data that shows natural numerical breaks in the data as categories called "stems" and individual values as "leaves."

2. Variations
There is more than one way to display data on a stem-and-leaf plot. Other variations include:

Splitting the stems


Rounding the data
Using two-digit leaves
Showing two groups with a back-to-back plot

2a. Splitting Stems


In our previous graph, the 80's had more than any other grouping. In fact, they had more than twice as much as any other single grouping, which looked a little strange. Is
there anything we can do about that?

Suppose that you decided that tens were too wide of a bin. Instead, you could break it down by fives, and then write two 8's--a low 8 and a high 8, or 85 to 89 for the high,
and 80 to 84 for the low. Note, though, that if you’re going to split one bucket, you need to split them all.

9 5 5 6

9 0 1 2

8 5 5 6 7 7 7 8 9 9 9

8 0 0 0 1 1 1 1 2 2 2 3 4 4 4 4

7 5 5 6 6 7 9

7 0 2 3 3 4

6 2 3

5 5 5 6 6

5 2
Key: means 43%
4

4 3

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 16
If you split the stems into lows and highs, the graph will look like the one above. Because this separates the stems so that no one stem has so much more data than any
other, this is a more of an appropriate visual than the first one.

2b. Rounding
Take a look at this set of high school GPAs for this group of students. Make a stem-and-leaf plot of these GPAs.

Student GPA

Amy 2.95

Blake 3.55

Holly 3.75

Isaiah 1.94

Jenny 2.23

Jesse 3.41

Jim 1.96

Johnathan 2.25

Katherine 2.56

Kelly 2.89

Ryan 3.24

Sherry 3.61

Teri 4.00

Todd 2.78

Tyler 3.12

In this option, we can round the GPAs, which is a legitimate thing to do. Take Jim, for example. His GPA is 1.96, which would round to 2.0. Isaiah's GPA of 1.94 would
round down to 1.9, Amy's GPA of 2.95 would round to 3.0.

4 0

3 0 1 2 4 6 6 8

2 0 2 3 6 8 9
Key: means GPA rounds to 2.0
1 9

The graphs says “2 bar 0 means the GPA rounds to 2.0.” That refers to Jim, whose GPA rounds to 2.0.

2c. Two-digit leaves.


In the same GPA example as above, we can leave the numbers as is and not round. This is also a completely legitimate way to represent this data as long as you visually
separate these numbers.

4 00

3 12 24 41 55 61 75

2 23 25 56 78 89 95

1 94 96

Key: means GPA is 2.23

The graph’s key says, “2 bar 23 means a GPA of 2.23.” For example, Tyler's GPA of 3.12 would be represented by the stem 3 and leaf 12.

2d. Back-to-Back
Again, using the same GPA data from above, suppose you are interested in the differences between girls' GPAs, like Amy, Holly, Jenny, Katherine, etc., and the boys'
GPAs.

You could compare those by putting one group of leaves to the right of the stem and another group of leaves to the left of the stem. This is known as a back-to-back stem-
and-leaf plot, and it would look like this:

Girls Boys

0 4

8 6 0 3 1 2 4 6

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 17
9 6 2 2 0 3 8
Key: means GPA rounds to 3.1
1 9

This graph rounds the numbers again, saying that “3 bar 1 means the GPA rounds to 3.1.” Here, the girls' GPAs are on the left. The boys' GPAs are on the right. This
allows you to compare the distributions of boys' GPAs to girls' GPAs, illustrating that the girls' GPAs are typically a little bit higher.

 TERM TO KNOW

Back-to-Back Stem-and-Leaf Plot


Two stem-and-leaf plots on the same set of stems. This allows us to compare the distributions of two different categories.

3. Advantages of Stem-and-Leaf Plots


Why use a stem-and-leaf plot instead of other graphical displays, like histograms or dot plots? Stem-and-leaf plots have a couple of advantages:

They are like a dot plot in that all the data points can be seen.
They work over a larger range and all the data points can be seen, making them better than a dot plot in some circumstances.

The drawback to stem-and-leaf plots is that they are difficult to create if the data set is too big. For example, the graph at the beginning had 50 data values, and it might be
challenging to see all the data values at once. Therefore, a stem-and-leaf plot might not be the most useful display for that data set.

 SUMMARY

Stem-and-leaf plots are very useful displays of quantitative data. There are very versatile, and there are many ways you can use them. To make these plots, you
start by creating bins from natural numerical breaks so that the reader can identify the numbers, followed by making a key to convey how the numbers should be
interpreted. To make the plot clearer, you can split stems, round the values, create leaves with double digits, or compare across categories using a back to back
plot.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Back-to-Back Stem-and-Leaf Plot


Two stem-and-leaf plots on the same set of stems. This allows us to compare the distributions of two different categories.

Stem-and-Leaf Plot/Stemplot
A distribution of quantitative data that shows natural numerical breaks in the data as categories called "stems" and individual values as "leaves."

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 18
Dot Plots
by Sophia

 WHAT'S COVERED

Constructing and interpreting a dot plot is important when visually organizing data from a table or chart. This tutorial will examine the following:
1. Dotplots
2. Ideal Settings for Dotplots

1. Dotplots
When the data is quantitative, dot plots are used exclusively when their values are close together and discrete. They can also be used for qualitative data.

IN CONTEXT
Suppose you gathered information on students and the number of pets they have, and you want to create a dot plot to visually organize the information.

Student Pets

Amy 1

Blake 3

Holly 2

Isaiah 1

Jenny 0

Jesse 1

Jim 0

Jonathan 2

Katherine 4

Kelly 6

Ryan 1

Sherry 2

Teri 1

Todd 0

Tyler 2

You can create a dot plot by first drawing an x-axis. It could be vertical, or it could be horizontal; the one below is horizontal. Then, you scale your axis from the
smallest number, which is 0, up to the highest number, which is 6. Include even the numbers that don't appear in the list, like 5. Label the axis as "pets".

Begin with the first number, which is for Amy. You can plot the number of pets she has by placing a dot above the 1. Next is Blake, who has three pets.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 19
Continue throughout the table, noting that Holly has two pets, and Isaiah has one until you complete notating the dots all the way through Tyler with two pets.

Notice that you stack the dots when you get more than one value at a particular number. You can see that there is a gap from 4 to 6--no one has five pets. Most of
the people have either zero, one, or two pets, but you need to keep the 5 in there to visually see the gap.

 TERM TO KNOW

Dot plot
A distribution in which each data value is represented by a dot above that value on an axis.

2. Ideal Settings for Dotplots


There are certain criteria to help signal when to use dot plots. When dealing with quantitative data, it is ideal when the data set is:

Small. There are not too many dots to draw. It doesn't mean there has to be small numbers--although the last example was small numbers. It simply can't have too
many dots to draw. For example, 20 observations would work well with a dot plot.
Discrete. Ideally, integers are nice and easy to plot.
Numbers Close Together. The numbers should not be too spread out. Think about the number of tick marks on the x-axis. If there are more than 15 tick marks
between the smallest and largest number, the dot plot will be more difficult to draw.

Dot plots can be used in a qualitative setting, as well. Suppose that you asked a class of 17 students what their favorite sport was. Three of them said soccer. Five of them
said baseball. The remaining nine said basketball. This is how your dot plot would look:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 20
Notice how the dots are stacked when multiple students are indicating the same sport as their favorite. It was constructed by creating an x-axis with sports labeled across
the bottom.

 SUMMARY

Dot plots are distributions for both quantitative and qualitative data. They are constructed by creating dots about an axis. They are easy to construct and even
easier to interpret. Small data sets, either qualitative or quantitative, are ideal settings for dot plots. If a dot plot is quantitative, the numbers should be discrete and
not too spread out.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Dotplot
A distribution in which each data value is represented by a dot above that value on an axis.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 21
Frequency Tables
by Sophia

 WHAT'S COVERED

This tutorial will introduce you to the basic principles of frequency tables. Our discussion breaks down as follows:

1. Frequency and Frequency Tables


2. Relative Frequency

1. Frequency and Frequency Tables


Frequency tables are tables that show how often data occurs. Frequency is the number of times that particular value occurs in a data set.

Let's look at several examples to show how easy it is to summarize frequencies in a table.

 EXAMPLE Suppose I have the 15 billiard balls from a pool table. One of the variables of these billiard balls that I might be interested in is their color. For
example, the one ball is yellow, the two ball is blue, the six ball is green, the nine ball is also yellow, etc.

Now, I could take the values that each variable color represents and put them in a table.

Color Frequency

Yellow 2

Blue 2

Red 2

Purple 2

Orange 2

Green 2

Maroon 2

Black 1

Frequency is how often these values occur. Two of the balls are yellow, two are blue, two are orange, two are green, etc. The only one that has a frequency of one is
black, the eight ball.

What about when we use quantitative data?

 EXAMPLE Suppose an ice cream taste tester was asked to rate his satisfaction of 20 different ice creams on a 1-10 scale. His satisfaction scores are listed
below.
6, 6, 4, 8, 9, 7, 8, 5, 8, 9, 3, 4, 4, 3, 4, 1, 9, 8, 7, 10
Try to construct a frequency table for these scores. Hopefully, you came up with something that looks like this:

Score Frequency

1 1

2 0

3 2

4 4

5 1

6 2

7 2

8 4

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 22
9 3

10 1

One ice cream received a score of one. None of them received a score of two. Two of them received a score of three, and so on. The frequency represents how often
a particular score was received.

Now, in some cases, you may not want to look at the raw data, but instead, look directly at the frequency table. This is useful if the data set is very large.

 EXAMPLE Consider the following frequency table that has the heights of 333 sixth-grade students.

Height Frequency

55 11

56 21

57 33

58 37

59 55

60 51

61 44

62 32

63 30

64 12

65 7
These heights are rounded to the nearest inch. This means 11 students are 55 inches tall (height of 4' 7"), 21 students who are 56 inches tall, and so forth.

 TERMS TO KNOW

Frequency Table
A table showing the values of the data, and their respective frequencies.

Frequency
How often a data value, or range of values, occurs.

2. Relative Frequency
Often it's preferable to not just look at frequency, but rather to ask what percent of the students a particular value represents. We can create a value called relative
frequency, created by dividing each value by the total.

We can use relative frequency, or percents, to get a better picture of the portion that 11 students are of the whole population.

Height Frequency Relative Freq

55 11 11/333 = 3%

56 21 21/333 = 6%

57 33 33/333 = 10%

58 37 37/333 = 11%

59 55 55/333 = 17%

60 51 51/333 = 15%

61 44 44/333 = 13%

62 32 32/333 = 10%

63 30 30/333 = 9%

64 12 12/333 = 4%

65 7 7/333 = 2%

As we can see, the 11 students who have a height of 55 inches make up about 3% of the population. We can fill out the entire table to find the relative frequencies as
opposed to "regular" frequencies.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 23
 TERM TO KNOW

Relative Frequency
The percent of the data points that take a particular value. This is obtained by dividing the frequency of each value by the total number of data points.

 SUMMARY

Data sets can be shown in frequency tables, whether they are qualitative or quantitative. Frequency tables are particularly useful with large data sets, in cases
where you don't want to see all of the raw data and would rather see it categorized. You can also use relative frequency to see what percent of the sample goes in
each bucket of the frequency table. Remember, frequency is the raw count, relative frequency is the percent, and both of those values can go into a frequency
table.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Frequency
How often a data value, or range of values, occurs.

Frequency Table
A table showing the values of the data, and their respective frequencies.

Relative Frequency
The percent of the data points that take a particular value. This is obtained by dividing the frequency of each value by the total number of data points.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 24
Cumulative Frequency
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of cumulative frequency. Our discussion breaks down as follows:

1. Cumulative Frequency
2. Relative Cumulative Frequencies
3. Ogives

1. Cumulative Frequency
You likely already know about frequency, which refers to how often a data value occurs. Cumulative means the accumulation of everything that has occurred up to a
certain point. Therefore, cumulative frequency is the collected frequency of data points.

 EXAMPLE If a teacher says that a test is cumulative, that means that it's going to cover everything that you've learned that year, up to the point of the test.
In this context, cumulative frequencies involve separating the data into bins and determining how many observations fall within or below that bin.

 EXAMPLE This is the distribution of temperatures by 10's for Chanhassen, Minnesota in the year 2009. Three days were between -10℉ and -1℉, eight days
that were between 0℉ and 9℉ for the high temperature, and so forth.

Temperature Frequency

-10 - -1 3

0-9 8

10 - 19 25

20 - 29 39

30 - 39 30

40 - 49 51

50 - 59 46

60 - 69 39

70 - 79 80

80 - 89 40

90 - 99 4
With this information about the distribution of temperatures, you can determine cumulative frequencies by asking, “How many days were at or below 9℉ for the high
temperature?” Well, eight days fell within the zero to nine bin, and three that fell below it. This equals a total of 11.
For the third category, how many days were at or below this 19℉? Well, 25 were in that third bin and 11 were below it, which means it's a total of 36. You can continue
this throughout the entire chart. Unsurprisingly, you will get 365 total days. All 365 days of the year were at or below 99 degrees in Chanhassen that year.

Cumulative
Temperature Frequency
Freq

-10 - -1 3 3

0-9 8 11

10 - 19 25 36

20 - 29 39 75

30 - 39 30 105

40 - 49 51 156

50 - 59 46 202

60 - 69 39 241

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 25
70 - 79 80 321

80 - 89 40 361

90 - 99 4 365

 TERM TO KNOW

Cumulative Frequency
The number of data points that fall within or below a given bin of data.

2. Relative Cumulative Frequencies


Sometimes it is useful to consider relative cumulative frequencies, which is the percent of observations that fall in or below a certain bin.

You may have encountered relative frequency before, but not relative cumulative frequency. Fortunately, it's calculated the same way as relative frequency. To determine
the relative cumulative frequency, divide each value by the total number of values.

In the above example, we are considering a full year or 365 days. So we will divide each cumulative frequency by 365 to get the relative cumulative frequency.

In the first bin, there were 3 out of 365 values that fell in this category. This means that 0.008 of the data fell in or below this bucket. Dividing 11 by 365 gives you about
0.03. Continuing on the rest of the chart, we get these values.

Cumulative Rel. Cumulative


Temperature Frequency
Freq Freq

-10 - -1 3 3 /365 = 0.008

0-9 8 11 /365 = 0.030

10 - 19 25 36 /365 = 0.099

20 - 29 39 75 /365 = 0.205

30 - 39 30 105 /365 = 0.288

40 - 49 51 156 /365 = 0.427

50 - 59 46 202 /365 = 0.533

60 - 69 39 241 /365 = 0.660

70 - 79 80 321 /365 = 0.879

80 - 89 40 361 /365 = 0.989

90 - 99 4 365 /365 = 1.000

 BIG IDEA

Note, the main overarching point here is to divide each cumulative by the total.
 TERM TO KNOW

Relative Cumulative Frequency


The percent of data points that fall within or below a given bin of data.

3. Ogives
In the previous chart, you may notice that the final value of 1.000 means that 100% of the values, or all 365 days, fell at or below this bin. Graphically, this information can
be presented in something called an ogive. It's also called a relative cumulative frequency graph, or sometimes a percentile graph. It's a line chart that uses these bins and
the relative cumulative frequencies to show how many values were at or below these bins.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 26
Temperature Rel. Cum. Freq

-10 - -1 0.008

0-9 0.030

10 - 19 0.099

20 - 29 0.205

30 - 39 0.288

40 - 49 0.427

50 - 59 0.533

60 - 69 0.660

70 - 79 0.879

80 - 89 0.989

90 - 99 1.000

Use the left-hand edge of the bin, because by the time you’ve gotten to negative 10 degrees going left to right on this number line, you haven't encountered any of the days
of the year yet. However, once you get to zero degrees, you’ve encountered three of the days, which is a certain amount of relative cumulative frequency. By the time you
get to 100 degrees, you will have encountered every single day. Every day will have been at some point in or below that bin.

 HINT

Ogives are increasing from left to right. If there's no data in a particular bucket, you get a flat line, or no increase.

 SUMMARY

Cumulative frequency and relative cumulative frequency show the number or the percent, respectively, of the data that falls in or below a certain bin. We might also
refer to this as an ogive. It is a useful way to show how certain values relate to other values or how they relate to the whole. For instance, is a day that is 70
degrees in Chanhassen considered a very hot day? How does it compare to the rest of the days of the year? The relative cumulative frequency can answer those
questions.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Cumulative Frequency
The number of data points that fall within or below a given bin of data.

Relative Cumulative Frequency


The percent of data points that fall within or below a given bin of data.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 27
Stack Plots
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of stack plots. Our discussion breaks down as follows:

1. Stack Bar Graphs


2. Stack Line Charts

1. Stack Bar Graphs


Stack plots are plots that stack one on top of each other so that you can see how they relate to the totals. Stack plots allow you to break down bar graphs or line charts into
component pieces to see the components more clearly. The problem with stack plots is that sometimes they can be hard to interpret.

 EXAMPLE Suppose that a company spent a certain number of millions of dollars over these five-year intervals, as shown in the graph below. So, in 1985, they
spent $10 million. In 1990, they spent a little bit less, and so forth. It looks their spending increased significantly in the last ten years or so.

What a stack plot will do is allow us to see exactly where that spending breakdown occurred. Look at the stack bar graph below, which breaks out each measured
year's spending into its components:
Operations, which is the day to day business of the company
Marketing, which is the promotion side of the business

If you look at the stack bar graph, you can see that operations (the green component) have grown a little bit over the years that the company has been in business,
perhaps due to inflation. The green bars stay about the same height throughout this period of time.
The marketing budget in yellow, however, has really proliferated over that same period of time.

 THINK ABOUT IT

If the company is looking to cut costs, do you think they would cut from the operations budget or the marketing budget? Well, perhaps they'll cut from marketing
because they can see that, by the year 2010, marketing and operations are costing about the same amount.
You can also see that the marketing budget didn't increase or decrease significantly within the five years from 2005 to 2010, as the size of the yellow ribbons is
approximately the same size in those two years. This is simply another way to use this type of breakdown to analyze the data.
 TERM TO KNOW

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 28
Stack Plot
A bar graph or line chart that is subdivided into its components so that the comparisons, as well as the totals, can be seen.

2. Stack Line Charts


A stack line chart can be problematic because they are often very hard to read. It is difficult to tell whether the values are cumulative or individual.

 EXAMPLE Suppose there are four different boutique stores in different locations that a company owns. They want to know how the business is doing in each of
these four markets.

Based on this graph, it's hard to tell whether the New York store is doing poorly because it's at the bottom, or whether these are stacked on top of each other. If they
are stacked, then the Miami store is doing the worst because its difference is the least from the next lowest one. Does the Miami store make $250,000 on its own? Or
is it a total of $250,000 between the four stores and each of these are components?
It's hard to tell whether these are added values or individual values. You can make it clearer to interpret by making the lines ribbons instead, as shown below.

Now you can see what revenue each store accounted for, in each measured month. By using ribbons instead of lines, it is clearer that you're talking about adding the
values, as opposed to individual values.

 TERM TO KNOW

Stack Line Chart


A line chart where the lines represent cumulative amounts rather than individual amounts. These are typically done with different colored "ribbons" to make it
clearer that we are talking about totals.

 SUMMARY

Stack plots are used when two or more data sets are to be shown on the same set of axes, and we are interested in their sum as well. Stack plots are also used to
break down one data set by its components. We learned about tow different types of stack plots: stack bar graphs and stack line charts. Different colors typically
are used to distinguish the components. On a line chart, we typically use ribbons instead of straight lines to indicate the size of the component, which shows that
we're talking about stacked groups instead of individual lines.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Stack Line Chart


A line chart where the lines represent cumulative amounts rather than individual amounts. These are typically done with different colored "ribbons" to make it clearer that
we are talking about totals.

Stack Plot
A bar graph or line chart that is subdivided into its components so that the comparisons, as well as the totals, can be seen.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 29
Misleading Graphical Displays
by Sophia

 WHAT'S COVERED

This tutorial will cover how graphics can be used in misleading ways. Our discussion breaks down as follows:

1. Misleading Graphics
2. Misleading Pictographs
3. Misleading Areas
4. Technological Distortions

1. Misleading Graphics
Sometimes people create graphics that make you want to think a certain way. They're essentially trying to sway you to believe something, so they'll distort information or
create misleading graphics to try to persuade you to think that way.

Graphs can be distorted or misleading in a variety of ways.

 EXAMPLE Suppose three friends--Paul, Hector, and Juan--were looking at the number baseball cards they had and created the graph below. Do you notice
anything misleading about this graph?

Well, this graph has fairly obvious visual distortion. Notice that the y-axis has unequal scaling. In this graph, one point on the Y-axis represents a 50-point increase
(between 0 and 50), whereas another represents a 70-point increase (between 50 and 120). However, the visual size gap between the two is the same, even though
70 is larger than 50.
Even more misleading, there’s another spot on the y-axis that represents a 5-point increase (between 120 and 125). But again, the visual size gap is the same!
Therefore, some graphs are drawn with unequal scales to represent things disproportionately.

 THINK ABOUT IT

Who do you think created this graph? Because the distortion of the data makes Juan appear to have more cards than his friends, it was probably drawn by Juan.

 EXAMPLE Consider this presentation of data about the preferred brand of dish soap. The data states that 15 people selected Brand A, 8 people selected B, and
20 people selected C. Both of these graphs can show us that.

Graph 1 Graph 2

Which graph is more accurate?


As it turns out, Graph 1 is more accurate. Graph 2 is clearly used to exaggerate the difference between B and C. Notice how much taller C is than B in Graph 2 as
opposed to Graph 1. It appears to be twice as tall as B in Graph 1, whereas it seems to be about five times taller than B in Graph 2.
Another consideration if you use a bar graph or a histogram is that it is a good idea to start the vertical axis at zero unless there's a good reason not to start there.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 30
 EXAMPLE If you were tracking the change in home prices starting from zero and going all the way up to $300,000, the graph won't show a big difference
between $300,000 and $280,000. However, to the homeowner, that drop in $20,000 is significant.
Graphs beginning anywhere besides zero tend to exaggerate differences. But graphs starting at zero sometimes can minimize very real differences.

 TERMS TO KNOW

Misleading Graphic
A graph meant to mislead a reader or make a reader feel or believe a certain way.

Scales
The way an axis on a graph is measured. Inappropriate scaling can lead to a misleading graph.

2. Misleading Pictographs
Pictographs are plots that use pictures instead of dots or bars. Pictographs show up in newspapers a lot because they're very visually appealing.

 EXAMPLE Suppose that a class of 17 students is asked to name their favorite sport. One student might have drawn this graph to illustrate the results:

The three soccer balls mean that three students said that soccer was their favorite sport. The five baseballs mean that five students said baseball, and the nine
basketballs mean that nine students said basketball. This is a completely valid graph. It's very analogous to a dot plot, except pictures are used instead of dots.
Another student might have created this dot plot:

This looks a little odd because you can see half of a soccer ball, half of a baseball, and half of a basketball. However, notice that this student went on to say that every
basketball, soccer ball, or baseball actually counts as two students. Therefore, one ball--which is two students--and another half a ball, which equates to one student,
represents a total of three students. This is the same as the data that was presented by the other student.

 BIG IDEA

A pictograph is going to use pictures instead of a scale or dots.


The only problem with pictographs is that sometimes they can be misleading. In the figure below, the USA had the most medals, and Russia had the next highest amount.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 31
However, it's not really clear what one medal icon actually means in terms of relative size. What we see is that if you divide 1,975 by six medal icons, one medal icon
counts for about 329 medals for the USA.

But if you divide 999 by 5 medal icons for Russia, one medal icon actually counts for about 200 medals for Russia.

In fact, none of these are very consistent.

What we should have done is chosen a medal icon to represent a certain, defined number of medals and then extended the ribbon out that far. A better-looking pictograph
would be something like this:

Here, the medal icon represents 100 medals and the results will be rounded to the nearest 100. This lines up 20 medals for the USA because the nearest 100 would be
2000. Russia would have half as many medal icons to represent their 999 medals. This shows much more accurately how many more Olympic medals the USA has than
the other countries.

 TERM TO KNOW

Pictograph
A graphical display that uses pictures of physical objects rather than dots or bars to indicate the relative size of numbers.

3. Misleading Areas
It's important to know what you're trying to emphasize and not create perceptual distortion. The images you use to represent bins in your graphs can also distort meaning.

 EXAMPLE Suppose a class of 18 students was asked their favorite sport. Three kids said soccer, five said baseball, and the remaining said basketball. A
student drew this graph based on the results:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 32
According to this graph, twice as many students chose basketball than baseball. So, what's the problem? Well, the problem is that while the height of the basketball is
twice the height of the baseball, it's also twice the width. If you compare the areas taken up by the basketball and the baseball, the basketball takes up about four times
as much area as the baseball.

In fact, you can see this even more clearly by putting the box that represents the baseball inside the box that represents the basketball. It's clearly only about ¼ the
size.

 TERM TO KNOW

Perceptual Distortion
Using area or three-dimensional visual tricks to make certain values appear bigger or smaller than they are.

4. Technological Distortions
To make matters worse, technology has introduced us to lots of different misleading graphs, like the one below:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 33
While it is clear that this graph is meant to show data across four different cities, it’s unclear what it's supposed to be measuring because there is no label. Even if we know
that this graph is supposed to show how these different markets are doing across time--maybe one of these shapes is meant to represent a store?--it's still unclear what
this is supposed to be measuring or what the numbers 0 through 100 actually mean.

The additional problem comes from the lack of clarity about what the graph is comparing. Is it comparing the height of these things or the volume of these things? For
instance, a cone only has about 1/3 the volume of a cylinder with the same base and height. Therefore, it doesn't really make sense to be comparing cones and pyramids
to cylinders and boxes.

Because the previous graph is three-dimensional, there's no way to easily compare heights. Is the cone supposed to be taller, shorter, or the same height as the cylinder
behind it? It's going to be very hard for anyone to tell, which makes this is an incredibly misleading graphic.

Technology, like certain spreadsheet programs, has allowed you to easily create many different graphs. Using so many technological tricks, like three-dimensional cones
and cylinders, can distort your data.

The best choice, if you're going to use bar graphs, would be the simple ones, like the simple two-dimensional ones at the top. For the data above comparing the four cities,
a better choice would be something like a time series, since this graph is meant to compare across time.

This is a lot more useful to anyone reading it than the previous example. It isn’t as flashy, but none of the information is hidden or distorted.

 SUMMARY

Graphical displays can be manipulated in many different ways. If you use a picture that does not represent the same amount in each category, you could be
creating a misleading pictograph. Similarly, if you use an inappropriate scale, you can exaggerate the differences or use areas to make differences seem larger
than they actually are. You can also use three-dimensional displays that aren't really clear at all. These are all ways that misleading shapes can create misleading
graphics; technological distortions are ways to create these misleading graphs. As a statistician, your goal is to make the complicated simple and to make the data
easy to understand. The goal is clarity--and all these misleading graphics don't do that!

Good luck!

 TERMS TO KNOW

Misleading Graphic
A graph meant to mislead a reader or make a reader feel or believe a certain way.

Perceptual Distortion
Using area or three-dimensional visual tricks to make certain values appear bigger or smaller than they are.

Pictograph
A graphical display that uses pictures of physical objects rather than dots or bars to indicate the relative size of numbers.

Scales
The way an axis on a graph is measured. Inappropriate scaling can lead to a misleading graph.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 34
Distributions
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of distributions. Our discussion breaks down as follows:

1. Distributions
2. Matching Distribution Types to Data Sets

1. Distributions
A data set is not just a list of numbers or values; there is some context associated with it, usually the units, or what type of measurement is used, or perhaps some kind of
descriptor.

A variable is any characteristic of the individual members of the population that can be measured. A variable of interest can take on different values for each member of the
population.

 EXAMPLE For example, suppose we are interested in the variable of height for a group of people. This could vary from person to person because people have
different heights.
A distribution is a way to visually show how many times a variable takes a certain value; it is the values the variable takes and how often they show up. There are many
kinds of distributions:

Types of Distributions Description Examples

Can visually show how often a


Frequency tables
variable takes on a certain value

The variables in these distributions


Qualitative Data
are categories.
Bar Graphs
Pie Charts
Dot Plots

The variables in these distributions Stem-and-Leaf Plots


Quantitative Data Dot Plots
are measures of values or counts.
Histograms
Line Charts
Time-Series Diagrams

Mathematical Rules Can visually show variables through


a certain pattern and are not strictly
data-driven.

Normal Distribution

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 35
Poisson Distribution

 TERMS TO KNOW

Variables
A measurable factor, characteristic, or attribute of an individual or a system.

Distribution
A way to visually display the values a variable takes and how often it takes each value.

3. Matching Distribution Types to Data Sets


Why are there so many different kinds of distributions? The point of a distribution is to make the data--possibly a large data set that is unwieldy--simpler to understand. You
want to make it easy for yourself and your readers to understand. Therefore, different kinds of distributions will lend themselves better to different types of data sets.

 EXAMPLE A dot plot is better for data that are close together and doesn't have a lot of values, whereas certain other distributions are better for larger data sets.
A histogram is better than a dot plot when the data is very spread out.
You can determine which kind of distribution to use based on the kind of data you have.

 BIG IDEA

Each distribution has its own situation for which it is ideal. The data will determine which distribution is best to use.

 SUMMARY

There are many types of distributions. The point of all of them is to visually display your data so the reader can take a large data set and succinctly understand
what is going on with it. Some distributions contain every observation or data point, and some only contain summaries; you can match your distribution types to the
data set. Each type of distribution discussed here can be explored further in its own tutorial.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Distribution
A way to visually display the values a variable takes and how often it takes each value.

Variable
A measurable factor, characteristic, or attribute of an individual or a system.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 36
Data Analysis
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of data analysis. Our discussion breaks down as follows:

1. Data Analysis
2. Shape
3. Center
4. Spread
5. Outliers

1. Data Analysis
Data analysis is what we do once we've collected our data. In this lesson, we will look at data analysis to identify the trends or key features of a data set. There are four
components of data analysis that are key:

The shape that a distribution will have


The center of that distribution
The spread of the data
Any outliers in the data

 TERM TO KNOW

Data Analysis
The understanding of the key features of a set of data--shape, center, spread, and outliers.

2. Shape
Shape is a qualitative notion telling us where most of the points lie in the distribution.

 EXAMPLE For instance, in this shape below, you would say that most of the data points are in the hump, where the line is highest on the y-axis. There are not a
lot of data points on the far right side, in what we'd call the tail of the graph.

Shapes can be either skewed to the left or the right: The distribution in the example above is called skewed to the right because it has a hump on the left and a tail on the
right.

In contrast, the distribution below is called skewed to the left. It has a tail on the left and a hump on the right.

 TERM TO KNOW

Shape

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 37
The qualitative description of the clustering of data points in a certain location when the data are graphed.

3. Center
The term "center" is essentially what it sounds like: it's wherever the middle is. There are a couple of different ways to measure the center.

In the graph below, a few arrows are pointing to the different measurements of the middle.

The first arrow (the arrow furthest to the left on the x-axis) falls directly below the peak.
The second arrow (the one in the middle) is a little further off to the right. It appears that if you drew a line directly through this arrow, about half the area of the graph
would be to the left of it and about half the area would be to the right of it.
The third arrow is farthest to the right of the x-axis.

Which one is the correct measure of center? They're all different measures, and they can all be correct in different situations.

 TERM TO KNOW

Center
The “middle” of the data set. There are many measures of center.

4. Spread
Spread gives a numerical value relating how spread out the data points are. Just as with center, there are several different measures for spread.

Perhaps you are interested in where most of the data points lie, which would be below the hump:

Or, perhaps you are interested in the full range of data points from the lowest all the way the highest:

These would both be different, and correct, measurements of the spread.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 38
 TERM TO KNOW

Spread
The numerical description of how close the numbers are to the center.

5. Outliers
When analyzing a data set, it is important to look for outliers, which are not just the highest or lowest numbers, but are numbers that are very far above the next highest
number in the data set or very far below the next lowest number in the data set.

 EXAMPLE Suppose that a small class took an exam, and the scores were as follows:
90, 98, 89, 88, 46, 90, 91, 84, 94
Some students did very well on this test. In fact, most students scored in the 80's or the 90's. However, one person scored only 46. That 46 would be considered an
outlier because it's so much lower than the rest of the pack.

 BIG IDEA

Outliers are important data points because they are so high or low that they would be considered unusual.
 TERM TO KNOW

Outliers
Points in a data set that are so high or so low as to be unusual, given the rest of the values.

 SUMMARY

Data analysis consists of clearly describing the four key elements of the data set: shape, center, spread, and outliers (if there are any). Some standard descriptions
are used to describe the shape, such as skewed to the left and skewed to the right, and there are also several different measures for the center and spread, which
are typically numbers. Outliers are values that are so high above or so far below the rest of the data set that they would be considered unusual.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Center
The "middle" of the data set. There are many measures of center.

Data Analysis
The understanding of the key features of a set of data - shape, center, spread, and outliers.

Outliers
Points in a data set that are so high or so low as to be unusual, given the rest of the values.

Shape
The qualitative description of the clustering of data points in a certain location when the data are graphed.

Spread
The numerical description of how close the numbers are to the center.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 39
Shapes of Distribution
by Sophia

 WHAT'S COVERED

This tutorial will cover the different shapes that distributions can take. Our discussion breaks down as follows:

1. Distribution
2. Symmetric Distribution
3. Skewed Distribution
4. Uniform Distribution
5. Unimodal Distribution
6. Bimodal Distribution
7. Multimodal Distribution

1. Distribution
A distribution is a way to visually show how many times a variable takes a certain value.

While distribution displays the values the variable takes and how often, shape describes the data points as a whole. This tutorial will use qualifying descriptors to identify
how the distribution of a data set can look when graphed.

 TERM TO KNOW

Distribution
A display of data that shows the values the data take and how often those values occur.

2. Symmetric Distribution
A symmetric distribution will have the same mean as its median. If plotted, it will look like two mirror images on the same plot.

Here are examples of symmetric distribution:

In the graph on the far left, for example, the line in the center of the graph is the mirror line, and it represents both the mean and the median of this distribution.

Symmetrical distribution doesn't happen too often. Only a few distributions are actually truly symmetric. Often we get distributions that look something like this:

Although this distribution is close to being symmetrical, it is not exactly symmetric.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 40
Note that when you say the word symmetric, you must mean exactly. Thus qualifiers like approximately symmetric, roughly symmetric, or nearly symmetric, are necessary
to make it clear when a distribution is nearly, but not exactly, symmetric.

 TERM TO KNOW

Symmetric Distribution
A distribution where the mean and median are the same. It will appear to have a "mirror line" at the median of the distribution.

3. Skewed Distribution
Certain distributions aren't even close to being symmetric. Many asymmetric distributions are called skewed distributions.

These distributions are characterized by a hump, which is sort of a dense grouping with lots of points at certain values and some values that only have a few occurrences.
The part of the distribution with fewer occurrences is called a tail. The tail occurs to one side of the median of the distribution. These distributions look like this:

There are two ways that a distribution can be skewed.

Skewed Distributions

Tail is on the right side


Right-Skewed of the median
(Positively
Skewed) Right is more positive
on the number line

Tail is on the left side


Left-Skewed of the median
(Negatively
Skewed) Left is more negative
on the number line

 TERMS TO KNOW

Skewed Distribution
A distribution where the majority of values are on one side of the distribution, and there are only a few values on the other.

Skewed Right (Positively Skewed) Distributions


A distribution where the majority of values are low, and there only a few high values that form a "tail" to the right of the median.

Skewed Left (Negatively Skewed) Distribution


A distribution where the majority of values are high, and there only a few low values that form a "tail" to the left of the median.

4. Uniform Distribution
When all values are equally distributed, then the shape is referred to as being in uniform distribution. Here is an example of uniform distribution:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 41
Uniform distributions are a certain kind of symmetric distribution. Imagine you put a line of symmetry between the three and four. The two sides would then be symmetric.
Moreover, this is a distribution where all the values are equally distributed.

You can also use the same qualifiers for uniform distribution as are used with symmetry.

 EXAMPLE If you rolled a die six times, you might get one 6, one 5, one 4, one 3, one 2, and one 1.

Suppose you rolled the die 600 times, you would expect about 100 of each. However, perhaps you only got 95 1's and 102 2's. The distribution will look almost
uniform, so we can use those words like “approximately,” “nearly,” or “almost” uniform in place of the word “exactly” uniform.
 TERM TO KNOW

Uniform Distribution
A distribution where all values are equally likely.

5. Unimodal Distribution
Often distributions will have a clear peak to their shape. They will peak in just one place on the distribution.

In the table below, each graph has a clear peak, so all of these are called unimodal distributions.

Unimodal Distributions

Peak in the Center

Peak to the Right

Peak to the Left

 HINT

The tallest bar is called the mode.


 TERM TO KNOW

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 42
Unimodal/Single-Peaked Distribution
A distribution where one value or bin contains more data than the other values or bins.

6. Bimodal Distribution
You might have a distribution that has two distinct regions with lots of data points and a gap in the middle. When this happens, two peaks form on the distribution. These
are both called modes, and a distribution like this is called bimodal distribution.

Technically, there's only one bin that is the mode: the very tallest bar. However, in the above graph, there are two bins that are the tallest relative to the others around
them--also known as local modes.

Now, sometimes you have a distribution that appears bimodal, like the graph below:

Even though it appears to be bimodal, upon further examination of heights, it's possible that you have two different distributions that happen to be graphed on the same set
of axes (see below).

There might be some hidden variable that causes the bi-modality. When viewed separately, you end up with two unimodal distributions that just happened to be graphed
on the same set of axes.

 TERM TO KNOW

Bimodal Distribution
A distribution where there are two distinct values or bins that contain more data than the others, usually separated by a gap.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 43
7. Multimodal Distribution
Any distribution with more than two peaks is called a multimodal distribution. This distribution, for instance, has four peaks:

You can have the same issue with this type of distribution as you did with the bimodal distribution, in that it may be multiple distributions graphed on the same set of axes.

 BIG IDEA

Uni means one, bi means two, and modal means the number of modes each distribution has.
 TERM TO KNOW

Multimodal Distribution
A distribution where there are many values or bins that contain more data than other nearby bins, usually separated by gaps.

 SUMMARY

Distributions, when graphed, have many descriptors that we can use to describe their shape. Symmetric distributions visually have mirror halves, and
mathematically they have the same mean and median. Uniform distributions are a specific type of a symmetric distribution that are visually very flat. Skewed
distributions have a hump on one side of the median and a tail on the other side of the median; if the tail is on the right side of the median, it is called skewed to the
right, or positively skewed, and if the tail is to the left of the median, it is skewed to the left, or negatively skewed. Some distributions are unimodal, or single-
peaked distributions. Others are bimodal, which means they are clearly double-peaked, and some are multimodal, with more than two peaks. Sometimes, a
bimodal distribution is simply two unimodal distributions graphed together.

Good luck!

Source: THIS WORK IS ADAPTED FROM SOPHIA AUTHOR JONATHAN OSTERS

 TERMS TO KNOW

Bimodal Distribution
A distribution where there are two distinct values or bins that contain more data than the others, usually separated by a gap.

Distribution
A display of data that shows the values the data take and how often those values occur.

Multimodal Distribution
A distribution where there are many values or bins that contain more data than other nearby bins, usually separated by gaps.

Skewed Distribution
A distribution where the majority of values are on one side of the distribution, and there are only a few values on the other.

Skewed Left (Negatively Skewed) Distribution


A distribution where the majority of values are high, and there only a few low values that form a "tail" to the left of the median.

Skewed Right (Positively Skewed) Distributions


A distribution where the majority of values are low, and there only a few high values that form a "tail" to the right of the median.

Symmetric Distribution
A distribution where the mean and median are the same. It will appear to have a "mirror line" at the median of the distribution.

Uniform Distribution
A distribution where all values are equally likely.

Unimodal/Single-Peaked Distribution

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 44
A distribution where one value or bin contains more data than the other values or bins.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 45
Mean, Median, and Mode
by Sophia

 WHAT'S COVERED

This tutorial will cover how to calculate the mean, median, and mode of a data set. Our discussion breaks down as follows:

1. Mean
a. Calculating Mean
b. Outliers
c. Notation
2. Median
a. Calculating Median
b. Extreme Values
c. Median Class
3. Mode
a. Qualitative Sets
b. Distributions

1. Mean
The word mean is often used interchangeably with the word “average.” However, several different things can be called an average; the mean is the most common of those.
In this tutorial, "mean” will be used interchangeably with the word “average,” whereas other concepts, such as median, will not be implied.

 TERM TO KNOW

Mean
The "average" value of a data set. It is obtained by dividing the sum of the values by the number of values in the set.

1a. Calculating Mean


The mean of a data set is found by adding up all the values together and dividing by how many there are. Notationally, it looks like this:
 FORMULA

Mean

The 1's, 2's, and 3's in the X₁ X₂ X₃--also known as subscripts--indicate the first number in the list, the second number in the list, and so forth, until the last number in the
list, marked by the Xn. The “n” value in the denominator is the total number of values.

 EXAMPLE The data set below shows the height of the players on the Chicago Bulls basketball team. What height would be considered average height for
basketball players on this team?

Player Height

Omer Asik 84

Carlos Boozer 81

Ronnie Brewer 79

Jimmy Butler 79

Luol Deng 81

Taj Gibson 81

Richard Hamilton 79

Mike James 74

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 46
Kyle Korver 79

John Lucas III 71

Joakim Noah 83

Derrick Rose 75

Brian Scalabrine 81

Marquis Teague 74

C.J. Watson 74

To find the answer, add all the values and divide by the total number of values that are represented. In this case, we're going to add up all of the players' heights and
divide by 15, because there are 15 total players.

The result is 78.33 inches, which is the average height of a player on the Chicago Bulls.
To calculate mean using technology, you can use a spreadsheet. Create your list of values and type “= average(”. Highlight all of the fields, close the parentheses, and
hit “Enter”.

In this example, the spreadsheet will return with 78.33, which is the same number you calculated in your notation above.
1b. Outliers
You may come across situations in which the mean is a poor representation of where the center actually is.

 EXAMPLE Suppose that you have 12 employees. Eight of them are shift workers, three of them are managers, and there’s one boss. Here are the salaries for
the respective positions:
Shift Worker: $42,000
Manager: $55,000
Boss: $200,000
Calculate the mean of the eight shift workers, the three managers, and the boss.

This means that the mean of the 12 workers is over $58,000. However, how many of the employees actually make more than $58,000? How many make less than
$58,000?
11 of the 12 employees make less than $58,000, and only one makes more than that and that one person makes substantially above that amount. Therefore, it doesn't
really make a lot of sense to measure center. The boss’s $200,000 salary is an outlier in this data set.
 HINT

In the presence of outliers, which are very few high or very few very low values, the mean won't give an accurate representation of center.

1c. Notation
There are a couple of accepted notations for expressing averages:

Symbol Pronunciation Description

"mew" Is a Greek letter; you will see this quite a lot as a notation for the mean.

“x-bar”: This is simply an x with a bar over it; you can also use a y-bar, or whatever value you're using.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 47
"sigma" Is a Greek letter; called summation notation

Summation notation, or sigma notation, is a different special notation to shorten up all of this summation. This notation uses the Greek letter, sigma: . The compact
formula of the summation notation is the same as the lengthier formula.

 FORMULA

Mean

The Xᵢ (read as “x subscript i”) is just like the X₁, X₂, and X₃ in the original summation (from the first section). Therefore, this notation means that the value of Xᵢ will be the
sum of all the X's, starting from the first one (where the i value is 1) and finishing at the “nth,” or last, one. When that is completed, you divide by n.
 TERM TO KNOW

Summation Notation
A notation that uses the Greek letter sigma to state that values should be added together.

2. Median
A median is simply a measure of center for the data set that actually finds the middle value in a sorted list. It's the middle number when the data set is arranged from least
to greatest or greatest to least.

 TERM TO KNOW

Median
The value that is in the "middle" of a data set when the set is arranged from least to greatest.

2a. Calculating Median


Recall the list heights of players from the Chicago Bulls basketball team.

Player Height

Omer Asik 84

Carlos Boozer 81

Ronnie Brewer 79

Jimmy Butler 79

Luol Deng 81

Taj Gibson 81

Richard Hamilton 79

Mike James 74

Kyle Korver 79

John Lucas III 71

Joakim Noah 83

Derrick Rose 75

Brian Scalabrine 81

Marquis Teague 74

C.J. Watson 74
You might notice that many of the players are 81 inches tall. Can you, therefore, call that height typical of the Chicago Bulls? To answer that question, you'll need to
calculate the median.

In the above list, the players were sorted alphabetically. To find the median, you need to have that list ordered from least to greatest. The first step, therefore, is to reorder
those numbers, which will look like this:

71, 74, 74, 74, 75, 79, 79, 79, 79, 81, 81, 81, 81, 83, 84

To find the middle number, start by crossing off the lowest and highest numbers and continue working your way in until you have just one number left.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 48
71, 74, 74, 74, 75, 79, 79, 79, 79, 81, 81, 81, 81, 83, 84,

In this case, the remaining number is 79, which is the median. Notice that half the values in the list are at or below 79, and half the values in the list are at or above 79.

You can also use technology to figure out the median of a data set. Place the list of heights, not ordered, in a spreadsheet. Type "= median(". Then, select the full range of
numbers for which you want to find the median, close the parentheses, and hit "Enter".

Using this method, your spreadsheet will give you a median of 79, just like the first method above.

If you have an even set of data, such as 16 pets or 20 courses, finding the median will take an extra step.

 EXAMPLE Suppose you have a class of 12 students and you have a 10-point quiz. Below are the scores from each of the students. What is the median?

10, 9, 6, 7, 7, 8, 9, 9, 10, 4, 7, 9

The first step is to reorder these scores, which should result in a list that looks like this:

4, 6, 7, 7, 7, 8, 9, 9, 9, 9, 10, 10

As you cross out the highest and lowest numbers, working toward the center, you will notice that there are two middle values:

4, 6, 7, 7, 7, 8, 9, 9, 9, 9, 10, 10,

In a case like this, you have to average those two numbers by adding the values together and dividing by 2:

Therefore, the median is 8.5.

2b. Extreme Values


How is the median affected by extreme values? Suppose that you have another 10-point quiz for a different class of 11 students. Here are the scores, in order:
2, 4, 5, 6, 6, 7, 8, 8, 8, 9, 90

The median for this set of data is 7.


Obviously, one of these values, 90, is completely out of range, perhaps because of a typo. Despite this typo, however, the median of this data set is 7 because that is the
middle number. If you correct the typo, changing that 90 to a 9, for instance, the median will still be 7.

 BIG IDEA

The median is not overly affected by outliers or extreme values.


2c. Median Class
Another way to figure out a median is to use data summarized in a frequency table, which can help you find the median class.

When is it best to use a frequency table? Let's explore an example. Here is information about the number of days that the temperature was in a particular range in
Chanhassen, Minnesota in 2009:

Relative
Cumulative
Temperature Frequency Cumulative
Frequency
Frequency

-10 - -1 3 3 0.01

0-9 8 11 0.03

10 - 19 25 36 0.10

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 49
20 - 29 39 75 0.21

30 - 39 30 105 0.29

40 - 49 51 156 0.43

50 - 59 46 202 0.55

60 - 69 39 241 0.66

70 - 79 80 321 0.88

80 - 89 40 361 0.99

90 - 99 4 365 1.00
You can see, for example, that eight days had a temperature of between 0 and 9 degrees Fahrenheit. Using this table, there are a couple of different ways to find not
exactly what the median temperature is, but which bin it's in.

 HINT

There are 365 days in a year, so the median would be on Day 183. There are 182 days before this date, and 182 days after this date, for a total of 365 days

Relative
Cumulative
Temperature Frequency Cumulative
Frequency
Frequency

-10 - -1 3 3 0.01

0-9 8 11 0.03

10 - 19 25 36 0.10

20 - 29 39 75 0.21

30 - 39 30 105 0.29

40 - 49 51 156 0.43

50 - 59 46 202 0.55

60 - 69 39 241 0.66

70 - 79 80 321 0.88

80 - 89 40 361 0.99

90 - 99 4 365 1.00
You can see that the 183rd day of the year falls in the 50-59 category. That means that 182 days were as cold or colder than that particular day, and 182 days were at least
as warm as that particular day, which means that these are semi-ordered by temperature. Thus, the median is somewhere in the 50's.

You can't be 100% sure exactly where in the 50's it is, but you can be sure that it's in the 50's. Notice the number 183, when you look at the cumulative frequency, falls
between the 156 and the 202. By the time you've gotten to the end of the temperatures in the 40's, you haven’t accounted for half the days in terms of ordered
temperatures. But by the time you finish the 50's, you have accounted for more than half the days, which means that the median is somewhere in the 50's.

If you look at the relative cumulative frequency column, you can see the same thing. By the time you have finished the 40's, you've accounted for less than 43% of the
data. By the time you finish the 50's, however, you will have accounted for over 55% of the data.

Where's the 50th percentile? You don't know what the number is, but again, you know it's somewhere in the 50's: 50% of days fall in or below the 50's. Therefore, you
would call the 50's the median class because you know the median is somewhere in that bin.

 TERM TO KNOW

Median Class
The bin that contains the median value. This is the most precise measurement we can obtain when we are looking at data that have already been categorized.

3. Mode
There are a couple of different ways to go about determining what would be considered typical for a set of data:

Mean, which would require adding all of these numbers up and dividing by 15.
Median, which would require ordering the values from least to greatest and find the number that's in the middle.
Mode, which is the value that appears most frequently

In a quantitative set, the mode is the most frequently occurring number or numbers, assuming that they occur more than once.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 50
 EXAMPLE Recall our list from above detailing the heights of the Chicago Bulls basketball team. Using this list, can you find out what height would be considered
the mode?

Player Height

Omer Asik 84

Carlos Boozer 81

Ronnie Brewer 79

Jimmy Butler 79

Luol Deng 81

Taj Gibson 81

Richard Hamilton 79

Mike James 74

Kyle Korver 79

John Lucas III 71

Joakim Noah 83

Derrick Rose 75

Brian Scalabrine 81

Marquis Teague 74

C.J. Watson 74

In our data set of heights, you can see that 81 occurs four times, and 79 also occurs four times. Unlike with means and medians, a distribution can have more than one
mode. In this case, 79 and 81 are both modes.
 TRY IT

Suppose a class has 12 students. Here are the grades from a 10-point quiz. Determine the mode.

10, 9, 6, 7, 7, 9, 9, 9, 10, 4, 7, 8

You probably realize that the mode is nine--it appears four times, and nothing else appears more than three times.
 TERM TO KNOW

Mode
The most frequently appearing number in a set of quantitative data or most frequently occurring category in a set of qualitative data.

3a. Qualitative Sets


You can also find the mode of a qualitative data set. In a qualitative data set, the mode is the most frequently occurring category or the largest category.

In the pie chart above, there are several large categories, but the largest category is the red one: biology. Therefore, biology would be the mode of this data set.

3b. Distributions
In a distribution that is fully graphed out, you might have something that's multi-peaked, like in the graph below.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 51
As this graph shows, there is a gap between the two highest peaks, where the amounts decrease very precipitously and then rise again. In a distribution, we would call
both of these areas modes. The values near five and eight would both be considered modes because they are the different peaks in the distribution. It might still be called
bimodal, although in reality there is only the one mode, the highest bar at eight.

 SUMMARY

The mean is one measure of center that we can use, and it's what is meant by the term “average.” When measuring mean, it is important to consider outliers,
which are very few high or very few very low values. If you factor in outliers, the mean won't give an accurate representation of center. Sometimes, summation
notation can be used as a shortcut instead of writing the whole long string of added values.

The median identifies the middle number in a set of ordered data. If there's an even number of data values, you take the mean of those two middle numbers. Even
for data sets with extreme values, the median will still be the middle number. If the data is on a frequency table, you can find the median class, but you can't find
the exact median itself.

The mode is the most common value in a data set. In qualitative data sets, that value can be a category, and in a quantitative data set, the value will be a number.
There can be one mode, many modes (if several values appear an equal amount of plural times), or no mode if no value appears more than one time. Modes may
also refer to the peak or peaks of distributions, even if they're not the tallest point in the distribution. If a distribution has many peaks, they can be called bimodal or
multimodal.

Good luck!

 TERMS TO KNOW

Mean
The "average" value of a data set. It is obtained by dividing the sum of the values by the number of values in the set.

Median
The value that is in the "middle" of a data set when the set is arranged from least to greatest.

Median Class
The bin that contains the median value. This is the most precise measurement we can obtain when we are looking at data that have already been categorized.

Mode
The most frequently appearing number in a set of quantitative data or most frequently occurring category in a set of qualitative data.

Summation Notation
A notation that uses the Greek letter sigma to state that values should be added together.

Weighted Mean/Average
A way of calculating a mean when not all the values count for the same amount. Each value should be multiplied by its weight and added together, then divide the sum
by the sum of the weights.

 FORMULAS TO KNOW

Mean

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 52
© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 53
Weighted Mean
by Sophia

 WHAT'S COVERED

This tutorial will explain finding the weighted mean of a data set. Our discussion breaks down as follows:

1. Calculating Weighted Mean


a. Example 1
b. Example 2

1. Calculating Weighted Mean


Sometimes, all of the values in your data set will not be weighted the same. If you are calculating the mean of a data set where there are weighted values, you will need to
account for these differences by using a weighted mean/average.

When calculating the weighted mean, it is important to keep the following formula in mind:

 FORMULA

Weighted Mean

Sometimes in an academic course, exams are weighted for a certain percentage of the grade, and the final is weighted for a greater percentage.

 TERM TO KNOW

Weighted Mean/Average
A way of calculating a mean when not all the values count for the same amount. Each value should be multiplied by its weight and added together, then divide the
sum by the sum of the weights.

1a. Example 1
Suppose a statistics course has four tests which comprise the course grade, but the final exam is weighted three times as much as the others.

A student receives the following scores:

Exam 1: 78
Exam 2: 83
Exam 3: 82
Final: 94

What is this student's grade for the course? How do you find the mean?

You would first multiply each of the tests by their weights. Count each of the first three tests as each one test. However, count the final exam as three tests because it’s
weighted three times as much:

This weighted average, or the weighted mean, is 87.5.

 BIG IDEA

It's important to note that you are counting these not as four tests, but essentially more like six tests, because the final one counts for three of the others. Therefore,
you divide by 6 in the end.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 54
1b. Example 2
Suppose a student named Sam is taking a class where each of the grades has different importance. Participation is worth 10% of Sam's grade, homework is worth 25%,
quizzes are worth 50%, and tests are worth 15%. This is indicated with the following table:

Assignment Weight

Participation 10%

Homework 25%

Quizzes 50%

Tests 15%
In this class, Sam earned the following scores:

Assignment Score

Participation 100

Homework 50

Quizzes 70

Tests 93
To calculate the weighted mean, we need to multiply each score that she received by the corresponding weight, add the values together, and divide by the total weight.

The weighted mean is 71.45, which tells us that Sam received a final grade of 71.45.

 SUMMARY

The mean, or what is meant by the term "average," is one measure of center that we can use. Weighted averages can be found by multiplying each value times its
weight and counting it, essentially, that many times.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Mean
The "average" value of a data set. It is obtained by dividing the sum of the values by the number of values in the set.

Summation Notation
A notation that uses the Greek letter sigma to state that values should be added together.

Weighted Mean/Average
A way of calculating a mean when not all the values count for the same amount. Each value should be multiplied by its weight and added together, then divide the sum
by the sum of the weights.

 FORMULAS TO KNOW

Mean

Weighted Mean

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 55
Measures of Center
by Sophia

 WHAT'S COVERED

This tutorial addresses the question, “Which measure of center should I use?” Our discussion breaks down as follows:

1. Measures of Center
a. Mean
b. Median
c. Mode

1. Measures of Center
There are multiple measures of center:

The mean
The median
The mode

Each of these measurements was previously covered in their own tutorials. However, you may be wondering which measure of center you should use for a given situation.

1a. Mean
The mean is the default measurement that you should use if there's no compelling reason to use anything else.

The mean is the best measurement to use if possible because it's the most versatile measure of center and therefore, the most appropriate one in the vast majority of
cases. However, there are certain situations in which the mean is not an appropriate gauge for center. In those cases, you should use the median. You will rarely use the
mode.

 TERM TO KNOW

Mean
The average number in a quantitative data set; the sum of all the values, divided by the number of values.

1b. Median
Sometimes the mean is a poor representation of where the center really is, so the median becomes a better measure of center.
Consider this table of one company’s employees and their salaries:

Title Number of Employees Salary

Boss 1 $200,000

Manager 3 $55,000

Shift Worker 8 $42,000


The mean of this set of data is about $58,000, but how many of the employees actually make more than $58,000 and how many make less than $58,000?

Eleven of the 12 people make less than $58,000. Only one employee makes more than that, and that employee makes substantially more. The boss’s $200,000 salary is
an outlier. Therefore, the mean doesn’t make very much sense as a measurement of the middle.

In this case, a better measure of center would be the median. If you took all the salaries and wrote them out from least to greatest, the median (the one in the middle)
would be $42,000. That more accurately describes what a typical worker makes.

$42,000
$42,000
$42,000
$42,000
$42,000
$42,000
$42,000

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 56
$42,000
$55,000
$55,000
$55,000
$200,000

 BIG IDEA

In the presence of outliers, which are very few high or very few low values, the mean won't give an accurate representation of center. Use the median in cases like
these.
 TERM TO KNOW

Median
The value that is in the "middle" of a data set when the set is arranged from least to greatest.

1c. Mode
When should you use mode? The mode isn’t used very often. It's used mainly for qualitative data sets, to determine the category that has the most values in it. Consider
the graph below:

In this case, the mode is biology, the red section.

The mode is also used to describe the peak of a distribution, such as in a histogram.

 TERM TO KNOW

Mode
The most frequently appearing number in a set of quantitative data or most frequently occurring category in a set of qualitative data.

 SUMMARY

The mean is our default measure of center. It's the preferred one and the most versatile. However, sometimes if we have outliers or a few values that can skew the
mean towards them--either on the high side or the low side--the mean won't accurately represent center anymore. In those cases, the median should be used
instead. Typically we reserve the mode for qualitative distributions.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Mean
The average number in a quantitative data set; the sum of all the values, divided by the number of values.

Median
The value that is in the "middle" of a data set when the set is arranged from least to greatest.

Mode
The most frequently appearing number in a set of quantitative data or most frequently occurring category in a set of qualitative data.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 57
Measures of Variation
by Sophia

 WHAT'S COVERED

This tutorial will explain different measures of variation and why they are necessary. Our discussion breaks down as follows:

1. Spread
2. Types of Variation

1. Spread
Different data sets will have different measures of variation; therefore, it is important to understand this spread when examining data.

Sometimes it's not sufficient to report simply an average, or a measure of center, when you're talking about a data set.

 EXAMPLE Suppose you were going to compare and contrast the high, low, and average temperatures in the month of January, for Buffalo Grove, Illinois versus
Valdez, Alaska.
Low High Average

Buffalo Grove 12° 28° 21°

Valdez 15° 25° 21°


The average for both of these cities is 21 degrees in January. However, if you look at the typical high temperature in Buffalo Grove, it's a little higher than the typical
high temperature in Valdez. In addition, the low temperature in Buffalo Grove is a little bit lower than the low temperature in Valdez.
Buffalo Grove's temperatures, although they average the same as Valdez, are a bit more variable, which means that the data is spread out a bit more. It gets a little
colder at night and a little warmer in the day. Valdez's temperatures seem a little bit more consistent. The data is not as spread.

 BIG IDEA

Because the different data sets have such different spreads, it would be inappropriate to simply compare them based on their averages.
 TERM TO KNOW

Measures of Variation/Spread
Statistical measures that indicate how close values are to the center of the distribution. For every measure of variation, a large number indicates the data are very
spread out, and a small number indicates the values are very close together.

2. Types of Variation
It is important to understand how variable the values around the measure of center (whichever you are using) are. Just like measures of center, there are several
measures of spread:

Range
Standard deviation
Interquartile range

All of these are covered in more detail in other tutorials.

Whatever measure of variation you use, high and low values are indicative of different things:

A high value means that the data set is not consistent, that it's more spread out.
A low value indicates that the values are not very spread out, that they're tightly clustered together. When the data does deviate from the center, it's not by a significant
amount.

You can have measures of spread or measures of variation that are zero, which would indicate that all the data values are, in fact, the same.

 SUMMARY

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 58
Variation indicates the extent to which the data set values are close together. There are many ways to measure variation, and all of those methods have a simple
rule: a high value means that the data are more varied and a smaller value means that the data are less varied. Variation and spread are synonyms that will be
used fairly extensively throughout these tutorials.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Measures of Variation/Spread
Statistical measures that indicate how close values are to the center of the distribution. For every measure of variation, a large number indicates the data are very spread
out, and a small number indicates the values are very close together.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 59
Range and Interquartile Range (IQR)
by Sophia

 WHAT'S COVERED

This tutorial will explain the range and interquartile range (IQR). Our discussion breaks down as follows:

1. Range
2. Interquartile Range

1. Range
One measure of variation is range. The range is one of the simplest ways to calculate variation. It is calculated by simply subtracting the minimum value from the maximum
value.

 FORMULA

Range

 EXAMPLE This chart shows the height of the Chicago Bulls basketball team for a particular year:
Height of Chicago Bulls Players

Omer Asik 84

Carlos Boozer 81

Ronnie Brewer 79

Jimmy Butler 79

Luol Deng 81

Taj Gibson 81

Richard Hamilton 79

Mike James 74

Kyle Korver 79

John Lucas III 71

Joakim Noah 83

Derrick Rose 75

Brian Scalabrine 81

Marquis Teague 74

C.J. Watson 74

It's easy to see from the list that the minimum value is 71 and the maximum value is 84. The range is actually the easiest measure of spread or variation to find: take the
maximum value (in this case, the tallest person) and subtract the minimum value (in this case, the shortest person).

Therefore, 84 minus 71 equals 13 inches. This means that every individual on the team falls within a 13-inch range.

 TERM TO KNOW

Range
The difference between the largest and smallest number in a data set.

2. Interquartile Range

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 60
Range and interquartile range are similar ideas in that they're both measured by subtraction, so they're not particularly difficult to calculate. While they are calculated
similarly, they do measure different measures of variation.

The interquartile range, also abbreviated IQR, is another measure of spread, but it's median based. To review, the median is the middle number of an ordered data set.

The IQR represents the range in which the middle 50% of the data points lie. Finding the interquartile range takes a few steps.

 STEP BY STEP

Step 1: First, order the data set from smallest to largest and then work your way in towards the center until you find the median.

71, 74, 74, 74, 75, 79, 79, 79, 79, 81, 81, 81, 81, 83, 84
The median, in this case, is 79.

Step 2: Once you identify the two data sets on either side of the median--a small half and a high half--you can find the middle number of each of those halves. These three
numbers together are called the quartiles.

The first quartile, or Q1, is where 1/4 of the data falls at or below this point.

The median, which can also be called the second quartile (Q2), is where half the data falls at or below this point.

The third quartile, or Q3, is where 3/4 of the data falls at or below this point.

With this data set:

71, 74, 74, 74, 75, 79, 79, 79, 79, 81, 81, 81, 81, 83, 84,
Q1 = 74
Q2 (median) = 79
Q3 = 81

Step 3: Finally, to calculate the interquartile range, find the difference between the third and first quartiles.

 FORMULA

Interquartile Range

Going back to our example, the IQR is calculated as Q3 minus Q1, or 81 minus 74, which equals 7 inches. This means that the middle half of the data set falls within a 7-
inch range, whereas the entire data set falls within a 13-inch range.

Visually, the IQR is the box on a box plot, shown below:

The range gives the entire spread of the data set lowest to highest whereas the IQR gives the range of the middle 50%.

 HINT

The advantage of using IQR over the range is if there are outliers, which would disproportionately affect the range, the IQR will not be affected by them.
 TERM TO KNOW

Interquartile Range
The difference between the third and first quartiles. It represents the range in which the middle 50% of the data points lie.

 SUMMARY

The range is not the most useful measure of variation, but it is the easiest to calculate. The interquartile range is more useful and measures the range of the
middle 50%, the most typical middle 50% of the data. It's a useful measure of spread for distributions with outliers or skewed distribution. In fact, you should use
IQR as your measure of variation when those factors are a concern. Because the IQR is based on finding the median, it should only be used as the measure of

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 61
spread when the median is the measure of center. You should not, for instance, use the mean as the measure of center and then report IQR as the measure of
spread.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Interquartile Range
The difference between the third and first quartiles. It represents the range in which the middle 50% of the data points lie.

Range
The difference between the largest and smallest number in a data set.

 FORMULAS TO KNOW

Interquartile Range

Range

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 62
Standard Deviation
by Sophia

 WHAT'S COVERED

This tutorial will explain the concept of standard deviation. Our discussion breaks down as follows:

1. Standard Deviation
2. Calculating Standard Deviation
3. Interpreting Standard Deviation
4. Using Technology

1. Standard Deviation
Standard deviation is a measure of variation that we use quite often in statistics. Standard deviation measures spread. You will interpret the standard deviation as the
typical amount that you would expect data to be within the mean.

The full name for standard deviation is "standard deviation from the mean." If you break that down, "standard" just means it's typical, "deviation" means that you expect it to
be off from the mean, just by chance. So "standard deviation from the mean" states that the data will be away from the mean.

Here is what the formula to find the standard deviation of a sample looks like:

 FORMULA

Standard Deviation of a Sample

As complicated as the formula looks, it's actually the one preferred most, provided that the distribution that you’re looking for is roughly symmetric and doesn’t have
outliers. If the data is not symmetric or has outliers, you will use a different measure of spread, the interquartile range.

 TERM TO KNOW

Standard Deviation
A typical amount by which we would expect a data point to differ from the mean. Typically, about half to two-thirds of the data points fall within one standard
deviation of the mean.

2. Calculating Standard Deviation


When calculating the formula for standard deviation, you follow these steps:

 STEP BY STEP

Step 1: Subtract the mean from each value.


Step 2: Square those values, resulting in (x minus mean)2.
Step 3: Use the sigma notation, which is the same as summation notation, to add these values.
Step 4: Divide that sum by (n minus 1).
Step 5: Take the square root.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 63
 EXAMPLE These are the heights of the Chicago Bulls basketball team.

Height of Chicago Bulls


Players

Omer Asik 84

Carlos Boozer 81

Ronnie Brewer 79

Jimmy Butler 79

Luol Deng 81

Taj Gibson 81

Richard Hamilton 79

Mike James 74

Kyle Korver 79

John Lucas III 71

Joakim Noah 83

Derrick Rose 75

Brian Scalabrine 81

Marquis Teague 74

C.J. Watson 74

When considering standard deviation, each of these items is a Xᵢ (X sub 1) in the original formula. Let's follow the five steps to calculate the standard deviation:

 STEP BY STEP

Step 1: The first step is to subtract the mean, which in this case, means you have to first calculate the mean. A thorough explanation of how to calculate the mean is
covered in another tutorial, so for today, know that the mean is around 78.33 inches. Therefore, subtract 78.33 from each of these height values.

84 minus 78.33 is 5.67, 81 minus 78.33 is 2.67, etc.

84 5.67

81 2.67

79 0.67

79 0.67

81 2.67

81 2.67

79 0.67

74 -4.33

79 0.67

71 -7.33

83 4.67

75 -3.33

81 2.67

74 -4.33

74 -4.33

Step 2: Square those values, resulting in (x minus mean)2.

5.672 is 32.15, 2.672 is 7.13, etc.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 64
84 5.67 32.15

81 2.67 7.13

79 0.67 0.45

79 0.67 0.45

81 2.67 7.13

81 2.67 7.13

79 0.67 0.45

74 -4.33 18.75

79 0.67 0.45

71 -7.33 53.73

83 4.67 21.81

75 -3.33 11.09

81 2.67 7.13

74 -4.33 18.75

74 -4.33 18.75

Step 3: Use this sigma notation, which is the same as summation notation, to add these values up. They sum up to 205.35.

84 5.67 32.15

81 2.67 7.13

79 0.67 0.45

79 0.67 0.45

81 2.67 7.13

81 2.67 7.13

79 0.67 0.45

74 -4.33 18.75

79 0.67 0.45

71 -7.33 53.73

83 4.67 21.81

75 -3.33 11.09

81 2.67 7.13

74 -4.33 18.75

74 -4.33 18.75

Sum 205.35

Step 4: Divide that sum by n minus 1. In this case, n is 15 because there were 15 players; therefore n minus 1 equals 14. Dividing our sum by 14 equals 14.67.

 HINT

If you stopped here, this 14.67 number would measure another kind of variation called "variance." Variance is equivalent to the square of the standard deviation.
Variance isn’t used very often, mainly because it is still a squared value. The measurement for variance, in this case, would be 14.67 inches squared, versus inches.
 FORMULA

Variance of a Sample

Step 5: The final step is to take the square root of that number, which equals 3.83.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 65
Here is an overview of the entire calculation:

84 5.67 32.15

81 2.67 7.13

79 0.67 0.45

79 0.67 0.45

81 2.67 7.13

81 2.67 7.13

79 0.67 0.45

74 -4.33 18.75

79 0.67 0.45

71 -7.33 53.73

83 4.67 21.81

75 -3.33 11.09

81 2.67 7.13

74 -4.33 18.75

74 -4.33 18.75

Sum 205.35

For this data set, you would expect a good portion of the heights to be within 3.83 inches of the mean, 78.33.

 BIG IDEA

Variance and standard deviation share an important relationship. The variance is the square of the standard deviation, and the standard deviation is the square root of
the variance.
 TERM TO KNOW

Variance
The square of standard deviation. While it has some uses in statistics, it is not a practical unit of measurement. It is calculated the same way as standard deviation,
but without the square root.

3. Interpreting Standard Deviation


Standard deviation is a typical amount by which we would expect values to vary around the mean.

 TRY IT

Considering the original list of heights, which fall into one standard deviation of the mean, meaning their heights are either 3.83 above or below 78.33?

Height of Chicago Bulls


Players

Omer Asik 84 ✘

Carlos Boozer 81 ✔

Ronnie Brewer 79 ✔

Jimmy Butler 79 ✔

Luol Deng 81 ✔

Taj Gibson 81 ✔

Richard Hamilton 79 ✔

Mike James 74 ✘

Kyle Korver 79 ✔

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 66
John Lucas III 71 ✘

Joakim Noah 83 ✘

Derrick Rose 75 ✔

Brian Scalabrine 81 ✔

Marquis Teague 74 ✘

C.J. Watson 74 ✘

You'll notice about ⅔ of the players had heights in this range. That is how you interpret the standard deviation.

4. Using Technology
As you may have noticed, calculating the standard deviation can require many steps, which could result in calculation errors. Because of this, the standard deviation is
almost always found on a calculator or a spreadsheet, or some kind of applet on the internet. Typically, it is not solved by hand.

If you're frustrated with the calculation you just practiced, you can use your calculator or a spreadsheet in the future.

Here is how to use spreadsheets to calculate standard deviation:

 STEP BY STEP

Step 1: Enter your list of data into the spreadsheet.


Step 2: For the formula, type “=stdev(”.
Step 3: Highlight the list that you want to find the standard deviation for, close the parentheses, and hit "Enter."

This spreadsheet formula finds the same number as the calculation you did by hand: 3.83.

 BIG IDEA

If you have Excel or another spreadsheet program, there should be a standard deviation formula that you can use.

 SUMMARY

Interpreting standard deviation is important, but it can be difficult to calculate. For this reason, calculating this number is typically done using technology. It's a
measure of how far we would expect a typical data point to be from the mean. Standard deviation is the square root of a value called variance, which is a little bit
easier to calculate but not as useful as a standard deviation number. Since the standard deviation is based on the mean, the standard deviation should only be
reported as the measure of spread when you're reporting the measure of center to be the mean. You shouldn't mix and match the standard deviation with the
median or the interquartile range.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Standard Deviation
A typical amount by which we would expect a data point to differ from the mean. Typically, about half to two-thirds of the data points fall within one standard deviation of
the mean.

Variance

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 67
The square of standard deviation. While it has some uses in statistics, it is not a practical unit of measurement. It is calculated the same way as standard deviation, but
without the square root.

 FORMULAS TO KNOW

Standard Deviation of a Sample

Variance of a Sample

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 68
Five Number Summary and Boxplots
by Sophia

 WHAT'S COVERED

This tutorial will discuss the five number summary of a data set and explain box-and-whisker plots. Our discussion breaks down as follows:

1. Five Number Summary


2. Obtaining the Five Numbers
3. Box-and-Whisker Plots/Boxplots
4. Using Boxplots: Comparing Two or More Distributions

1. Five Number Summary


The five number summary takes larger data sets and makes them more manageable and easier to understand. By breaking down large data sets from many numbers to
just five, this method can help to summarize the center and variability.

The five number summary consists of five parts:

Minimum
Q1
Median
Q3
Maximum

 TERM TO KNOW

Five Number Summary


A brief overview of a data set consisting of the minimum, the first quartile, the median, the third quartile, and the maximum.

2. Obtaining the Five Numbers


Two of the numbers in the five number summary are the smallest and largest--the minimum and the maximum.

 EXAMPLE Suppose you have a list of the heights of the Chicago Bulls basketball team:

Height of Chicago Bulls


Players

Omer Asik 84

Carlos Boozer 81

Ronnie Brewer 79

Jimmy Butler 79

Luol Deng 81

Taj Gibson 81

Richard Hamilton 79

Mike James 74

Kyle Korver 79

John Lucas III 71

Joakim Noah 83

Derrick Rose 75

Brian Scalabrine 81

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 69
Marquis Teague 74

C.J. Watson 74

It's easy to see that the shortest person on the team is 71 inches tall, and the tallest person on the team is 84 inches tall. Those are two of the numbers in the five number
summary. The three remaining numbers will be based on the median.
The median measures the center of a data set; it's the middle of an ordered set of data. Currently, this is alphabetical by the last name, so it needs to be rearranged from
least to greatest height order. We can then see that the middle number, 79, is the median.

71 74 74 74 75 79 79 79 79 81 81 81 81 83 84
Median

Dividing at that point, you are left with two groups: a low group and a high group. Next, take the median of each of those data sets. Now you have 74 in the low group, 81
in the high group, and 79 in the middle.

71 74 74 74 75 79 79 79 79 81 81 81 81 83 84
Median
Q1 Q3
Q2

In this data set, 74 is the first quartile, 79 is the second quartile or the median, and 81 is the third quartile.

Now, the five number summary consists of the following five numbers.

Minimum
First quartile (Q1)
Second Quartile/Median (Q2)
Third quartile (Q3)
Maximum.

71 74 74 74 75 79 79 79 79 81 81 81 81 83 84
Minimum ~25% Q1 ~25% Median ~25% Q3 ~25% Maximum

The benefits of this particular summary are that about 25% of the data falls within each of these bands.
You'll notice that:

25% of the data falls at or below the first quartile


50% falls at or below the median
75% falls at or below the third quartile
All the data falls at or below the maximum

Also, you can see where a concentration of data values lie within the data set. For instance, there are more data values in a narrower range. There are the same amount of
data values between 79 and 81, as there are between 74 and 79. Although it's the same number of data values, the range of the 79 to 81 band is narrower than the 74 to
79 band. Therefore, you can tell the data are more clustered together in the 79 to 81 band versus the 74 to 79 band.

 TERMS TO KNOW

Quartiles
The values that divide the data set into four equal partitions.

First/Lower Quartile
The number at which approximately 25% of the data set falls at or below that value.

Second Quartile/Middle Quartile/Median


The number at which approximately 50% of the data set falls at or below that value.

Third/Upper Quartile
The number at which approximately 75% of the data set falls at or below that value.

3. Box-and-Whisker Plots/Boxplots

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 70
Boxplots are also sometimes called box-and-whisker plots. A boxplot is a way to graphically display the five number summary for a data set. It is composed of a box, which
contains the middle 50% of the values, and whiskers, which extend out to the maximum and minimum values.

To create a boxplot, following the simple steps below:

 STEP BY STEP

Step 1: Draw an axis. It can be horizontal or vertical.


Step 2: Scale the axis with equal increments.
Step 3: Make a mark to identify the five numbers from the five number summary.
Step 4: Draw a box from the first quartile to the third quartile. Draw a whisker from Q1 to the minimum and from Q3 to the maximum.

 EXAMPLE Refer to the chart above of the heights of the Chicago Bulls basketball team. Recall that the five number summary consists of:
Minimum: 71
Q1: 74
Median: 79
Q3: 81
Maximum: 84

So, how do you put this information into a boxplot?

 STEP BY STEP

Step 1: Draw an axis. It can be horizontal or vertical.


Step 2: Scale the axis with equal increments. Here, the graph includes the lowest number, 71, to the tallest number, 84.

Step 3: Make a mark to identify the five numbers from the five number summary: 71, 74, 79, 81, and 84.

Step 4: Draw a box from the first quartile to the third quartile. The box shows where the middle 50% of the data lies. Then, about 25% percent of the data falls in the
"whisker" to the left side, and about 25% of the data falls in a "whisker" to the right side. This is why it's sometimes called a box-and-whisker plot.

 TERM TO KNOW

Boxplot/Box-and-Whisker Plot
A graphical distribution of the five number summary. The "box" in the middle contains the middle 50% of the values, and the "whiskers" extend out to the maximum
and minimum values from the quartiles.

4. Using Boxplots: Comparing Two or More Distributions


You can use boxplots to compare two distributions. For instance, if you were talking about the heights of girls versus boys, you might be able to compare them by saying
the spread, or the variation, with the girls, is much less than the variation with boys.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 71
You can see this variation not only in the width of the boxes but also in the total width from the minimum to the maximum in each of these two data sets. Therefore, you can
use boxplots as sort of a summary distribution for the boys and the girls.

 SUMMARY

The five number summary is a brief overview of a data set consisting of the minimum, the first quartile, the median, the third quartile, and the maximum. It allows
us to understand where clusters of data points might be and where the data might be more spread out. Boxplots allow you to display, visually, the five number
summary. You can interpret a boxplot to see where the data points are close together and where the data points are further apart. With boxplots, you can analyze
for data skews or look for symmetry. You can use multiple boxplots on the same set of axes to compare two or more distributions.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Boxplot/Box-and-Whisker Plot
A graphical distribution of the five number summary .The "box" in the middle contains the middle 50% of the values, and the "whiskers" extend out to the maximum and
minimum values from the quartiles.

First/Lower Quartile
The number at which approximately 25% of the data set falls at or below that value.

Five Number Summary


A brief overview of a data set consisting of the minimum, the first quartile, the median, the third quartile, and the maximum.

Quartiles
The values that divide the data set into four equal partitions.

Second Quartile/Middle Quartile/Median


The number at which approximately 50% of the data set falls at or below that value.

Third/Upper Quartile
The number at which approximately 75% of the data set falls at or below that value.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 72
Outliers and Modified Boxplots
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of outliers and modified boxplots. Our discussion breaks down as follows:

1. Outliers
2. The 1.5xIQR Rule
3. Boxplots

1. Outliers
You may recall that outliers are values that are far outside the pattern established by the rest of the data. They're either very high or very low in comparison to the rest of
the data set.

Boxplots, introduced in another tutorial, are a way to graphically display the five number summary for a data set. This tutorial will present a modified version of boxplots so
that it is easier to observe outliers in them.

 EXAMPLE Here is a set of test scores.

90, 98, 89, 88, 46, 90, 91, 84, 94

Almost everyone scored in the 80's or 90's, except for one student, who scored a 46. That student is an outlier.
 TERMS TO KNOW

Outlier
A point that is so large or small as to be unusual, given the rest of the data points.

Modified Boxplot
A graphical display showing a modified version of the five number summary. If a distribution has outliers, then the "whiskers" only extend to the highest and lowest
points that are not outliers.

2. The 1.5xIQR Rule


To make it easier to find outliers, there is a mathematical rule for determining whether a point is an outlier or not. This is called the “1.5xIQR rule.” IQR stands for
Interquartile Range.

So, how do you use the 1.5xIQR method?

 STEP BY STEP

Step 1: Find the quartiles of the data set.


Step 2: Find the interquartile range (IQR).
Step 3: If we have a point that is 1.5 IQR's below the first quartile or 1.5 IQR's or more above the third quartile, then it is an outlier.

 EXAMPLE Consider the data set of test scores from above.

90 98 89 88 46 90 91 84 94

Step 1: First, find the quartiles of the data set. To do this, order the data from least to greatest. find the median, and find the medians within each of the low and high data
sets.

46 84 88 89 90 90 91 94 98

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 73
↑ ↑ ↑
Q1=86 Median Q3=92.5

The median of this data set is 90. The median of the first quartile is actually between 84 and 88, or at 86, and the median of the third quartile is between 91 and 94, which
is at 92.5.

Step 2: Next, find the interquartile range, or IQR. The interquartile range is the distance between the first and third quartiles. The difference between 92.5 and 86 is 6.5.

Step 3: Any point that is 1.5 IQR's below the first quartile or 1.5 IQR's or more above the third quartile will be considered an outlier.

1.5 IQR below the first quartile:

1.5 IQR above the third quartile:

This indicates that any test score higher than 102.25 points would be considered an outlier on the high side. Anything below 76.25 will be considered an outlier on the low
side.
Of the test scores, only 46 falls outside this range, so this test score would be an outlier.

 TRY IT

Suppose that you have the data set of all house prices for homes purchased in Albuquerque, New Mexico, from February to April in 1993. These are in thousands of
dollars.

The first and third quartiles have been calculated:


Home Prices in Albuquerque, New Mexico
From February - April, 1993
205 72 93.9 99.5 87.5 105

208 72 82 97.5 88.9 104.5

215 74.9 78 97.5 85.5 105

215 73.1 77 90 83.5 102

199.9 72.5 70 96 81 100

190 67 62 86 80.5 103

180 215 54 169.5 79.9 97.5

156 159.9 107 155.3 75 95

145 135 210 125 75.9 94

144.9 129.9 72.5 130 75.5 92

137.5 125 66 102 75 94.5

127 123.9 60 102 73 87.4

125 120 58 92.2 72.9 87.2

123.5 112.5 184.4 92.5 71 87

117 110 158 89.9 77.3 86.9

118 108 69.9 85 69 76.6

115.5 105 133 87.6 67 73.9

111 104.9 116 89 61.9

113.9 95.5 110.9 87 129.5

99.5 93.4 112.9 70 97.5

Q1 = 78, Q3 = 120

What is the range for outliers? What is the lower fence for outliers and the upper fence for outliers?
IQR

1.5 IQR below the first quartile:

1.5 IQR above the third quartile:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 74
The interquartile range is 42, so any point below 78 minus 1.5xIQR, which is 15, or above 120 plus 1.5xIQR, which is 183, will be an outlier.
Notice that there's nothing in the list below 15, but there are seven above 183. This means that there are seven outliers in this data set, which, by the way, is a
completely legitimate and legal occurrence in this situation.
 TERM TO KNOW

1.5xIQR Rule
If a point is larger than Q3 + 1.5xIQR, or smaller than Q1 - 1.5xIQR, then it is an outlier.

3. Boxplots
You can use this new information to create a new version of an already existing plot that you have. You’ve made boxplots in another tutorial; now you can modify them to
show outliers.

Generally, you would make the whiskers on the box-and-whisker plot extend all the way out to the maximum and minimum. If the minimum or maximum (or both) are
outliers, that will make the whiskers really long. For a modified boxplot, instead of going all the way out to those outliers, you can extend them only to the highest and
lowest values that aren't outliers and notate the outliers separately.

 EXAMPLE Refer back to the student data set from the section above. Here are the values from least to greatest.

46 84 88 89 90 90 91 94 98
↑ ↑
Q1=86 Q3=92.5

Mark the same values that you would have if you were making a regular box-and-whisker plot. However, don’t go all the way down to 46 for your minimum--even
though 46 is the actual minimum. 46 is an outlier, so instead go to the next lowest number that isn't an outlier---84--and make your line there. Then you can make your
box and whiskers.

You still have to show the 46 as part of this data set somehow, so you will mark it with a dot. This is a modified boxplot.

In the home value data set, there were seven high outliers. This is a modified plot for that data set:

 SUMMARY

You can determine in some measurable way if a point within a data set is an outlier using the 1.5xIQR rule. Data sets might have no outliers, or they might have
one or more outliers on the low side, one or more outliers on the high side, or both. There's no rule for how many outliers are allowed in a data set. Whatever
outliers exist, you can use a modified boxplot to visually display them.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 75
1.5xIQR Rule
If a point is larger than Q3 + 1.5xIQR, or smaller than Q1 - 1.5xIQR, then it is an outlier.

Modified Boxplot
A graphical display showing a modified version of the five number summary. If a distribution has outliers, then the "whiskers" only extend to the highest and lowest points
that are not outliers.

Outlier
A point that is so large or small as to be unusual, given the rest of the data points.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 76
Percentiles
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of percentiles. Our discussion breaks down as follows:

1. Percentiles

1. Percentiles
You probably heard of percentiles, or percentile rank, before. Percentile is the same as a relative cumulative frequency, or the percent of data points in, or below, some
other bin of data.

IN CONTEXT
You may have seen percentiles reporting standardized test scores. If you were in the 95th percentile on a standardized test, it doesn't mean you scored a 95 on
the test. It means that your score was at least as good as 95% of test takers.

Often, large data sets are given in frequency tables, frequently with rounded values.

 EXAMPLE Here is a table showing heights (in inches) of 333 sixth-grade students, along with the frequency, relative frequency, and relative cumulative
frequencies.
Rel. Cum.
Height Frequency Relative Freq
Freq

55 11 3% 3%

56 23 7% 10%

57 33 10% 20%

58 36 11% 31%

59 54 16% 47%

60 51 15% 62%

61 43 13% 75%

62 32 10% 85%

63 30 9% 94%

64 13 4% 98%

65 7 2% 100%

How do we read this? Notice the first two rows have a relative frequency of 3% and 7%, respectively. Using these values, we can find the relative cumulative frequency of
10% in the second row by combining these two relative frequencies. You can also check this by dividing the cumulative amount of 11 and 23 students, which is 34, by the
total amount of students, 333, you'll get a number close to 10%.

By the time we get to 65 inches, we will have accounted for all of the sixth graders in the data set.

 TRY IT

You can use this chart to answer many questions:

Question Answer

Which percentile will a student


From the chart, you can see that 62 inches falls in the 85th percentile. That means that a 62-inch student is at least as tall as 85% of
with a height of 62 inches fall
his/her classmates.
into?

How tall is a student in the 94th


They would be 63 inches tall.
percentile?

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 77
This question is a little tricky. By the time you finish counting up through all of the 59-inch students, you still haven't accounted for half
the grade yet; only 47% of the students. However, by the time you're done counting all the 60-inch students, you've accounted for
What is the median height for 62% of the grade.
sixth graders?
That means that somewhere within that 60-inch range is the median height. So this tells us that half the students are at or above 60
inches, and half the students are at or below 60 inches.
In the previous example, we summarized the data from hundreds of students in a table format. However, you can also calculate percentile in small datasets.

 EXAMPLE The following are the measurements (in inches) for the 12 fish caught at the annual Fishing Expo at Cam's local pond:

8, 12, 14, 10, 5, 4, 18, 22, 12, 12, 12, 11

Cam's fish measured 10 inches. What percentile does this represent?

First, order the values from smallest to largest:

4, 5, 8, 10, 11, 12, 12, 12, 12, 14, 18, 22

Cam's 10 inch fish is 4th of 12 values, so 4/12 = 0.33 or the 33rd percentile.

Everyone in the 75th percentile or above receives a trophy. What size fish represents the 75th percentile?

Again, we order the fish from smallest to largest:

4, 5, 8, 10, 11, 12, 12, 12, 12, 14, 18, 22

We have 12 values and are interested in the 75th percentile. 12*0.75 = 9 which tells us the 9th fish in the series, or 12 inches would be the 75 percentile.
 TERM TO KNOW

Percentile
Relative Cumulative Frequency; the amount of data points at or below a particular value.

 SUMMARY

Percentiles are the same as relative cumulative frequency. They can be used to compare where individuals rank relative to their group. Percentiles measure what
percent of data points fall in a bin or below that bin.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Percentile
Relative Cumulative Frequency; the amount of data points at or below a particular value.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 78
Normal Distribution
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of normal distribution. Our discussion breaks down as follows:

1. Normal Distribution
2. Center and Spread

1. Normal Distribution
The normal distribution is one of the most important ideas in all of statistics. Normal distribution may also be called a Gaussian distribution, after the mathematician Carl
Friedrich Gauss, or a bell curve.

You may have heard of the term bell curve before. Does that refer to shape, center, or spread? A bell curve refers to shape. When something is called “bell curved” in
terms of its shape, it means that it is single-peaked, symmetric. A bell curve looks like this:

Normal Distribution
(a.k.a. Gaussian Distribution)
In this case, “normal" doesn't mean that it's usual, or always happens, or is typical. "Normal" just means that it's going to be single-peaked and symmetric.

 TERM TO KNOW

Normal Distribution/Gaussian Distribution/Bell Curve


A single-peaked, symmetric distribution that follows a specific bell-shaped pattern.

2. Center and Spread


Since the shape of these distributions is always the same, the only thing that's going to be different about varying normal distributions is their center. That is, the only
difference is where each normal distribution is placed on the x-axis, and what their spread is (which is also how wide they are).

 EXAMPLE A graph that has a spread of 5 and one that has a spread of 20 will look a little different in shape--the former narrower and the latter wider--but their
shapes wouldn't be any different.
Because of this, you need a measure of center and a measure of spread. Which measure of center should you use: mean or median? Which measure of spread should
you use: standard deviation or IQR?

Measure of Center and Spread for Normal Distribution

Measure of Center: Measure of Spread:


Mean Standard Deviation

Because the mean is going to


The mean is the more versatile be your measure of center, use
measure of center. the standard deviation as the
measure of spread.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 79
You might ask, “Since the distribution is symmetric, won’t the mean and the median be the same?” You're right! Likewise, the mode is the same as both the mean and the
median in this distribution. Therefore, it doesn't really matter if you reference the mean or the median, but typically, you’re going to say that your graph has a specific mean
and a specific standard deviation.

To measure center and spread, all you need is:

The bell curve shape


The mean (μ): Given with the Greek letter mu (pronounced “mew”)
The standard deviation (σ): Given with the Greek letter sigma

Sometimes, you'll see this measurement shorthanded this way: N (μ,σ). In this notation, N means the normal distribution, making this is a convenient and compact way of
writing the normal distribution.

 SUMMARY

All normal distributions look the same. The normal distribution is a very commonly-used family of distributions. Such distributions are not all centered at the same
place nor all spread out the same way, but they all look the same: they're all single-peaked and symmetric. And therefore, all we need to measure center and
spread is to find them by their mean and their standard deviation.

Good luck!

Source: This work is adapted from Sophia author Jonathan Osters.

 TERMS TO KNOW

Normal Distribution/Gaussian Distribution/Bell Curve


A single-peaked, symmetric distribution that follows a specific bell-shaped pattern.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 80
68-95-99.7 Rule
by Sophia

 WHAT'S COVERED

This tutorial will explain the 68-95-99.7 rule. Our discussion breaks down as follows:

1. The 68-95-99.7 Rule


2. Finding More Percents

1. The 68-95-99.7 Rule


The 68-95-99.7 rule applies to normal distributions and states that 68% of all data points fall within one standard deviation of the mean, 95% of all data points fall within
two standard deviations of the mean, and 99.7% of all data points fall within three standard deviations of the mean.

To understand this rule, let's start with the normal distribution. As you can see below, the normal distribution is single peaked and symmetric. Thus, this distribution can be
described exclusively by its mean and standard deviation.

All normal curves, all normal distributions, look the same, and exactly like this. The only thing that might make these distributions look a little different is that they may have
a wider standard deviation or be centered at a different place. However, since that is all the difference, we can describe the distribution with the notation above (N, mean,
standard deviation).

When you learned to calculate standard deviations, you were taught that a good amount of the data points would fall within one standard deviation of the mean. In a
normal distribution, you get a good amount of the data points in the first standard deviation.

How many of the data points will fall in the first standard deviation? 68% of all the data falls within the first standard deviation of the mean, from one standard deviation
below the mean to one standard deviation above the mean.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 81
If you go out another standard deviation, so now you are two standard deviations below the mean to two standard deviations above the mean, you get 95% of the data
points.

If you go all the way out to three standard deviations below the mean and three standard deviations above, you get almost the entire set of data: 99.7%.

This is why it's called the 68-95-99.7 Rule.

About 68% of the data values are within one standard deviation of the mean.
About 95% are within two standard deviations above or below the mean.
About 99.7% of the data, almost all the data, fall within three standard deviations of the mean.

 TERM TO KNOW

68-95-99.7 Rule
A rule that applies to normal distributions, stating that 68% of all data points fall within one standard deviation of the mean, 95% of all data points fall within two
standard deviations of the mean, and 99.7% of all data points fall within three standard deviations of the mean.

2. Finding More Percents


Because of the symmetry of the normal distribution, you can examine this rule further.

You can say that 68% of the data falls within one standard deviation, but because of the symmetry, 34% falls between one standard deviation below and the mean, and
another 34% falls between the mean and one standard deviation above.

You can continue with this logic: the green bars--between one and two standard deviations below and above the mean--each contain 13.5%. That 13.5% is obtained
because we know that the two standard variations on each side of the mean contain 95% of the data points. Since 68% is within the red portion--one standard deviation
above and below the mean--then the remaining 27% (95-68) must fall within the two green bars. Also, because they have the same area, they must each contain half of
that remainder or 13.5%.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 82
Using that same logic again, you can see that about 2.35% of the data points fall within the blue bars.

Way out in the tails, you get almost none of the data points. To make up the full 100%, another 0.15% falls within each of those tails further out than three standard
deviations away.

 EXAMPLE A particular battery from a company has a lifetime that is normally distributed with a mean of 500 hours, and a standard deviation of 18 hours. What
percent of batteries last between 482 and 518 hours?

482 and 518 are exactly one standard deviation above and below the mean of 500. Using the 68-95-99.7 rule, we can say that 68% of batteries last between 482 and
518 hours.

 EXAMPLE What percent of batteries from this company last between 446 hours, and 536 hours?
Well, 446 is three standard deviations below the mean. 536 is two standard deviations above the mean. So, there are two ways to calculate the answer to this
question.

1. You could calculate it as the full 95% percent (the green area), plus the 2.35% (one of the blue bars).

95% + 2.35% = 97.35%

2. You could have also calculated it as 99.7% (all of the blue area) and subtracted the extra 2.35% (the blue bar to the far right).

99.7% - 2.35% = 97.35%

Either way, you end up with 97.35%.

 EXAMPLE What percent of batteries last longer than 464 hours?


464 is two standard deviations below the mean, so there are a few different ways to do this calculation.

1. You could just start adding each area above 464.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 83
13.5% + 34% + 34% + 13.5% + 2.35% + 0.15% = 97.5%

2. You could add the 95% plus the remaining 2.5% in the upper tail.

95% + 2.5% = 97.5%

3. The entire curve is 100%, and the only part you don't want is the 2.5% on the left.

100% - 2.5% = 97.5%

However you calculate this, you should end up with the same answer: 97.5% of batteries last longer than 464 hours.

 SUMMARY

The 68-95-99.7 Rule is a way to generate approximate percents of values that will be within a particular interval of the normal distribution. You can combine this
rule with your knowledge of the symmetry of the normal distribution to find more percents than just 68, 95, and 99.7. This rule will not work if the values are not at
integer standard deviations, meaning whole numbers of standard deviations away from the mean.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

68-95-99.7 Rule
A rule that applies to normal distributions, stating that 68% of all data points fall within one standard deviation of the mean, 95% of all data points fall within two standard
deviations of the mean, and 99.7% of all data points fall within three standard deviations of the mean.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 84
Standard Scores and Z-Scores
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of z-scores. Our discussion breaks down as follows:

1. Z-Scores
a. Calculating Z-Scores
b. Negative Z-Scores
2. Converting Standard Distributions to Z-Scores

1. Z-Scores
Z-scores are values that allow you to make one-to-one comparisons between a couple of different distributions. Often, you want to compare two things, but it's not really
fair to compare them directly. Z-scores help with this.

A z-score is a value that explains how many standard deviations away from the mean an observation is. It can be positive (if the value is above the mean) or negative (if
the value is below the mean).

IN CONTEXT
Suppose you are taking a math class. For the first exam of the year, the class mean was 88 points, and the standard deviation was 5. You scored a 92, so you did
better than the class average. The second test was much harder. You scored an 80, which is a lot worse than you did the first time, but the class mean was 74.
So, even though your score went down, the class average also went down. The standard deviation, this time, was 4 points. Did you do better on the first or the
second test, relative to your classmates?

It's obvious that you scored higher on the first test, but relative to your classmates, did you do better on the first test or the second test? Just based on the scores,
it's not fair to say that you did better on the first test. You want to see how you did relative to your classmates. Z-scores are going to allow you to make this
comparison.

Z-scores are sometimes called standardized scores because they measure how many standard deviations away from the mean your observation is. In the previous
example, for the first test of the year, the standard deviation was 5 for the first exam. You scored 4 points higher than average, 92 compared to the average score of 88.
This means that your z-score is less than 1 because you are less than one standard deviation above the mean.

 TERM TO KNOW

Standard Scores/Z-Scores
A value that explains how many standard deviations away from the mean an observation is. It can be positive (if the value is above the mean) or negative (if the
value is below the mean).
1a. Calculating Z-Scores
In the above example, the actual z-score for the first test is positive 4/5: positive 4 means above the mean, and divided by 5, the standard deviation.

But how do you get this calculation? You use this formula:

 FORMULA

Z-Score

The z-score (represented by z) is equal to the raw score (x) minus the mean (mu), divided by the standard deviation (sigma).

 TRY IT

Using our example above, let's compare the standardized scores by comparing the z-score from the first test to the z-score from the second. First, let's calculate the z-
score from the first test, using our formula:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 85
Comparing Z-Scores of Two Tests

The class mean is 88 with a standard deviation of 5.


Test 1
You score a 92.

The class mean is 74 with a standard deviation of 4.


Test 2
You score an 80.
What does this mean? Because positive 1.5 is larger than positive 0.8, your score on the second test was, in fact, better relative to the rest of the class.

1b. Negative Z-Scores


The other thing that's worth noting is that a z-score can be negative. If you're subtracting a bigger number from a smaller number (i.e., if the raw score is a smaller number
than the mean), then you'll end up with a negative value.

 BIG IDEA

If the raw score is below the mean, the z-score will be negative. If the raw score is above the mean, the z-score is positive. If the raw score and the mean are the
same, the z-score is 0.

2. Converting Standard Distributions to Z-Scores


How are z-scores used in standard distributions? Let's explore an example.

 EXAMPLE Men's heights follow a normal distribution, with a mean of 68 and a standard deviation of 3.

Imagine, then, converting the 59, 62, 65, etc., into z-scores. For example:

A z-score of negative 3 tells us that 59 is three standard deviations below the mean. If we convert all these values into z-scores, what would that normal distribution
look like?

Since the noted numbers are integers of standard deviations away from the mean, and that's what z-scores measure, each can be represented by z-scores.

 BIG IDEA

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 86
This normal distribution of z-scores is called the standard normal distribution--standard, because it's the normal distribution of standardized scores.

 SUMMARY

Standardized scores / z-scores allow you to make one-to-one comparisons of scores from one distribution to scores from another distribution. They measure how
many standard deviations above or below the mean you are, and thus normal distributions can be converted to z-scores. A point that's further above the mean will
have a higher z-score than a point that's closer to the mean. A point above the mean will have a positive z-score, and a point below the mean will have a negative
z-score.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Standard Scores/z-scores
A value that explains how many standard deviations away from the mean an observation is. It can be positive (if the value is above the mean) or negative (if the value is
below the mean).

 FORMULAS TO KNOW

z-score

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 87
Standard Normal Distribution
by Sophia

 WHAT'S COVERED

This tutorial will cover the topic of standard normal distribution. Our discussion breaks down as follows:

1. Standard Normal Distribution


2. Standard Normal Tables
a. Negative Z-Scores
b. Positive Z-Scores
3. Different Cases

1. Standard Normal Distribution


The standard normal distribution is a specific kind of normal distribution that uses standard scores. Standard scores are also known as z-scores.

In the standard normal distribution, the mean will be zero, and the standard deviation is one.

 EXAMPLE Here is an example to show how to use standard normal distribution. Men's heights are normally distributed with a mean of 68 inches, which is five
feet eight inches, and a standard deviation of 3 inches.

What percent of men are over six feet (or 72 inches) tall?

First, recall the 68-95-99.7 Rule, which says that 68% of the data points fall within one standard deviation of the mean. That means that 68% of men's heights will fall
within three inches of 68. 95% will fall within two standard deviations, and 99.7% percent will fall within three standard deviations.

Where is 72 inches? It falls between the first and second standard deviation above the mean, and the goal was to find the number of men above that height. So,
because the answer is not on an integer standard deviation, it becomes an issue. How do you find the answer?
To solve this challenge, take these heights and standardize them by turning these values into z-scores. Z-scores are how many standard deviations away from the
mean an observation is. In the above distribution, 71 is one standard deviation above the mean, so in our new graph, it will be marked as +1. Next, 74 is two standard
deviations above the mean, so it will be marked as +2. The rest of the numbers will follow, resulting in a graph that looks like this:

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 88
In this graph, the 68 has been marked as 0, and all the other numbers marked as their respective integer number of standard deviations away from the mean. But
remember, 72 wasn't an integer number of standard deviations. It was somewhere between one and two standard deviations away. So we can calculate the exact z-
score for that cutoff point of 72 inches by using the z-score formula.

It's at 1.33 standard deviations above the mean, and our goal will be to find the percent of values that are above this score.

 TERM TO KNOW

Standard Normal Distribution


A normal distribution of z-scores. The mean is zero, and the standard deviation is one.

2. Standard Normal Tables


To solve the problem we were addressing above, you need a standard normal table, or z-table. A standard normal table is a table of probabilities that lie below particular z-
scores; it's a table that calculates the percent of values below a particular z-score. With these tables, you can find the percent of values that fall at or below a particular z-
score.

 TERM TO KNOW

Standard Normal Table/Z-Table


A table that calculates the percent of values below a particular z-score.
2a. Negative Z-Scores
Negative z-scores correspond to values that are below the mean. To find the probability of a negative z-score, we need to use the z-table that shows negative values. In
the table below, the left column shows negative z-scores that fall to the left of the mean. Also, notice in the graph that the values are always below the mean.

z +0.00 +0.01 +0.02 +0.03 +0.04 +0.05 +0.06 +0.07 +0.08 +0.09

-3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002

-3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003

-3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005

-3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007

-3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010

-2.9 0.0019 0.0018 0.0017 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014

-2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019

-2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026

-2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036

-2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048

-2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064

-2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084

-2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110

-2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143

-2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 89
-1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233

-1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294

-1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367

-1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455

-1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559

-1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681

-1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823

-1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985

-1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170

-1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379

-0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611

-0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867

-0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148

-0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451

-0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776

-0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121

-0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483

-0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3829

-0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247

0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641

2b. Positive Z-Scores


Positive z-scores correspond to values that are above the mean. To find the probability of a positive z-score, we need to use the z-table that shows positive values. In the
table below, the left column shows positive z-scores that fall to the right of the mean. Also, notice in the graph that the values are always above the mean.

z +0.00 +0.01 +0.02 +0.03 +0.04 +0.05 +0.06 +0.07 +0.08 +0.09

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359

0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141

0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517

0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224

0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549

0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852

0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133

0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830

1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015

1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177

1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319

1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441

1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545

1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633

1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706

1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767

2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817

2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 90
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890

2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916

2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936

2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952

2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964

2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974

2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981

2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990

3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993

3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995

3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997

3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997

You're only going to need to use one of these tables at a time. Because your z-score was positive 1.33, you’re going to use the positive z-score table.

The column on the far left represents the tenths place of your z-score. Your z-score was 1.33, so you’re going to find 1.3 as your tenths.
The row across the top represents the hundredths place of your z-score. Your z-score was 1.33, so you’re going to find 0.03 as your hundredths.

Then look in the table for the value that corresponds to the tenths place of 1.3 and the hundredths place of 0.03:
Hundredths

z +0.00 +0.01 +0.02 0.03 0.04 +0.05

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199

0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596

0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987

0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368

0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088

0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422

0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734

0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023


Tenths
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531

1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749

1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944

1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115

1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265

1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394

1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505

1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599

However, this is going to give you the amount of area below, meaning the percent of men shorter than 72 inches tall. A probability of 0.9082 tells us that 90.82% of men are
shorter than six feet tall.

Since the original question was what percent of men are taller than six feet, subtract 90.82% from 100%. You should find that 9.18% of men have heights of over 72
inches.

3. Different Cases
We can use the z-table to find the probability for different scenarios:

Percent below a particular value: This is the easiest to find because the table gives you the percentage below a value.
Percent above a particular value: This can be found by taking the table value and subtracting from 100%.
Percent between two values: This will require a few extra steps.

 EXAMPLE Men's heights are normally distributed with a mean of 68 inches, which is five feet eight inches, and a standard deviation of 3 inches.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 91
What percent of men are between 5'6" and 5'9", or between 66 and 69 inches tall?

The orange area in the above graph is what you’re looking for. First, you need to calculate the z-scores of each value.

The z-scores are negative 0.67 and positive 0.33. Next, find the table value that corresponds to positive 0.33, which is 0.6293 from the table. Then, find the table value
for negative 0.67, which is 0.2514.

Orange Area = P(z < 0.33) - P(z < -0.67)


Orange Area = 0.6293 - 0.2514
Orange Area = 0.3779

To find just the orange area, we need to subtract the full area that corresponds to a z-score of 0.33, which was about 63%, and the area that corresponds to a z-score
of -0.67, which was approximately 25%.
When you subtract them, you end up with about 0.6293 minus 0.2514, or 0.3779. This tells us that the orange area that contains 38% represents all men who fall in
between those two heights.

 SUMMARY

It's possible to find the percent of values above or below a particular value using something other than the 68-95-99.7 Rule using the z-scores on the standard
normal distribution. The standard normal table helps you find the percent of values below a particular z-score in order to calculate a percentage of values above or
below a particular value, or between two values.

Good luck!

 TERMS TO KNOW

Standard Normal Distribution


A normal distribution of z-scores. The mean is zero, and the standard deviation is one.

Standard Normal Table/z-table


A table that calculates the percent of values below a particular z-score.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 92
Introduction to Sampling Distribution
by Sophia

 WHAT'S COVERED

This tutorial will explain the sampling distribution of the sample means. Our discussion breaks down as follows:

1. Sampling Distribution

1. Sampling Distribution
A sampling distribution of sample means is a distribution that shows the means from all possible samples of a given size. Let’s start with an example of a sampling
distribution.

Consider the spinner shown here:

Suppose you spin it four times to obtain an average. You get a 2 the first time, a 4 the second time, a 3 the third time, and a 1 the fourth time. The mean is the average of
2, 4, 3, and 1 is:

So, your first mean is 2.5.

Sample Mean

S1 = {2, 4, 3, 1} x̄1 = 2.50


However, your mean won't be 2.5 every time. Suppose you repeat this process five more times to get the following six samples:

Sample Mean

S1 = {2, 4, 3, 1} x̄1 = 2.50

S2 = {1, 4, 3, 1} x̄2 = 2.25

S3 = {4, 2, 4, 4} x̄3 = 3.5

S4 = {2, 2, 3, 1} x̄4 = 2.00

S5 = {3, 1, 1, 1} x̄5 = 1.50

S6 = {1, 1, 1, 2} x̄6 = 1.25


So how can we represent all these distributions?

 STEP BY STEP

Step 1: First, take these sample means and graph them. Draw out an axis. For this one, it should go from 1 to 4 because this set can’t average anything higher than
four or lower than a one.
Step 2: Take the average value, for example, the mean of 2.5, and put a dot at 2.5 on the x-axis, much like a dot plot. Do this for all the sample means that you have
found.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 93
Step 3: You can keep doing this over and over again. Ideally, you would do this hundreds or thousands of times, to show the distribution of all possible samples that
could be taken of size four. Once you’ve enumerated every possible sample of size four from this spinner, then the sampling distribution looks like this:

On the graph, the lowest number you can get is one, and the highest number you can get is four. On the far right of the graph is the point that represents a spin of 4 fours,
{4, 4, 4, 4}. On the far left is the point that represents a spin of 4 ones, {1, 1, 1, 1}. Notice that 4 ones happens more than 4 fours. Why is that? If you take a look at the
spinner, you'll see that there are more ones on the spinner than there are fours.

You can also notice that, since there are more ones, this actually pulls the average down a bit. The most frequent average is 2.25, not 2.5, which would be the exact middle
between 1 and 4. Therefore, this distribution is skewed slightly to the right because the numbers on the spinner are not evenly distributed.

 TERM TO KNOW

Sampling Distribution of Sample Means


A distribution that shows the means from all possible samples of a given size.

 SUMMARY

A sampling distribution is the distribution of all possible means that you could have for a given size. For example, in a sampling distribution where you graph all the
possible means for samples of size 4, you take the sample, calculate the mean, and plot it with a dot. Then you take another sample, calculate the mean again,
and plot that, and so forth. It can be a long and tedious process! In large populations, the sampling distribution consists of a very large number of points.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Sampling Distribution of Sample Means


A distribution that shows the means from all possible samples of a given size.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 94
Center and Variation of a Sampling Distribution
by Sophia

 WHAT'S COVERED

This tutorial will explain how to find the mean and standard deviation of a sampling distribution of the sample means. Our discussion breaks down as follows:

1. Mean of a Distribution of Sample Means


2. Standard Deviation of a Distribution of Sample Means

1. Mean of a Distribution of Sample Means


In this tutorial, you're going to learn about the center and variation of a sampling distribution. We're going to be using the mean to measure center, and the standard
deviation to measure the variation.

Suppose you have a spinner with the following sectors:

It's fairly easy to find the mean number spun from one spin. You just add up all the values, and divide by 8 because they're all equally likely sectors:

You end up with 2.375 as your mean. Using the standard deviation formula or Excel, the population standard deviation of the spinner is 1.218.

Now suppose you spun it four times to obtain an average. Next, you did it again and obtained another average, and then another average. You could eventually consider
every possible set of four outcomes. If you consider every possible scenario and plot each scenario on a graph like the one below, you can create what's called a sampling
distribution of sample means.

For this distribution, it looks like the data average is somewhere around the 2.25 or 2.5 region. Those are the most likely averages from four spins.

Now, this distribution of sample means itself has a mean. In the case of this distribution of sample means, the mean is somewhere in the center. In fact, the number is
2.375, which is the same as the mean of the original spinner. So, the mean of the distribution of sample means is the same as the mean of the original distribution for the
spinner. Sometimes we call this the parent distribution or the grand mean.

Symbolically it looks like this:

 FORMULA

Mean of a Distribution of Sample Means

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 95
 TERM TO KNOW

Mean of a Distribution of Sample Means


The average of all possible means from all possible samples of a given size. It will be equal to the mean of the original population.

2. Standard Deviation of a Distribution of Sample Means


Recall that the standard deviation of the original distribution for the spinner was 1.218. There is also a standard deviation of the distribution of sample means, shown below
for 4 spins. The standard deviation, in this case, is 0.609.

How does that 0.609 relate to 1.218? Well, 0.609 is half of 1.218. The standard deviation on the distribution for four spins is only half as large as the original standard
deviation was.

What is the distribution of sample means when you spin nine times? You'll notice that the mean of the distribution for nine spins is the same as the other means: 2.375.
Next, look at the shape. You can see that the extreme values are much less likely now, and things start moving towards the center. This might be indicative of the standard
deviation getting smaller yet again. In fact, the standard deviation is just 0.406 on either side of the mean.

Let's consider the three cases so far:

When you spun just once, the mean was 2.375, and the standard deviation was 1.218.
When you spun four times, the mean was 2.375. Standard deviation was 0.609, which is half of the standard deviation of the original distribution.
When you spun nine times, again the mean was 2.375. Standard deviation was 0.406, which is a third of the standard deviation of the original distribution.

As the number of spins increases, the standard deviation goes down. However, it's not linear; it's proportional to the inverse of the square root of n. Therefore, to calculate
the standard deviation of a distribution of a sample means, you divide the original standard deviation by the square root of sample size.

 FORMULA

Standard Deviation of a Distribution of Sample Means

 TERM TO KNOW

Standard Deviation of a Distribution of Sample Means


The standard deviation of all possible means from all possible samples of a given size. It will be equal to the standard deviation of the original population, divided
by the square root of the sample size.

 SUMMARY

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 96
The mean of a sampling distribution, from samples of size n, is always going to be the same as the mean of the original distribution it came from, or the parent
distribution. The mean of all the x-bars is the same as the original mean from the parent distribution. The standard deviation, on the other hand, gets smaller as the
sample size increases. The larger the sample size, the more likely it is that the extreme values will get evened out and pulled back towards the mean. Thus, the
standard deviation decreases, and you can quantify the decrease in standard deviation. The standard deviation for the sampling distribution is the standard
deviation of the parent distribution, divided by the square root of sample size.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Mean of a Distribution of sample means


The average of all possible means from all possible samples of a given size. It will be equal to the mean of the original population.

Standard Deviation of a Distribution of sample means


The standard deviation of all possible means from all possible samples of a given size. It will be equal to the standard deviation of the original population, divided by the
square root of the sample size.

 FORMULAS TO KNOW

Mean of a Distribution of sample means

Standard Deviation of a Distribution of sample means

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 97
Shape of a Sampling Distribution
by Sophia

 WHAT'S COVERED

This tutorial will cover the shape of a sampling distribution. Our discussion breaks down as follows:

1. The Characteristics of Distributions


2. Shape
3. The Central Limit Theorem

1. The Characteristics of Distributions


In a sampling distribution, the center is the same as the center of the original distribution. That is to say, the mean of all the x-bar averages is the same as the mean of the
original distribution. The shape is also a characteristic of distributions and will be discussed in the next section.

The spread, or standard deviation, is the same as the original standard deviation divided by the square root of sample size. In other words, the standard deviation of all the
x-bars is equal to the original standard deviation divided by the square root of n (n being the sample size).

These two characteristics are notated like this:

 FORMULA

Mean of a Distribution of Sample Means

Standard Deviation of a Distribution of Sample Means

2. Shape
Consider this spinner:

Consider the sampling distributions caused by averaging different numbers of spins. Well, it's fairly obvious that if you spun it once, you would spin a one about 3 out of 8
times. You'd spin a two about 1 out of 8 times, a three about 2 out of 8 times, and a four about 2 out of 8 times, making the distribution look something like this:

You can see here that one is the most common, three and four are the next most common (and equally common), and two is the least common.

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 98
What about the sampling distributions if you averaged four spins? Well, you wouldn't just have options of 1, 2, 3, and 4 anymore. You'd have options of 1.25, 1.5, 1.75 , 2,
2.25, 2.5, etc. Having more options would necessarily decrease the likelihood of getting all 1's or all 4's. The distribution would look something like this:

Getting all four 1's would be extremely unlikely, and getting all four 4's would also be extremely unlikely. The most likely scenario is getting 2.25. There are a few more 1's
than there are anything else, which pulls the mean down a little bit from where you might have thought it would be: 2.3. This distribution is slightly skewed to the right.

What if you sampled nine and averaged nine spins? Well, the probability, for instance, that you get all 1's, therefore averaging a 1, goes down even further. Also, the
probability that you get all 4's goes down to almost zero:

As this graph shows, it's possible to get all 1's, but it’s not very likely. It's a lot more common to get something between 2 and 3. The spread of the sampling distribution is
decreasing as n gets bigger. As the previous graphs show, the shape of the distribution changes as the number of spins, n, changes.

Suppose you are averaging 20 spins. It's almost impossible to average a 1, a 4, or even something close to 3. You're almost guaranteed to average something between 2
and 3. The spread, again, is decreasing.

 THINK ABOUT IT

What would you say is happening to the shape as the number of samples are increased?

As n increases, the shape is becoming more normal.

3. The Central Limit Theorem


The Central Limit Theorem deals with the changes in shape that we saw above; it discusses the shape of a sampling distribution. The Central Limit Theorem states that
when the sample size is large, the shape of the sampling distribution of means becomes nearly normal.

 BRAINSTORM

How large a sample size is considered large enough for the distribution to be approximately normal? In our distributions so far in this tutorial, is it approximately normal
when n is nine? Is it approximately normal when n is 20? What constitutes a large enough sample so that when you show the distribution of all the averages, you get a
normal distribution?
The definition of “large” will be different depending on what the original distribution looked like. Our original distribution was almost uniform, so it didn't take very many
trials. If the distribution had been heavily skewed, it would have taken more trials to average out some of those high numbers with some of those low numbers.

In most cases, 30 is going to be a good sample size such that when you average the 30 observations, you are going to get something close to what you expect. With a
sample size of 30, the distributions that are off from what you expect will tail off in a normal shape.

 BIG IDEA

For almost all distributions, a sample size of 30 is exactly what we want. It's the Central Limit Theorem that explains why so many real-world processes are normally
distributed.
 TERM TO KNOW

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 99
Central Limit Theorem
A theorem that explains the shape of a sampling distribution of sample means. It states that if the sample size is large (generally n ≥ 30), and the standard
deviation of the population is finite, then the distribution of sample means will be approximately normal.

 SUMMARY

A sampling distribution is the distribution of all possible means of a given size; there are characteristics of distributions that are important, and for the Central Limit
Theorem, the important characteristic is the shape. The Central Limit Theorem outlines that when the sample size is large, for most distributions (meaning 30 or
larger), then the distribution of sample means will be approximately normal. Occasionally, you need to make the distribution of sample means even bigger than 30
if the parent distribution that you started with is very skewed or has outliers. The Central Limit Theorem is probably the most important idea in statistics.

Good luck!

Source: Adapted from Sophia tutorial by Jonathan Osters.

 TERMS TO KNOW

Central Limit Theorem


A theorem that explains the shape of a sampling distribution of sample means. It states that if the sample size is large (generally n ≥ 30), and the standard deviation of
the population is finite, then the distribution of sample means will be approximately normal.

 FORMULAS TO KNOW

Mean of a Distribution of Sample Means

Standard Deviation of a Distribution of Sample Means

© 2023 SOPHIA Learning, LLC. SOPHIA is a registered trademark of SOPHIA Learning, LLC. Page 100

You might also like