Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Chapter 2 Note Outline

 Main Concepts Covered:


o Tables and Graphs o Shapes of Distributions
o Measures of Center o The Empirical Rule
o Measures of Position o z-scores
o Measures of Variability

 Tables and Graphs:


o Example for tables and graphs:
Identify the type of variable for each variable given:

• Classification

• Residence Type

• Number of Classes Fall 2022

• Current GPA
o Tables for Categorical Variables:
 If you are only working with one categorical variable, you make what is called a .

 If you are working with two categorical variables, you make what is called a

o Graphs for Categorical Variables:

o Tables for Quantitative Variables:


 If you are only working with quantitative variable, you make what is called a .
 Possible Columns in Frequency Table:
 : the number of values in the range.

 : the proportion of values for a given range. Frequency / Total


number of values. A number between 0 and 1. (The total for this column should be 1, or
very close if there is a rounding issue).

 : The proportion written as a percentage. (Frequency / total


number of values) x 100%. A number between 0% and 100%. (The total for this column
should be 100%, or very close if there is a rounding issue).

 The sum of all the percentages of values less than the upper
limit in the bin. A running total for percentages. A number between 0% and 100%. (The
last data value should have a value of 100%, or very close if there is a rounding issue).

o Graphs for Quantitative Variables:

 Outliers
o An observation that lies an distance from other values in a dataset.

o An observation that is either than the other values or


than the other values.
o Resistant measurements:
 Measurements that are by outliers are not
resistant.
 Measurements that are by outliers are
resistant.

o At the end of this lecture part we will look at the rule used by statisticians to determine what is an
abnormal distance from the others. There is a rule of thumb to determine if a value is unusual and
considered an outlier.
o As we define the numerical summaries we use we will discuss which measures would be greatly affected
by the presence of an outlier.

 Sample Size
o The number of values in a dataset.
o Notation:
 N= n=
o Note this value will be referenced in many formulas throughout the semester.
 Measures of Center
o Information gained from these measurements:

o Mean
 The in the dataset.
 Notation:
 µ= x=
 Calculated:


∑ x= ∑ of all values
n total number of values
o Median
 The when data is ordered. % of the data
falls at or below the value.
 Calculated:
 Find the middle of the dataset.
n n
 If n is even the middle falls between the and + 1 values. Average these two values.
2 2
n+1
 If n is odd the middle falls at the value.
2
o Mode
 The value in the dataset (if there is one).
 If there isn’t a value that occurs most often state that there is .
 Calculated:
 Easiest to do when data is sorted.
 Count how many times each value appears and the one with the highest count is the
mode.
 Example 1: The following dataset represents the number of items purchased in a single shopping trip to a
grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 30
The measures of center would answer the question: What is the number of items normally purchased in a single
shopping trip to a grocery store?
o Mean:
x=
∑ x = 3+6 +6+8+ 14+18+23+ 30 =¿
n 8
o Median:
 Note the data is in order (if it wasn’t you would need to put it in ascending order first).
 n= which is
n n
 The middle falls between the =¿ and + 1=¿ values.
2 2
 value = and value =
 Average these to find median =

o Mode:
 occurs which is the most common (all others occur only once).

o Fill in the blanks:


 The average number of items purchased was .

 50% of shoppers purchased or fewer items.

 The most common number of items purchased was .

 Example 2: The following dataset represents the number of items purchased in a single shopping trip to a
grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 100
Assume someone was having a party and needed to purchase a large amount of items, 100.
Since 100 is abnormal, compared to other values, it is probably an outlier.
o Mean: Would the value change?
 All values are used in the calculation of the mean.

o Median: Would the value change?


 No matter how large maximum gets the middle wouldn’t change.

o Mode: Would the value change?


 An outlier is an unusual value and the mode is the most common. By definition they can’t be
the same.

 Measures of Center and Outliers:

o The is the only measure of center that is greatly affected by outliers.

o The are much less affected by an outlier.


o Sometimes decisions are made about which measurement to use based on how likely an outlier is in the
data.

 Example 3: Find the median of the following dataset.


0 10 15 18 20 22 24 26 35
o Median:
 Note the data is in order (if it wasn’t you would need to put it in ascending order first).
 n= which is
n
 The middle falls between the + 1=¿
2

 Note: The measures of center are by far the most common and simplest measurements in statistics; however,
to truly get a good picture of the data we need to know more information.

Dataset 1 3 3 3 4 4 4 4 5 5 5 Mean = 4 Median = 4

Dataset 2 2 3 3 4 4 4 4 5 5 6 Mean = 4 Median = 4

Dataset 3 1 1 2 2 3 4 5 6 8 8 Mean = 4 Median = 3.5

All three of these dataset have


the same mean but these are
three different datasets.

 Measures of Position
o Information gained from these measurements.

o Five Number Summary:


Minimum First Quartile Second Quartile Third Quartile Maximum

o Minimum: number in the dataset


o First Quartile, Q 1: value such that % of data is at or below it.

o Second Quartile, Q 2: value such that % of data is at or below it.


 Which measure of center is this?

o Third Quartile, Q 3: value such that % of data is at or below it.

o Maximum: number in the dataset


o Calculating the five number summary:
 Make sure the data is sorted in ascending order.
 Minimum and maximum are easy to identify.
 To find the quartile:
 Start by finding Q 2, the second quartile.
o Find the middle of the entire dataset.
n n
o If n is even the middle falls between the and + 1 values. Average these two
2 2
values.
n+1
o If n is odd the middle falls at the value.
2
 Split the dataset into two equal halves at the location of Q 2, note do not include Q 2
value if the dataset was odd. (make sure you have the same number of values in each
half when you split the dataset).
 Find the middle of the first half of the dataset only. This will give you the first quartile,
Q 1.
 Find the middle of the second half of the data set only. This will give you the third
quartile, Q 3.
o Example 4: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 30
Give the five number summary.
 Minimum

 Maximum

 Median, Q 2
 Note the data is in order (if it wasn’t you would need to put it in ascending order first).
 n= which is
n n
 The middle falls between the =¿ and + 1=¿ values.
2 2
 value = and value =
 Average these to find median =
 First and Third Quartiles
 Split the dataset into two equal halves where the median fell.
 Find middle of first half only: Note each half has 4 values.

n =4 even

n n
Middle between =¿2nd and + 1= 3rd
2 2
 Find middle of second half only:

 Fill in the blanks:


 The smallest number of items purchased was .
 25% of shoppers purchased or fewer items.
 50% of shoppers purchased or fewer items.
 75% of shoppers purchased or fewer items.
 The largest number of items purchased was .
o Quartiles Break Data into 4 Equal Chunks

3 6 6 8 14 18 23 30

o Example 5: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.

3 6 6 8 14 18 23 100

o Measures of Positions and Outliers

 The could be greatly affected by outliers.

 The are much less affected by an outlier.


o Example 6: Find the five number summary of the following dataset.
0 10 15 18 20 22 24 26 35 40 52
 Note: Data is already ordered.
 Find the quartiles:
 Find Q 2.

 Find Q 1 and Q 3.
 Split the data in 2 equal halves (since n was odd do not include Q 2in either half).

 Give five number summary: Min= Q1 = Q2 = Q3 = Max =


 Boxplot:
o A graph to visualize the five number summary.
o Used to be called a box-and-whisker plot
o Box:
 The three quartiles make up the box.
 On a number line you will draw three vertical lines (of equal height) at the three quartile values.
 Connect the top and bottom of the lines to make a box.
o Whisker:
 Draw a line from the middle of the front of the box to the value of minimum.
 Draw a line from the middle of the end of the box to the value of the maximum.
o Example 7: Draw the boxplot for the five number summary:
 Min = 3 Q 1= 6 Q2= 11 Q3= 20.5 Max = 30

o Example 8: Assume a class has 10 homework assignments for the semester. An instructor is interested
in determining if students would do homework assignments if they were not part of the grade.
 The teacher gives one section of the class the homework assignment and it does not count as
part of their final grade.
 Another section of the class the homework assignments and it does count as part of their final
grade.

It is clear from the boxplot


there is a difference in the
number of assignments done
in the semester based on
whether or not the grades
count towards their final
grade calculation.

 Measures of Variability:
o Information gained from these measurements:
o Example 9: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 30
The measures of variability would answer the question: How much variation is there in the number of
items normally purchased in a single shopping trip to a grocery store?
 Range

 IQR

 Standard Deviation s=
√ ∑ ( x−x )2
n−1
o Let's start by understanding the numerator of this equation.

o Recall the standard deviation represents the average distance (or difference) the data falls from the
mean.

o Step 1: Start by calculating the difference between each data point and the mean. (We previously found
x = 13.5).
 This is called finding the deviations.

x 3 6 6 8 14 18 23 30

Step 1 (x – x )

• Then find the average deviation – this is where things get tricky.

• Normally to find an average you would take all the values, in this case (x – x ), and divide by the # of
values, n.
• The problem with this method: ∑ (x−x ¿)¿ = -10.5 + (-7.5) + (-7.5) + (-5.5) + 0.5 + 4.5 + 9.5 + 16.5 = 0.
The sum of the deviations will always be zero because the sum of the negative values will be the same
as the sum of the positive values. This is because the mean is the balance point in the data.

• The fix to the problem: we need to remove the negative values. To do this we square each of the
deviations.

Step 2: Find ∑ ( x −x ) , the sum of the squared deviations. (which we have to do so we don’t end up
2

with zero in our numerator).

Step 2 ( x−x )2

 Step 3: Determine the numerator for the equation:

Step 3 ∑ ( x −x )2= 110.25 + 56.25 + 56.25 + 30.25 + 0.25 + 20.25 + 90.25 + 272.25 =
 Now let’s talk about the denominator.
 We still want to find the average of the squared deviations so we do need to divide:
 For the sample standard deviation you divide by n – 1 because the sample mean is used to
calculate the deviations.
 Explanation:
 If x = 13.5 and n = 8, which we have used to calculate the standard deviation, then
∑ x =x ( n ) =108.
 This problem has n – 1 degrees of freedom. Meaning 7 of the values can be any possible
set of numbers but the last value, the 8th value, must be a value that will result in the
∑ x =108.
 Step 4: Find the average of the squared deviations:

Step 4
∑ ( x−x )2 =¿
n−1
 Step 5: Take the square root:
 The answer from step 4 is the average squared deviations.
 We don’t want to talk to things in squared units so let’s get rid the squared units by

taking .

Step 5 s=
√ ∑ ( x−x )2 =
n−1
o Fill in the blanks:
 The average distance between the values in the dataset and the mean was .
 The total spread in the data, the difference between the smallest value and the largest value,

was .
 The spread in the middle 50% of the data, the difference between the first and third quartile,

was .
o Measures of Variability and Outliers
o Example 10: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 100
 Determine if and by how much the measures of variability change if there is an outlier.

 Range: Max – Min =

 Range will by the presence of an outlier.

 IQR: Q 3−Q1 =

 IQR will by the presence of an outlier.


 Standard Deviation: (Note: x=22.25 ¿

s=
√ ∑ ( x−x )2 =¿
n−1

 Standard deviation affected by the presence of an outlier.


 Shapes of Distributions
o Shapes of distributions are described by the following characteristics:
 Number of peaks
 If there is a single peak in the distribution then it is called .

 If there are two peaks in the distribution then it is called .


 Symmetric or skewed
 Symmetric distributions occur if from the middle of the distribution the two sides are
mirror images of each other.
 Skewed distributions have more points plotted on one side of the graph than the other.
o Left skewed (negatively skewed) distributions have the majority of the data in

the portion of the distribution with unusually

values pulling the tail further from the peak.


o Right skewed (positively skewed) distributions have the majority of the data in
the portion of the distribution with unusually

values pulling the tail further from the peak.

 Mean vs Median in Unimodal Distributions:

Notice:
• To help determine the shape of the distribution, find the peak and determine which tail from the peak goes
further. The longer tail is the direction it is skewed.
• The mean gets pulled further from the peak in the direction of the longer tail.
• The longer the tail to the left the smaller the mean. (note < points to the left)
• The longer the tail to the right the larger the mean. (note > points to the right)

 Z-scores:
o Definition:
 A z-score gives the number of above or below the
a values falls.
 This will tell us how different a particular value is from the mean in the number of .
o Formula:
a value of x−mean
 z=
standard deviation
x−x
 If you are working with a sample: z=
s
x−μ
 If you are working with a population: z=
σ
o The possible values a z-score can take are:
 z = 0. This will occur when a particular value to the mean. x=x or
x=μ.
 z can be negative. This will occur when a particular value than
the mean. x < x or x < μ.
 z can be positive. This will occur when a particular value than
the mean. x > x or x > μ.
o Uses of the z-score
 Standardizes the units:
 Variables generally could have many different units: inches, feet, dollars, pounds, etc.
 We can compare different variables if we can think in the same units; in this case, how
many standard deviations.
 Determines if a value is considered unusual; in other words, determines if a value is an outlier.

 Any z-score three standard deviations of the mean is not


considered unusual. (The value is not considered an outlier).

 Any z-score three standard deviations away from the mean


is considered unusual. (The value can be considered an outlier).

 Many methods in statistics will require us to study variables using the same unit of
measurement, how many standard deviations. (z-scores will continuously come up in this class
throughout the semester).
o Interpreting the z-score
 Online: A value of the value of x falls z (as a positive number) standard deviations
above/below the mean.
 The first blank is the value of x you used in the equation.
 The second blank is the z score calculated as a positive number (never put a negative
value in this spot).
 The last blank is either the word above (if the z-score was positive) or below (if the z-
score was negative).
o Example 11: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 100
 A z-score could be calculated for each of the values of x in the dataset.
 Find and interpret the z-score for the following values and determine if the value would be
considered unusual. Note the mean was found to be 22.25 and the standard deviation was
found to be 32.15.
 Determine the z-score for a shopper that purchased 6 items during a single shopping trip to a
grocery store.

 Determine the z-score for a shopper that purchased 100 items during a single shopping trip to a
grocery store.

o Example 12: An analysis was done on the amount of money students spend per day on Spring Break.
The population mean was found to be $100 with a standard deviation of $50.
 Determine if the following values are considered unusual or not. Justify your answer
referencing the z-score.
 Would a person that spent $200 per day on Spring Break be considered unusual?

 Would a person that spent $270 per day on Spring Break be considered unusual?

 The Empirical Rule


o Only works if you have a symmetric, bell-shaped distribution.

 This is called the and we will talk a lot more about this
later in the class.
o What the rule states:
 If you have a symmetric, bell-shaped distribution:

 Approximately % of the data will fall within one standard deviation of the
mean.
 Approximately % of the data will fall within two standard deviations of the
mean.
 Approximately % of the data will fall within three standard deviations of the
mean.
o How the rule is used:
 Once you determine the shape is symmetric, bell-shaped and you determine the mean and
standard deviation you can determine between which two values these percentages fall.

 % falls between μ−1 σ and μ+1 σ .

 % falls between μ−2 σ and μ+2 σ .

 % falls between μ−3 σ and μ+3 σ .


o Example 13: An analysis was done on the amount of money students spend per day on Spring Break.
The population mean was found to be $100 with a standard deviation of $50.
 Assume this distribution was roughly symmetric and bell-shaped.
 Note the Empirical rule can be used only because we know the shape is roughly
symmetric and bell-shaped. (Without this information or if the data was known to be
skewed the Empirical rule would not apply).

 Between which two values will 68% of the data fall between?

 Between which two values will 95% of the data fall between?

 Between which two values will 99.7% of the data fall between?

You might also like