Professional Documents
Culture Documents
Chapter 2
Chapter 2
• Classification
• Residence Type
• Current GPA
o Tables for Categorical Variables:
If you are only working with one categorical variable, you make what is called a .
If you are working with two categorical variables, you make what is called a
The sum of all the percentages of values less than the upper
limit in the bin. A running total for percentages. A number between 0% and 100%. (The
last data value should have a value of 100%, or very close if there is a rounding issue).
Outliers
o An observation that lies an distance from other values in a dataset.
o At the end of this lecture part we will look at the rule used by statisticians to determine what is an
abnormal distance from the others. There is a rule of thumb to determine if a value is unusual and
considered an outlier.
o As we define the numerical summaries we use we will discuss which measures would be greatly affected
by the presence of an outlier.
Sample Size
o The number of values in a dataset.
o Notation:
N= n=
o Note this value will be referenced in many formulas throughout the semester.
Measures of Center
o Information gained from these measurements:
o Mean
The in the dataset.
Notation:
µ= x=
Calculated:
∑ x= ∑ of all values
n total number of values
o Median
The when data is ordered. % of the data
falls at or below the value.
Calculated:
Find the middle of the dataset.
n n
If n is even the middle falls between the and + 1 values. Average these two values.
2 2
n+1
If n is odd the middle falls at the value.
2
o Mode
The value in the dataset (if there is one).
If there isn’t a value that occurs most often state that there is .
Calculated:
Easiest to do when data is sorted.
Count how many times each value appears and the one with the highest count is the
mode.
Example 1: The following dataset represents the number of items purchased in a single shopping trip to a
grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 30
The measures of center would answer the question: What is the number of items normally purchased in a single
shopping trip to a grocery store?
o Mean:
x=
∑ x = 3+6 +6+8+ 14+18+23+ 30 =¿
n 8
o Median:
Note the data is in order (if it wasn’t you would need to put it in ascending order first).
n= which is
n n
The middle falls between the =¿ and + 1=¿ values.
2 2
value = and value =
Average these to find median =
o Mode:
occurs which is the most common (all others occur only once).
Example 2: The following dataset represents the number of items purchased in a single shopping trip to a
grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 100
Assume someone was having a party and needed to purchase a large amount of items, 100.
Since 100 is abnormal, compared to other values, it is probably an outlier.
o Mean: Would the value change?
All values are used in the calculation of the mean.
Note: The measures of center are by far the most common and simplest measurements in statistics; however,
to truly get a good picture of the data we need to know more information.
Measures of Position
o Information gained from these measurements.
Maximum
Median, Q 2
Note the data is in order (if it wasn’t you would need to put it in ascending order first).
n= which is
n n
The middle falls between the =¿ and + 1=¿ values.
2 2
value = and value =
Average these to find median =
First and Third Quartiles
Split the dataset into two equal halves where the median fell.
Find middle of first half only: Note each half has 4 values.
n =4 even
n n
Middle between =¿2nd and + 1= 3rd
2 2
Find middle of second half only:
3 6 6 8 14 18 23 30
o Example 5: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 100
Find Q 1 and Q 3.
Split the data in 2 equal halves (since n was odd do not include Q 2in either half).
o Example 8: Assume a class has 10 homework assignments for the semester. An instructor is interested
in determining if students would do homework assignments if they were not part of the grade.
The teacher gives one section of the class the homework assignment and it does not count as
part of their final grade.
Another section of the class the homework assignments and it does count as part of their final
grade.
Measures of Variability:
o Information gained from these measurements:
o Example 9: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 30
The measures of variability would answer the question: How much variation is there in the number of
items normally purchased in a single shopping trip to a grocery store?
Range
IQR
Standard Deviation s=
√ ∑ ( x−x )2
n−1
o Let's start by understanding the numerator of this equation.
o Recall the standard deviation represents the average distance (or difference) the data falls from the
mean.
o Step 1: Start by calculating the difference between each data point and the mean. (We previously found
x = 13.5).
This is called finding the deviations.
x 3 6 6 8 14 18 23 30
Step 1 (x – x )
• Then find the average deviation – this is where things get tricky.
• Normally to find an average you would take all the values, in this case (x – x ), and divide by the # of
values, n.
• The problem with this method: ∑ (x−x ¿)¿ = -10.5 + (-7.5) + (-7.5) + (-5.5) + 0.5 + 4.5 + 9.5 + 16.5 = 0.
The sum of the deviations will always be zero because the sum of the negative values will be the same
as the sum of the positive values. This is because the mean is the balance point in the data.
• The fix to the problem: we need to remove the negative values. To do this we square each of the
deviations.
Step 2: Find ∑ ( x −x ) , the sum of the squared deviations. (which we have to do so we don’t end up
2
•
with zero in our numerator).
Step 2 ( x−x )2
Step 3 ∑ ( x −x )2= 110.25 + 56.25 + 56.25 + 30.25 + 0.25 + 20.25 + 90.25 + 272.25 =
Now let’s talk about the denominator.
We still want to find the average of the squared deviations so we do need to divide:
For the sample standard deviation you divide by n – 1 because the sample mean is used to
calculate the deviations.
Explanation:
If x = 13.5 and n = 8, which we have used to calculate the standard deviation, then
∑ x =x ( n ) =108.
This problem has n – 1 degrees of freedom. Meaning 7 of the values can be any possible
set of numbers but the last value, the 8th value, must be a value that will result in the
∑ x =108.
Step 4: Find the average of the squared deviations:
Step 4
∑ ( x−x )2 =¿
n−1
Step 5: Take the square root:
The answer from step 4 is the average squared deviations.
We don’t want to talk to things in squared units so let’s get rid the squared units by
taking .
Step 5 s=
√ ∑ ( x−x )2 =
n−1
o Fill in the blanks:
The average distance between the values in the dataset and the mean was .
The total spread in the data, the difference between the smallest value and the largest value,
was .
The spread in the middle 50% of the data, the difference between the first and third quartile,
was .
o Measures of Variability and Outliers
o Example 10: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 100
Determine if and by how much the measures of variability change if there is an outlier.
IQR: Q 3−Q1 =
s=
√ ∑ ( x−x )2 =¿
n−1
Notice:
• To help determine the shape of the distribution, find the peak and determine which tail from the peak goes
further. The longer tail is the direction it is skewed.
• The mean gets pulled further from the peak in the direction of the longer tail.
• The longer the tail to the left the smaller the mean. (note < points to the left)
• The longer the tail to the right the larger the mean. (note > points to the right)
Z-scores:
o Definition:
A z-score gives the number of above or below the
a values falls.
This will tell us how different a particular value is from the mean in the number of .
o Formula:
a value of x−mean
z=
standard deviation
x−x
If you are working with a sample: z=
s
x−μ
If you are working with a population: z=
σ
o The possible values a z-score can take are:
z = 0. This will occur when a particular value to the mean. x=x or
x=μ.
z can be negative. This will occur when a particular value than
the mean. x < x or x < μ.
z can be positive. This will occur when a particular value than
the mean. x > x or x > μ.
o Uses of the z-score
Standardizes the units:
Variables generally could have many different units: inches, feet, dollars, pounds, etc.
We can compare different variables if we can think in the same units; in this case, how
many standard deviations.
Determines if a value is considered unusual; in other words, determines if a value is an outlier.
Many methods in statistics will require us to study variables using the same unit of
measurement, how many standard deviations. (z-scores will continuously come up in this class
throughout the semester).
o Interpreting the z-score
Online: A value of the value of x falls z (as a positive number) standard deviations
above/below the mean.
The first blank is the value of x you used in the equation.
The second blank is the z score calculated as a positive number (never put a negative
value in this spot).
The last blank is either the word above (if the z-score was positive) or below (if the z-
score was negative).
o Example 11: The following dataset represents the number of items purchased in a single shopping trip to
a grocery store for a random sample of 8 shoppers.
3 6 6 8 14 18 23 100
A z-score could be calculated for each of the values of x in the dataset.
Find and interpret the z-score for the following values and determine if the value would be
considered unusual. Note the mean was found to be 22.25 and the standard deviation was
found to be 32.15.
Determine the z-score for a shopper that purchased 6 items during a single shopping trip to a
grocery store.
Determine the z-score for a shopper that purchased 100 items during a single shopping trip to a
grocery store.
o Example 12: An analysis was done on the amount of money students spend per day on Spring Break.
The population mean was found to be $100 with a standard deviation of $50.
Determine if the following values are considered unusual or not. Justify your answer
referencing the z-score.
Would a person that spent $200 per day on Spring Break be considered unusual?
Would a person that spent $270 per day on Spring Break be considered unusual?
This is called the and we will talk a lot more about this
later in the class.
o What the rule states:
If you have a symmetric, bell-shaped distribution:
Approximately % of the data will fall within one standard deviation of the
mean.
Approximately % of the data will fall within two standard deviations of the
mean.
Approximately % of the data will fall within three standard deviations of the
mean.
o How the rule is used:
Once you determine the shape is symmetric, bell-shaped and you determine the mean and
standard deviation you can determine between which two values these percentages fall.
Between which two values will 68% of the data fall between?
Between which two values will 95% of the data fall between?
Between which two values will 99.7% of the data fall between?