Professional Documents
Culture Documents
Stat 101 Exam 1: Important Formulas and Concepts 1
Stat 101 Exam 1: Important Formulas and Concepts 1
1
Important Formulas and Concepts
1 Chapter 1
1.1 Definitions
1. Data
Any collection of numbers, characters, images, or other items that provide information
about something.
2. Categorical/Qualitative Variables
Name categories for grouping.
3. Quantitative Variables
When a variable contains measured numerical values with measurement units.
4. Identifier Variable
Each record has a unique value like Student ID or SSN.
5. Frequency Table
Records to totals and uses the category names to label each row.
6. Relative Frequency Table
Displays percentages of the values in each category.
7. Bar Chart
Displays the distribution of a categorical variable, showing counts for each category
next to each other for easy comparison.
8. Relative Frequency Bar Chart
Same as a bar chart but displays the percentage of people in each category rather than
the counts.
9. Pie Charts
Shows a whole group of cases as a circle. The circle is sliced into pieces whose size is
proportional to the fraction of the whole in each category.
10. Distribution
Slices up all the possible values of the variable into equal width bins and gives the
number of values (or counts) falling into each bin.
11. Histogram
Uses adjacent bars to show the distribution of a quantitative variable. Each bar shows
the frequency of values falling into each bin.
1
This version: February 3, 2020, by Dale Embers. May not include all things that could possibly be
tested on. To be used as an additional reference to studying all Chapters 1 - 6.
12. Unimodal
Histogram with one peak.
13. Bi-modal
Histogram with two peaks.
14. Uniform
Histogram that doesn’t appear to have any mode. Bars are approximately the same
height for each bin.
15. Symmetric
Histogram in which the two halves on either side of the center look approximately like
mirror images.
16. Skew
Histogram that is not symmetric.
2 Chapter 2
2.1 Definitions
1. 5 Number summary- Min Q1 Median Q3 Max
2. Boxplot: Displays the 5 number summary as a central box with whiskers that extend
to the nonoutlying data values.
3. Use sample mean and sample standard deviation when the data is symmetric and
has no significiant outlier. Use the median and IQR when the data is skewed or has
signficant outliers.
2.2 Formulas
1. Median = Once the data is ordered from smallest to largest, it is the middle value in
the data. Divides the histogram into 2 equal pieces.
P
x
2. Mean = Average of all of the values = x̄ =
n
3. Range = Max - Min
(x − x̄)2
P
2
7. Variance: s =
n−1
√
8. Standard
v Deviation: s = s2
u (x − x̄)2
uP
=t
n−1
9. Upper Fence for Boxplot = Q3 + 1.5IQR
3 Chapter 3
3.1 Definitions
1. z-score
Tells how many standard deviations a value is from the mean. Regardless of direction,
the farther a data vlaue is from the mean the more unusual it is.
3. 68-95-99.7 Rule
In a normal model, about 68% of values fall within 1 standard deviation of the mean,
about 95% fall within 2 standard deviations of the mean; about 99.7% of values fall
within 3 standard deviations of the mean.
4. Normal Percentile
The normal percentile corresponding to a z-score gives the percentage of values in a
standard normal distribution found at that z-score or below. Compared to area under
the curve. See normal table in the textbook.
3.2 Formulas
1. z-score:
x−µ
z=
σ
6. If a normal curve is split with 30% of the area on one side, the other side of the curve
is 70% of the area
7. If a normal curve has 60% of the area in the middle, the remaining portions are a total
of 40%. This 40% is allocated half to each side. So the far left has 20% of the area,
the middle is 60% of the area, and the far right side has 20% of the area.
Textbook Normal Table Note: These tables give the percentage to the left of the z value.
4 Chapter 4
4.1 Definitions
1. Scatterplot
Shows the relationship between 2 quantitative variables.
2. Direction of Scatterplot
Positive direction means as one variable increases so does the other. Decreasing direc-
tion means the association is negative.
3. Form of Scatterplot
Is it in a straight line or some other form?
4. Strength of Scatterplot
Strong association if there is little scatter around the underlying relationship.
5. Outlier
A point that does not fit the overall pattern seen in the scatterplot.
4.2 Formulas
x−x̄
1. z-Scores for a Scatterplot zx = sx
zy = y−ȳ
sy
2. Correlation
P
Coefficient
zx zy
r =P n−1
[(x−x̄)(y−ȳ)]
= (n−1)sx sy
4.3 Properties of Correlation Coefficient
1. Always between -1 and 1
4. No units
5 Chapter 5
5.1 Definitions
1. Linear Model
Equation of the form ŷ = a + bX. ŷ means estimated values for y.
2. Predicted Values
Value of ŷ found for a given x-value in the data.
4. Residual
Differences between data values and the corresponding values predicted by the model
(observed - expected)
5. R2
Gives the fraction of variability of y accounted for by the least squares linear regression
on x. It is an overall measure of how successful the regression is in linearly relating y
to x.
7. Extrapolation
In any regression situation it is unsafe. Predictions from extrapolation should not be
trusted.
8. Influential Point
A point that ,if omitted from the data, results in a very different regression model.
5.2 Formulas
1. Residual = Observed value - Predicted value = y − ŷ
2. b = r (sy /sx )
3. a = ȳ − bx̄
6 Chapter 6
6.1 Definitions
1. Contingency Table
Table which shows how the individuals are distributed along each variable.
2. Marginal Distribution
Row total or column total in contingency tables.
3. Conditional Distribution
Show distribution of one variable for just those cases that satisfy a condition on another
variable. Example: Event B given Event A occurs first.
7 Extra Information
Review any and all notes and supplementary materials. It may be the case that something
was accidentally omitted from this study guide. Also, review any problems that may have
been discussed in class as not all example problems may have been provided here.
8 Example Problems
1. Use the below table to answer the following questions.
Eye Color
Blue Green Brown Total
Male 5 7 15 27
Gender
Female 6 2 10 18
Total 11 9 25 45
(a) Construct a frequency table for Eye Color based on the data above.
(b) Find the marginal distribution of gender.
(c) What percentage of females have blue eyes?
(d) What percentage of green eyed people are male?
(e) What percentage of people are females and have green eyes?
(f) What percentage of blue eyed people are female?
(g) What percentage of males have brown eyes?
(h) What percentage of people have brown eyes?
(i) What percentage of people are males and have blue eyes?
3. In a histogram that is skewed to the right, which is larger, the mean or the median?
4. In a histogram that is skewed or has outliers, which should be reported, the mean or
the median?
5. In a histogram that is skewed or has outliers, which should be reported, the IQR or
the Standard Deviation?
(a) Mean
(b) Median and Quartiles
(c) Range and IQR
(d) What are the values of the upper fence and the lower fence?
(e) Are there any outliers in this data? Why?
(f) Find the variance and standard deviation.
8. Shown below are the histogram and summary statistics for the number of camp sites
at public parks in Vermont.
(a) Which statistics would you use to identify the center and spread of this distribu-
tion? Why?
(b) How many parks would you classify as outliers? Explain.
(c) Create a boxplot for this data.
9. Given mean 16 and standard deviation 3.
(a) Standardize y = 9
(b) Standardize y = 21
(c) Which of the two above is most unusual?
(a) Using the model described above, draw the model showing what the 68-95-99.7
Rule predicts.
(b) In what interval would you expect the central 99.7% of values to be found?
(c) What percent of values are above 50?
(d) What percent of values are between 40 and 60?
(e) What percent of values are between 40 and 50?
(f) What percent of values are between 50 and 60?
(g) What percent of values are between 45 and 50?
(h) What percent of values are between 50 and 65?
(i) What percent of values are above 60?
(j) What percent of values are below 45?
(k) What percent of values are between 40 and 65?
(l) What percent of values are between 45 and 60?
12. Based on the Normal Model with a mean of 50 and standard deviation of 5, answer
the following questions.
(a) Draw a scatterplot of the above data to study the association between the energy
used and the price it cost.
(b) What is the direction of the association?
(c) What is the form of the relationship?
(d) What is the strength of the relationship?
(e) Are there any outliers?
(f) What is the correlation coefficient?
15. For the below residual plots, decide if a linear model is appropriate. In the case that the
linear model is not appropriate, decide which condition is violated (linearity, outlier,
or equal spread).
(a) z = 9−16
3
= −7
3
= −2.33.
21−16 5
(b) z = 3
= 3 = 1.67.
(c) y = 9 is more unusual.
(a) Mean = 8.
(b) Standard deviation = 2.
(c) Variance = 22 = 4.
10−8 2
(d) z = 2
= 2
= 1.
11. Use a normal model with a mean of 50 and standard deviation of 5.
(a) For the 68-95-99.7 Rule, we will have the following points on the graph (not shown
here). µ − 3σ = 50 − 3(5) = 35µ − 2σ = 50 − 2(5) = 40µ − σ = 50 − 5 = 45µ =
50µ + σ = 50 + 5 = 55µ + 2σ = 50 + 2(5) = 60µ + 3σ = 50 + 3(5) = 65
(b) Between 35 and 65.
(c) 50%.
(d) 95%.
(e) 47.5%.
(f) 47.5%.
(g) 34%.
(h) 49.85%.
(i) 2.5%.
(j) 16%.
(k) Between 40 and 50 is 47.5%. Between 50 and 65 is 49.85%. Between 40 and 65 is
97.35%.
(l) Between 45 and 50 is 34%. Between 50 and 60 is 47.5%. Between 45 and 60 is
81.5%.
12. Based on the Normal Model N(50, 5), answer the following questions. Draw pictures
to help you see what is going on.
(a) 50%.
(b) Step 1: Standardize. z = 62−50
5
= 125
= 2.4. Step 2: Calculate value from a
calculator or a table. From calculator: normalcdf(2.4, 999) = 0.0082. Solution
=0.82%. From table: We want area(z > 2.4). The table gives area(z < 2.4) =
0.9918. Our answer is 1 − 0.9918 = 0.0082, which is 0.82%.
(c) Step 1: Standardize. z = 39−505
= −11
5
= −2.2. Step 2: Calculate value from a
calculator or a table. From calculator: normalcdf(−999, −2.2) = 0.0139. Solution
= 1.39%. From table: We want area(z < −2.2). The table gives us this directly
and the value is 0.0139. Solution = 1.39%.
(d) Step 1: Standardize. z = 43−50
5
= −75
= −1.4. Step 2: Calculate value from a
calculator or a table. From calculator: normalcdf(−1.4, 999) = 0.9192. Solution
= 91.92$. From table: We want area(z > −1.4). The table gives area(z <
−1.4) = 0.0808. Our answer is 1 − 0.0808 = 0.9192, which is 91.92%.
(e) Step 1: Standardize. z = 58−50 5
= 85 = 1.6. Step 2: Calculate value from a
calculator or a table. From calculator: normalcdf(−999, 1.6) = 0.9452. Solution
= 94.52$. From table: We want area(z < 1.6). The table gives us this directly
and the value is 0.9452. Solution = 94.52%.
(f) Step 1: Standardize both values. z1 = 37−50 5
= −2.6. z2 = 52−50
5
= 0.4. Step 2:
Calculate value from a calculator or a table. From calculator: normalcdf(−2.6, 0.4) =
0.6508. Solution = 65.08%. From table: We want area(−2.6 < z < 0.4). The
table gives area(z < −2.6) = 0.0047 and area(z < 0.4) = 0.6554. Our answer is
0.6554 − 0.0047 = 0.6507, which is 65.07%. The difference between the calculator
answer and the table answer is because of rounding. Show your work!
(g) Step 1: Standardize both values. z1 = 57.25−50
5
= 1.45. z2 = 66−50
5
= 3.2. Step 2:
Calculate value from a calculator or a table. From calculator: normalcdf(1.45, 3.2) =
0.0728. Solution = 7.28%. From table: We want area(1.45 < z < 3.2). The ta-
ble gives area(z < 1.45) = 0.9265 and area(z < 3.2) = 0.9993. Our answer is
0.9993 − 0.9265 = 0.0728, which is 7.28%.
13. Based on the Normal Model N(50, 5), answer the following questions. Draw pictures
to help you see what is going on.
(a) Highest 5% of values corresponds to a z-value of 1.645. Find this using invnorm(0.95)
on your calculator, or looking for the value on a table. Use the z-score formula to
solve for the value you are looking for. z = 1.645 = x−50
5
⇒ 8.225 = x − 50
⇒ x = 58.225.
(b) Lower 25% of values corresponds to a z-value of -0.67. Find this using invnorm(0.25)
on your calculator, or looking for the value on a table. Use the z-score formula to
solve for the value you are looking for. z = −0.67 = x−505
⇒ −3.35 = x − 50
⇒ x = 46.65.
(c) We need 2 values for z in this case. Let the lower value of z be zL and the upper
value of z be zR . Find these values by typing invnorm(0.15) and invnorm(0.85)
on your calculator. We have zL = −1.04. zR = 1.04 respectively. Solve for the
two values of x. zL = −1.04 = x−505
⇒ −5.2 = x − 50
⇒ xL = 44.8,
zR = 1.04 = x−50
5
⇒ 5.2 = x − 50
⇒ xR = 55.2.
30
25
20
Price ($)
15
10
0
0 5 10 15 20 25
Energy Used (KWH)