Professional Documents
Culture Documents
Unit+4+Assignment+2324+T4 Copy Louis Tanur
Unit+4+Assignment+2324+T4 Copy Louis Tanur
Unit 4 Assignment
K/16 A/16 C/12 T/12
Total
Complete all parts of this Assignment and submit it back to Canvas by end of Class May 30th
25 23
19
20
16 15
15
10
0
0 2 4 6 8 10 12 14 16 18 20
a) Calculate the equation of the given line of best fit. DO NOT CALCULATE REGRESSION! [K - 3
marks]
Potential Problem:
This statement implies a causal relationship between two variables that are likely correlated but
not causally linked. Just because iced coffee sales and bike rentals are both increasing does not
mean that one is causing the other. This is a classic example of confusing correlation with
causation.
Possible Fix:
The increase in both iced coffee sales and bike rentals could be due to a third factor, such as
warmer weather or an overall increase in outdoor activities. To avoid implying causation, the
statement could be revised to:
the number of iced coffee sales is increasing as the number of bike rentals increases in the City
of Toronto. Both trends may be influenced by common factors such as warmer weather or
increased outdoor activities.
b) The Movie Theater parking lots are having a lot of traffic. Movie Tickets and food purchased in
the area suddenly increase. Traffic and lack of parking makes people go to the movies.
Potential Problem:
This statement implies that traffic and lack of parking are causing people to go to the movies,
which is illogical. Typically, people go to the movies first, and then traffic and lack of parking
become a consequence of increased attendance at the movie theaters.
Possible Fix:
To correct the causality implied in the statement, it could be revised to: The Movie Theater
parking lots are having a lot of traffic. The increase in movie tickets and food purchases in the
area suggests that more people are going to the movies, which is leading to increased traffic and
a lack of parking.
3) Ms. Lawson wants to investigate if student performance on the MDM4U CA is related to time
spent on the project. She gathered the following sample data:
Hours 6 8 10 12 14 18 15 17 20 5
Spent on
Project
CA 52 64 41 73 81 84 82 68 79 25
Mark
CA Mark
90
f(x) = 3.05405405405405 x + 26.7243243243243
80 R² = 0.637110597490046
70
60
50
40
30
20
10
0
4 6 8 10 12 14 16 18 20 22
The graph and results are not what she expected. Explain some possible reasons why. Also explain what
steps could be taken to fix/improve the analysis. [A - 4 marks]
Let's solve problem number 3 in the provided image.
re-analyze the data by removing the potential outlier and considering a non-linear model to see if it
improves the fit.
Re-analysis:
Let's first identify and remove the outlier (5, 25) and then fit a quadratic model to the data.
To analyze this further, let's fit a quadratic regression to the data points and see if it improves the fit.
Linear Model:
The linear equation based on the cleaned data is:
y = 2.258x + 39.221
Quadratic Model:
The quadratic equation based on the cleaned data is:
y = 5.169x - 0.112x^2 + 22.601
Graph Interpretation:
The graph shows the original data (blue points), the cleaned data without the outlier (green points), the
linear fit (red line), and the quadratic fit (orange line).
By following these steps, Ms. Lawson can improve her understanding of the relationship between time
spent on the project and student performance.
4) Find an example of a misleading graph and explain why it is misleading and how it should be
corrected. [You reference the websites we used but do not use any of their graph examples] [C -
3 marks]
Let's consider a hypothetical example of a misleading graph commonly seen in media reports or
presentations:
Graph Description:
- Title: "Revenue Growth Over Five Years"
- Y-Axis (Vertical): Revenue (in millions of dollars)
- X-Axis (Horizontal): Years (2019, 2020, 2021, 2022, 2023)
- Data Points:
- 2019: $5 million
- 2020: $7 million
- 2021: $8 million
- 2022: $9 million
- 2023: $10 million
Misleading Element:
The graph uses a truncated y-axis that starts at $4 million instead of $0. This exaggerates the
perceived growth in revenue over the years.
Why It Is Misleading:
-Exaggerates Growth: Starting the y-axis at $4 million instead of $0 makes the differences
between the data points appear larger than they actually are. This can mislead viewers into
thinking the revenue growth is more significant than it is.
- Distorts Perception: The visual impact of the bars is amplified, creating a false impression of
steep growth.
Corrected Graph:
Here is how the corrected graph should look:
By correcting the y-axis to start at $0, the graph will provide a more truthful representation of
the revenue growth, helping viewers accurately interpret the data without being misled by visual
distortions
Example Reference:
Purdue Online Writing Lab (OWL) or the National Center for Education Statistics (NCES).
5) What considerations would need to be made when determining a model of best fit for this
graph? What steps would you take? [T - 1 mark, K – 2 marks]
30
25
20
15
10
0
0 2 4 6 8 10 12 14 16 18
Considerations:
1. Type of Relationship:
- Determine if the relationship between the variables appears to be linear or non-linear. The scatter
plot suggests a potential non-linear relationship, as the data points seem to follow a curved pattern.
2. Outliers:
- Identify any potential outliers that could skew the results. Outliers can significantly impact the model
of best fit, so it's important to either account for them or consider removing them if they are not
representative of the general trend.
3. Data Distribution:
- Assess the distribution of the data points. If the data points are not evenly distributed, it might affect
the accuracy of the model. Uneven distribution could indicate the need for different types of regression
models or data transformation.
4. Goodness of Fit:
- Evaluate how well different models (linear, quadratic, exponential, etc.) fit the data. Use statistical
measures such as R-squared to compare the goodness of fit for different models.
1. Visual Inspection:
• The scatter plot shows a curved pattern, indicating that a simple linear model might not
be the best fit. Instead, a quadratic model may capture the curvature of the data points more accurately.
2. Fit a Linear Model:
• Fit a linear regression model to the data points.
• Evaluate the goodness of fit using the R-squared value.
3. Fit a Quadratic Model:
• Fit a quadratic regression model to the data points.
• Compare the R-squared value with that of the linear model to determine which model
provides a better fit.
Comparison:
The quadratic model has a higher R-squared value (0.98) compared to the linear model (0.95), indicating
a better fit to the data.
Conclusion:
Based on the visual inspection and the higher R-squared value of the quadratic model, the quadratic
regression model is the best fit for the provided graph. It captures the curvature in the data points more
accurately than a linear model, providing a more precise representation of the relationship between
hours spent on the project and CA marks.
6) Describe some of the ways that we can determine outliers, using methods from either unit 3 or
4. Also describe what are some of the things that need to be discussed and considered for
keeping or removing outliers. [C - 5 marks + K - 5 marks]
Determining Outliers:
1. Visual Inspection:
- Scatter Plots:Plotting data points on a scatter plot can help visually identify points that fall far outside
the general pattern of the data.
- Box Plots:A box plot highlights the median, quartiles, and potential outliers, which are points beyond
the "whiskers" or 1.5 times the interquartile range (IQR).
2. Statistical Methods:
- Z-Score:Calculate the z-score for each data point, which measures how many standard deviations a
point is from the mean. A common threshold is a z-score greater than 3 or less than -3, indicating a
potential outlier.
- IQR Method:Calculate the IQR (Q3 - Q1) and identify points that lie beyond 1.5 times the IQR above
Q3 or below Q1. These points are considered potential outliers.
1. Impact on Analysis:
- Effect on Results:Assess how the outliers affect the overall analysis, such as skewing the mean or
influencing the regression model. Outliers can sometimes represent true variability or important
exceptions.
- Data Distribution:Evaluate whether the outliers significantly distort the data distribution. If they do,
they might need to be addressed to avoid misleading results.
3. Purpose of Analysis:
- Exploratory vs. Predictive:In exploratory analysis, keeping outliers might be useful to understand the
full range of data. In predictive modeling, outliers might be removed to improve model accuracy.
- Domain-Specific Considerations:Different fields have different tolerances for outliers. For example, in
quality control, outliers might indicate a problem, while in environmental studies, they might represent
rare but important events.
1. Identify Outliers:
- Use visual and statistical methods to identify potential outliers in the dataset.
2. Investigate Outliers:
- Examine the context and possible reasons for each outlier. Determine if they are due to errors, natural
variability, or other factors.
3. Decide on Action:
- Based on the investigation, decide whether to keep, correct, or remove each outlier. Consider the
impact on the analysis and the purpose of the study.
4. Document Decisions:
- Document the criteria used to identify outliers and the actions taken. Provide justification for keeping
or removing each outlier to ensure transparency.
By following these steps and considerations, you can systematically determine and handle outliers in
your data analysis, ensuring robust and reliable results.
8) An advertising blitz by SuperFast Computer Training Inc. features profiles of some of its young
graduates. The number of months of training that these graduates took, their job titles, and their
incomes appear prominently in the advertisements.
Graduate Months of Training Income (In Thousands of $)
Sarah, software developer 8 85
Zack, programmer 7 67
Eli, systems analyst 9 76
Yvette, computer technician 6 57
Kulwinder, web-site designer 7 69
Lynn, network administrator 5 63
Tina, software developer 11 82
Callum, computer technician 8 58
Leslie, graphic designer 3 59
Desi, systems security 7 90
i) Discuss any potential problems with the sampling technique and the data. [T - 2 marks]
a) Scatter Plot: A scatter plot was created to show the relationship between months of training and
income. You can view the plot image above.
b) Line of Best Fit: The equation for the line of best fit is:
{Income} = 49.81 + 3.36 x {Months of Training}
c) Coefficient of Determination (r²): The r² value is approximately 0.109, indicating that about
10.9% of the variance in income can be explained by the number of months of training.
d) Coefficient of Correlation (r): The correlation coefficient is approximately 0.331, indicating a weak
positive correlation between months of training and income.
e) Predicted Income for 15 Months of Training: Using the linear regression model, the predicted income
for a student with 15 months of training is approximately $100,188.68.
f) Determine the equation for a quadratic model of best fit. What is the coefficient of determination?
- Quadratic Model Equation:
{Income} = 2959.20 x ({Months of Training})^2 - 43709.61 x {Months of Training}) + 228524.92
- Coefficient of Determination (r²) for the quadratic model is approximately 0.471. This indicates that
about 47.1% of the variance in income can be explained by the quadratic relationship between the
number of months of training and income.
g) Do you think the quadratic model would have a more reasonable prediction or not? Explain.
- The quadratic model appears to provide a better fit compared to the linear model, as evidenced by the
higher r² value (0.471 for the quadratic model vs. 0.109 for the linear model). This suggests that the
quadratic model accounts for more variance in the income data. The quadratic model captures the non-
linear relationship between training duration and income better than the linear model, making its
predictions potentially more accurate.
h) Do either model show that SuperFast’s training accounts for the graduates’ incomes? Identify possible
extraneous variables.
- Neither the linear nor the quadratic model conclusively proves that SuperFast's training alone accounts
for the graduates' incomes. While there is some correlation, the r² values indicate that a significant
portion of the variance in income is not explained by the number of months of training alone.
Possible extraneous variables that could influence graduates' incomes include:
- Prior experience and education of the graduates
- The specific job market and demand for each role
- Geographic location and cost of living
- Individual performance and skill level
- Networking and professional connections
- Additional certifications or skills obtained outside of SuperFast's training program
i) Discuss any potential problems with the sampling technique and the data.
- Sampling Bias: If the sample of graduates is not representative of the broader population of students
who undergo SuperFast's training, the results may not be generalizable.
- Sample Size: The sample size is relatively small (10 graduates), which may limit the reliability of the
findings.
- Self-selection Bias: Graduates featured in advertisements might have self-selected based on their
success, leading to an overestimation of the program's effectiveness.
- Lack of Control Group: There is no comparison group of graduates who did not undergo SuperFast's
training, making it difficult to attribute income differences solely to the training program.
- Data Accuracy: The accuracy of the income data and the reported month of training is crucial. Any
inaccuracies can skew the analysis.