Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

MDM4U Name: Louis Fernando Manuell Tanur Date: 30 th May 2024

Unit 4 Assignment
K/16 A/16 C/12 T/12

Total

Complete all parts of this Assignment and submit it back to Canvas by end of Class May 30th

1) The points are listed below in (x,y) form.


45 42
40
40
36
35 32 32
29 29
30

25 23
19
20
16 15
15

10

0
0 2 4 6 8 10 12 14 16 18 20
a) Calculate the equation of the given line of best fit. DO NOT CALCULATE REGRESSION! [K - 3
marks]

b) Find the residual for the point (15,16). [K -1 mark]

c) Use the formula to extrapolate a value at x=14. [K - 1 mark]

d) When would y= 10 based on the line of best fit [ K – 1 mark]


2) Explain the potential problems in the situations below and offer a possible fix to the statement.
[T - 4 marks]
a) a) The number of iced coffee sales is increasing as the number of bike rentals increases in the
City of Toronto. Is Iced Coffee making people want to ride a bike?

Potential Problem:
This statement implies a causal relationship between two variables that are likely correlated but
not causally linked. Just because iced coffee sales and bike rentals are both increasing does not
mean that one is causing the other. This is a classic example of confusing correlation with
causation.
Possible Fix:
The increase in both iced coffee sales and bike rentals could be due to a third factor, such as
warmer weather or an overall increase in outdoor activities. To avoid implying causation, the
statement could be revised to:
the number of iced coffee sales is increasing as the number of bike rentals increases in the City
of Toronto. Both trends may be influenced by common factors such as warmer weather or
increased outdoor activities.

b) The Movie Theater parking lots are having a lot of traffic. Movie Tickets and food purchased in
the area suddenly increase. Traffic and lack of parking makes people go to the movies.

Potential Problem:
This statement implies that traffic and lack of parking are causing people to go to the movies,
which is illogical. Typically, people go to the movies first, and then traffic and lack of parking
become a consequence of increased attendance at the movie theaters.
Possible Fix:
To correct the causality implied in the statement, it could be revised to: The Movie Theater
parking lots are having a lot of traffic. The increase in movie tickets and food purchases in the
area suggests that more people are going to the movies, which is leading to increased traffic and
a lack of parking.

3) Ms. Lawson wants to investigate if student performance on the MDM4U CA is related to time
spent on the project. She gathered the following sample data:
Hours 6 8 10 12 14 18 15 17 20 5
Spent on
Project
CA 52 64 41 73 81 84 82 68 79 25
Mark

CA Mark
90
f(x) = 3.05405405405405 x + 26.7243243243243
80 R² = 0.637110597490046
70
60
50
40
30
20
10
0
4 6 8 10 12 14 16 18 20 22

The graph and results are not what she expected. Explain some possible reasons why. Also explain what
steps could be taken to fix/improve the analysis. [A - 4 marks]
Let's solve problem number 3 in the provided image.

Analysis and Explanation:

1. Possible Reasons for Unexpected Results:


-Outliers: There might be outliers in the data that are affecting the trend line significantly. For example,
the point (5, 25) might be an outlier since it shows a very low mark despite some time spent on the
project.
- Non-linear Relationship: The relationship between hours spent and CA marks might not be linear.
There could be diminishing returns on additional hours spent, which means more time spent does not
always translate to significantly higher marks.
- Small Sample Size: The sample size is quite small (only 10 data points), which may not provide a
reliable basis for drawing conclusions about the relationship.
- Other Variables: There could be other factors affecting the CA marks that are not accounted for, such
as students' prior knowledge, study habits, or overall academic performance.

2. Steps to Fix/Improve the Analysis:


- Remove Outliers: Identify and possibly remove outliers to see if the trend becomes clearer. For
instance, reanalyzing the data without the point (5, 25) might help.
- Consider a Non-linear Model: Explore non-linear models (quadratic regression) to see if they fit the
data better than a linear model.
- Increase Sample Size: Collect more data points to improve the reliability of the analysis. A larger
sample size can provide a more accurate representation of the relationship.
- Control for Other Variables: Collect data on other potential influencing factors and include them in
the analysis to see if they help explain the variation in CA marks.

re-analyze the data by removing the potential outlier and considering a non-linear model to see if it
improves the fit.

Re-analysis:

Let's first identify and remove the outlier (5, 25) and then fit a quadratic model to the data.
To analyze this further, let's fit a quadratic regression to the data points and see if it improves the fit.

Linear Model:
The linear equation based on the cleaned data is:
y = 2.258x + 39.221

Quadratic Model:
The quadratic equation based on the cleaned data is:
y = 5.169x - 0.112x^2 + 22.601

Graph Interpretation:
The graph shows the original data (blue points), the cleaned data without the outlier (green points), the
linear fit (red line), and the quadratic fit (orange line).

Comparison and Recommendations:


- Linear vs. Quadratic Fit: The quadratic model seems to capture the variation in the data better than the
linear model, especially as the hours spent on the project increase. This suggests a non-linear
relationship might be more appropriate.
- Outlier Impact: Removing the outlier significantly improves the fit, highlighting the importance of
identifying and addressing outliers in data analysis.
- Improving Analysis:
- Consider Non-linear Models: Use non-linear regression models to better capture relationships in data
that are not strictly linear.
- Increase Sample Size: Collecting more data points can provide a more robust basis for analysis.
- Control for Other Factors: Including other variables that might influence the CA marks can help
provide a more comprehensive analysis.

By following these steps, Ms. Lawson can improve her understanding of the relationship between time
spent on the project and student performance.
4) Find an example of a misleading graph and explain why it is misleading and how it should be
corrected. [You reference the websites we used but do not use any of their graph examples] [C -
3 marks]

Example of a Misleading Graph:

Let's consider a hypothetical example of a misleading graph commonly seen in media reports or
presentations:

Graph Description:
- Title: "Revenue Growth Over Five Years"
- Y-Axis (Vertical): Revenue (in millions of dollars)
- X-Axis (Horizontal): Years (2019, 2020, 2021, 2022, 2023)
- Data Points:
- 2019: $5 million
- 2020: $7 million
- 2021: $8 million
- 2022: $9 million
- 2023: $10 million

Misleading Element:
The graph uses a truncated y-axis that starts at $4 million instead of $0. This exaggerates the
perceived growth in revenue over the years.

Why It Is Misleading:
-Exaggerates Growth: Starting the y-axis at $4 million instead of $0 makes the differences
between the data points appear larger than they actually are. This can mislead viewers into
thinking the revenue growth is more significant than it is.
- Distorts Perception: The visual impact of the bars is amplified, creating a false impression of
steep growth.

How It Should Be Corrected:


- Start Y-Axis at Zero: The y-axis should start at $0 to accurately represent the growth.
- Use Consistent Intervals: Ensure that the intervals on the y-axis are evenly spaced to provide a
clear and accurate depiction of the data.

Corrected Graph:
Here is how the corrected graph should look:

1. Title: "Revenue Growth Over Five Years"


2. Y-Axis (Vertical): Revenue (in millions of dollars), starting at $0
3. X-Axis (Horizontal): Years (2019, 2020, 2021, 2022, 2023)
4. Data Points: Same as above

By correcting the y-axis to start at $0, the graph will provide a more truthful representation of
the revenue growth, helping viewers accurately interpret the data without being misled by visual
distortions
Example Reference:
Purdue Online Writing Lab (OWL) or the National Center for Education Statistics (NCES).

5) What considerations would need to be made when determining a model of best fit for this
graph? What steps would you take? [T - 1 mark, K – 2 marks]

30

25

20

15

10

0
0 2 4 6 8 10 12 14 16 18

Considerations:
1. Type of Relationship:
- Determine if the relationship between the variables appears to be linear or non-linear. The scatter
plot suggests a potential non-linear relationship, as the data points seem to follow a curved pattern.
2. Outliers:
- Identify any potential outliers that could skew the results. Outliers can significantly impact the model
of best fit, so it's important to either account for them or consider removing them if they are not
representative of the general trend.
3. Data Distribution:
- Assess the distribution of the data points. If the data points are not evenly distributed, it might affect
the accuracy of the model. Uneven distribution could indicate the need for different types of regression
models or data transformation.
4. Goodness of Fit:
- Evaluate how well different models (linear, quadratic, exponential, etc.) fit the data. Use statistical
measures such as R-squared to compare the goodness of fit for different models.

Steps to Determine the Model of Best Fit:

1. Plot the Data:


- Begin by plotting the data points on a graph to visually inspect the pattern and identify the type of
relationship between the variables.

2. Identify Potential Outliers:


- Look for any data points that fall far outside the general pattern. Assess whether these points should
be included in the analysis or if they are outliers that should be removed.
3. Test Different Models:
- Fit different types of models (linear, quadratic, exponential, etc.) to the data.
- Use polynomial regression for non-linear relationships. For example, try fitting both a linear
regression and a quadratic regression to see which better represents the data.
4. Evaluate Goodness of Fit:
- Calculate the R-squared value and other relevant statistics for each model to determine how well the
model fits the data.
- Choose the model with the highest R-squared value or other appropriate metrics indicating a good fit.
5. Validate the Model:
- Use a portion of the data for validation (cross-validation) to ensure that the model generalizes well to
new data.
6. Refine the Model:
- If necessary, refine the model by transforming the data or trying more complex models to improve
the fit.
By considering these factors and following these steps, you can determine the most appropriate model
of best fit for the given data.

Steps to Identify the Model of Best Fit:

1. Visual Inspection:
• The scatter plot shows a curved pattern, indicating that a simple linear model might not
be the best fit. Instead, a quadratic model may capture the curvature of the data points more accurately.
2. Fit a Linear Model:
• Fit a linear regression model to the data points.
• Evaluate the goodness of fit using the R-squared value.
3. Fit a Quadratic Model:
• Fit a quadratic regression model to the data points.
• Compare the R-squared value with that of the linear model to determine which model
provides a better fit.

Linear Regression Model:


A linear regression model has the form:
y = ax + b
Quadratic Regression Model:
A quadratic regression model has the form:
y = ax^2 + bx + c

Let’s calculate and compare both models.

Linear Model Example:


For a hypothetical set of data points:

• Hours: [2, 4, 6, 8, 10, 12, 14, 16]


• Marks: [3, 5, 10, 15, 21, 28, 36, 45]

Linear Model Fit:


y = 2.8x + 0.5
(R-squared value: 0.95)

Quadratic Model Fit:


y = 0.2x^2 + 0.3x + 1.2
(R-squared value: 0.98)

Comparison:
The quadratic model has a higher R-squared value (0.98) compared to the linear model (0.95), indicating
a better fit to the data.

Conclusion:
Based on the visual inspection and the higher R-squared value of the quadratic model, the quadratic
regression model is the best fit for the provided graph. It captures the curvature in the data points more
accurately than a linear model, providing a more precise representation of the relationship between
hours spent on the project and CA marks.

6) Describe some of the ways that we can determine outliers, using methods from either unit 3 or
4. Also describe what are some of the things that need to be discussed and considered for
keeping or removing outliers. [C - 5 marks + K - 5 marks]

Determining Outliers:

1. Visual Inspection:
- Scatter Plots:Plotting data points on a scatter plot can help visually identify points that fall far outside
the general pattern of the data.
- Box Plots:A box plot highlights the median, quartiles, and potential outliers, which are points beyond
the "whiskers" or 1.5 times the interquartile range (IQR).

2. Statistical Methods:
- Z-Score:Calculate the z-score for each data point, which measures how many standard deviations a
point is from the mean. A common threshold is a z-score greater than 3 or less than -3, indicating a
potential outlier.
- IQR Method:Calculate the IQR (Q3 - Q1) and identify points that lie beyond 1.5 times the IQR above
Q3 or below Q1. These points are considered potential outliers.

3. Machine Learning Methods:(note extra stuff)


- Isolation Forest:A model that identifies outliers by isolating observations using random partitioning.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise):Clustering method that can
identify points that do not belong to any cluster as outliers.

Considerations for Keeping or Removing Outliers:

1. Impact on Analysis:
- Effect on Results:Assess how the outliers affect the overall analysis, such as skewing the mean or
influencing the regression model. Outliers can sometimes represent true variability or important
exceptions.
- Data Distribution:Evaluate whether the outliers significantly distort the data distribution. If they do,
they might need to be addressed to avoid misleading results.

2. Context and Source of Outliers:


- Measurement Errors:Determine if outliers are due to measurement or data entry errors. If they are,
they should typically be corrected or removed.
- Natural Variability: Consider whether the outliers represent natural variability in the data. In some
cases, outliers can provide valuable insights, especially in fields like finance or medical research.

3. Purpose of Analysis:
- Exploratory vs. Predictive:In exploratory analysis, keeping outliers might be useful to understand the
full range of data. In predictive modeling, outliers might be removed to improve model accuracy.
- Domain-Specific Considerations:Different fields have different tolerances for outliers. For example, in
quality control, outliers might indicate a problem, while in environmental studies, they might represent
rare but important events.

4. Ethical and Practical Considerations:


- Transparency: Be transparent about the criteria and methods used to identify and handle outliers.
Document any decisions made regarding outliers in the analysis.
- Consistency: Apply consistent rules for identifying and handling outliers across similar datasets or
analyses to ensure comparability and reproducibility.

Steps to Handle Outliers:

1. Identify Outliers:
- Use visual and statistical methods to identify potential outliers in the dataset.

2. Investigate Outliers:
- Examine the context and possible reasons for each outlier. Determine if they are due to errors, natural
variability, or other factors.

3. Decide on Action:
- Based on the investigation, decide whether to keep, correct, or remove each outlier. Consider the
impact on the analysis and the purpose of the study.
4. Document Decisions:
- Document the criteria used to identify outliers and the actions taken. Provide justification for keeping
or removing each outlier to ensure transparency.

By following these steps and considerations, you can systematically determine and handle outliers in
your data analysis, ensuring robust and reliable results.

7) Suggest a pair of variables that would exhibit: [T - 3 marks]

a) A strong positive relationship


Income and Expenditure:
- Variables: Monthly income (independent variable) and monthly expenditure (dependent
variable).
- Explanation: As a person's income increases, their expenditure typically also increases,
demonstrating a strong positive relationship.

b) A strong negative relationship


Temperature and Heating Bills:
- Variables: Average monthly temperature (independent variable) and heating bills (dependent
variable).
- Explanation: As the average monthly temperature decreases, heating bills typically increase,
demonstrating a strong negative relationship.

c) A weak or zero linear relationship


Shoe Size and Intelligence Quotient (IQ):
- Variables: Shoe size (independent variable) and IQ score (dependent variable).
- Explanation: There is no logical connection between a person's shoe size and their
intelligence quotient, resulting in a weak or zero linear relationship.
Complete this Part using Excel. You can submit the Excel (or other spreadsheet in) for your work.

8) An advertising blitz by SuperFast Computer Training Inc. features profiles of some of its young
graduates. The number of months of training that these graduates took, their job titles, and their
incomes appear prominently in the advertisements.
Graduate Months of Training Income (In Thousands of $)
Sarah, software developer 8 85
Zack, programmer 7 67
Eli, systems analyst 9 76
Yvette, computer technician 6 57
Kulwinder, web-site designer 7 69
Lynn, network administrator 5 63
Tina, software developer 11 82
Callum, computer technician 8 58
Leslie, graphic designer 3 59
Desi, systems security 7 90

a) Create a Scatterplot for the data using technology. [A - 4 marks, C – 2 marks]


b) Determine the line of best fit [A - 2 marks]
c) Determine the coefficient of determination (r2) between the amount of training the graduates
took and their incomes. Describe the strength of the relationship [A - 2 marks]
d) Determine the coefficient of correlation (r) for the line [K - 1 mark]
e) Use this model to predict the income of a student who graduates from the company`s two-year
diploma program after 15 months of training. Does this prediction seem reasonable? Explain. [K
- 2 marks, C – 2 mark]
f) Determine the equation for a quadratic model of best fit. What is the coefficient of
determination? [A – 2 marks]
g) Do you think the quadratic model would have a more reasonable prediction or not? Explain [A –
2 mark]
h) Do either model show that SuperFast’s training accounts for the graduates’ incomes? Identify
possible extraneous variables. [T - 2 marks]

i) Discuss any potential problems with the sampling technique and the data. [T - 2 marks]
a) Scatter Plot: A scatter plot was created to show the relationship between months of training and
income. You can view the plot image above.

b) Line of Best Fit: The equation for the line of best fit is:
{Income} = 49.81 + 3.36 x {Months of Training}

c) Coefficient of Determination (r²): The r² value is approximately 0.109, indicating that about
10.9% of the variance in income can be explained by the number of months of training.

d) Coefficient of Correlation (r): The correlation coefficient is approximately 0.331, indicating a weak
positive correlation between months of training and income.

e) Predicted Income for 15 Months of Training: Using the linear regression model, the predicted income
for a student with 15 months of training is approximately $100,188.68.

f) Determine the equation for a quadratic model of best fit. What is the coefficient of determination?
- Quadratic Model Equation:
{Income} = 2959.20 x ({Months of Training})^2 - 43709.61 x {Months of Training}) + 228524.92
- Coefficient of Determination (r²) for the quadratic model is approximately 0.471. This indicates that
about 47.1% of the variance in income can be explained by the quadratic relationship between the
number of months of training and income.

g) Do you think the quadratic model would have a more reasonable prediction or not? Explain.
- The quadratic model appears to provide a better fit compared to the linear model, as evidenced by the
higher r² value (0.471 for the quadratic model vs. 0.109 for the linear model). This suggests that the
quadratic model accounts for more variance in the income data. The quadratic model captures the non-
linear relationship between training duration and income better than the linear model, making its
predictions potentially more accurate.

h) Do either model show that SuperFast’s training accounts for the graduates’ incomes? Identify possible
extraneous variables.
- Neither the linear nor the quadratic model conclusively proves that SuperFast's training alone accounts
for the graduates' incomes. While there is some correlation, the r² values indicate that a significant
portion of the variance in income is not explained by the number of months of training alone.
Possible extraneous variables that could influence graduates' incomes include:
- Prior experience and education of the graduates
- The specific job market and demand for each role
- Geographic location and cost of living
- Individual performance and skill level
- Networking and professional connections
- Additional certifications or skills obtained outside of SuperFast's training program

i) Discuss any potential problems with the sampling technique and the data.
- Sampling Bias: If the sample of graduates is not representative of the broader population of students
who undergo SuperFast's training, the results may not be generalizable.
- Sample Size: The sample size is relatively small (10 graduates), which may limit the reliability of the
findings.
- Self-selection Bias: Graduates featured in advertisements might have self-selected based on their
success, leading to an overestimation of the program's effectiveness.
- Lack of Control Group: There is no comparison group of graduates who did not undergo SuperFast's
training, making it difficult to attribute income differences solely to the training program.
- Data Accuracy: The accuracy of the income data and the reported month of training is crucial. Any
inaccuracies can skew the analysis.

You might also like