Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Data Cleaning

1. First we load the dataset into R.


Output:

2. We convert the names in the raw dataset to R compatible names.


Output:

3. Removing nulls.
Output:
4. Checking the structure of the dataset.
Output:

PART 1
5. Creating a dummy variable for transmission column
Output:

6. Creating two subsets for manual and automatic transmission.


Output:
Subset for manual transmission
Subset for automatic transmission

7. Printing the number of subset created


Output:

8. Performing linear regression analysis between max power and mileage.


Output:
Interpretation:
 Intercept:The estimated mileage when the maximum power is zero. Since it's not
practically meaningful for a car to have zero maximum power, this coefficient is
mainly for mathematical purposes.
 As max_power increases by one unit, the predicted mileage_km_ltr_kg decreases by
approximately 0.0423 units.
 Significance:Both coefficients are highly statistically significant with p-values
significantly below conventional thresholds (p < 0.05). This indicates a strong
evidence against the null hypothesis that there is no relationship between the
predictors and the response variable.
 Residuals are the differences between the actual observed values and the values
predicted by the model. The range of residuals (-20.671 to 35.626) shows that the
model's predictions can vary by this amount from the actual values. This indicates
some variability in how well the model predicts mileage_km_ltr_kg.
 R-squared value of 0.1403 means that approximately 14.03% of the variability in
mileage_km_ltr_kg can be explained by variations in maximum power.
 Adjusted R-squared value (0.1402) is almost identical to the R-squared value,
suggesting that the inclusion of max_power as a predictor doesn't substantially affect
the model's explanatory power.
 F-statistic:
 The F-statistic tests the overall significance of the regression model. With a very high
F-statistic of 1290 and an extremely low p-value, the model as a whole is deemed
statistically significant, indicating that at least one of the independent variables
(max_power) is related to the dependent variable (mileage_km_ltr_kg
 The residual standard error (3.743) is an estimate of the standard deviation of the
residuals. It gives an average amount that the observed responses deviate from the
true regression line.
9. Finding the number of regression lines in the linear regression analysis
performed.
Output:

10. Scatter plot and multiple regression plots.

 By including categorical variables (transmission_manual and


transmission_automatic) as different colors in the scatter plot, we can visually
inspect how the relationship between maximum power and mileage differs
between cars with manual and automatic transmissions.
 We can see how mileage varies concerning maximum power for both manual and
automatic transmission cars.
 By having both manual and automatic transmission data overlaid on the same
plot, you can directly compare the regression lines and see if there are
significant differences in the relationship between maximum power and mileage
between the two transmission types.
By examining the slope and confidence intervals of the regression lines for manual and
automatic transmissions, you can gain insights into how the relationship between maximum
power and mileage differs between the two transmission types.
PART 2:
11. Subset for manual transmission data.
Output:

Fig 1: Manual Transmission subset

12. Subset for automatic transmission data.


Output:

Fig 2. Automatic Transmission susbset.

13. Regression analysis between max power and mileage for manual cars.
Output:

Interpretation:
 The regression analysis suggests a statistically significant relationship between the
maximum power of cars and their mileage per kilometer per liter per kilogram
(mileage_km_ltr_kg) for manual transmission cars.
 For manual transmission cars, as the maximum power increases by one unit, the
predicted mileage_km_ltr_kg decreases by approximately 0.0526 units, holding all
other variables constant.
 The model explains approximately 11.17% of the variability in mileage_km_ltr_kg
for manual transmission cars.
 The overall model is statistically significant (p < 0.001), indicating that the predictors
(maximum power) are useful for predicting mileage_km_ltr_kg in this subset of cars.

Fig 3. Regression line and scatter plot for manual transmission


14. Regression analysis between max power and mileage for automatic cars.
Output:

Interpretation:
 The regression analysis suggests a statistically significant relationship between the
maximum power of cars and their mileage_km_ltr_kg for automatic transmission
cars.
 For automatic transmission cars, as the maximum power increases by one unit, the
predicted mileage_km_ltr_kg decreases by approximately 0.0314 units, holding all
other variables constant.
 The model explains approximately 16.54% of the variability in mileage_km_ltr_kg
for automatic transmission cars.
 The overall model is statistically significant (p < 0.001), indicating that the predictors
(maximum power) are useful for predicting mileage_km_ltr_kg in this subset of cars.
Fig 4. Regression line and scatter plot for automatic transmission.
15. Difference between a multiple linear regression plot and a single linear
regression plot.
 A single linear regression plot, there's only one predictor variable used to
predict the dependent variable.
 Regression line represents the best-fit line that minimizes the overall distance
between the observed data points and the predicted values generated by the
regression model.
 In a multiple linear regression plot, there are multiple predictor variables
(independent variables) included in the regression analysis to predict the
dependent variable.
 It depicts the relationship between the observed values of the dependent
variable and the predicted values generated by the multiple regression model,
allowing assessment of the overall model fit.
 The multiple regression model considers the combined effect of all predictor
variables on the dependent variable, potentially capturing more complex
relationships compared to single linear regression.
In summary, while single linear regression plots focus on the relationship between one
predictor and the dependent variable, multiple linear regression plots incorporate multiple
predictors and may visualize their individual or combined effects on the dependent variable.

CONCLUSION:
Both single and multiple linear regression analyses reveal statistically significant
relationships between maximum power and mileage for both manual and automatic
transmission cars.The amount of variability explained by the models is relatively low, with R-
squared values ranging from 11.17% to 16.54%. This suggests that other factors beyond
maximum power may also influence mileage. In total, both single and multiple linear
regression analyses indicate significant relationships between maximum power and mileage
for manual and automatic transmission cars. However, the models' explanatory power is
limited, suggesting that other factors may also influence mileage, warranting further
investigation.
CITATION:
1. Kabacoff, R. I. (2021). R in Action (3rd ed.). Manning Publications.
2. Bluman, A. G. (2021). Elementary Statistics: A Step by Step Approach (8th ed.).
McGraw-Hill Education.

You might also like