Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

1/7/24, 2:31 AM Assignment_Auto.

R - Colaboratory

keyboard_arrow_down Assignment
library(dplyr)
library(stringr)

df <- read.csv('/content/Auto.csv')

str(df)

'data.frame': 397 obs. of 9 variables:


$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : chr "130" "165" "150" "150" ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : int 1 1 1 1 1 1 1 1 1 1 ...
$ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

head(df)

A data.frame: 6 × 9
mpg cylinders displacement horsepower weight acceleration year origin name

<dbl> <int> <dbl> <chr> <int> <dbl> <int> <int> <chr>

1 18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu

2 15 8 350 165 3693 11.5 70 1 buick skylark 320

3 18 8 318 150 3436 11.0 70 1 plymouth satellite

4 16 8 304 150 3433 12.0 70 1 amc rebel sst

5 17 8 302 140 3449 10.5 70 1 ford torino

6 15 8 429 198 4341 10.0 70 1 ford galaxie 500

summary(df)

output Min.
mpg
: 9.00
cylinders
Min. :3.000
displacement
Min. : 68.0
horsepower
Length:397
1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 Class :character
Median :23.00 Median :4.000 Median :146.0 Mode :character
Mean :23.52 Mean :5.458 Mean :193.5
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
Max. :46.60 Max. :8.000 Max. :455.0
weight acceleration year origin
Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
1st Qu.:2223 1st Qu.:13.80 1st Qu.:73.00 1st Qu.:1.000
Median :2800 Median :15.50 Median :76.00 Median :1.000
Mean :2970 Mean :15.56 Mean :75.99 Mean :1.574
3rd Qu.:3609 3rd Qu.:17.10 3rd Qu.:79.00 3rd Qu.:2.000
Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
name
Length:397
Class :character
Mode :character

non_numeric_values <- df$horsepower[!grepl("^\\d+$", df$horsepower)]


non_numeric_values

'?' · '?' · '?' · '?' · '?'

keyboard_arrow_down Replace special characters in specific columns with NA

# Replacing special character with NA


df$horsepower <- ifelse(!grepl("^\\d+$", df$horsepower), NA, df$horsepower)

# Get the count of null values in the whole dataframe


sum(is.na(df))

https://colab.research.google.com/drive/1-js3QyiIxeYYLpklM33bOqQBaHjp6w8O#scrollTo=QZNq9x2X3T_o&printMode=true 1/6
1/7/24, 2:31 AM Assignment_Auto.R - Colaboratory

# Get the Count of null values in each column


colSums(is.na(df))

mpg: 0 cylinders: 0 displacement: 0 horsepower: 5 weight: 0


acceleration: 0 year: 0 origin: 0 name: 0

keyboard_arrow_down Median Imputation


df$horsepower <- as.numeric(df$horsepower)
df$horsepower <- ifelse(is.na(df$horsepower), median(df$horsepower, na.rm = TRUE), df$horsepower)
# result = Median is 93.5

sum(is.na(df$horsepower))

keyboard_arrow_down Mode Imputation


# df$horsepower_mode <- data[is.na(data)] <- as.numeric(names(which.max(table(data))))

horsepower_mode <- ifelse(is.na(df$horsepower), as.numeric(names(which.max(table(df$horsepower)))), df$horsepower)


sum(is.na(horsepower_mode))

head(df)

A data.frame: 6 × 9
mpg cylinders displacement horsepower weight acceleration year origin

<dbl> <int> <dbl> <dbl> <int> <dbl> <int> <int>

c
1 18 8 307 130 3504 12.0 70 1 c

2 15 8 350 165 3693 11.5 70 1

p
3 18 8 318 150 3436 11.0 70 1

keyboard_arrow_down Qualitative/Quantitative
str(df)

'data.frame': 397 obs. of 9 variables:


$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : int 70 70 70 70 70 70 70 70 70 70 ...
$ origin : int 1 1 1 1 1 1 1 1 1 1 ...
$ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

Result

Quantitative are mpg, cylinders, displacement, horsepower, weight, year, origin, acceleration.
Qualitative is name

keyboard_arrow_down Range of each Quantitative predictor

https://colab.research.google.com/drive/1-js3QyiIxeYYLpklM33bOqQBaHjp6w8O#scrollTo=QZNq9x2X3T_o&printMode=true 2/6
1/7/24, 2:31 AM Assignment_Auto.R - Colaboratory
numeric_columns <- sapply(df, is.numeric)
ranges <- sapply(df[, numeric_columns], function(x) range(x, na.rm = TRUE))

print(ranges)

mpg cylinders displacement horsepower weight acceleration year origin


[1,] 9.0 3 68 46 1613 8.0 70 1
[2,] 46.6 8 455 230 5140 24.8 82 3

keyboard_arrow_down Mean and Standard Deviation of each quantitative predictor


numeric_columns <- sapply(df, is.numeric)

means <- sapply(df[, numeric_columns], function(x) mean(x, na.rm = TRUE))


sds <- sapply(df[, numeric_columns], function(x) sd(x, na.rm = TRUE))

summary_mean_std <- data.frame(Mean = means, SD = sds)

print(summary_mean_std)

Mean SD
mpg 23.515869 7.8258039
cylinders 5.458438 1.7015770
displacement 193.532746 104.3795833
horsepower 104.331234 38.2669944
weight 2970.261965 847.9041195
acceleration 15.555668 2.7499953
year 75.994962 3.6900049
origin 1.574307 0.8025495

Now remove the 10th through 85th observations. What is the range, mean, and
keyboard_arrow_down standard deviation of each predictor in the subset of the data that remains?
subset_df <- df[-c(10:85), numeric_columns]

summary_subset <- sapply(subset_df, function(x) list(Min = min(x), Max = max(x),Mean = mean(x), Sd = sd(x)))

print(summary_subset)

mpg cylinders displacement horsepower weight acceleration year


Min 11 3 68 46 1649 8.5 70
Max 46.6 8 455 230 4997 24.8 82
Mean 24.43863 5.370717 187.0498 100.8629 2933.963 15.72305 77.15265
Sd 7.908184 1.653486 99.63539 35.68013 810.6429 2.680514 3.11123
origin
Min 1
Max 3
Mean 1.598131
Sd 0.8161627

keyboard_arrow_down Investigate the predictors graphically


plot(df$mpg ~ df$displacement, xlab = "Displacement", ylab = "MPG", main = "Scatterplot of MPG vs. Displacement")
plot(df$mpg ~ df$horsepower, xlab = "Horsepower", ylab = "MPG", main = "Scatterplot of MPG vs. Horsepower")
plot(df$mpg ~ df$weight, xlab = "Weight", ylab = "MPG", main = "Scatterplot of MPG vs. Weight")
plot(df$mpg ~ df$acceleration, xlab = "Acceleration", ylab = "MPG", main = "Scatterplot of MPG vs. Acceleration")
plot(df$mpg ~ df$year, xlab = "Year", ylab = "MPG", main = "Scatterplot of MPG vs. Year")
plot(df$mpg ~ df$origin, xlab = "Origin", ylab = "MPG", main = "Scatterplot of MPG vs. Origin")

https://colab.research.google.com/drive/1-js3QyiIxeYYLpklM33bOqQBaHjp6w8O#scrollTo=QZNq9x2X3T_o&printMode=true 3/6
1/7/24, 2:31 AM Assignment_Auto.R - Colaboratory

https://colab.research.google.com/drive/1-js3QyiIxeYYLpklM33bOqQBaHjp6w8O#scrollTo=QZNq9x2X3T_o&printMode=true 4/6
1/7/24, 2:31 AM Assignment_Auto.R - Colaboratory

# Correlation Matrix
options(width = 80)
round(cor(df[,1:8]), 2)

A matrix: 8 × 8 of type dbl


mpg cylinders displacement horsepower weight acceleration year

mpg 1.00 -0.78 -0.80 -0.77 -0.83 0.42 0.58

cylinders -0.78 1.00 0.95 0.84 0.90 -0.50 -0.35

displacement -0.80 0.95 1.00 0.90 0.93 -0.54 -0.37

horsepower -0.77 0.84 0.90 1.00 0.86 -0.69 -0.41

weight -0.83 0.90 0.93 0.86 1.00 -0.42 -0.31

acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00 0.28

year 0.58 -0.35 -0.37 -0.41 -0.31 0.28 1.00

origin 0.56 -0.56 -0.61 -0.45 -0.58 0.21 0.18

Observations:
Based on the scatterplots, it appears that displacement, horsepower, and weight have a negative relationship with MPG.

Acceleration and year have a positive relationship with MPG which means cars became more efficient with time due to advancement in
technology

https://colab.research.google.com/drive/1-js3QyiIxeYYLpklM33bOqQBaHjp6w8O#scrollTo=QZNq9x2X3T_o&printMode=true 5/6
1/7/24, 2:31 AM Assignment_Auto.R - Colaboratory
In correlation matrix, there is a positive relationship between horsepower and weight.This makes sense as generally, a heavier car requires
more horsepower to attain same level of performance as a lighter car.Therefore, horsepower of a car is relative to its weight

In correlation matrix, there is a positive relationship between horsepower and displacement as more displacement will produce more
power

Suppose that we wish to predict gas mileage (mpg) on the basis of the other
keyboard_arrow_down variables. Do your plots suggest that any of the other variables might be useful in
predicting mpg? Justify your answer.

If the relationship between the variables is strong and if there is a clear pattern or trend in the data then it is likely that those variables will
be useful in predicting the target variable . We may conclude that "displacement", "weight", "year", and "origin" have a statistically
significant relationship while "cylinders", "horsepower" and "acceleration" do not.
On the other hand, if there is no clear relationship between the variables, then it is less likely that those variables will be useful for
prediction.

Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.

https://colab.research.google.com/drive/1-js3QyiIxeYYLpklM33bOqQBaHjp6w8O#scrollTo=QZNq9x2X3T_o&printMode=true 6/6

You might also like