#PART 1a) : "Vqv/ggbiplot"

Exercise 1. The dataset mpg is part of the R datasets package.
It contains a subset of the fuel economy data

that the Environment Protection Agency (EPA) makes available on https://fueleconomy.gov/. It contains
only car models which had a new release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car. It is a dataframe with 234 rows and 11 variables: manufacturer name (manufacturer),
model name (model), engine displacement (displ), year of manufacture (year), number of cylinders(cyl), type
of transmission (trans), type of drive train(drv), city miles per gallon(cty), highway miles per gallon(hwy),
fuel type (fl) and type of car (class). Applying an appropriate R data visualisation method on the mpg data,
perform the following tasks.
(a). Write code that displays a graph which plots in the order of decreasing medians of the vehicle’s miles-
per-gallon on highway (hwy) against their manufacturers. Plot the graph and list the manufacturers in the
order of fuel efficiency of their vehicles. Using the graph, find out which companies produce the most and
the least fuel efficient vehicles.
#PART 1a)
require(ggplot2)
## Loading required package: ggplot2
library(devtools)
## Loading required package: usethis
install_github("vqv/ggbiplot")
## Skipping install of ’ggbiplot’ from a github remote, the SHA1 (7325e880) has not changed since last i
## Use ‘force = TRUE‘ to force installation
require(ggbiplot)
## Loading required package: ggbiplot
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
dim(mpg)
## [1] 234 11
hwymedian=setNames(aggregate(mpg$hwy,list(mpg$manufacturer),median),c("manufacturer","hwy"))
ggplot(hwymedian,aes(x=reorder(manufacturer, -hwy), y=hwy))+geom_point()+ylab("Miles Per Gallon(hwy)")+x
1
Fuel Efficiency vs Manufacturer
30
Miles Per Gallon(hwy)
25
20
honda
volkswagen
hyundai audi nissanpontiacsubaru toyotachevrolet jeep ford mercurydodge lincolnland rover
Manufacturers
#WE can see from the plot that most fuel efficient manufacturer is HONDA and least is LAND-ROVER as it l
(b). Write code that displays a graph which plots in the order of decreasing medians of the vehicle’s miles-
per-gallon on highway (hwy) against the type of car (class). Plot the graph and list the classes of vehicle in
the order of their fuel efficiency.
#PART 1b)
classmedian=setNames(aggregate(mpg$hwy,list(mpg$class),median),c("class","hwy"))
ggplot(classmedian,aes(x=reorder(class, -hwy), y=hwy))+geom_point()+ylab("Miles Per Gallon(hwy)")+xlab(
2
Fuel Efficiency vs Class
27.5
25.0
Miles Per Gallon(hwy)
22.5
20.0
17.5
compact midsize subcompact 2seater minivan suv pickup

Class
# most efficient class is compact and midsize while the least efficient is SUV and pickup
(c). Draw a bar chart of manufacturers in terms of numbers of different types of cars manufactured. Based
on this, comment on classes of vehicles manufactured by the companies producing the most and the least
fuel efficient vehicles and possible reason(s) for highest/lowest fuel efficiency.
# PART 1c)
ggplot(mpg,
aes(x = manufacturer)) + geom_col(aes(y=hwy,fill=class))
3
800
class
600
2seater
compact
midsize
hwy
400 minivan
pickup
subcompact
suv
200
audi
chevrolet
dodge ford honda
hyundaijeep
land rover
lincoln
mercury
nissan
pontiac
subarutoyota
volkswagen
manufacturer
#from the barchart we can see that compact and midsize class of car is fuel efficient while SUV and pick
Exercise 2. The diamonds dataset within R’s ggplot2 contains 10 columns (price, carat, cut, color, clarity,
length(x), width(y), depth(z), depth percentage, top width) for 53940 different diamonds. Using this dataset,
carry out the following tasks.
(a). Write code to plot histograms for carat and price. Plot these graphs and comment on their shapes.
#PART 2a)
library(ggplot2)
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 500,colour="blue") + xlab('Price') +
ylab('Frequency')
4
10000
7500
Frequency
5000
2500
0 5000 10000 15000 20000

Price
ggplot(data = diamonds, aes(x =carat)) +

geom_histogram(binwidth = 0.25,colour="blue") + xlab('Carat') +
ylab('Frequency')
5
10000
Frequency
5000
0 1 2 3 4 5
Carat
(b). Write code to plot bar charts of cut proportioned in terms of color and again bar charts of cuts
proportioned in terms of clarity. Comment on how proportions of diamonds change in terms of clarity and
colour under different cut categories.
#PART 2b)
ggplot(diamonds,
aes(x = clarity,
fill =cut)) +
geom_bar(position = "fill")+ylab("Proportions")
6
1.00
0.75
cut
Fair
Proportions
Good
0.50
Very Good
Premium
Ideal
0.25
0.00
I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

clarity
ggplot(diamonds,
aes(x = color,
fill = cut)) +
geom_bar(position = "fill")+ylab("Proportions")
7
1.00
0.75
cut
Fair
Proportions
Good
0.50
Very Good
Premium
Ideal
0.25
0.00
D E F G H I J
color
#According to the charts we can see that fair cut diamonds are in clarity l1 in highest proportion and m
#on the basis of color we can see that all color category have equal proportion of diamonds cutting
(c). Write code to display an appropriate graph that facilitates the investigation of a three-way relationship
between cut, carat and price. Plot the graph. What inferences can you draw regarding the three way
relationship?
#PART 2c)
ggplot(diamonds, aes(x=carat, y=price)) +

geom_point(aes(color=cut)) +
scale_color_manual(name="Cut", values=c("red", "darkblue","darkgreen", "grey50","black"))
8
15000
Cut
Fair
Good
price
10000
Very Good
Premium
Ideal
5000
0
0 1 2 3 4 5
carat
# from the plot we can see that most of the diamonds cut ideal There is not much variation on cut . the
Exercise 3. Before deciding about selecting a particular machine learning technique for a data science prob-
lem, it is important to study the data distribution particularly through visualization. However, visualizing a
multivariate data with two or more variables is difficult in a two dimensional plot. In this exercise, you are
required to study the R’s iris dataset which is a multivariate data consisting of four features or properties
(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) characterizing three species of iris flower (setosa,
versicolor, and virginica). The principal component analysis (PCA) is a technique that can help facilitate
visualization of a multivariate data distribution. The first two principal components (PC1 and PC2) ob-
tained after applying PCA, can explain the majority of variations in the data. In order to study the data
variability in iris data-set, perform the following tasks.
(a). Write code to obtain PC scores.
#PART 3a)
library(ggplot2)
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]
pcairis <- prcomp(log.ir,
center = TRUE,
scale. = TRUE)
print(pcairis)
## Standard deviations (1, .., p=4):
9
## [1] 1.7124583 0.9523797 0.3647029 0.1656840
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5038236 -0.45499872 0.7088547 0.19147575
## Sepal.Width -0.3023682 -0.88914419 -0.3311628 -0.09125405
## Petal.Length 0.5767881 -0.03378802 -0.2192793 -0.78618732
## Petal.Width 0.5674952 -0.03545628 -0.5829003 0.58044745
(b). Write code to obtain a scatter plot representing PC1 vs. PC2, wherein data clusters corresponding to
three flower types are clearly marked using possibly an ellipsoid.
#PART 3b
ggbiplot(pcairis,choice=c(1,2), groups=iris$Species, ellipse=TRUE,
scale=0,var.scale=0.2, colour="blue", varname.size=3)+
ggtitle("PCA visualization",subtitle="ggbiplot()") +
theme(plot.title =element_text(size=15, face="bold", hjust=0.5,
colour = "red"),plot.subtitle =element_text(size=10,
face="bold.italic", hjus
PCA visualization
ggbiplot()
2
PC2 (22.7% explained var.)
1 groups
setosa
0 versicolor
Petal
Pe tal..W
Leidt
nghth
virginica
−1 Se
pa
l.L
en
gt
h
−2
idth
al.W
Sep
−2 0 2
PC1 (73.3% explained var.)
(c). Run the codes to make the scatter plot, mark flowers using ellipsoids and comment on the feature
distribution.
#Code for the above exercise.
#PART 3c
10
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +
geom_point() +
stat_ellipse()
Species
Petal.Width
setosa
versicolor
virginica
1
2 4 6
Petal.Length
Exercise 4. In this task, you are required to analyze the Animals dataset from the MASS package.This
dataset contains brain weight (in grams) and body weight (in kilograms) for 28 different animal species.The
three largest animals are dinosaurs, whose measurements are obviously the result of scientific modeling rather
than precise measurements.
A scatter plot given below fails to describe any obvious relationship between brain weight and body weight
variables. You are required to apply appropriate power transformations to the variables to obtain more
interpretable plot and describe the obtained relationship. To this end, undertake the following tasks.
library(ggplot2)
library(MASS)
data(Animals)
qplot(brain, body, data = Animals)
11
75000
50000
body
25000
0 2000 4000
brain
Task-1. Check whether each of the variables has normal distribution. Your response should be based on an
appropriate statistical test as well as smoothed histogram plots.

#Shapiro test is used to confirm p-value
shapiro.test(Animals$body); shapiro.test(Animals$brain)
##
## Shapiro-Wilk normality test
##
## data: Animals$body
## W = 0.27831, p-value = 1.115e-10
##
##
## data: Animals$brain
## W = 0.45173, p-value = 3.763e-09
#USing density plot and checking the difference between median and mean
plot(density(Animals$body))
12
density.default(x = Animals$body)
0.0012
0.0008
Density
0.0004
0.0000
0 20000 40000 60000 80000
N = 28 Bandwidth = 164.1
plot(density(Animals$brain))
13
density.default(x = Animals$brain)
0.0015
0.0010
Density
0.0005
0.0000
0 1000 2000 3000 4000 5000 6000
hist(Animals$brain,30)
14
Histogram of Animals$brain
15
Frequency
10
5
0
0 1000 2000 3000 4000 5000 6000
Animals$brain
hist(Animals$body,30)
15
Histogram of Animals$body
20
15
Frequency
10
5
0
0 20000 40000 60000 80000
Animals$body
#through density plot we can see there is big difference in mean and median which shows that data is inc
Task-2. A power transformation of a variable X consists of raising X to the power lambda. Using an
appropriate statistical test and/or plot, find best lambda values needed for transforming each of the variables
requiring power transformation.

body_sqrt = sqrt(Animals$body)
body_cub = sign(Animals$body) * abs(Animals$body)ˆ(1/3)
body_log = log(Animals$body)
four_body=sign(Animals$body) * abs(Animals$body)ˆ(1/4)
brain_sqrt = sqrt(Animals$brain)
brain_cub = sign(Animals$brain) * abs(Animals$body)ˆ(1/3)
brain_log = log(Animals$brain)
plot(density(brain_sqrt))
16
density.default(x = brain_sqrt)
0.03
0.02
Density
0.01
0.00
−20 0 20 40 60 80
plot(density(brain_cub))
17
density.default(x = brain_cub)
0.10
0.08
0.06
Density
0.04
0.02
0.00
0 10 20 30 40 50
plot(density(brain_log))
18
density.default(x = brain_log)
0.15
Density
0.10
0.05
0.00
0 5 10
plot(density(body_sqrt))
19
density.default(x = body_sqrt)
0.030
0.020
Density
0.010
0.000
0 50 100 150 200 250 300
plot(density(body_cub))
20
density.default(x = body_cub)
0.10
0.08
0.06
Density
0.04
0.02
0.00
0 10 20 30 40 50
plot(density(body_log))
21
density.default(x = body_log)
0.00 0.02 0.04 0.06 0.08 0.10
Density
−10 −5 0 5 10 15
#our results shows that best transformation is LOG . We will confirm this with boxcox transformation.
animal_body=lm(body~.,data=Animals)
animal_brain=lm(brain~.,data=Animals)
boxcox(animal_body,plotit = TRUE)
22
95%
−100
log−Likelihood
−200
−300
−400
−2 −1 0 1 2
boxcox(animal_brain,plotit = TRUE)
23
95%
−100
log−Likelihood
−150
−200
−250
−2 −1 0 1 2
#box cox shows the likelihood for log
#BEST VALUE WILL BE LOG
Task-3. Apply power transformation and verify whether transformed variables have a normal distribution
through statistical test as well as smoothed histogram plots.

Animals$body = log(Animals$body)
Animals$brain = log(Animals$brain)
#After transformation we will confirm this with shapiro test and density plot
shapiro.test(Animals$brain);shapiro.test(Animals$body)
##
##
## data: Animals$brain
## W = 0.95787, p-value = 0.31
##
##
## data: Animals$body
## W = 0.98465, p-value = 0.9433
24
plot(density(Animals$brain))
density.default(x = Animals$brain)
0.15
Density
0.10
0.05
0.00
0 5 10
plot(density(Animals$body))
25
density.default(x = Animals$body)
0.00 0.02 0.04 0.06 0.08 0.10
Density
−10 −5 0 5 10 15
hist(Animals$brain,5)
26
Histogram of Animals$brain
10
8
Frequency
6
4
2
0
−2 0 2 4 6 8 10
Animals$brain
hist(Animals$body,30)
27
Histogram of Animals$body
3.0
2.5
2.0
Frequency
1.5
1.0
0.5
0.0
0 5 10
Animals$body
Task-4. Create a scatter plot of the transformed data. Based on the visual inspection of the plot, provide
your interpretation of the relationship between brain weight and body weight variables. You may like to add
an appropriate smoothed line curve to your plot to help in interpretation.

anim_reg<-lm(brain~body, data=Animals)
plot(Animals$body, Animals$brain, xlab="Log(Body Weight)", ylab="Log(Brain Weight)", main="Plot of Body
abline(anim_reg, lwd=2, col="red")
28
Plot of Body Wt. to Brain Wt.
8
Log(Brain Weight)
6
4
2
0
0 5 10
Log(Body Weight)
#We have used logarithmic transformation to compresses the large values but stretches the small ones
# There is a strong relationship between these weights on the grounds that a large body might well need
29

#PART 1a) : "Vqv/ggbiplot"

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

#PART 1a) : "Vqv/ggbiplot"

Uploaded by

Copyright:

Available Formats

Exercise 1. The dataset mpg is part of the R datasets package.

It contains a subset of the fuel economy data

## Loading required package: ggplot2

## Loading required package: usethis

## Loading required package: ggbiplot

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

compact midsize subcompact 2seater minivan suv pickup

0 5000 10000 15000 20000

ggplot(data = diamonds, aes(x =carat)) +

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF

ggplot(diamonds, aes(x=carat, y=price)) +

## Standard deviations (1, .., p=4):

#Code for the above exercise.

#Code for the above exercise.

0 20000 40000 60000 80000

0 1000 2000 3000 4000 5000 6000

0 1000 2000 3000 4000 5000 6000

0 20000 40000 60000 80000

#Code for the above exercise.

0 50 100 150 200 250 300

#box cox shows the likelihood for log

#BEST VALUE WILL BE LOG

#Code for the above exercise.

#Code for the above exercise.

You might also like