Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Question 1.

In order to know the data type for each variable, I will use the funcion class:

class(Laptop.df$Date)
class(Laptop.df$Configuration)
class(Laptop.df$Customer.Postcode)
class(Laptop.df$Store.Postcode)
class(Laptop.df$Retail.Price)
class(Laptop.df$Screen.Size..Inches.)
class(Laptop.df$Battery.Life..Hours.)
class(Laptop.df$RAM..GB.)
class(Laptop.df$Processor.Speeds..GHz.)
class(Laptop.df$Integrated.Wireless.)
class(Laptop.df$HD.Size..GB.)
class(Laptop.df$Bundled.Applications.)
class(Laptop.df$customer.X)
class(Laptop.df$customer.Y)
class(Laptop.df$store.X)
class(Laptop.df$store.Y)

Had I expected exactly the same data types as shown by R?


No. First, in the case of Configuration I was expected character data type. This because I think that
this variable is like an ID for the combination. I won’t use this variable for calculations or something
like that, for these reason I hope a character.
Second, I hoped a numerical variable in the case of Battery. Life. Hours and Screen.Size.Inches
because both are measures and can have non-integer values.
Question 2.

I will use heatmap for determinate missing values. In this heatmap we use a binary coding of the
original dataset, where if exist a missing value will assign 1 for the value, if not will be a zero.

heatmap(1 * is.na(Laptop.df), Rowv = NA, Colv = NA)


The heatmap show me that in Retail.Price, store.X and store.Y I have missing values. I can check this
information with the funcion summary.

summary(Laptop.df$Retail.Price)
summary(Laptop.df$store.X)
summary(Laptop.df$store.Y)

Question 3.

We can know the average and median price with histogram and boxplot

hist(Laptop.df$Retail.Price, xlab = "Price in GBP")


boxplot(Laptop.df$Retail.Price, ylab = "Price in GBP")

We can see that the median price is closely to 500. In addition, approx. 440 is the 25 percentile and
600 is the 75 percentiles. If I want to know exactly the median and the average I can use summary.
summary(Laptop.df$Retail.Price)
Question 4.

We have missing values in Retail.Price. In order to complete this task, I will assign a value for the
missing values. In this case I will take mean to replace NA values in my data.

Laptop.df$Retail.Price[is.na(Laptop.df$Retail.Price)] <- mean(Laptop.df$Retail.Price, na.rm = TRUE )


summary(Laptop.df$Retail.Price)

Now, we are ready to compare retail prices across the stores. In order to compare stores, I need a
measure for all the stores, in this case will use the mean.

data.for.store<- aggregate(Laptop.df$Retail.Price, by = list(Laptop.df$Store.Postcode) ,FUN=mean,)


names(data.for.store)<-c('Store','MeanPrice')

ggplot(data.for.store) + geom_bar(aes(x = Store, y = MeanPrice), stat = "identity")

Our data contains information about sixteen stores. We can appreciate that the mean of the Retail
Price presents a variation. We have five stores that are below 500. On the other hand, the date
present eleven store above 500, but we can see that is not the same mean, for example NW52QH is
higher than S1P3AU even though both are more that 500.
Question 5.

We use a boxplot to understand the behaviour of the Price if the laptop is Integrated Wireless or
not.

boxplot(Laptop.df$Retail.Price ~ Laptop.df$Integrated.Wireless., xlab = "Integrated Wireless", ylab =


"Price")

We can conclude that the price is very similar to the Laptop. No matter if this computer has
Integrated Wirelees or not, the price can be the same.

Question 6.

The utility for these scatter plots is the information that we can receive. For example, we can
appreciate that in the stores just sell Laptop with two screen size: 15 y 17 inches. In addition, we see
that laptops with size 15 can be cheaper that size 17 because the minimum value of 15 size is below
200, in contrast to 17 size that is above this price. We can do this analysis for all the scatter plots.
In the case of Configuration, we can see a clear tendency: as the value of the configuration is higher,
the price tends to increase

Question 7.

In my opinion, RAM GB, make the biggest difference in the laptop price. We can compare in scatter
plot and see a difference in the price between laptops with 1 or 2 and 4 GB. As the capacity of RAM
is bigger, the price can be increase.

To demonstrate this, I will use a heatmap for know the correlation between variables.

data.for.cor<-Laptop.df[,c(2,5,6,7,8,9,11)]
library(ggplot2)
library(reshape) # to generate input for the plot
cor.mat <- round(cor(data.for.cor),2) # rounded correlation matrix
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x = X1, y = X2, fill = value)) +
geom_tile() +
geom_text(aes(x = X1, y = X2, label = value))

We can appreciate that RAM GB is the variable with higher correlation with the price. Configuration
have an important correlation with Screen Size, but is less important for the price. I can check if RAM
is a differential
data.for.ram<- aggregate(Laptop.df$Retail.Price, by = list(Laptop.df$RAM..GB.) ,FUN=mean,)
names(data.for.ram)<-c('RAM','MeanPrice')
library(ggplot2)

ggplot(data.for.ram) + geom_bar(aes(x = RAM, y = MeanPrice), stat = "identity")

Now we can conclude that RAM make de biggest difference in the price.

You might also like