Exercise 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Lab session 3: Missing data, Outliers, Paired plots,

Summaries, and Dinosaurs


Data analysis and Visualization – Ilias Thomas

Red: code to copy paste in RStudio

Italics: variable and dataset names

Blue: functions and arguments

For this lab we will also work with the following dataset:

• diamonds: A dataset containing the prices and other attributes of almost 54,000 diamonds.

Missing data
Having seen heatmaps, it is not time to explore another use for them. Since they can be used to visualize
patterns, why not use them to visualize missing data patterns? This can be more informative than
summary statistics, where positioning of the missing data and correlations are not shown.

Task 1
Install the library Amelia.

1. For the six datasets (economics, presentential, txhousing, mpg, msleep, midwest) that you have seen so
far, create both a heatmap and missing values map. For the missing values, you can work with the function
missmap. Carefully think about for which of those it makes sense to create a heatmap and for which it
does not! Furthermore, also think about the variables that can be plotted and the ordering (for the
ordering the cluster_cols and cluster_cols argument can be very useful).

2. Can you draw some conclusions for the mpg, msleep, midwest datasets from the heatmaps?

Outliers
When discussing exploratory data analysis understanding outliers is one of the most common goals. We
have already seen that a boxplot or a histogram would reveal some information about the outliers.
However, can we assess then in a scatterplot? As an initial exercise plot the engine size versus the
consumption for the mpg data. What do you see in this plot? Are there any outliers?

Task 2
Install library plotly.

1. Save the previous plot as an object and use the function ggplotly to plot the object you saved. You can
now interact with the points on this plot. Can you which are the outliers?
Hint: use text = paste("Row:", rownames(your data)) inside aes

2. Add the variable maker to the plot instead of the row name. Which is the maker that sticks out?

As you saw in the previous task, we can evaluate outliers in a scatterplot in terms of a third variable.
Instead of using such an interactive plot, what if we want to create a static image were the third variable
is shown? We have seen how to do what with colour in the aes argument.

Is there another way? The geom_tile argument gives us this opportunity. See what happens when you
plot engine size vs. consumption but you use geom_tile instead of geom_point.

Task 3
1. Add a third numerical variable to the previous plot (e.g cyl) and summarize it in relation to the first two.
You have to use z=cyl in the aes and include the function of interest (e.g. mean) on the geom_tile
argument.

(An alternative way to do this would be to fill by the variable of interest in the as aes in the geom_tile
argument).

Finally, how about naming the outliers in a plot? We saw previously in the msleep data that two animals
were quite heavy. Can we identify them? We can do that with adding text to the plot! Try plotting sleep
times against weights as points and add this argument to your plot:

geom_text(aes(label = ifelse(bodywt > 100, rownames(msleep), "")), hjust = 1.1)

Task 4
1. Repeat this but display the animal names and not the row names in the plot.

2. Plot table vs. depth in the diamonds dataset and print the price for the outliers you find. See in all
directions!
Summaries
From the diamonds dataset plot table over depth. What do you notice? A way to improve this plot is to
use some way to summarize the values that are to create density bins to visualize where some variables
will be. geom_bin2d can be useful for that and allows us to specify the binwidth.

Task 5
1. Do the same plot for the diamonds dataset but use geom_bin2d instead of the geom you used before.
Does the plot look better now?

Paired plots
An alternative way to visualize the relationship of two or more variables is to plot the relationship between
ALL variables. The function ggpairs can be used in this case, from the library GGally.

Task 6
Install GGally.

1. Plot pairwise the variables from the mpg and the diamonds dataset. Be careful on which variables you
chose to plot.

Dinosaurs and the importance of data visualization


(The following was accessed through Microsoft Word - DataDino-PostRebuttal.docx (autodesk.net))

In the lab room you will find a dataset called DatasaurusDozen.tsv. Import the data into your workspace
and create summary statistics. You can use this code to
library(readr)

df <- read.csv(“your path here”, header = TRUE, sep="")

list <- split(df, df$dataset)

lapply(list, summary)

What do you notice here? What conclusion would you draw if only this was available to you?

Task 7
1. Create separate x vs. y plots for all the datasets in the list. Use what you have learning in the previous
labs to do so. What is your conclusion now?

You might also like