Professional Documents
Culture Documents
Visulisation Ggplot Exercise
Visulisation Ggplot Exercise
1 A first (scatter)plot 3
1.1 Topic and data set . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Sketching the visualisation . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Preparing the data . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Building our visualisation . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Interpreting the visualisation . . . . . . . . . . . . . . . . . . . . 6
1.7 Prettifying our plot . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.1 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.2 Unity line . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.3 Datapoint labeling . . . . . . . . . . . . . . . . . . . . . . 7
1
6 Pick-and-choose your visual adventure (Bonus) 29
6.1 Another ggplot visualisation . . . . . . . . . . . . . . . . . . . . . 29
6.1.1 Creating your graph . . . . . . . . . . . . . . . . . . . . . 30
6.1.2 R4DS: ggplot’s possibilities . . . . . . . . . . . . . . . . . 31
6.1.3 Extra extra ggplot . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Leaflet for R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3 Plotly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.4 GGanimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2
Chapter 1
A first (scatter)plot
3
Exercise 3: Sketch the visualisation
Using pen and paper, make a sketch of what you think the visualisation
will look like. Accuracy doesn’t matter for now, especially in a sketch.
Draw the pattern you expect, and add some informative labels.
Exercise 6
Explore the data set, using e.g., head(), glimpse(), distinct().
What data variables (i.e., columns) do you have? What defines the ob-
servations (rows)?
Each row is an individual observation, so the data is tidy. That is nice, saves
time on cleaning the data. We also don’t have to think about any summary
statistics, as we want every observation (country) visualised.
There is one issue, however. The data is split per quarter. However, we want
to visualise the total import and export for the whole year.
Exercise 7: Quarters and dimes
Create a column import, with the total import in 2018.
(Hint: which function from the exercises of yesterday can add a new
column to your data frame?)
Exercise 8
Also create a column export, with the total export in 2018.
Great, now we have import and export for the whole year! But... we still
have the columns for each quarter as well. These do not spark joy, so we want
to get rid of them.
4
Exercise 9: Marie Kondo the dataframe
Use the select() function to remove the unwanted columns. Use ?select
to see the documentation for this function.
Hint: deselecting the columns to remove, is the same as selecting the
columns you want to keep...
Ah, much better! Now we have only the information that we want to work
with. Time to get some plotting done.
Specifying the mapping is also done in an argument, using the aes() func-
tion. This function creates a mapping from the dataset to the aesthetics of the
geom. You can + this mapping to the plot:
ggplot(data = dataframe) +
aes(x = foo, y = bar) +
geom_point()
Alternatively, you can include it as the mapping = named argument to the
geom function:
ggplot() +
geom_point(data = df,
mapping = aes(x = foo, y = bar))
5
1.5 Building our visualisation
That was a bunch of information. Time to start simple.
Exercise 11: Your first ggplot
Create a scatterplot of the trade data, where each point is an observation,
and the x and y coordinates are determined by the total 2018 import and
export.
Hint: look at the syntax of the examples above.
6
Exercise 15: Reading our plot
Where, in relation to this line, would a country with a surplus show up?
If a country shows up below this unity line, does it have surplus, or a
deficit?
Does this depend on the mapping? Why (not)?
Great! That’s a much better looking visualisation, which you could show
someone else and have them understand what the data is and what the pattern
of import and export values is.
7
Chapter 2
Last week, you worked hard on a project in which the aim was to infer what
happened to air quality during the 2019-2020 forest fire season in Australia.
You have written a small report on it, even.
What was obviously missing, was a visual representation of your results.
Sometimes, a table just isn’t sufficient to illustrate magnitudes.
In this chapter, we will visualise the change in air quality over time, using a
line graph.
One topic we have avoided so far, is the topic of dates. Dates are a tricky
component of analysis: a date is not quite a number and not quite a piece of
text. Arithmetic with dates is non-trivial: can you calculate how many days
there are between January 20th and the second Wednesday of March quickly?
Computers can, but only if you tell them to treat your dates as dates.
We won’t go into depth in this topic during this week. Dates are tricky and
finnicky, and there are packages dedicated to helping you do these things.
However, we will see one or two examples where date times are useful.
Exercise 21
Take another look at this (new) air quality data set. In specific, use
glimpse() on it.
Looking at these results, note the text between <...>.
8
Figure 2.1: The result of glimpsing this data set.
That looks kinda cool! The careful observer, however, would notice some-
thing amiss. Instead of a clear, simple line, ggplot has drawn for you a line that
is slightly wonky.
9
As it turns out, without specifying which data points belong to which line,
R defaults to connecting each data point to the closest data point on the x-axis.
Your wonky line is the result of this data set consisting of data from multiple
stations but measured at the same time points. Without telling ggplot that it
should differentiate between stations, it clearly can’t know, and it produces a
graph that is very wrong.
To fix it, we need to make sure that ggplot knows which data point belongs
to which line. This sounds a lot like what we do with the aestheic. After all,
we’re telling ggplot exactly how to interpret the connection between data and
visual. To tell ggplot which measurement of AQI_Site belongs to which group,
all we have to do, is add an argument to the aes() layer:
ggplot(dataframe) +
geom_*() +
aes(x=variable_1, y=variable_2, group=variable_3)
Pretty cool, right? Ggplot even immediately adds a legend for you, so that
you can tell the stations apart.
Exercise 26: A proper visualisation
Remember: any good visualisation has a title, and sensible labels. That
includes a sensible title, also for the legend.
To change the label for any mapping, you can always refer back to the
keywords you used in aes(). For instance, to change the label for your
colored line, you could do something like labs(color=’A title’).
Add useful labels to your line graph, where necessary.
10
2.3.1 (Bonus) Air quality’s coda
Exercise 28: Mean vs. median
Add two reference lines to your visualisation: one representing the mean
site AQI during the forest fire season of 2019-2020, and one representing
the median. Give them different colours.
Do you understand what you’re looking at?
Does the result surprise you? Why?
11
Chapter 3
In this chapter we’ll learn about another common plot: a bar plot. In a bar plot,
the data is visualised as a set of bars, where each bar represents one or more
observations. The position of a bar on the horizontal axis indicates something
about which observation(s) are represented by that bar. The height of the bar,
on the vertical axis, represents some value of the observation(s) in that bar.
(This assumes a vertical bar plot, also known as a column-plot. You can also
create a horizontal bar plot, where this mapping is switched around).
For this chapter, we will continue with the pipeline of chapter 1. Hence,
make sure to keep your previous script at the ready.
There should be two bars, and the trade differential determines the height
of each bar.
12
3.2 Preparing the data
In ggplot, we can use the geom_col(). This requires mapping onto (at least) the
aesthetics x (position of the bar on the horizontal axis) and y (what variable to
map onto the vertical axis). So, we want to have a dataset that has one variable
indicating the group (surplus, or deficit), and one variable indicating the mean
trade differential.
This is not the most exciting data frame... it’s just one column and one row.
But it’s good to know that we can do the basic summary before we make things
more complicated.
13
Exercise 34: Determining trade deficits
Create the new column that indicates whether a country has a deficit.
Now that we have this column that dichotomizes our dataset, we can use it
to tell summarise to treat deficits and surpluses separately.
Exercise 35: Group by deficit/surplus
Group the dataframe by the newly created column. Print the grouped
dataframe; how can you see how it is now grouped?
In our case, this should create two groups, one for when our new column is
TRUE (i.e., those observations with a deficit) and one for when our new column
is FALSE (i.e., those observations with a surplus)
Exercise 36: Re-summarise
Make another summary, now on the grouped dataframe. Did it work?
3.2.3 Relabeling
Ugh, it’s kind of ugly though that this table has TRUE and FALSE in a column –
This is not very informative of what exactly is “true” or “false”. As a matter
of fact, that is kind of ugly in our big dataframe as well. Let’s clean that up
before we call this analysis done.
What would be better is if instead our column contained the text deficit
if a country has a trade deficit, and else surplus.
To operate on a row in a dataframe conditionally, we use the ifelse()
function. This takes three arguments: a logical expression to evaluate, what
to return if the expression evaluates TRUE, and what to return if the expression
evaluates FALSE. We can put this function inside a mutate() to create a new
column with the output:
14
3.3 Building our visualisation
Exercise 38
Create a barchart, using the summary dataframe as the data source, using
geom_col(), mapping the deficit/surplus to the horizontal position, and
the mean trade differential to the vertical position.
ggplot(data = summary_df) +
aes(x = foo,
y = bar) +
geom_bar(stat = "identity")
is identical to
ggplot(data = summary_df) +
aes(x = foo,
y = bar) +
geom_col()
15
ggplot(data = raw_df) +
aes(x = foo,
y = bar) +
geom_bar(stat = "summary",
fun.y = "mean")
plot not just the raw values in bar on the y-axis, but first applies the function
mean, thus summarising across rows. Thus, for this we need to specify the raw
dataframe as the datasource.
Exercise 40
Re-create the bar plot in your pipeline, but this time not from the sum-
mary dataframe, but from the full (201 rows) dataframe.
And there we go, now our visualisation is constructed directly from the
original data. This has the advantage that if something changes about the
data, we do not have to worry about updating our summary before updating
the barplot.
Done!
In these chapters you’ve practiced the basics of visualisation within the Gram-
mar of Graphics framework. We’ve dealt briefly with the data-layer, the idea of
mapping variables onto aesthetics, and even the idea of statistical transforma-
tions. We’ve worked with two geoms: point and bar.
You’ve also encountered some more iterative development: we’re continueing
to extend our original data analysis pipeline. We first created a barplot from
our summary table, and only then went back to create it from the full dataframe
itself. This is not necessarily the right order, but in this case, I thought it would
be conceptually easier to understand if we first do the summarising ourselves.
Sometimes iterative development is also for our own understanding.
16
Chapter 4
17
4.2 Recognising errors in a bad graph
Exercise 41: The good, the, bad, the wrong, and the ugly
The following pages all contain a visualisation. In groups of two or three,
discuss all of these visualisations. For each graph, using your knowledge
about visualisations and bad/wrong/ugly visualisations, you should:
check if you understand what it is trying to communicate. Then,
decide whether you think this is a good, bad, ugly, or wrong graph
(or several of those), and
18
Figure 4.1: Lunch time in the office
19
Figure 4.2: Speed versus fuel usage in a specific Toyota.
20
Figure 4.3: Gun deaths in Florida before and after implementing the ‘Stand your
ground’ law. ‘Stand your ground’ refers to legislation implemented in several
states that allows the defending of property using firearms.
21
Figure 4.4: Popularity of different genres over time. The text above the graphic
explains what you’re looking at: “This graphic shows film genre popularity
over time, represented as the percentage of all films released that year with the
specified genre tagged on IMDB.”
22
Figure 4.5: Change of ‘artistic standards’ over time
23
Figure 4.6: How speed of voting and voting margins in the US 2020 elections
relate. Source.
24
Figure 4.7: Sleeping pattern of a newborn. For explanation, see here.
25
Figure 4.8: Incarceration in Real Numbers. This image is only a small part of
the whole, and we want you to look at the full visualisation. You can find the
full visualisation here.
26
Chapter 5
Creating better
visualisations
Avoiding ugly graphs Ugly graphs are easy to make, and seemingly easy
to avoid. Most ugly graphs are the result of over-designing. A good way to
avoid accidental ugly graphs is by keeping colours, text, and lines as simple and
plain as possible. When using plotting software, take extra care that colours
and fonts are not ugly. Avoid 3D graphs.
Avoiding bad graphs Everyone falls into the trap of making a bad graph at
some point. This is the hardest type of mistake to avoid.
27
By far the most important rule of visualisation: usually, simpeler is better.
Some other rules of thumb to avoid bad graphs are:
Geoms often are suitable for either continuous data (e.g. line graph) or
discrete data (e.g. bar plot). Respect that difference.
Avoid breaking your axes to save white space,
28
Chapter 6
29
paying under $27,000 were 16% scarcer. Those who feed, transport,
clothe and entertain people who are out-and-about account for about
a quarter of American employment, note David Autor and Elisabeth
Reynolds of MIT. The large number of low-paid service jobs is often
lamented, but “having too few low-wage, economically insecure jobs
is actually worse than having too many”.” (‘Zoom and gloom’, 2020)
In this set of exercises, you will design a visualisation that could be printed
next to this paragraph to support the conclusions drawn in this article.
Exercise 42: The visualisation
What story should your graph tell?
Can you identify what type of geom_*() you might want to use? Can
you specify any other necessary elements you’d use from ggplot?
Sketch approximately what the graph would look like. (Actually do
the sketching - find yourself some paper, or use paint.)
If you write code that’s not meant to be run, but instead meant to convey
an idea or concept, it is refered to as pseudocode. It can be very useful to write
pseudocode, especially when working on more complex projects. By writing the
pseudocode first, you can ignore syntactical challenges, and focus on the broader
concept first. Once you’ve worked out where what is supposed to happen, you
can then fill in the details without worrying about the whole.
Exercise 44: The pseudotechnical nitty-gritty (Bonus)
Write in ggplot-esque pseudocode how you would put this visualisation
together.
30
6.1.2 R4DS: ggplot’s possibilities
Many components of the tidyverse, are written or inspired by the same person:
Hadley Wickham. Amongst other meaningful things in the R-universe, Wick-
ham wrote Grolemund and Wickham, 2017. This book contains so much useful
information, and is both a great reference and teacher of important tidyverse
concepts.
Chapter 3 is a chapter on using ggplot. Work through the chapter.
Exercise 47
Using any other data you find interesting, make an interactive map using
Leaflet for R.
A good source for data is Kaggle.
6.3 Plotly
Plotly can make similar graphs as R can, and make them interactively. Person-
ally I prefer the way ggplot builds their visualisations, so I hardly ever use plotly.
I do quite like using plotly because it can make my ggplot plots interactive!
Exercise 48
Taking any of the graphs you’ve made this morning, create an interactive
ggplot visual.
To do so, first install plotly (install.packages(’plotly’)).
Run the code to create the original ggplot.
Then run: plotly::ggplotly().
Tada!
31
6.4 GGanimate
In addition to interactive visualisations, you can make animated visualisations.
Animated visualisation are often really fun to look at, but the added information
transfer is quite niche. However, there are some cases where adding animation
is preferred over other options (such as facetting). And considering this is the
section of ‘fun with visualisations’, why not see if you can get it to work?
NB: creating animated pictures often involves installed software other than
R and RStudio. This doesn’t have to be difficult, but can be a bit of a pain. So
only do this if you have plenty of time, and aren’t foreign to working with your
computer!
Exercise 49: Animations are fun
Reproduce the ‘Yet another example’ example on gganimate
32
Figure 6.1: A version of the visualisation included in The Economist’s ‘Zoom
or Gloom’ (‘Zoom and gloom’, 2020).
Seperate and unequal
United States, employment rate by salary
% change since January 2020
10.0%
0.0%
−10.0%
−20.0%
−30.0%
−40.0%
33
Chapter 7
Visualisation exercise:
recreating The Economist’s
Bremorse visualisation
34
Figure 7.1: The Ecocomist’s Bremorse visualisation. You can find the original
article here.
35
7.1.3 Line graphs
Remember, the geom that can draw lines is geom_line().
It works like most other geoms. It needs (at least) two mappings: one to the
x-axis, and one to the y-axis.
Additionally, it can map a variable onto the color axis. You will need this
functionality for this assignment if you want to only use one geom-layer. Use
documentation to figure out how this works.
7.1.4 Labels
Labels (titles, axis names, legend names) are amongst the most essential com-
ponents of a graph to transfer information. Succinct, short, clear labels will
improve readability immensely.
Like everything else in ggplot, there is a myriad of ways to add labels to a
graph. One option is through the labs() layer.
Changing the title of a legend To change the title of a legend, you can
rename it in the labs() layer by referring to its name in the aesthetic layer.
Subtitle You should know by now that you can add a title in labs(). You
can add a subtitle in the exact same way.
This is a little abstract, so let’s check out an example. Check out the docu-
mentation of
scale_y_continuous()
36
Scrolling down, you can see all the different components of the y-axis scale that
you can tweak using these functions. For instance, you can tweak the limit of
a plot. Try adding the following layer to your graph, and see what happens:
scale_y_continuous(limits=c(NA, 100)).
Labels on axes
Using scale_*_*() functions, you can change the labels on your scales. The
labels parameter can take a vector (c()) of pieces of text to explicitly define
the text on the labels.
While this can be useful, can you imagine a reason why you should refrain
from hard-coding your labels, if at all possible?
Instead, labels also accepts an instruction for a transformation. This is
super powerful, and best practice. Those instructions should come in the shape
of a function. For instance, try and add labels = scales::label_percent()
in your scale_y_continuous(). Does it do what you expected?
ggsave(’airquality.pdf’)
If you don’t specify the width and height of the plot in ggsave(), R will default
to the size of the box that contains your visualisation in RStudio.
37
7.2 Assignment pt. 2: Good visualisations
The Bremorse visualisation you have recreated is only one of two versions the
Economist published. Compare the two visualisations and decide which you
think is better.
Figure 7.3: Bremorse: two versions. Left a line plot, right a scatter plot.
In my opinion, in this case we can only judge the effective communication of chart based on what its purpose is. If the
purpose of the visualization is to show how erratic the decision making was shown by the chart on the left, well then that
chart would be the right fit. However if the purpose of the chart was to show the dramatic change in the response to this
question overtime, well then I think the chart of the right hand side would communicate that view the best.
38
7.3 (Bonus) Recreating a visualisation, cont.
What you’ve created so far captures the basic outline of the original visualisation
but doesn’t really look like it. This section outlines some (aesthetic) options
with which you would be able to almost precisely recreate the original.
39
Bibliography
40