Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Data Visualisation

Joris Vincent & Lucie Kattenbroek

28th December 2021


Contents

1 A first (scatter)plot 3
1.1 Topic and data set . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Sketching the visualisation . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Preparing the data . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Plotting with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Building our visualisation . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Interpreting the visualisation . . . . . . . . . . . . . . . . . . . . 6
1.7 Prettifying our plot . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.1 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.2 Unity line . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7.3 Datapoint labeling . . . . . . . . . . . . . . . . . . . . . . 7

2 Line plots - visualising air quality 8


2.1 Dates in data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Line graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Air quality’s last appearance (we promise) . . . . . . . . . . . . . 9
2.3.1 (Bonus) Air quality’s coda . . . . . . . . . . . . . . . . . . 11

3 More complicated: barplot 12


3.1 Sketching the visualisation . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Preparing the data . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Determining trade differentials . . . . . . . . . . . . . . . 13
3.2.2 Group the data . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.3 Relabeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Building our visualisation . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Statistical transformations . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Summarising plots . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Good, bad, ugly, and wrong 17


4.1 Plotting data versus visualising results . . . . . . . . . . . . . . . 17
4.1.1 Ugly, wrong, and bad graphs . . . . . . . . . . . . . . . . 17
4.2 Recognising errors in a bad graph . . . . . . . . . . . . . . . . . . 18

5 Creating better visualisations 27


5.1 Avoiding bad, wrong, and ugly graphs . . . . . . . . . . . . . . . 27
5.1.1 Common errors in designing graphs . . . . . . . . . . . . 27

1
6 Pick-and-choose your visual adventure (Bonus) 29
6.1 Another ggplot visualisation . . . . . . . . . . . . . . . . . . . . . 29
6.1.1 Creating your graph . . . . . . . . . . . . . . . . . . . . . 30
6.1.2 R4DS: ggplot’s possibilities . . . . . . . . . . . . . . . . . 31
6.1.3 Extra extra ggplot . . . . . . . . . . . . . . . . . . . . . . 31
6.2 Leaflet for R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3 Plotly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.4 GGanimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Visualisation exercise: recreating The Economist’s Bremorse


visualisation 34
7.1 Recreating a Bremorse visualisation . . . . . . . . . . . . . . . . 34
7.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1.2 The basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.1.3 Line graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.1.4 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.1.5 Colours and themes . . . . . . . . . . . . . . . . . . . . . 36
7.1.6 Scales and labels . . . . . . . . . . . . . . . . . . . . . . . 36
7.1.7 Saving a figure . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Assignment pt. 2: Good visualisations . . . . . . . . . . . . . . . 38
7.3 (Bonus) Recreating a visualisation, cont. . . . . . . . . . . . . . . 39
7.3.1 Scales pt. II: colours . . . . . . . . . . . . . . . . . . . . . 39
7.3.2 Text on a graph . . . . . . . . . . . . . . . . . . . . . . . 39
7.3.3 More themes . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.3.4 Scales pt. III: Dates . . . . . . . . . . . . . . . . . . . . . 39

2
Chapter 1

A first (scatter)plot

We will start with a simple visualisation: a scatterplot. In a scatter plot, each


individual observation is visualised as an individual point. The position of the
point is determined by two properties (in a 2D scatterplot): the horizontal (x)
position and the vertical (y) position. These are the two aesthetics that we will
have to map two variables onto.

1.1 Topic and data set


We will be analysing trade deficits. We want to answer questions such as: how
many countries have a trade deficit, i.e., import more than they export? Is it
a few countries doing all the exporting? What is the average trade deficit, and
the average trade surplus?
We will use a data set from the International Monetary Fund on the size of
import and export by different countries in 2018.

1.2 Sketching the visualisation


Our simple scatterplot will plot each country as a separate point, and the posi-
tions will be determined by the total 2018 import and export.
Exercise 1: Imagine the arrangement
Think about what this visualisation will look like. Which variable will
you put on the horizontal position? Which variable will you put on the
vertical position? Does it matter?

Exercise 2: Imagine the pattern


What do you think the pattern of the data will be? If a country has a
higher value for import, will it have a higher or lower value for export?

3
Exercise 3: Sketch the visualisation
Using pen and paper, make a sketch of what you think the visualisation
will look like. Accuracy doesn’t matter for now, especially in a sketch.
Draw the pattern you expect, and add some informative labels.

1.3 Preparing the data


Exercise 4: Project setup
Set-up your RStudio for a new exercise. Create a new directory for this
project. This is always good practice. Set this directory as the working
directory in RStudio.
Get the CSV file from Blackboard, and save it to the directory.

Exercise 5: Load the data


Create a new R script, load the tidyverse, and import the data set.
Don’t forget to assign the read-in data to a variable.

Exercise 6
Explore the data set, using e.g., head(), glimpse(), distinct().
What data variables (i.e., columns) do you have? What defines the ob-
servations (rows)?

Each row is an individual observation, so the data is tidy. That is nice, saves
time on cleaning the data. We also don’t have to think about any summary
statistics, as we want every observation (country) visualised.
There is one issue, however. The data is split per quarter. However, we want
to visualise the total import and export for the whole year.
Exercise 7: Quarters and dimes
Create a column import, with the total import in 2018.
(Hint: which function from the exercises of yesterday can add a new
column to your data frame?)

Exercise 8
Also create a column export, with the total export in 2018.

Great, now we have import and export for the whole year! But... we still
have the columns for each quarter as well. These do not spark joy, so we want
to get rid of them.

4
Exercise 9: Marie Kondo the dataframe
Use the select() function to remove the unwanted columns. Use ?select
to see the documentation for this function.
Hint: deselecting the columns to remove, is the same as selecting the
columns you want to keep...

Ah, much better! Now we have only the information that we want to work
with. Time to get some plotting done.

1.4 Plotting with ggplot2


ggplot2 is a very powerful ploting library that allows you to build plots that
are very customisable. It is part of the Tidyverse, and hence will be installed
(and loaded) already.
In principle any ggplot plot consists of the same basic elements. For the
simplest visualisation, we need to specify:
1. what data to use
2. what geometric object to use
3. how to map the data onto the geom
The basic command to start a plot is ggplot(), you can run just that
function to see that R creates the space for a plot. To add visual elements to
a ggplot, we use geom_* functions, such as geom_point(). These elements are
added to a ggplot using the + operator. However, we still need to tell ggplot
what to use as the underlying data, and how to map that onto the geoms.
Specifying the data is simple: both ggplot() and the geom_*() functions
accept an argument data, specifying which R object to use as the underlying
data source. If specifying the data source within the ggplot() call, all other
elements of the plot will inherit this same dataset. If specifying the data source
within the geom function call, only that geom will use the specified dataset.
Exercise 10
Why would this distinction be useful? When would you want to specify
a dataset just for one geom in a plot, and another dataset for another
geom?

Specifying the mapping is also done in an argument, using the aes() func-
tion. This function creates a mapping from the dataset to the aesthetics of the
geom. You can + this mapping to the plot:
ggplot(data = dataframe) +
aes(x = foo, y = bar) +
geom_point()
Alternatively, you can include it as the mapping = named argument to the
geom function:
ggplot() +
geom_point(data = df,
mapping = aes(x = foo, y = bar))

5
1.5 Building our visualisation
That was a bunch of information. Time to start simple.
Exercise 11: Your first ggplot
Create a scatterplot of the trade data, where each point is an observation,
and the x and y coordinates are determined by the total 2018 import and
export.
Hint: look at the syntax of the examples above.

Exercise 12: Part of the pipeline


Rerun your script in a clean R instance, to make sure everything works
well.

1.6 Interpreting the visualisation


Exercise 13: Assess the results
Look at your plot. Describe the features of the plot. Can you think of an
explanation for the pattern that becomes clear? There’s no right answer
there - just your best guess, as the person analysing the data.

1.7 Prettifying our plot


1.7.1 Labeling
Admittedly, the plot isn’t very clear. If you were to show this figure to someone
else, they wouldn’t know whats plotted, which year the data is from, etc. A
good visualisation also has some labeling. We can add titles and labels to our
plot using labs(), which we can + to a ggplot just like our geoms.
Exercise 14: Labeling
Add a title, an x-axis label, and a y-axis label. Look at the help of labs()
to figure out how.

1.7.2 Unity line


Another visual aid we could add to our plot is an indicator of whether a country
had a trade surplus or trade deficit. If a country exports exactly as much as it
imports, then its import would be equal to its export – in our plot, this would
show up as point with the same values for the x and y aesthetics. Thus, if we
draw a line with for y = x in our plot, any country that imports exactly how
much they export would end up on this line. The diagonal line y = x is often
referred to as the unity line.

6
Exercise 15: Reading our plot
Where, in relation to this line, would a country with a surplus show up?
If a country shows up below this unity line, does it have surplus, or a
deficit?
Does this depend on the mapping? Why (not)?

ggplot has three geoms for drawing straight lines:


ˆ geom_hline draws horizontal lines; it takes an argument yintercept: the
y-value where the hline intercepts the vertical axis.
ˆ geom_vline draws vertical lines; it takes an argument xintercept: the
x-value where the vline intercepts the horizontal axis.

ˆ geom_abline draws any straight line; it takes two arguments intercept:


the y-value where the hline intercepts the vertical axis, i.e., b in y = ax+b;
and slope: the slope of the line, i.e., a in y = ax + b

Exercise 16: Add unity line


Add (with +) to your plot a geom that forms the unity line y = x.

1.7.3 Datapoint labeling


Adding geom_*()s is fun, and the ease of it is one of the very powerful features
of ggplot. For instance, we can add a label to each datapoint to tells which
country that point represents. geom_text() draws a text-label using the x and
y aesthetics for position, and the label aesthetic to determine the text.
Exercise 17: What’s your name
Add text labels to the scatterplot Remember, you will have to define a
mapping using aes() to map the column region to the label aesthetic

Exercise 18: What’s your name 2


Which countries as the big outliers? Does that surprise you?

Exercise 19: Crowded House


With the text labels, this scatterplot is very crowded. Filter the data for
some countries that you are interested in comparing to each other, and
plot just those.

Great! That’s a much better looking visualisation, which you could show
someone else and have them understand what the data is and what the pattern
of import and export values is.

7
Chapter 2

Line plots - visualising air


quality

Last week, you worked hard on a project in which the aim was to infer what
happened to air quality during the 2019-2020 forest fire season in Australia.
You have written a small report on it, even.
What was obviously missing, was a visual representation of your results.
Sometimes, a table just isn’t sufficient to illustrate magnitudes.
In this chapter, we will visualise the change in air quality over time, using a
line graph.

2.1 Dates in data


Exercise 20: Another version of air quality data
Download the latest air quality data set from the course Blackboard. Load
it into R, save it to a variable.
How is this data set different from the data you worked with earlier this
week? (Use the functions that you’ve seen for navigating data sets to
explore this data.)

One topic we have avoided so far, is the topic of dates. Dates are a tricky
component of analysis: a date is not quite a number and not quite a piece of
text. Arithmetic with dates is non-trivial: can you calculate how many days
there are between January 20th and the second Wednesday of March quickly?
Computers can, but only if you tell them to treat your dates as dates.
We won’t go into depth in this topic during this week. Dates are tricky and
finnicky, and there are packages dedicated to helping you do these things.
However, we will see one or two examples where date times are useful.
Exercise 21
Take another look at this (new) air quality data set. In specific, use
glimpse() on it.
Looking at these results, note the text between <...>.

8
Figure 2.1: The result of glimpsing this data set.

One of the uses of glimpse() is to check if R considers your columns text,


numbers, dates, or one of the many other things you don’t have to worry about
for now. This information is visible when you glimpse() your data, in between
the <...> signs.
The column Name has <chr> after it. chr stands for ‘character’, a computer-
term for ‘text’. (Analogous to Python’s string). Anything you provide to R
between apostrophes, is seen as chr.
The column AQI_Site has <dbl> behind it. dbl stands for ‘double’, a
computer-term for ‘number’. (Analogous to Python’s float). R can do cal-
culations on things that are dbl.
The column Date has <Date> behind it. This makes sense, right? Date is
also a human-term for ‘date’. R knows this is a date! One of the beautiful
things of ggplot, is that R can automatically handle dates as dates. The power
of this might not be clear to you now, but if you ever run into datetimes after
this course, hopefully this short introduction has helped you understand what’s
going on.

2.2 Line graphs


For this visualisation, we want to make a line graph. Unsurprisingly, there is a
geom that can draw lines: geom_line()!
It works like most other geoms. It needs (at least) two mappings: one to the
x-axis, and one to the y-axis.
Exercise 22: Data and lines
Looking at your data, which of your columns should map to the x-axis?
Which to the y-axis?

2.3 Air quality’s last appearance (we promise)


Exercise 23: Air quality visualised
Create a bare-bones line graph of this data, using geom_line().

That looks kinda cool! The careful observer, however, would notice some-
thing amiss. Instead of a clear, simple line, ggplot has drawn for you a line that
is slightly wonky.

9
As it turns out, without specifying which data points belong to which line,
R defaults to connecting each data point to the closest data point on the x-axis.
Your wonky line is the result of this data set consisting of data from multiple
stations but measured at the same time points. Without telling ggplot that it
should differentiate between stations, it clearly can’t know, and it produces a
graph that is very wrong.
To fix it, we need to make sure that ggplot knows which data point belongs
to which line. This sounds a lot like what we do with the aestheic. After all,
we’re telling ggplot exactly how to interpret the connection between data and
visual. To tell ggplot which measurement of AQI_Site belongs to which group,
all we have to do, is add an argument to the aes() layer:

ggplot(dataframe) +
geom_*() +
aes(x=variable_1, y=variable_2, group=variable_3)

Exercise 24: That’s better (?)


Add a group argument to your line graph, to tell ggplot which data points
belong to which station (remember, the weather station is encoded in the
column Name).

Exercise 25: Colours are nice


Instead of telling ggplot which group data points belong to in the
aesthetic, you can also tell it to color the lines based on a variable.
Change your line graph, such that the two lines have different colours.

Pretty cool, right? Ggplot even immediately adds a legend for you, so that
you can tell the stations apart.
Exercise 26: A proper visualisation
Remember: any good visualisation has a title, and sensible labels. That
includes a sensible title, also for the legend.
To change the label for any mapping, you can always refer back to the
keywords you used in aes(). For instance, to change the label for your
colored line, you could do something like labs(color=’A title’).
Add useful labels to your line graph, where necessary.

Exercise 27: Pretty huge effect.


Take a moment to look at your visualisation. Do you understand what
you’re looking at? What is the main message of this visualisation, accord-
ing to you?

10
2.3.1 (Bonus) Air quality’s coda
Exercise 28: Mean vs. median
Add two reference lines to your visualisation: one representing the mean
site AQI during the forest fire season of 2019-2020, and one representing
the median. Give them different colours.
Do you understand what you’re looking at?
Does the result surprise you? Why?

11
Chapter 3

More complicated: barplot

In this chapter we’ll learn about another common plot: a bar plot. In a bar plot,
the data is visualised as a set of bars, where each bar represents one or more
observations. The position of a bar on the horizontal axis indicates something
about which observation(s) are represented by that bar. The height of the bar,
on the vertical axis, represents some value of the observation(s) in that bar.
(This assumes a vertical bar plot, also known as a column-plot. You can also
create a horizontal bar plot, where this mapping is switched around).
For this chapter, we will continue with the pipeline of chapter 1. Hence,
make sure to keep your previous script at the ready.

3.1 Sketching the visualisation


Using the data of chapter 1, we will create a column plot, which contains the
mean trade differential (remember, ‘trade differential’ is the difference between
import and export), separately for the countries that have a deficit, and the
countries that have a surplus.
Exercise 29: Imagine the arrangement
Think about what this visualisation will look like. Which variable will
you put on the horizontal position? Which variable will you put on the
vertical position? Does it matter?

Exercise 30: Imagine the pattern


What do you think the pattern of the data will be? If a country has a
higher value for import, will it have a higher or lower value for export?

Exercise 31: Sketch the visualisation


Using pen and paper, make a sketch of what you think the visualisation
will look like. Accuracy doesn’t matter for now, especially in a sketch.
Draw the pattern you expect, and add some informative labels.

There should be two bars, and the trade differential determines the height
of each bar.

12
3.2 Preparing the data
In ggplot, we can use the geom_col(). This requires mapping onto (at least) the
aesthetics x (position of the bar on the horizontal axis) and y (what variable to
map onto the vertical axis). So, we want to have a dataset that has one variable
indicating the group (surplus, or deficit), and one variable indicating the mean
trade differential.

3.2.1 Determining trade differentials


We are interested in how many countries have a trade deficit, and what the
avarage trade deficit is.
First, we will have to calculate the trade differential for each country. We
do this by subtracting the import from the export, for each country. If this
number is positive, it means the country exported more than they imported,
which means they had a trade surplus. If this number is negative, the country
had a trade deficit.
Exercise 32: Calculate trade differentials
Create a new column with the trade differential for each country

That’s a lot of data, and what we want is a pattern, or a summary; to create


the bar plot we want a single table with two rows: one with the average deficit,
and another row with the average surplus.
Exercise 33: Summarise trade differentials
Create a summary data frame, that averages the trade differentials.
Hint: which function could we use to create summaries again? Don’t
hesitate to look back in the materials, or to use the help function in R.

This is not the most exciting data frame... it’s just one column and one row.
But it’s good to know that we can do the basic summary before we make things
more complicated.

3.2.2 Group the data


We said we wanted to have this average differential separately for deficits and
surpluses. This requires telling summarise() to treat deficits and surpluses
separately. But, R does not know about deficits and surpluses – to R, all the
differentials are just numbers.
Instead, we have to tell it which observations are considered “deficits”, and
which are considered “surpluses”. We can do this by creating a new column, of
type “logical” (boolean), that we will call “deficit”. For each observation that
runs a deficit, the value in this column should be TRUE and for each observation
that runs a surplus, the value in this column should be FALSE. Remember that
a country has a deficit if
export < import

13
Exercise 34: Determining trade deficits
Create the new column that indicates whether a country has a deficit.

Now that we have this column that dichotomizes our dataset, we can use it
to tell summarise to treat deficits and surpluses separately.
Exercise 35: Group by deficit/surplus
Group the dataframe by the newly created column. Print the grouped
dataframe; how can you see how it is now grouped?

In our case, this should create two groups, one for when our new column is
TRUE (i.e., those observations with a deficit) and one for when our new column
is FALSE (i.e., those observations with a surplus)
Exercise 36: Re-summarise
Make another summary, now on the grouped dataframe. Did it work?

3.2.3 Relabeling
Ugh, it’s kind of ugly though that this table has TRUE and FALSE in a column –
This is not very informative of what exactly is “true” or “false”. As a matter
of fact, that is kind of ugly in our big dataframe as well. Let’s clean that up
before we call this analysis done.
What would be better is if instead our column contained the text deficit
if a country has a trade deficit, and else surplus.
To operate on a row in a dataframe conditionally, we use the ifelse()
function. This takes three arguments: a logical expression to evaluate, what
to return if the expression evaluates TRUE, and what to return if the expression
evaluates FALSE. We can put this function inside a mutate() to create a new
column with the output:

mutate(dataframe, math_check = ifelse(5 > 3,


"five is greater",
"maths is broken"))

Exercise 37: Relabel the trade deficits


Relabel the column indicating which observations have deficits. You
should do this before you summarise.
Hint: if you tell mutate to create a colunn with a name that already exists
in the dataframe, it will override that already existing column.
Hint: You cannot override variables used for grouping. But you can un-
group the data frame, override the variable, and then group by the variable
again.

14
3.3 Building our visualisation
Exercise 38
Create a barchart, using the summary dataframe as the data source, using
geom_col(), mapping the deficit/surplus to the horizontal position, and
the mean trade differential to the vertical position.

Exercise 39: Labeling


Add a title, an x-axis label, and a y-axis label.

Okay, great! Hopefully, that wasn’t too difficult!

3.4 Statistical transformations


The grammar of graphics also includes the concept of statistical transformations,
which determine how a variable gets mapped onto an aesthetic. Most geoms use
the transformation “identity” by default, which takes the value of the variable,
and does nothing to it, before putting it on the axis; e.g., in our scatterplot,
the import value is not transformed before it is used to determine the x or y
position.
This is also what geom_col() does: whatever you specify as the variable
for the y aesthetic is not transformed at all. This is because geom_col() is
a special case of geom_bar(): this more general geom takes arguments stat,
which specifies the transformation for the y aesthetic. geom_col() uses the
stat = "identity":

ggplot(data = summary_df) +
aes(x = foo,
y = bar) +
geom_bar(stat = "identity")

is identical to
ggplot(data = summary_df) +
aes(x = foo,
y = bar) +
geom_col()

3.5 Summarising plots


Bar plots are interesting, because in they can summarise. One bar can represent
more than one observation – in our case, one bar represents all the countries
with a deficit. Compare this to our scatterplot, where each point is representing
a single observation. Currently, we have done that summarising ourselves, but
it turns out that ggplot is smart enough to do some of that for us!
Instead of using the statistical transformation "identity", we can use
stat = "summary". This tells R to essentially do a summarise under the hood,
for which we also have to specify which fun.y (or fun if you are using R4.0).:

15
ggplot(data = raw_df) +
aes(x = foo,
y = bar) +
geom_bar(stat = "summary",
fun.y = "mean")
plot not just the raw values in bar on the y-axis, but first applies the function
mean, thus summarising across rows. Thus, for this we need to specify the raw
dataframe as the datasource.
Exercise 40
Re-create the bar plot in your pipeline, but this time not from the sum-
mary dataframe, but from the full (201 rows) dataframe.

And there we go, now our visualisation is constructed directly from the
original data. This has the advantage that if something changes about the
data, we do not have to worry about updating our summary before updating
the barplot.

Done!
In these chapters you’ve practiced the basics of visualisation within the Gram-
mar of Graphics framework. We’ve dealt briefly with the data-layer, the idea of
mapping variables onto aesthetics, and even the idea of statistical transforma-
tions. We’ve worked with two geoms: point and bar.
You’ve also encountered some more iterative development: we’re continueing
to extend our original data analysis pipeline. We first created a barplot from
our summary table, and only then went back to create it from the full dataframe
itself. This is not necessarily the right order, but in this case, I thought it would
be conceptually easier to understand if we first do the summarising ourselves.
Sometimes iterative development is also for our own understanding.

16
Chapter 4

Good, bad, ugly, and wrong

4.1 Plotting data versus visualising results


Plotting data and visualising results are two fully different concepts. You should
be plotting results really often; it is something that can be done ‘quick and dirty’
and is essential to keep tabs on what’s going on with your data. Visualising
results, on the other hand, takes effort and thought: you need to think about
your story, and how to convey that story to your audience.

4.1.1 Ugly, wrong, and bad graphs


Defining what constitutes a ‘good’ visualisation is (neigh) impossible. The next
best thing, is to avoid creating bad, wrong, or ugly graphs.
These very helpful concepts by Claus Wilke in his book on data visualisation
(Wilke, 2019). This is a great reference book, that we strongly recommend
browsing through.
The concepts can help you recognise good visualisations, and create better
visualisations yourself. At the end of these chapters, we hope that you have a
sense of how to avoid making ugly, wrong, or bad visualisations.

17
4.2 Recognising errors in a bad graph
Exercise 41: The good, the, bad, the wrong, and the ugly
The following pages all contain a visualisation. In groups of two or three,
discuss all of these visualisations. For each graph, using your knowledge
about visualisations and bad/wrong/ugly visualisations, you should:
ˆ check if you understand what it is trying to communicate. Then,

ˆ decide whether you think this is a good, bad, ugly, or wrong graph
(or several of those), and

ˆ describe why you think that is the case.

For some of the more complicated visualisations, it can be beneficial to


use a theoretical framework to support your arguments. For this, we
recommend skimming through Part II (Chapters 17 - 26) of Wilke’s book
(Wilke, 2019).
(Bonus)
ˆ What kind of data would you need to create this visualisation?

ˆ How would you fix this graph?

18
Figure 4.1: Lunch time in the office

19
Figure 4.2: Speed versus fuel usage in a specific Toyota.

20
Figure 4.3: Gun deaths in Florida before and after implementing the ‘Stand your
ground’ law. ‘Stand your ground’ refers to legislation implemented in several
states that allows the defending of property using firearms.

21
Figure 4.4: Popularity of different genres over time. The text above the graphic
explains what you’re looking at: “This graphic shows film genre popularity
over time, represented as the percentage of all films released that year with the
specified genre tagged on IMDB.”

22
Figure 4.5: Change of ‘artistic standards’ over time

23
Figure 4.6: How speed of voting and voting margins in the US 2020 elections
relate. Source.

24
Figure 4.7: Sleeping pattern of a newborn. For explanation, see here.

25
Figure 4.8: Incarceration in Real Numbers. This image is only a small part of
the whole, and we want you to look at the full visualisation. You can find the
full visualisation here.

26
Chapter 5

Creating better
visualisations

5.1 Avoiding bad, wrong, and ugly graphs


In the previous chapter you learnt how to describe bad, ugly, and wrong visu-
alisations. Now you’re a pro at what not to do, it’s time to learn how to avoid
making bad visualisations.
This section will outline some theory on how to avoid bad, wrong or ugly
graphs.

5.1.1 Common errors in designing graphs


Desiging a good visualisation is not a science. Regardless, there are some simple
principles that should help you deliver visualisations that are not ugly, wrong,
or bad.

Avoiding ugly graphs Ugly graphs are easy to make, and seemingly easy
to avoid. Most ugly graphs are the result of over-designing. A good way to
avoid accidental ugly graphs is by keeping colours, text, and lines as simple and
plain as possible. When using plotting software, take extra care that colours
and fonts are not ugly. Avoid 3D graphs.

Avoiding wrong graphs Creating a wrong graph is a clear no-go. Wrong


graphs are created when hurrying to a finished end product, or when you mis-
understand your data.
The best way to avoid creating wrong graphics is by always double checking
all numbers of your graph. Check whether your data parsing is accurate. Do
a sense check on the graph itself: do the numbers make sense in the context of
your research? Compare the numbers on your visualisation with the numbers
in your data.

Avoiding bad graphs Everyone falls into the trap of making a bad graph at
some point. This is the hardest type of mistake to avoid.

27
By far the most important rule of visualisation: usually, simpeler is better.
Some other rules of thumb to avoid bad graphs are:

ˆ Don’t plot data using different axes in one visualisation,

ˆ Only use colours to add information to a graph,

ˆ Clearly label your graphs: title and axes,

ˆ Geoms often are suitable for either continuous data (e.g. line graph) or
discrete data (e.g. bar plot). Respect that difference.
ˆ Avoid breaking your axes to save white space,

ˆ Avoid 3D graphs, unless it increases clarity.

28
Chapter 6

Pick-and-choose your visual


adventure (Bonus)

Visualisation is a useful tool in data analysis. It can help eluminate other-


wise invisible patterns, translate information in an image that would take long
amounts of text, and aid a better understanding in research using data.
Additionally, visualisation can be very fun. Visualisations can be a burst of
colours in an otherwise challenging technical environment.
This chapter will show you some of the cool possibilities at your finger tips
once you get the hang of working with R. There are many other ways of visual-
isating data than using ggplot: Leaflet, plotly, and many others. A surprisingly
large amount of software that can make visualisations in R use the basic prin-
ciples of ggplot: any visualisation is a stack of layers, all adapting different
components of the relation between the data and the visualisation.

6.1 Another ggplot visualisation


In case you’d like to work with ggplot some more, here’s another exercise.

COVID-19 and the job market In October, The Ecocomist published an


article discussing the effects of COVID-19 on the American job market. This is
one excerpt of that article.

“[One implication of the changes in the American labour market due


to COVID-19] is a period of higher inequality. Recessions are usually
worse for the poor and unskilled than for others, but the pandemic
has been bad for them even accounting for the severity of the hit to
the labour market, according to a working paper by Ippei Shibata
of the IMF. Job losses have been heavy among service workers (who
are more likely to be young, female and black) whose employment
depends on the spending of high-earning professionals.
Data from Opportunity Insights, a team of researchers at Harvard
University, reveal that by the end of July there were 2% fewer jobs in
America paying more than $60,000 a year than in January. But jobs

29
paying under $27,000 were 16% scarcer. Those who feed, transport,
clothe and entertain people who are out-and-about account for about
a quarter of American employment, note David Autor and Elisabeth
Reynolds of MIT. The large number of low-paid service jobs is often
lamented, but “having too few low-wage, economically insecure jobs
is actually worse than having too many”.” (‘Zoom and gloom’, 2020)

In this set of exercises, you will design a visualisation that could be printed
next to this paragraph to support the conclusions drawn in this article.
Exercise 42: The visualisation
ˆ What story should your graph tell?

ˆ Can you identify what type of geom_*() you might want to use? Can
you specify any other necessary elements you’d use from ggplot?
ˆ Sketch approximately what the graph would look like. (Actually do
the sketching - find yourself some paper, or use paint.)

Exercise 43: The data


ˆ What kind of data would you need to make this graph?

ˆ List the column names of the data.

If you write code that’s not meant to be run, but instead meant to convey
an idea or concept, it is refered to as pseudocode. It can be very useful to write
pseudocode, especially when working on more complex projects. By writing the
pseudocode first, you can ignore syntactical challenges, and focus on the broader
concept first. Once you’ve worked out where what is supposed to happen, you
can then fill in the details without worrying about the whole.
Exercise 44: The pseudotechnical nitty-gritty (Bonus)
Write in ggplot-esque pseudocode how you would put this visualisation
together.

6.1.1 Creating your graph


On the blackboard, you will find the data that The Economist used to create
the visualisation they printed with this article. The data is a little convoluted:
it contains the relative change of employment across different income brackets,
comparative to a date in early January. ‘Relative’ change means that the values
represent the change in unemployment, relative to the moment of measurement.
Exercise 45
Create a line plot using this data, that represents the story that The
Economist wrote. (If you want a hint of what it should look like, a version
is appended at the very bottom of this chapter. Ideally, though, try to
come up with a graph without looking at the example.)

30
6.1.2 R4DS: ggplot’s possibilities
Many components of the tidyverse, are written or inspired by the same person:
Hadley Wickham. Amongst other meaningful things in the R-universe, Wick-
ham wrote Grolemund and Wickham, 2017. This book contains so much useful
information, and is both a great reference and teacher of important tidyverse
concepts.
Chapter 3 is a chapter on using ggplot. Work through the chapter.

6.1.3 Extra extra ggplot


If you’ve finished all of this, take a moment to solidify your theoretical un-
derpinning of ggplot and have a read through the original paper (Wickham,
2010).

6.2 Leaflet for R


If you often work with geographical data, Leaflet is an invaluable tool. Leaflet
is a library originally written for javascript, a language very good at building
interactive web interfaces. However, it has also been ported to R, which means
you can write Leaflet code in the familiar tidyverse dialect.
(Check out Leaflet for R documentation.)
Leaflet creates beautifully interactive maps. For instance, check out this
example. Leaflet can be super tricky, but it doesn’t have to be.
Exercise 46
Using this tutorial, make an interactive map in leaflet displaying earth-
quakes in Japan.

Exercise 47
Using any other data you find interesting, make an interactive map using
Leaflet for R.
A good source for data is Kaggle.

6.3 Plotly
Plotly can make similar graphs as R can, and make them interactively. Person-
ally I prefer the way ggplot builds their visualisations, so I hardly ever use plotly.
I do quite like using plotly because it can make my ggplot plots interactive!
Exercise 48
Taking any of the graphs you’ve made this morning, create an interactive
ggplot visual.
To do so, first install plotly (install.packages(’plotly’)).
Run the code to create the original ggplot.
Then run: plotly::ggplotly().
Tada!

31
6.4 GGanimate
In addition to interactive visualisations, you can make animated visualisations.
Animated visualisation are often really fun to look at, but the added information
transfer is quite niche. However, there are some cases where adding animation
is preferred over other options (such as facetting). And considering this is the
section of ‘fun with visualisations’, why not see if you can get it to work?
NB: creating animated pictures often involves installed software other than
R and RStudio. This doesn’t have to be difficult, but can be a bit of a pain. So
only do this if you have plenty of time, and aren’t foreign to working with your
computer!
Exercise 49: Animations are fun
Reproduce the ‘Yet another example’ example on gganimate

32
Figure 6.1: A version of the visualisation included in The Economist’s ‘Zoom
or Gloom’ (‘Zoom and gloom’, 2020).
Seperate and unequal
United States, employment rate by salary
% change since January 2020

10.0%

0.0%

−10.0%

−20.0%

−30.0%

−40.0%

jan apr jul okt

High−wage workers Low−wage workers


Over $65,000/year Under $27,000/year

33
Chapter 7

Visualisation exercise:
recreating The Economist’s
Bremorse visualisation

In this exercise, you will focus on two different skills:


ˆ (Re)create a visualisation using ggplot,

ˆ Evaluate what makes a good visualisation.

On September 14th, 2018, the Ecocomist posted an article discussing Britain


national politics. In it, they included an interesting data visualisation, titled
‘Bremorse’. We’ll try to recreate this visualisation.

7.1 Recreating a Bremorse visualisation


7.1.1 Data
You can find the data for this exercise on the course Blackboard page. The data
is collected by NatCen Social Research, a British organisation collecting data
on social topics.
At the end of this exercise, your graph should look something like Figure 7.2.
Clearly this is not the same graph as the original graph, but it contains all the
important elements.
You should try to recreate a visualisation that is as close as possible to this
one.
You should try to not use more than one layer of geom_*() in this
graph (so don’t draw the two lines separately.)
The rest of this section will outline some useful concepts that should help
you get to your final product. This text is not intended to stand alone: rather,
you should always consult documentation if you’re exploring a new function. If
you feel confident, you are welcome to ignore the help and go about it yourself.

34
Figure 7.1: The Ecocomist’s Bremorse visualisation. You can find the original
article here.

Figure 7.2: A first attempt at recreating The Economist’s Bremorse graph.

7.1.2 The basics


Take a moment to look at what you want to build, and disseminate the different
components. Do you recognise which geom_*() is used? Can you understand
the mappings? Are there other elements that surprise you, considering the
data you have? Answering these questions before starting, will help you with
breaking down the building of this visualisation in manageable chuncks.
Load in your data and convince yourself that it loaded in properly. Consider
whether you already know that you need to transform your data at all, and if
yes, do those transformations.
You have built several ggplot graphs before, so get going with the basic
ggplot skeleton you know. Add any mappings you already understand, and
any other elements you know how to build.
The following sections will help you out with some components of this visu-
alisation. Feel free to use them as and when you please.

35
7.1.3 Line graphs
Remember, the geom that can draw lines is geom_line().
It works like most other geoms. It needs (at least) two mappings: one to the
x-axis, and one to the y-axis.
Additionally, it can map a variable onto the color axis. You will need this
functionality for this assignment if you want to only use one geom-layer. Use
documentation to figure out how this works.

7.1.4 Labels
Labels (titles, axis names, legend names) are amongst the most essential com-
ponents of a graph to transfer information. Succinct, short, clear labels will
improve readability immensely.
Like everything else in ggplot, there is a myriad of ways to add labels to a
graph. One option is through the labs() layer.

Changing the title of a legend To change the title of a legend, you can
rename it in the labs() layer by referring to its name in the aesthetic layer.

Removing a label To remove a label, you can set it in labs() to NULL.

Subtitle You should know by now that you can add a title in labs(). You
can add a subtitle in the exact same way.

7.1.5 Colours and themes


Changing the layout of a graph is not difficult, but it can be a lot of work. Luck-
ily, (other) smart people have helped us out, by writing functions (theme_*()s)
that immediately update a bunch of different components of a layout.
Add the layer theme_light() to your graph.

7.1.6 Scales and labels


As you have seen by now, ggplot is super smart. It automatically adapts the
limits (minimum and maximum on the graph), breaks (the place where a label
gets put) and labels (the actual text that’s put next to the axis) to your data.
However, in some cases you want to have more control over these and other
components on your axes.
Any mapping in ggplot is from some kind of variable (usually a column in
your data) to a scale. In this case, we have mapped the date to the x-axis (a
scale). The details of those mappings can be tweaks with a set of functions:
scale_[mapping]_[flavour]()

This is a little abstract, so let’s check out an example. Check out the docu-
mentation of
scale_y_continuous()

36
Scrolling down, you can see all the different components of the y-axis scale that
you can tweak using these functions. For instance, you can tweak the limit of
a plot. Try adding the following layer to your graph, and see what happens:
scale_y_continuous(limits=c(NA, 100)).

Labels on axes
Using scale_*_*() functions, you can change the labels on your scales. The
labels parameter can take a vector (c()) of pieces of text to explicitly define
the text on the labels.
While this can be useful, can you imagine a reason why you should refrain
from hard-coding your labels, if at all possible?
Instead, labels also accepts an instruction for a transformation. This is
super powerful, and best practice. Those instructions should come in the shape
of a function. For instance, try and add labels = scales::label_percent()
in your scale_y_continuous(). Does it do what you expected?

7.1.7 Saving a figure


While not necessary for this assignment, you might want to know how to export
a figure to your computer. There is a simple function for that: ggsave(). Per
default, ggsave() will save the last plot that you created in R, in a file with the
name that you put into the function. For example, this would produce a basic
lineplot, and save it as a pdf image in your working directory:
ggplot(airquality) +
aes(x = Date, y = AQI_Site) +
geom_line()

ggsave(’airquality.pdf’)
If you don’t specify the width and height of the plot in ggsave(), R will default
to the size of the box that contains your visualisation in RStudio.

37
7.2 Assignment pt. 2: Good visualisations
The Bremorse visualisation you have recreated is only one of two versions the
Economist published. Compare the two visualisations and decide which you
think is better.

Figure 7.3: Bremorse: two versions. Left a line plot, right a scatter plot.

Check out the two versions of this visualisation in Figure 7.3.


The Bremorse graph is one of the graphs featured in The Economist’s article
‘Mistakes, We’ve Drawn a Few’ (Leo, 2019). Once you’ve thought about which
graph you think it best (and maybe have discussed it with a class mate) take a
moment to read through the article.
Remember: whether a visualisation is good or bad is not a science. There
is (usually) no one correct answer. So if you disagree with The Economist’s
answer that doesn’t mean that you drew the wrong conclusion.

In my opinion, in this case we can only judge the effective communication of chart based on what its purpose is. If the
purpose of the visualization is to show how erratic the decision making was shown by the chart on the left, well then that
chart would be the right fit. However if the purpose of the chart was to show the dramatic change in the response to this
question overtime, well then I think the chart of the right hand side would communicate that view the best.

38
7.3 (Bonus) Recreating a visualisation, cont.
What you’ve created so far captures the basic outline of the original visualisation
but doesn’t really look like it. This section outlines some (aesthetic) options
with which you would be able to almost precisely recreate the original.

7.3.1 Scales pt. II: colours


Before reading this part, read subsection 7.1.6.
Colours, like axes, are just scales: a way of mapping data to an aesthetic.
Hence, they can be changed accordingly.
For any other assignments, check out scale_colour_viridis_*(), or scale_colour_brewer().
For this assignment, though, we want to manually update the colours. After
all, we’re rebuilding a graph, so we want to be true to the original.
To do that, we will use scale_colour_manual(). This function takes the
parameter values, which you have to give what is referred to as a named list.
Next week you will learn more about what that means. For now, all you have
to know, is that this is the syntax:
scale_colour_manual(
values=c(
"wrong"=rgb(16, 109, 160, maxColorValue = 255),
"right"=rgb(220, 112, 111, maxColorValue = 255)
))
rgb() defines the colours. "wrong" and "right" tell R which colour to map
to which value.

7.3.2 Text on a graph


Adding text on a graph can be done in many ways. You can use a geom_*() to do
it (any text geoms are excluded from our one-geom-only rule). or annotate().
See what works for you!

7.3.3 More themes


All colours that are not related to the data itself, can be tweaked through
themes. If you are very passionate about colours on your graph (as I am, when
I have to deliver a graph) you can dive into the theme() layer. Somewhere
in one of those many parameters, there are instruction on how to change the
background colour of this graph.
NB: add theme_economist() before you tweak the theme() layer! Any
theme you apply changed the theme, so it will nullify all your changes.

7.3.4 Scales pt. III: Dates


We have mostly avoided working with dates in this course so far, because dates
can be pretty finnicy. If you feel up for it, see if you can change the labels on
the x-axis to match the style of the original graph.
(Hint: use scale_x_date())
Beware: this can be quite tricky!

39
Bibliography

Grolemund, G. & Wickham, H. (2017). R for data science. O’Reilly. http://


r4ds.had.co.nz/
Leo, S. (2019). Mistakes, we’ve drawn a few. https://medium.economist.com/
mistakes-weve-drawn-a-few-8cdd8a42d368
Wickham, H. (2010). A layered grammar of graphics. Journal of Computational
and Graphical Statistics, 19 (1), 3–28. https://doi.org/10.1198/jcgs.
2009.07098
Wilke, C. O. (2019). Fundamentals of data visualization: A primer on making
informative and compelling figures. O’Reilly Media.
Zoom and gloom. (2020). https://www.economist.com/special- report/2020/
10/08/zoom-and-gloom

40

You might also like