Lecture #3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

GET1030

Computers and the humanities

Lecture 3
Visualizing data

Dr Miguel Escobar Varela 1


GET1030

Learning Objectives

1. To describe the main types of visualizations


used today
2. To identify different approaches to data
visualization
3. To identify the potential for bias in
visualizations
4. To offer critical perspectives on data
visualization
Lecture 3
Visualizing data

Part 1: Most common scientific data visualizations today

To describe the main types of


visualizations used today
3
GET1030

What is a visualization?

Representing numerical and categorical data with graphical elements (color, shape, position, size)

What is it useful for?

- To give an overview of the data


- As a first step for further research

4
GET1030

Charts in this session

Boxplots
Barplots
Lineplots
Scatterplots
Histograms
KDE plots
Violinplots
Joint plots

5
GET1030

Boxplot

Description of a univariate distribution (one variable)


For this toy example: number of actors required for a theatre play

1st quartile =

Median =

3rd quartile =

Minimum =

Maximum =

6
GET1030

Boxplot (Outliers)

1.5 x IQR
Calculating outliers using Tukey’s rule
IQR = 75th percentile - 25th percentile

1st quartile = 8.5

Median = 10

3rd quartile = 11.5

Minimum = 3

Maximum = 17

7
GET1030

Categories in boxplots

How many variables are


represented in this graph?

variables
1. days of the week
2. bill size

categories
1. smoke or not

8
GET1030

Standard deviation in bar charts

Measure of the dispersion of the data


Square root of the variance
Variance is the average of the squared differences from the
mean

Mean =

Variance =

Standard Deviation =

9
GET1030

Standard deviation in lineplots

Consider this example of


actors in plays over time

10
GET1030

Scatterplot

Shows the numerical relationship


between two variables. Color can
be used to indicate categorical
variables.

11
GET1030

Scatterplot

Shows the relationship between two variables.

12
GET1030

Scatterplot

Shows the numerical relationship between two variables. Color can be use categorical variables.

13
GET1030

Histogram

Each bar groups numbers into


ranges. Taller bars show that more
data falls in that range. It displays
the shape and spread of the data.

14
GET1030

Histogram

Consider this example of stars given to films in a group of reviews.

15
GET1030

Histogram (relative frequency)

Consider this example of stars given to films in a group of reviews.

16
GET1030

Histogram (relative frequency)

Consider this example of stars given to films in a group of reviews.

17
GET1030

Histogram (bin size)

Consider this example of stars given to films in a group of reviews.

18
GET1030

Histogram

https://tinlizzie.org/histograms/
19
GET1030

Kernel Density Estimation (KDE) plots

Closely related to histograms. They show a


smoothed representation of the data
distribution.

20
GET1030

Violinplot

Another visual representation of


the 5-number summary but also
represents the distribution of
values, in a way similar to a
KDE.

21
GET1030

Jointplots

Combined scatterplot with


regression line, confidence
interval, histogram and KDE.

22
GET1030

Pairplots

Scatterplots and KDE of


multiple variables for the
same samples.

23
Lecture 3
Visualizing data

Part 2: Different approaches to Data Visualization

24
GET1030

William Playfair (1759-1823)

From The Commercial and Political Atlas; Representing, by Means of Stained Copper-Plate
Charts, the Exports, Imports, and General Trade of England, at a Single View (1785)
25
GET1030

John Snow (1813-1858)

John Snow’s map of cholera outbreaks and wells (1854)


Digital version by Robin Wilson (2013)
http://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/

26
GET1030

Florence Nightingale (1820-1910)

‘Coxcom’ (polar area graph) visualization of the cause of death in the army (1850s
27
Crimea War)
*preventable causes, wounds, accidents
GET1030

Charles Joseph Minard (1781-1870)

Napoleon’s March to Moscow (published 1869)


Beginning at the Polish-Russian border, the thick band shows the size of the army at each position. 28
The path of Napoleon's retreat from Moscow is depicted by the dark lower band, which is tied to
temperature and time scales.
GET1030

The scatterplot

First graphic representation of interquartile range boxes.


*This is a reconstruction based on
data from Herschell’s 1833 paper, “On
the Investigation of the Orbits of
Revolving Double Stars” by Friendly
and Denis (2005).

Sir John Frederick William


Herschel (1972-1871)

29
GET1030

The scatterplot

Francis Galton’s (1986)


smoothed correlation
diagram for the data on
heights of parents and
children, showing one ellipse
of equal frequency.

Francis Galton (1822-1911)

30
GET1030

The ‘Infoviz’ Approach

Statisticians often disagree with a type of visualization aesthetic common in the news, which was most
influential developed by Nigel Holmes, while working for TIME magazine in the 1970s.

*as we’ll see later, this is the kind of graph that Tufte would classify as ‘chartjunk’

31
GET1030

Additional resources

https://exhibits.stanford.edu/dataviz 32
GET1030

Exploratory data analysis (EDA)

Theorized by John Tukey (1915-2000).


Visualizations are often central to EDA.
Often includes statistical information (error bars, confidence intervals, standard deviation)
Many different visualizations with shared axes.

33
GET1030

Modern scientific approach to dataviz

Edward Tufte:
Against “chartjunk”
Popularized sparklines.
Famous for several concepts:
lie factor, the data-ink ratio (against decoration), small multiples and
the data density of a graphic.

34
GET1030

What’s the purpose?

To grab attention?
To show trends?
To help scientists analyze data?

35
Lecture 3
Visualizing data

Part 3: Sources of bias in data visualization

36
GET1030

Axis Cropping

Kendall Fortney, “5 Ways Data Visualizations can Lie”, Towards data science,
https://towardsdatascience.com/5-ways-data-visualizations-can-lie-46e54f41de37

37
GET1030

Axis Scalling

Ravi Parikh, ‘How to lie with data visualization’,


Gizmodo, 2014,
https://gizmodo.com/how-to-lie-with-data-visualization-15
63576606

38
GET1030

Axis Scaling

Kendall Fortney, “5 Ways Data Visualizations can Lie”, Towards data science,
https://towardsdatascience.com/5-ways-data-visualizations-can-lie-46e54f41de37 39
GET1030

The problem of pie charts

The green slice is actually equal to a quarter


of the yellow one, and pink is a third of the
value of the purple slice

Maryland is bigger than the others (by 3%). The 3D effect visually
adds more volume to NH, tricking your eyes. Without labels telling
the percentages there would be little to no chance of accurately
guessing it.

Kendall Fortney, “5 Ways Data Visualizations can Lie”, Towards data science,
https://towardsdatascience.com/5-ways-data-visualizations-can-lie-46e54f41de37 40
GET1030

The problem of pie charts

The green slice is actually equal to a quarter of the


yellow one, and pink is a third of the value of the purple
slice

Which is bigger? If you choose New Hampshire you would be wrong, it


was actually Maryland. The 3D affect visually adds more volume to that
slice, tricking your eyes. Without labels telling the percentages there
would be little to no chance of accurately guessing it.

People tend to underestimate the size of acute


angles (<90°) and overestimate the size of obtuse
ones (>90°) (Nundy et al, 2000, text at 41
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC25873/)
GET1030

The problem of pie charts

Walt Hickey, “The Worst Chart In


The World”, Business Insider,
2013,
https://www.businessinsider.com/
pie-charts-are-the-worst-2013-6

42
GET1030

Multiple dimensions

Nathan Yau, How to Spot Visualization


Lies, Flowingdata, 2017,
https://flowingdata.com/2017/02/09/how-t
o-spot-visualization-lies/

43
GET1030

Values not normalized

Chiqui Esteban, ‘A Quick Guide to


Spotting Graphics That Lie, National
Geographic, May 2015,
https://www.nationalgeographic.com/news/2015/06/1
50619-data-points-five-ways-to-lie-with-charts/

44
GET1030

Spurious correlations

Linecharts tend to suggest correlation

http://www.tylervigen.com/spurious-correlations 45
GET1030

Even good graphics tell different stories

Cabanski, C., Gilbert, H., & Mosesova, S. (2018). Can Graphics Tell Lies? A Tutorial on How
To Visualize Your Data. Clinical and translational science, 11(4), 371–377.
doi:10.1111/cts.12554

46
GET1030

Raw data, not just summary statistics

Nine data sets with equivalent


summary statistics. Each data set
has the same x mean (54.26), y
mean (47.83), x SD (16.76), y SD
(26.93), and Pearson correlation
coefficient ( −0.06). The nine distinct
patterns show the importance of
plotting the raw data rather than only
displaying summary statistics or
models.

Cabanski, C., Gilbert, H., &


Mosesova, S. (2018). Can
Graphics Tell Lies? A Tutorial on
How To Visualize Your Data.
Clinical and translational science,
11(4), 371–377.
doi:10.1111/cts.12554
47
GET1030

List of caveats

https://www.data-to-viz.com/caveats.html
48
Lecture 3
Visualizing data

Part 4: critical perspectives on data visualization

49
GET1030

Alternative modes

A basic bar chart compares the


number of men (top bar) and
the number of women (bottom
bar) in seven different nations,
A through F, at the present time
(2010). The assumptions are
that quantities (number),
entities (nations), identities
(gender) and temporality (now)
are all self-evident. Graphic
credit Xárene Eskandar.

Drucker (2011)

50
GET1030

Alternative modes

In this chart gendered identity is


modified. In nation A, the top bar
contains a changing gradient,
indicating that “man” is a continuum
from male infant to adult, or in
countries E and D, gender ambiguity
is displayed.

In country F women only register as


individuals after coming of
reproductive age, thus showing that
quantity is a effect of cultural
conditions, not a self-evident fact.

The movement of men back and forth


across the border of nations B and C
makes the “nations” unstable entities.
Graphic credit Xárene Eskandar.

Drucker (2011)
51
GET1030

Two geographical visualizations

52
Hepword and Church (2011)
GET1030

Two geographical visualizations

53
Hepword and Church (2011)
GET1030

Two geographical visualizations

54
Hepword and Church (2011)
GET1030

Two geographical visualizations

55
GET1030

Each visualization tells a story

By Nathan Yau

“Let the data speak.” It’s a common saying


for chart design. The premise — strip out the
bits that don’t help patterns in your data
emerge — is fine, but people often
misinterpret the mantra to mean that they
should make a stripped down chart and let
the data take it from there.

You have to guide the conversation though.


You must help the data focus and get to the
point. Otherwise, it just ends up rambling
about what it had for breakfast this morning
and how the coffee wasn’t hot enough.

https://flowingdata.com/2017/01/24/one-data
set-visualized-25-ways/

56
GET1030

Jer Thorp

Collection → Computation → Representation

This formula is too simplistic


Whenever you look at data — as a spreadsheet or database view or a visualization, you are looking at [a
system]. What this diagram doesn’t capture is the immense branching of choice that happens at each
step along the way. As you make each decision — to omit a row of data, or to implement a particular
database structure or to use a specific color palette you are treading down a path through this wild, tall
grass of possibility. It will be tempting to look back and see your trail as the only one that you could have
taken, but in reality a slightly divergent you who’d made slightly divergent choices might have ended up
somewhere altogether different. To think in data systems is to consider all three of these stages at once
[...] (Thorp 2017)

https://medium.com/@blprnt/you-say-data-i-say-system-54e84aa7a421

57
GET1030

Ways of Seeing Data

● Book chapter: Gray et al (2016) “Ways of Seeing Data: Toward a Critical


Literacy for Data Visualizations as Research Objects and Research Devices” [in
LumiNUS].
● Checklist of questions for analyzing visualizations
● Influenced by John Berger’s Ways of Seeing (1972), which emphasized political
and cultural histories of how images are produced and consumed.
● Objective: developing a critical literacy of visualization.
● There is no ‘neutral’ option, every type of visualization reflects ‘narrative
regimes’ (ideas of what stories should be told, and how they should be told)
● 3 forms of mediation:
○ World ⇢ data
○ Data ⇢ image
○ Image ⇢ eye

● Culturally and historically specific ‘ways of seeing’ that reflect values, cultures,
aesthetics and prejudices.

58
GET1030

World ⇢ data

• What information and data are being represented in the visualization?


• What are the sources for this information? Where do the data come from?
(is it cited?)
• How are the data generated? What are the rationales, methods, and
standards inscribed in the data infrastructures through which the
data are generated?
• How are the data transformed or prepared?
• Which data sources are combined and how?
• How do the data selectively prioritize certain things over others?

59
GET1030

Data ⇢ image

• How are the data mediated into graphical form?


• What kinds of graphical techniques, methods, and technologies have
been used?
• What are their affordances? what does the data enable you to “do”, what
stories can you tell with it) How do they guide our attention toward
different aspects of the data?
• What design decisions have been taken? What are their consequences?

Try to reverse-engineer the visualizations: what steps were taken to


produce them?

60
GET1030

Image ⇢ eye

• What kinds of visual cultures and practices are implicated or reflected


in the data visualization? Where do these come from?
• What forms of usage are inscribed in the visualization? Who are the
publics of the data visualization? How is it circulated, cited, and
shared?

Visual cultures - what are they and where do they come from?

How are visualizations used, circulated, cited and shared?

Denaturalize - historicize (every visual “idea” comes from somewhere)

Find the genealogy of the visualization styles

61
GET1030

Other takeaways from this chapter

Data infrastructures
Visualization devices, rather than ‘tools’ (they emphasize world-making)
‘Communicative objectivity’ Halpern
Tufte advocates: championing efficiency and parsimony, maximizing the
‘data–ink’ ratio, and eliminating ‘chartjunk’

62
GET1030

For the individual assignment

Choose a data visualization and analyze it using the perspectives seen in


this chapter. It is strongly recommended that you use the checklist in Gray’s
chapter.

63
GEK 2050
GET1030

Individual assignment (20%)

A 1000 word analysis of a data visualization (it doesn’t need to be a


humanities data visualization, but you must use critical data perspectives,
as explained in Lectures #2 and #3).

You must explain the computational thinking process that led to this
project: how data was modeled, which aspects of the analysis were
automated, and what output was produced. You should also explain what
are the potential blind spots and limitations of the project.

File format: one file with your name clearly stated in the body of the text
(5 marks will be deducted if this missing). For citations, please use the
Chicago Manual of Style 17th Edition, Author-Date format. Do not cite
academic journals is if they were websites.

64
GEK 2050
GET1030

Where to find graphs to analyze?

1) The media. You can analyze examples that you consider to be


successful or problematic from web pages, social media and news
sources.
2) Digital Scholarship in the Humanities https://academic.oup.com/dsh
(Access provided via NUS libraries).
3) Ideally you shouldn’t choose a scientific data visualization from
natural science publications, where visualization practices are more
standardized. You want a visualization of a ‘less settled’ area, in
connection to the social and cultural world.

65
GET1030

That’s it for today

https://xkcd.com/688/

66
GET1030

References

Drucker, Johanna. 2011. “Humanities Approaches to Graphical Display.” Digital Humanities Quarterly 005 (1).
Hepworth, Katherine, and Christopher Church. 2019. “Racism in the Machine: Visualization Ethics in Digital Humanities
Projects.” Digital Humanities Quarterly 012 (4).
Gray, Jonathan, Liliana Bounegru, Stefania Milan, and Paolo Ciuccarelli. 2016. “Ways of Seeing Data: Toward a Critical
Literacy for Data Visualizations as Research Objects and Research Devices.” In Innovative Methods in Media and
Communication Research, edited by Sebastian Kubitschko and Anne Kaun, 227–51. Cham: Springer International
Publishing. https://doi.org/10.1007/978-3-319-40700-5_12.
Thorp, Jer. 2017. “You Say Data, I Say System.” Hacker Noon. July 13, 2017.
https://hackernoon.com/you-say-data-i-say-system-54e84aa7a421.
Friendly, Michael and Daniel Denis. “The early origins and development of the scatterplot.” Journal of the History of the
Behavioral Sciences, 41(2), 103–130.
Nundy, Surajit et al. 2000. “Why are angles misperceived?”. Proc Natl Acad Sci 97(10): 5592–5597.

67

You might also like