Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

MA304 - Data Visualisation

LECTURE 4
DR JACKIE WONG
Recall from last lecture:
• The role of aesthetics in DV
• How to map data to aesthetics
• Various types of aesthetics and their suitability
• Coordinate systems and axes
• Linear and nonlinear axes for data representation
• Colour
In this lecture, we will cover:
• Directory of visualizations
• Quick overview of base graphics in R
Histograms
Density plots
• More about basic plots
ECDF plots
Q − Q plots

Box plots
Strip charts
• Visualizing multiple distributions
Violin plots
Sina plots
Directory of Visualizations
To visualize qualitative data (i.e., numerical values shown for some set of categories)
• The most common approach is using bars, either vertically or horizontally arranged.

• We can also place dots at the location where the corresponding bar would end.
• If there are two or more sets of categories for which we want to show
amounts, we can group or stack the bars.
• We can also map the categories onto the x and y axis and show amounts by
colour, via a heatmap.
To visualize quantitative data
• Histograms and density plots provide the most intuitive visualizations of a
distribution, but both require arbitrary parameter choices and can be
misleading.
• Cumulative densities and quantile-quantile (q-q) plots always represent the
data faithfully but can be more difficult to interpret.
To visualize many distributions at once
• Boxplots, violins, strip charts, and sina plots
• Stacked histograms and overlapping densities: more in-depth comparison of a
smaller number of distributions, though stacked histograms can be difficult to
interpret and are best avoided.
• Ridgeline plots: useful when visualizing very large numbers of distributions or
changes in distributions over time.
Proportions
• Pie charts, side-by-side bars, or stacked bars, and as in the case for
amounts, bars can be arranged either vertically or horizontally.
• Pie charts emphasize that the individual parts add up to a whole
and highlight simple fractions.
• However, the individual pieces are more easily compared in side-by-
side bars. Stacked bars look awkward for a single set of proportions.
Multiple sets of proportions
• Grouped bars work well as long as the number of conditions compared is
moderate, and stacked bars can work for large numbers of conditions.
• Stacked densities are appropriate when the proportions change along a continuous
variable.
When proportions are specified according to multiple grouping variables
• Mosaic plots, treemaps, or parallel sets are useful visualization approaches. Mosaic plots
assume that every level of one grouping variable can be combined with every level of another
grouping variable, whereas treemaps do not make such an assumption.
• Treemaps work well even if the subdivisions of one group are entirely distinct from the
subdivisions of another.
• Parallel sets work better than either mosaic plots or treemaps when there are more than two
grouping variables.
Association between variables
• Scatterplots represent the archetypical visualization when we want to show one
quantitative variable relative to another.
• If we have three quantitative variables, we can map one onto the dot size,
creating the bubble chart.
• For paired data, where the variables along the x and the y axes are measured in
the same units, it is generally helpful to add a line indicating x = y.
• Paired data can also be shown as a slope graph of paired points connected by
straight lines.
Association between variables
• For large numbers of points, contour lines, 2D bins, or hex bins may useful.
• When we want to visualize more than two quantities, on the other hand, we
may choose to plot correlation coefficients in the form of a correlogram instead
of the underlying raw data.
To visualize temporal relationships
• When the x axis represents time or a strictly increasing quantity such as a treatment
dose, we commonly draw line graphs.
• If we have a temporal sequence of two response variables, we can draw a connected
scatterplot where we first plot the two response variables in a scatterplot and then
connect dots corresponding to adjacent time points.
• We can use smooth lines to represent trends in a larger dataset.
Geospatial data
• A map takes coordinates on the globe and projects them onto a flat surface, such that
shapes and distances on the globe are approximately represented by shapes and
distances in the 2D representation.
• We can show data values in different regions by colouring those regions in the map
according to the data. Such a map is called a choropleth.
• Cartograms : It may sometimes be helpful to distort the different regions according to
some other quantity (e.g., population number) or simplify each region into a square.
To visualize uncertainties
• Error bars are meant to indicate the range of likely values for some estimate or
measurement. They extend horizontally and/or vertically from some reference
point representing the estimate or measurement.
• Reference points can be shown in various ways, such as by dots or by bars.
• Graded error bars show multiple ranges at the same time, where each range
corresponds to a different degree of confidence. They are in effect multiple error
bars with different line thicknesses plotted on top of each other.
More detailed visualization of uncertainties
• Confidence strips provide a clear visual sense of uncertainty but are difficult to read
accurately.
• Eyes and half-eyes combine error bars with approaches to visualize distributions
(violins and ridgelines, respectively), and thus show both precise ranges for some
confidence levels and the overall uncertainty distribution.
• A quantile dot plot. By showing the distribution in discrete units, the quantile dot
plot is not as precise but can be easier to read than the continuous distribution
shown by a violin or ridgeline plot.
More detailed visualization of uncertainties
• For smooth line graphs, the equivalent of an error bar is a confidence band. It
shows a range of values the line might pass through at a given confidence level.
• As in the case of error bars, we can draw graded confidence bands that show
multiple confidence levels at once.
• We can also show individual fitted draws in lieu of or in addition to the confidence
bands.
Base Graphics in R
Base Graphics in R
• Graphics can be easily created in R and subsequently included in Word, Power Point, LATEX, etc.
• Some useful functions are as follows:
 Change display settings of active graphics window
> par(mfrow = c(1,2), mar = c(4,4,0,0)) #mar = c(bottom, left, top, right)
> ?par #for more options
 Close active graphics window
> dev.off()
Close all graphics window
> graphics.off()
 Saving plots directly using R: pdf(), postscript(), png(), jpeg() etc.
> pdf(“Figure1.pdf”, width = 10, height = 5)
> ?Devices # for more info
> par(mfrow = c(1,2), mar = c(3,3,0,3) ) #mar = c(bottom, left, top, right)
> x <- seq(-5, 5, 0.1)
> y1 <- x^2
> plot(x, y1, type = "l") #OR plot(y1 ~ x,type="l")
> y2 <- sin(x)
> plot(y2 ~ x, type = "l")
• Questions for you!

PollEv.com/jackiewong677
More on Histograms
• We have covered histograms (see Lecture 2)
• What else?
• The arbitrary choice of break points affects appearances
• By default, Sturges formula is used to determine the number of classes,
No. of Classes = 1 + 3.3 × log10 (No. of Observations)
• Always test various values of bin width!
Default histograms (frequency/relative)
> par(mfrow=c(1,2))
> hist(ChickWeight$weight,xlab="Weight (grams)",main="Histogram of ChickWeight")
> hist(ChickWeight$weight,freq=FALSE,xlab="Weight (grams)",main="Histogram of ChickWeight")
Changing break points
> par(mfrow=c(1,2))
> hist(ChickWeight$weight,breaks=3,xlab="Weight (grams)",main="Histogram of ChickWeight")
> hist(ChickWeight$weight,breaks=seq(0,400,by=100),xlab="Weight
(grams)",col=gray(1:4/4),main="Histogram of ChickWeight")
Adding Text
> hist(ChickWeight$weight,xlab="Weight (grams)",main="Histogram of ChickWeight",col="blue")
> text(100, 50, "some text",srt=45,cex=2,col="white")
Another weakness: Bad at visualising multiple distributions together
> hist(ChickWeight$weight[ChickWeight$Diet==1],col="blue",main="Histogram of ChickWeight
by Diet",xlab="Weight (grams)")
> hist(ChickWeight$weight[ChickWeight$Diet==2],col=2,add=TRUE)
> legend("topright",c("Diet=1","Diet=2"),lty=1,col=c("blue",2))
Density Plots
• Most commonly done using kernel density estimation
• Draw a continuous curve (the kernel) with a small width (controlled by a
parameter called bandwidth) at the location of each data point, and then we
add up all these curves to obtain the final density estimate.
• Related to the concept of “smoothing”
• Two arbitrary parameters:
 Bandwidth choices
 Kernel choices but no much difference if sample size is large
> plot(density(ChickWeight$weight),main="Default Density Plot")
Bandwidth Choices
> plot(density(ChickWeight$weight,bw=5),main="Different Bandwidths",xlab="Weight (grams)")
> lines(density(ChickWeight$weight,bw=20),col=2)
> lines(density(ChickWeight$weight,bw=50),col=3)
> legend("topright",c("bandwidth=5","bandwidth=20","bandwidth=50"),lty=1,col=1:3)
Kernel Choices
> plot(density(ChickWeight$weight),main="Different Kernels",xlab="Weight (grams)")
> lines(density(ChickWeight$weight, kernel="rectangular"),col=2)
> lines(density(ChickWeight$weight, kernel="cosine"),col=3)
> legend("topright",c("gaussian","rectangular","cosine"),lty=1,col=1:3)
Good to visualize multiple distributions together
> plot(density(ChickWeight$weight[ChickWeight$Diet==1]),col=1,main="Histogram of ChickWeight by Diet",xlab="Weight (gram
> lines(density(ChickWeight$weight[ChickWeight$Diet==2]),col=2)
> lines(density(ChickWeight$weight[ChickWeight$Diet==3]),col=3)
> lines(density(ChickWeight$weight[ChickWeight$Diet==4]),col=4)
> legend("topright",c("Diet 1","Diet 2","Diet 3","Diet 4"),lty=1,col=1:4)
Density Plots
Danger!
Density Plots
Weaknesses:
- The arbitrary choices of bandwidths and kernels affect the resulting appearances
- they have a tendency to produce the appearance of data where none exists, in particular in the
tails.
- can be misleading also for data with small sample sizes
- not showing exact frequencies (for discrete data)
Cumulative Distribution Function (CDF) Plot
Student grades examples
> x= rnorm (30) # 30 randomly generated observations from N(0 ,1)
> y=rt (30 , df =3) # 30 rand . obs. from Student 's t with 3 df
# verticals = TRUE adds vertical lines ( resemble steps ), do.points=FALSE suppress data points
> plot ( ecdf (x), verticals = TRUE ,xlim=c(-3,3),do.points=FALSE)
> plot ( ecdf (y), verticals =TRUE , col=2,add= TRUE,do.points=FALSE )
> legend("bottomright",c("Standard Normal","Student's t"),col=c(1,2),lty=1)
Cumulative Distribution Function (CDF) Plot
Highly skewed variables: may consider transformations
Quantile-Quantile (Q-Q) plots
• Useful for checking if the observed data points follow (or not) a reference
distribution.

• Most commonly based on normal distribution as the reference

• Simply constructed by ranking the data points against the theoretical quantile
values
# No discrepancy
> qqnorm(x)
> abline(a=0,b=1) # add the reference line
#Discrepancy in mean
> x1<-rnorm(30,mean=2)
> qqplot(x,x1,main="Q-Q Plot",xlab="N(0,1)",ylab="N(2,1)")
> abline(a=0,b=1)
#Discrepancy in variance
> x2<-rnorm(30,sd=sqrt(10))
> qqplot(x,x2,main="Q-Q Plot",xlab="N(0,1)",ylab="N(0,10)")
> abline(a=0,b=1)
Visualizing Multiple Distributions
Box Plots (Box and Whisker Plots)
• Invented by the statistician John Tukey in the early
1970s
• Very useful for visualizing multiple quantitative
variables
• Mainly showing the median, 1st and 3rd quartiles.
• Also whiskers that extend either to the maximum and
minimum values of the data or to the maximum or
minimum values that fall within 1.5 times the height of
the box, whichever shorter.
Box Plots (Box and Whisker Plots)
Violin Plots
Violin Plots

Have the same shortcomings of density plots. So maybe try showing individual data points?
Not very good…
Strip Charts

But is this good enough?


Sina Plots
Better of both worlds!
Example
Fabric Examples: As part of quality standards check, material is checked for its flammability.

Fabric 1 Fabric 2 Fabric 3 Fabric 4


17.80 11.20 11.80 11.90
16.20 11.40 11.00 10.80
17.50 12.60 10.00 12.80
17.40 10.00 9.20 10.70
15.00 10.40 9.20 9.60
⋮ ⋮ ⋮ ⋮

Ignition times (in seconds) for different types of fabric.


> par(mfrow=c(1,2))
> boxplot ( time ~Fabric , data = fabric , ylab =" Time ( seconds )")
#tuning the lengths of whiskers
> boxplot ( time ~Fabric , range=0.5 , data = fabric , ylab =" Time ( seconds )")
>plot(jitter(Fabric[Fabric==1],amount=0.1),time[Fabric==1],type="p",pch=20,ylim=c(range(time)),main=
"Strip Chart",xlim=c(0.5,4.5),xlab="Fabric",ylab=" Time ( seconds )")

> for (i in 2:4){ points(jitter(Fabric[Fabric==i],amount=0.1),time[Fabric==i],pch=20) }


Summary
• Directory of visualizations

• Quick overview of base graphics in R

• More about basic plots

• Visualizing multiple distributions


What Next?
• Will see more plots in our next lecture

• In particular, plots for visualizing associations between variables

• Introduce the R package “ggplot2” provides more graphical tools


(Some of the plots we discussed are difficult to code in R Base Graphics,
e.g. violin plots, sina plots etc.)
References
Wilke, Claus O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling
Figures. O'Reilly Media Inc.
Thank you!
SEE YOU NEXT MONDAY!
REMEMBER TO GO TO YOUR INDIVIDUAL LABS!

You might also like