Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

Plot a PCA in six easy steps

Aaron Teo
Step 1: format dataset
• Data must be formatted as shown
• 1 observation per row
• 1 variable per column
Variable names must be a single word. Avoid
symbols and keep them simple when
possible.
SeptNo 
Sept.No 
Sept No 
53|>t |\|0 !!11!? 
• Save as “Text (Tab delimited) (*.txt)”
Step 2: load dataset into R
• data<-read.table(“file directory", header=TRUE)

Key
data name of the object where the dataset is stored in R
read.table function for loading data in txt files
“file directory“ location of the file on computer
header=TRUE loads the column names in the dataset if TRUE
Step 3: log the data (numerical portion only)
log.data<-log(data[,2:6])

Key
log.data name of object where logged data is stored
log function for performing natural logarithm
data dataset from step 2
[,2:6] for subsetting a portion of matrices and dataframes. In the format of
data[x,y] to subset row x and column y of data. Blank denotes all, single
value for a particular row/column, a:b for row/column a to row/column b. Eg
data[,2:6] means all rows in columns 2 to 6 in data ie the numerical portion
Step 4: store groups of interest separately
group.names <- data[,1]

• As before, storage of all rows in column 1 of object called data in a


new object called group.names
• Allows identification of points in plots so that populations can be
colour coded
Step 5: perform PCA
data.pca <- prcomp(log.data,center = TRUE,scale = TRUE)

Key
data.pca PCA output data stored here
prcomp PCA function
log.data numerical data to be plotted
center = TRUE centers log.data if TRUE
scale = TRUE scales log.data if TRUE
Step 6: plot data
biplot(data.pca)

Key
biplot function plots PCA based on
output from data.pca

Congratulations, PCA is plotted.


However, most people find this
inscrutable.
Alternative step 6:
plot data
library(ggbiplot)
ggbiplot(data.pca, obs.scale = 1,
var.scale = 1, groups =
group.names, ellipse = TRUE, circle
= TRUE) +
scale_color_manual(values=c("blac
k","blue","green","red")) +
theme(legend.direction =
'horizontal', legend.position = 'top’)

Congratulations, PCA is plotted.


Breaking down alternative step 6, part A
library(ggbiplot)
ggbiplot(data.pca, obs.scale = 1, var.scale = 1, groups = group.names, ellipse = TRUE,
circle = TRUE)

Key
library(ggbiplot) calls additional functions from a library of functions called ggbiplot
ggbiplot function for plotting, same as library name
groups = group.names group.names object supplies info on how to colour code
data points in plot
ellipse = TRUE draws ellipses if TRUE
circle = TRUE draws correlation circle if TRUE
Product of step 6a
• There is room for improvement.
Colours on the plot do not
intuitively match their colour
groupings in data
Breaking down alternative step 6, part B
+ scale_color_manual(values=c("black","blue","green","red"))

Key
scale_color_manual assign colours in plot manually
values=c("black","blue","green","red“) assign colours in order of
“black”, “blue”, “green” and lastly “red”.
Follows alphabetical /numerical order of unique
levels in group.names
Product of step 6b
• Colours now more intuitive.
Legend can be moved for
aesthetic / space efficiency
purposes
Breaking down alternative step 6, part C
+ theme(legend.direction = 'horizontal', legend.position = 'top’)

Key
theme function controls theme of plot
legend.direction = 'horizontal’ legend arranges elements horizontally
legend.position = 'top’ legend placed at top of plot
Product of step 6c
• As seen in alternative step 6
Other useful information
plot(data.pca, type="l")
• Graphical display of amount of
variation in data explained by
each principle component
• Same information displayed in
summary(pca.data)
Other useful information
print(data.pca)
• Shows loadings of each PC, allows interpretation of PC and correlation
with and between actual variables
• Magnitude denotes strength of correlation, +/- denotes positive or
negative correlation
• Eg PC1 primarily positively
correlated with Cdmax and CDmin
and secondarily SL. Suggests that
the three variables are positively
correlated too
Self learning
• Experiment to produce different kinds
of plots in R!

You might also like