Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

BM1, Applied Statistics, Lesson 1: Data and graph basics

Luis del Peso Ovalle


Dept. Biochemistry, Universidad Autnoma de Madrid
Instituto de Investigaciones Biomdicas Alberto Sols (UAM-CSIC)
Madrid, Spain

Contents
1
2
3
4
4.1
4.2
5
5.1
5.2
5.3
6
6.1
6.2
7
8

Data and graph basics . . . . . . . . . . . . . . . . . . . . .


Study Cases . . . . . . . . . . . . . . . . . . . . . . . . . . .
Files we will use . . . . . . . . . . . . . . . . . . . . . . . .
Reading data . . . . . . . . . . . . . . . . . . . . . . . . . .
Spreadsheets and csv format . . . . . . . . . . . . . . . . . .
Reading data in R commander . . . . . . . . . . . . . . . .
Exploring data . . . . . . . . . . . . . . . . . . . . . . . . .
Visualizing data . . . . . . . . . . . . . . . . . . . . . . . .
Summarizing data . . . . . . . . . . . . . . . . . . . . . . .
Types of variables and representation of numerical variables
Managing data . . . . . . . . . . . . . . . . . . . . . . . . .
Stacking data and factors in R . . . . . . . . . . . . . . . .
Subsetting and aggregating data . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

. 2
. 2
. 3
. 3
. 3
. 4
. 5
. 5
. 6
. 8
. 9
. 9
. 10
. 15
. 16

STUDY CASES

Data and graph basics

In this lesson we will learn about the following concepts:


types of variables and how to represent them graphically
measures of center, spread and error
experimental design and replicates
We will also learn how to perform basic data processing in R (R commander):
read data
subset and rearrange data sets
transform data
make different types of graphs
Disclaimer: This document is not, nor does the author pretend it to be, a substitute the reference
books cited in the syllabus. It is intended as a mere guide for the course contents. Moreover, it may
(and surely does) contain errors.

Study Cases

Case 1.

Suppose you are studying the role of the transcripts from EFNA3 gene in cancer metastasis. As part
of this project, you contact a collaborator who has developed a zebrafish model to assess metastatic
potential of cells. The experiment is quite straightforward: breast cancer cells, expressing different
transcripts from this gene (each cell line expresses one specific isoform), are labeled with a fluorescent
dye and injected into the perivitelline space of several zebrafish embryos (see figure 2.1). Then, 2-4
days after injection disseminated (i.e. metastatic) cells are visualized using confocal microscopy and
the total number of disseminated cells per fish is recorded. Several fish are injected with each type of
cell to determine the statistical significance of the results. After a few weeks your colleague send you
an email with the results (see figure 2.2). Accompanying figure 2.2 is the following legend: Systemic
dissemination of cancer cells is enhanced by expression of different EFNA3 constructs: MDA-MB231 cells infected with lentivirus designed to induce expression of EFNA3, NC1, NC1s, NC2, NC2s
or empty control (pLOC) were labeled with DiI dye and approximately 100-500 cells were injected per
embryo in the perivitelline space of two days old transgenic Tg(fli1a:EGFP) zebrafish embryos. A)
Flourescent micrographs of the posterior tail showing distal dissemination of tumor cells (red) and
their close association with the blood vessels of the embryo (green) B) Graph depicting the number
of disseminated tumor cells. n = 17-41. * p<0.05; ** p<0.01 *** p<0.001. You are absolutely
delighted with these results as they clearly show that cells expressing your gene of interesrt lead to an
increased number of distant metastasis. Moreover, the results are statistically significant. Superb!.
Now, on closer inspection, there are a few issues with this figure and its caption. Can you spot them?
If not, ask yourself: 1. Bar graphs are probably the most common way of data representation in
biomedical sciences but, is it the best way to represent this kind of data? why? 2. Do you miss
some information in the figure legend? what do represent the bars? and the error bars? why is this
important? 3. How were the reported p-values obtained? does it really matter?

Fig. 2.1: Zebrafish experiment design

Fig. 2.2: (Typical) Representation of experiment results

READING DATA

Case 2.

In another experiment for your project, you investigate the effect of drug X on EFNA3 mRNA levels
using human vascular endothelial cells (HUVEC) as a model. To this end, you seed two culture
plates with HUVEC from the same donor and treat one plate with the drug and the other with
vehicle alone. Then, following the standard procedures in your lab, you extract total RNA from both
culture plates and determine the transcript abundance by RT-qPCR in three aliquots from each RNA
sample (see figure 2.3). The experiment shows a clear effect of drug on EFNA3 mRNA level, but
your supervisor is unconvinced claiming that you must validate this result in additional independent
experiments. Why? you already have three replicates, arent they enough? Taking the advise you
repeat the whole experiment three more times in different dates and using each time a new set of
reagents and independent batches of cells from different donors. So you end up having 4 experiments
x 3 qRT/PCR replicates =12 determinations for EFNA3 transcript in each condition (control and
drug). A lot of data, good job, time to show your results to your supervisor...wait! How are you
going to process this data? should you plot all 12 data points? what is the n of your experiment?
Does it really matter? If so, why?
Fig. 2.3: qRT-PCR experiment

Files we will use


Zebrafish_data.xlsx (case 1)
Zebrafish_data.csv (case 1)
qPCR_data.xls (case2)
antipyretic.xls (exercises)
tumor_vol.csv (exercises)
metabolites_diet.xls (exercises)

Reading data

4.1

Spreadsheets and csv format

To answer the questions regarding case 1, lets first take a look at the raw data that was generated in the
experiment described in figure 2.1.
Most researchers use spreadsheet software such as MS office Excel or LibreOffice Calc to process data
and generate graphs. Thus, chances are youll receive results in some of the file types generated by
these software packages, such as .xls or .xlsx files. However, these kind of files are not universally
readable by other software packages. For example, R cannot directly read them1 . Fortunately, almost
all spreadsheet software packages can also save data in a plain text format in which individual data
values in columns are separated by commas, hence is name comma separated value and extension
name .csv2 . The main advantage of this format is that it can be read by almost any software package.
Here are the steps you have to follow in MS Excel to save data in csv format (in other software packages
will be very similar):
1
2

There are dedicate packages that allow R to read and write these file formats
since this files are just text files, youll see them with the extension .txt as well.

4.2

Reading data in R commander

READING DATA

open the file Zebrafish_data.xls with MS Excel or LibreOffice.


using save as option in the file menu, select CSV (Comma delimited) from the Save as type list.
Click Save.
Youll probably get some warnings informing you that some features are not compatible with this file
format. Thats fine, just click yes. Certainly this kind of format wont keep any of the fancy formatting
that you may had in the spreadsheet (border, highlighting,...) nor the formulas, but the raw data in each
cell (numbers and text) will be preserved3 .
You are done. If you open the file with any text editor youll see rows of numbers, each row corresponding to one in the original excel file, delimited by commas. The comma separated values keep the same
ordering as in the columns in the original file. Note that we can export data in this format using any
delimiter instead of commas, which can be useful in some cases4 . In this regard, tabs are commonly
used as delimiters in text files instead of commas5 .

4.2

Reading data in R commander

In the menu bar choose Data>Import data>from text file: Make sure you choose Import data and
not use the Load dataset instead (see number 1 in figure 4.1). Select file Zebrafish_data.csv from the
directory were you downloaded it to.
Give a descriptive name to your data6 (e.g. Zebrafish_data) and make sure you indicate the appropriate
delimiter (comma, tabs, white space,...) which, in this case, is comma7 . Finally, indicate whether the
decimal point character is a dot or a comma. (see number 2 in figure 4.1)

Fig. 4.1: Reading data into R commander

Also, if you have several sheets in your Excel document, only the active one will be exported to csv.
for example, in Spain comma is used to separate decimals, so it is not a good idea to use commas to separate decimal data
values in this case.
5 this variation of the csv format is sometimes referred as tab separated values or tsv, but many times text files with the
extension .csv use tabs.
6 by default the name is set to Dataset
7 If you are not sure about the delimiter that was used, just open the file with any text editor (or word processor) and look at
its content (remember that csv files are just text files).
4

EXPLORING DATA

Fig. 4.2: R-code: Reading data from file


ZF_data <- read.table("Zebrafish_data.csv", sep = ",", header = T,
na.strings = "NA", dec = ".")

Now the data has been loaded into the computer memory and it is accessible to R commander, so we
can visualize and edit it by clicking on the appropriate buttons (view dataset edit dataset, located
right below the main menu of the R commander main window, see 6.4). Note that the name of the
data set will appear in the Active data set area and you can see the data by clicking on View data
set (see figure 6.4). Of course, we could have perform the same operation, that is loading data into
memory Zebrafish_data.csv, using R commands, as is shown in figure 4.2. Note that, in this case,
data is stored into the variable Zebrafish_data.

Exploring data

5.1

Visualizing data

Fig. 5.1: Histograms to visualize data distribution

Lets take a peek at the data, usually the best way to get familiar with the raw data is to represent
it. Choose Graphs in the menu bar and look at all the graph options. For example, it is always
useful to explore how data values are distributed. Histograms are a useful representation to visualize
the distribution of numerical data. Go to Graphs>Histograms and select the group of data to be
represented (PLOC, EFNA3,...). Then choose Options to decorate the graph with appropriate labels
(see figure 5.1). The resulting histogram represent the number of animals (y axis) that present a given
number of metastasis (x axis). Please, try also strip charts8 from the Graphs menu (figure 5.1). In
this plot, each dot represents a single fish, so it allows the visualization of the number of metastasis
recorded in each individual fish. If you repeat this process for the control (pLOC) and experimental
points (EFNA3,NC1,...), youll see that fish injected with cells expressing EFNA3 tend to have higher
number of distant metastasis.
8

sometimes termed dot plots

5.2

Summarizing data

EXPLORING DATA

Fig. 5.2: R-code: Comparison histograms vs strip charts


par(mfrow = c(2, 3), cex = 0.5)
hist(ZF_data$pLoC, main = "pLOC", xlab = "number of distant metastasis",
ylab = "number of animals", breaks = seq(0, 120, 20))
hist(ZF_data$EFNA3, main = "EFNA3", xlab = "number of distant metastasis",
ylab = "number of animals", breaks = seq(0, 120, 20))
hist(ZF_data$NC1s, main = "NC1s", xlab = "number of distant metastasis",
ylab = "number of animals", breaks = seq(0, 120, 20))
stripchart(ZF_data$pLoC, main = "pLOC", xlab = "number of distant metastasis",
method = "stack", xlim = c(0, 120), frame.plot = F)
stripchart(ZF_data$EFNA3, main = "EFNA3", xlab = "number of distant metastasis",
method = "stack", xlim = c(0, 120), frame.plot = F)
stripchart(ZF_data$NC1s, main = "NC1s", xlab = "number of distant metastasis",
method = "stack", xlim = c(0, 120), frame.plot = F)

20

60

100

20

60

100

8
4
0

10 15

number of animals

NC1s

number of animals

EFNA3

0 2 4 6 8

number of animals

pLOC

20

60

100

number of distant metastasis

number of distant metastasis

number of distant metastasis

pLOC

EFNA3

NC1s

20

60

100

number of distant metastasis

20

60

100

number of distant metastasis

20

60

100

number of distant metastasis

Figure 5.2 represents the histograms and strip charts for a few selected groups along with the R code
used to generate the figure. These results are in general agreement with the figure we got from our
collaborator (see 2.2). However, you will also notice that figure 5.2 reveals that there is a wide variation
in number of metastasis among animals injected with the same kind of cells and, importantly, there is
a large overlap between different conditions. Note that none of these features were evident in figure
2.2 and hence important information, such as the shape of the distribution and the spread of the raw
data, is obscured in that bar plot representation.

5.2

Summarizing data

The shortcomings of figure 2.2 indicated above, are further aggravated by the lack of information regarding the meaning of the bars and error bars in the figure legend provided with this figure. Although
is was not indicated, bars in figure 2.2 represent the mean of the data values for each group and the
error bars the standard error of the mean (SEM). The mean and SEM are commonly used to summarize a large amount of data (in this case the number of metastasis observed in each group of fish and
uncertainty of this measure), but there are other ways to convey this information. Another common
measure of the center of the distribution is the median (the middle value when data values are sorted),
it is important to notice that median is a much more robust measure of center than the mean, that
is, outliers (isolated values showing extreme values) have a more profound impact in the value of the
mean9 . As for the measures of spread of data, the standard deviation (roughly the mean deviation of
each data value from the mean) and the interquartile range 10 (IQR) are commonly used in descriptive
statistics.
9

As an example compute the mean and median of the values 1,2,3,4,5 and 1,2,3,4,50
To compute the IQR, data is sorted in ascending order. The value below which lie 25% of the data values is the first quartile
(Q1, 25th percentile) and the third quartile (Q3, 75th percentile) corresponds to the first value above 75% of the data values (note
that the median is the second quartile,Q2). Finally, the IQR is calculated as Q3-Q1. Note that 50% of the data values are within
the IQR. In fact, the IQR represent the middle 50% of the data.
10

EXPLORING DATA

5.2

Summarizing data

Other summary statistics11 commonly used as error bars are SEM and confidence intervals (CI). It
should be noted however, that SEM and CI are measures of the confidence on the determination of the
mean rather than a mere description of the data spread. Thus, they are used in inferential statistics12 .
We will see much more about SEM and CI during the course. Table 1provides a summary of measures
of center and spread and reference Cumming et al. [2007] provides and excellent review about the use
of error bars in experimental biology.
It is important to note that, although some of these statistics try to describe the same thing (e.g. center of the data distribution), their numerical value may not be the same (and usually is not the same).
Because figure symbols and error bars can represent many different things, they are meaningless, or even
misleading, if the figure legend does not state what kind they are. Thus, always describe in the figure legend
the meaning of all the symbols included in the figure. Specifically, when representing measures of center and/or
spread/error you must always clearly state what they are.

Fig. 5.3: Getting summary statistics

To see the summary statistics for the zebrafish dataset, just go to Summaries under the Statistics
in the main menu and then select Active data set (figure 5.3, up to step 1).
For each group you will get the minimum and maximum values (that is the range), the values of the 1st
and 3rd quartiles (their difference is the interquartile range, IQR), the mean and the median (figure 5.4).

Alternatively you can get a customized summary from the Numerical summaries option under Summaries (see 2a in figure 5.3). Make sure you select all the groups (2b in figure 5.3) and the mark all the
statistics you want to be displayed (see 2c figure 5.3), including SEM (see 2d figure 5.3). Note that you
can choose the quantiles to be calculated13 (figure 5.5).
Please take a moment to explore the summary tables. Although the mean and median values within
each group are different, they do not differ much except in the case of NC1s. The distribution of
values in the NC1s group right skewed (see figure 5.2) and, as a consequence, the value of the mean
is displaced to higher values. Thus, the use of mean in figure 2.2, together with the lack of direct
information about data spread, misleads the reader into thinking that the number of disseminated cells
in NC1s is high above the control (pLoC) and, in fact, higher than in any other group. However, a
closer analysis reveals that the variability (data spread) in this group (see for example IQR or SD in
figure 5.5) was much higher than in the rest of the groups and, in fact, the first and second quartiles
are very similar between NC1s and pLoC, suggesting that the increased dissemination was observed
only in some animals. In contrast, EFNA3 seems to induce a generalized increase in the number of
11
12
13

summary statistic is a single number summarizing a large amount of data


make inferences about populations using data drawn from the population
remember that quantile 0.5 correspond to the 2nd quartile which is the median

5.3

Types of variables and representation of numerical variables

EXPLORING DATA

Fig. 5.4: Data summaries


summary(ZF_data)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

pLoC
Min.
: 3.00
1st Qu.:12.00
Median :19.00
Mean
:20.18
3rd Qu.:26.00
Max.
:44.00
NA's
:24
NC2
Min.
: 7.00
1st Qu.:22.00
Median :32.00
Mean
:35.97
3rd Qu.:42.00
Max.
:82.00
NA's
:5

EFNA3
Min.
: 0.00
1st Qu.:25.00
Median :34.50
Mean
:33.75
3rd Qu.:40.50
Max.
:89.00
NA's
:13
NC2s
Min.
: 17.00
1st Qu.: 29.00
Median : 43.00
Mean
: 49.72
3rd Qu.: 64.00
Max.
:115.00
NA's
:16

NC1
Min.
: 3.00
1st Qu.:22.00
Median :28.00
Mean
:35.71
3rd Qu.:49.00
Max.
:95.00

NC1s
Min.
: 7.0
1st Qu.: 12.0
Median : 23.0
Mean
: 48.1
3rd Qu.: 85.0
Max.
:120.0
NA's
:20

Fig. 5.5: Customized data summary


##
##
##
##
##
##
##

EFNA3
NC1
NC1s
NC2
NC2s
pLoC

mean
33.75000
35.70732
48.09524
35.97222
49.72000
20.17647

sd
15.48625
21.56298
42.76553
18.67999
26.66227
12.07391

se(mean)
2.926627
3.367572
9.332204
3.113332
5.332454
2.928354

IQR 0% 25% 50% 75% 100% n NA


15.5 0 25 34.5 40.5
89 28 13
27.0 3 22 28.0 49.0
95 41 0
73.0 7 12 23.0 85.0 120 21 20
20.0 7 22 32.0 42.0
82 36 5
35.0 17 29 43.0 64.0 115 25 16
14.0 3 12 19.0 26.0
44 17 24

distant metastasis in all animals albeit with different penetrance (see figure 5.2 and summary statistics
in figure 5.5).

5.3

Types of variables and representation of numerical variables

As we have seen above and in spite of being widely used (and abused) in cellular and molecular biology
publications, bar plots are not the most convenient way to represent numerical variables. A much more
appropriate graphical representation would be a box plot were, the distribution of data is clearly visible.
Box plot. To build a box plot numerical data is sorted in ascending order, so that the first quartile (Q1, 25th
percentile) , the median (i.e. second quartile or 50th percentile), and the third quartile (Q3, 75th percentile)
can be calculated. Then, the first step is drawing a dark line denoting the median. Next, a rectangle that goes
from the Q1 to Q3 and thus represents the middle 50% of the data is plotted. Finally, error bars or whiskers
are plotted so that they go up to the maximum/minimum values contained within 1.5 times the IQR from the
Q1 or Q3a (i.e. the go up to the most extreme value contained within Q1-1.5*IQR, Q3+1.5*IQR). Any value
ouside the [Q1-1.5*IQR, Q3+1.5*IQR] interval is shown as a dot to make it easier to spot potential ouliers.
a

Actually this is just one common definition of wiskers (Tukey), but there are others. In Spear definition whiskers extend to
minimum and maximum values; In Altman, whiskers extend to 5th and 95th percentile.

Now, lets represent the zebrafish data using box plots14 (at least the three groups in figure 5.2)15 .
Compare the graphs with the numbers in the summary statistics (figures 5.4 and 5.5), as youll see this
representation is much more informative than the bar plot in figure 2.2.
box plots a extremely efficient way to represent numerical data, but unfortunately they are seldom used
in biomedical publications. One reason for it, is that standard spreadsheet software tools, such as MS
Excel, do not produce this kind of graph. To overcome this problem and promote the use of this kind
of representation some researchers Spitzer et al. [2014] have created web-based tools that generate
box-plots from the user-provided data. One such a tool is BoxPlotR http://boxplot.tyerslab.com/
14
15

box plots are also known and box-and-whisker plots


To do it just Follow the steps shown in figure 5.1 but choose Boxplot in the Graph dropdown menu

MANAGING DATA

name
median
inter quartile range
range

Tab. 1: Summary statistics


calculated as
represents (measures)
50th quartile (Q2)
center of data
IQR = Q3 Q1
spread of data
min,max
spread of data
P

notes
robust
robust
sensitive to outliers

mean

xi
n
(xi
x)2
i=1
n1
= s2

x
=P

center of data

sensitive to outliers

spread of data
spread of data
uncertainty of the estimate of the mean
uncertainty of the estimate of the mean

descriptive
descriptive
inferential
inferential

i=1

variance
standard deviation
standard error of the mean
Confidence interval

s2 =

s
SEx = sn
x
M arginError

By now you would probably have notice that it is quite inconvenient to plot each experimental condition
(each group) separately. However, R commander does not seem to allow selecting/plotting more that
one condition at once. Why is that? Actually, the problem is the way data was recorded (tabulated).
Note that we have a separate column for each condition, but we are actually measuring the same thing
(same parameter) in all the cases, namely the number of distant metastasis. Ask yourself the following
questions:
How many variables are present in the zebrafish dataset?16 What type of variables are they?17
In data analysis, columns are used to record different parameters (different variables or attributes)
of each case/observation and rows contain the information about individual cases/observations. For
this reason R commander (actually R) does not allows us to plot more than one condition. Being
the groups in different columns, they are interpreted as distinct variables and histograms (and also
strip charts and box plots) represent univariate ("one variable") data. We can avoid this inconvenience by reordering data in a more logical (at least from the data analysis point of view) way. Well
see how in the next section. Before that, it is important to review the different types of variables:
Types of variables
There are two main types of variables: numerical (what we record is a number that represent a measurable
quantity, e.g. height of an individual or the pg of mRNA per cell) and categorical (what we record is a
qualitative characteristic, e.g. the color of eyes, gender). The different values that a categorical variable can
take are referred to as levels (e.g. the categorical variable gender has only two levels, male and female). The
levels can show a natural order (e.g. level of education: elementary school, high school, college graduate, PhD),
in this case they are termed ordinal categorical, or be unordered (e.g. gender) and then we call them nominal
categorical variables (or just categorical). Numericala variables can be continuous (e.g. temperature) or discrete
(e.g. number of siblings). Note that the label for some categorical variables are numbers (e.g. zip codes), which
can mislead us to classify them as numerical. As a rule of thumb, with numerical variables it is sensible to add,
subtract, or take averages with their values. In contrast, it does not make sense to apply arithmetic operations
to categorical variables (what would be the meaning of adding of two zip codes?)
a Numerical variables are often subdivided into two types: ratio scale, variables that have a true 0 point below which you cannot
have any data (e.g length of bacterial cilia); interval scale, the 0 point is arbitrary so negative values are possible (e.g. temperature
measured in Celsius degrees). For this reason, you may find numerical variables referred as ratio or scale variables as well.

6
6.1

Managing data
Stacking data and factors in R

As indicated above, in the zebrafish experiment, we only determined a single type of parameter (number of metastasis). Thus, all the data points belong to the same variable and, accordingly, we should
stack all values in a single column. This column contains the number observed number of disseminated cells per animal and thus it is a numerical variable (it would make sense to say that we see 1.5
more disseminated cells in fish X than in Y or that the average number of disseminated cells per fish is
234). After stacking data into a single column, values from each condition appear as a contiguous set
adjacent to the others. Thus, to keep track of where (which fish) each data value comes from, we add
16 The original data table contained 6 columns one for each group of animals. However, we are only measuring one parameter,
namely the number of distant metastasis, which is one of the variables. On the other hand we have fish injected with different cell
lines. Thus, the cell line used to inject the fish is another variable. In total we have 2 variables.
17 As it is explained below, although there are several types that are classified in slightly different ways by different authors, in
general variables can be of two major types: numerical and categorical. The differences are explained in the text. In our case, the
number of distant metastasis is a numerical variable and the type of cell line used in the experiment a categorical variable.

6.2

Subsetting and aggregating data

MANAGING DATA

another column (another variable or attribute) that records the experimental condition. This second
variable is categorical (defines a qualitative, non measurable attribute of each case). In R categorical
variables are called factors. As mentioned before the number of possible values of a categorical
variable (or factor) are called levels. In this case, the variable has six levels (pLoC, EFNA3, NC1,
NC1s, NC2 and NC2s).

Fig. 6.1: Stacking data

Stacking columns is very easy, just go to Data>Active dataset>Stack variables. Make sure you select
all columns and provide a name for the new dataset (by default StackedData). Note that a new
dataset is created and is set as the active dataset (see 6.1). By clicking on View dataset you will see
StackedData has two columns: the first one containing the stacked data values of distant metastasis
and a second column that contains conditions (see 6.1). Remember that this second column is a
categorical variable or factor.
The resulting table is called data matrix and it is the standard way to record and organize data. In a
data matrix each row correspond to a unique case and each column corresponds to a variable.
Now that we have our data organized in a data matrix, we can plot all experimental conditions at once.
For example, generate a stripchart or boxplot from the StackedData data matrix. Note that there
is a single (numerical) variable to plot (called variable), select Plot by factor and choose the factor
(in this case there is a single categorical variable (a single factor) termed factor. With this information
the software will partition the numerical data according to the different levels in the categorical variable
and draw a box plot for each subset. The R code that stack columns and generate a box plot of the
distant metastasis in each experimental condition is shown in figure 6.2.
Compare this graph with the one in figure 2.2. In particular, note that the box plot is a much more
informative representation of the values in the NC1s group.
Exercise. Generate a strip chart of the zebrafish data and compare it to the figure 6.2. Which type of
representation is more informative?18

6.2

Subsetting and aggregating data

Lets focus now on the second case (see 2.3). Save data from the file qPCR_data.xls as a comma
separated text file (qPCR_data.csv) and load it into R commander. This data set contains the results
of all four experiments (figure 2.3) arranged in three columns: experiment, treatment, EFN3A_levels.
Note that this this the correct data arrangement.
18 In this case, both type of representations could be used. strip charts are in general preferred over box plots when a reduced
number of numerical values are represented. If we have, for example, only 3 values it would not make much sense to represent
quartiles. On the other hand, when a large number of values are represented box plots tend to provide a much cleaner visualization
of data distribution.

10

MANAGING DATA

6.2

Subsetting and aggregating data

Fig. 6.2: R-code: Stacking data

120

20 40 60 80

Number of distant metastasis

StackedData <- stack(ZF_data[, c("EFNA3", "NC1", "NC1s", "NC2", "NC2s", "pLoC")])


names(StackedData) <- c("variable", "factor")
boxplot(variable ~ factor, vertical = TRUE, method = "stack", ylab = "Number of distant metastasis",
data = StackedData)

EFNA3

NC1

NC1s

NC2

NC2s

pLoC

Exercise. How many variables does this data matrix contain?19 what type are they?20 Indicate the levels of
the categorical variables if any21 .

19

3
The two first columns (experiment and Treatment represent categorical variables (factors) and the last one
EFNA3_level_AU is a numerical variable
21 Experiment has four levels: qpcr0809, qpcr0109, qpcr2508 and qpcr2506; Treatment has two levels: vehicle and
drugX
20

11

6.2

Subsetting and aggregating data

MANAGING DATA

Fig. 6.3: Subsetting data sets

Lets start by exploring the results obtained in the first experiment qpcr2506. To that end we could
just extract the corresponding rows (cases) from the data matrix (i.e. subsetting). In R commander
go to Data>Active dataset>Subset active dataset. In the pop-up window you have to indicate the
subsetting criteria (it is set by default to <all cases>), in this case you want to select those rows
whose value for the Experiment variable is qpcr2506. The expression that encapsulates this criteria is: Experiment==qpcr250614. Note that in R we use two equal symbols (==) as the equality
operator22 . Finally, make sure you check include all variables (so that the resulting dataset contains
all three columns) and give a new name to the resulting dataset, for example FirstExp (see 6.3).

Fig. 6.4: Selecting active data set

Note that the active data set is now FirstExp. However, the original data matrix is loaded in the
memory and you switch between the different data sets by clicking on the active Data set button
(figure 6.4).
You can now represent the data (you already know how to do it!).
What kind of graphical representation will you use?23
22

The use of two contiguous = symbols as the equality operator is frequent in many programming languages. It is a common
mistake to use a single equal symbol (assign operator). Note that the equality operator (==) compares the expressions on the
right and left of the operator and returns TRUE when they are identical or FALSE otherwise. The assign operator (=), assigns
the value on the right to the expression on the left.
23 First, ask yourself what type of variable is the one you are representing. In this case it is a numerical variable (EFNA3 level),
note we use the categorical variable (treatment) to divide the numerical variable in two groups (vehicle and drugX). Now, when
we want to represent a single (univariate) numerical variable we use either histograms, strip chart or box plots. Histograms are

12

MANAGING DATA

6.2

Subsetting and aggregating data

Fig. 6.5: R-code: Subsetting data

100
80
40

60

EFNA3 mRNA level (AU)

120

140

qPCRdata <- read.table("qPCR_data.csv", sep = ",", header = T)


FirstExp <- subset(qPCRdata, Experiment == "qpcr250614")
with(FirstExp, stripchart(EFNA3_level_AU ~ Treatment, method = "jitter",
vertical = T, ylab = "EFNA3 mRNA level (AU)", xlab = "Treatment"))

drugX

vehicle
Treatment

The R code to produce this graph is shown in figure 6.5. This data looks nice: tightly packed replicates
and large (much larger than the difference between values within each group) difference between the
two groups. The difference between groups will definitely be significant (well see how to calculate it
on the next lesson).

mainly used when our focus is the shape of the distribution. Strip charts and box plots are very useful to compare two or more
sub groups of a single numerical variable. Finally, when the number of values in each group is small, it doesnt make much sense
to represent quartiles, etc. Thus for groups of just a few values, as is our case here, we will preferentially use strip charts.

13

6.2

Subsetting and aggregating data

MANAGING DATA

Fig. 6.6: Biological Replicates from ref. Vaux et al. [2012]

Now, what is the meaning of these replicates? How generalizable is this result? The replicates in this
case are just independent measures of the same RNA samples (those coming from experiment 2506),
thus they their variance represents technical variability in our RT-qPCR procedure (pippeting errors,
non-uniform measurement of PCR plate by the optical device,...) but it does not captures the biological
variability because we measured RNA coming from a single sample. For example, if the basal level
of the EFNA3 mRNA were variable (sensible to culture conditions, culture age, cell confluence,...),
this variability could explain the observed difference rather than the treatment. It could also happen
that response to treatment is non-uniform between individuals (remember HUVEC cells are primary
cells from a human donor, see figure 2.3) and, just by chance, we could have bumped into a donor
whose endothelial cells show an extreme response to the drug. For all these reasons, we need several
independent biological replicates (in addition to technical replicates) of the experiment. It is important
to randomize all possible sources of variability in the set of independent experiments. Importance
of replication is clearly and concisely explained in reference Vaux et al. [2012], see also figure 6.6
extracted from it.
14

SUMMARY

Fig. 6.7: Aggregate data

Fig. 6.8: R-code: Aggregate data

100
80
60
20

40

EFNA3 mRNA level (AU)

120

140

qPCRmeans <- aggregate(qPCRdata[, c("EFNA3_level_AU"), drop = FALSE],


by = list(Experiment = qPCRdata$Experiment, Treatment = qPCRdata$Treatment),
mean)
with(qPCRmeans, stripchart(EFNA3_level_AU ~ Treatment, method = "jitter",
vertical = T, ylab = "EFNA3 mRNA level (AU)", xlab = "Treatment"))

drugX

vehicle
Treatment

Thus, we must represent all experiments, but how? should we represent all 4x3 individual data points
for each condition? Well you have to ask, what is the n of your experiment? Since we want to
make inferences about the biological population, our n is the number of independent biological
experiments, in this case n=4. Thus, we have to calculate the average of the technical replicates for
each experiment and represent them. If we represent all 12 data points we are implying n=12, which is
not the case. We can easily aggregate (i.e. summarize) the data on the table according to a given factor.
In this case, we want to aggregate data by calculating the mean of each group according to experiment
date. On the bar menu select Data>Active data set>Aggregate variables in active data set. Then, in
the pop-up window, indicate the name for the aggregated data set (by default AggregatedData), select
the variable you want to aggregate (in this case EFNA3_level_AU), the factors youll use to aggregate
it (in this case both Experiment and Treatment) and finally select the function to aggregate data
(in this case were interested in the calculating the mean). See figure 6.7 to follow these instructions
in R commander and figure 6.8 for the R-code. Note that, all we have done is to calculate the mean of
instances (rows) that had the same value for the Experiment and Treatment variables (the ones we
used to aggregate data).

Summary
1. When analyzing experimental data it is good practice to:
(a) order data into a data matrix
15

EXERCISES

(b) perform a basic descriptive analysis of the raw data including summaries and graphs.
2. There are two main types of variables: numerical and categorical. This is an important distinction as,
among other things, it will determine the type of statistical tests than can be applied to them. It also
affect the kind of graphs that are more appropriate to represent them as shown below.
3. It is important to use an appropriate type of graphical representation. Quit using bar plots as an allpurpose graphical representation. Below is a non-comprehensive list of the most common type of graphical
representation of data according to their type 24 :
(a) Numerical variables
i. Representation of a single numerical variable:
A. Histogram. Commonly used to represent the shape of the distribution of a numerical variable.
B. Box plots (box-and-whisker plot). Useful and efficient way of representing the center and
spread of a set of values and to compare between different subsets of numerical values corresponding to different groups.
C. Strip chart (dot plot). Same as box plots. It is preferred over box plots for the representation
of a reduced number of data values.
ii. Representation of two numerical variables:
A. Scatter plot (see lesson 2 and 3)
(b) Categorical variables
i. Representation of a single categorical variable:
A. Bar plots. Note that, although bar plots are frequently used in bioscience publications to
represent numerical data, this kind of graph is intended for the representation of the number
of cases in each level of a categorical variable. Well see much more about representation of
categorical variables in lesson 4.
B. pie chart (see lesson 4)
ii. Representation of more than one categorical variable:
A. Segmented bar plots (see lesson 4)
B. Mosaic plots (see lesson 4)
4. Make sure that you correctly identify the number of independent repeated experiments (biological replicates) you are dealing with. It is not acceptable to use technical replicates (or a mix of technical and
biological replicates) to calculate statistics, error bars and make inferences.
5. Figure legends must always indicate what statistics are represented. In particular, when present, it should
be clearly stated the kind of central and spread/error measure is represented. In addition, the number of
independent biological replicates (the n) should be included in the figure legend.

Exercises

Please complete the following exercises and answer the corresponding questions in the Self-evaluation
test in the Moodle page of the course.
Exercise 1. You are measuring the volume of tumors in a set of mice treated with a new drug and write them
down on a piece of paper. The file tumor_vol.csv contains the data. Now, you measure a new tumor and
its size is 0.049 mm3. If by mistake you skip the first decimal digit and enter it as 0.51 instead of 0.051, how
will the mean be affected? and the median? Import the data and edit it to generate two new data sets one
including the value 0.51 and another including the value 0.051. Calculate summary statistics and represent the
two data sets using a boxplots and strip charts. In the case of the data set containing the incorrect measure,
which statistic (mean or median) describes best the data (i.e. is closer to statistic calculated from the data set
with the correct measure)?
sep
Exercise 2. Match each of the elements in a box plot with their meaning.
24 Although we have not seen many of these graphical representations yet, we include them in this summary for completeness of
the explanation and for your reference.

16

REFERENCES
Element
Dark line within box
Dots
Wiskers
Length of the box

REFERENCES
Definition
Standard deviation
most extreme value within 1.5 times the IQR from the mean
mean
most extreme value within 1.5 times the IQR from the median
50th percentile
any value farther than 1.5 times the IQR from the mean
erroneous data values

sep
Exercise 3. Could you match histograms A-C to boxplots d1-d3 (figure 8.1)? How does the value of the mean
compares to the median in each group? and between groups? How does the IQR between groups compare?

Fig. 8.1: Exercise 2 figure

sep
Exercise 4. To analyze the effect of diet on the level of a lipid metabolite (lipid X), 6 mice were randomly
divided in two groups and the animals in each group were fed with regular chow or high fat diet (HFD) for 10
weeks. Then, animals were euthanized and the level of the lipid metabolite was determined in four independent
1mg samples from the liver and gastrocnemius muscle of each mice. The file metabolite_diet.xls contains
the amount of lipid (in microgr/mg tissue) in each sample. Reorder the data to generate a data matrix (do it
manually by copying-pasting data sets) and save it as a comma-separated text file. How many rows does the
resulting data matrix contain? What type each variable is? How many cases does it contain? Keeping in mind
that with this experiment we wanted to know whether the HFD increases the content of lipid X in the liver
(and/or the muscle), what is the n (independent biological replicates) of each treatment group? Make a plot
of the results.

References
Geoff Cumming, Fiona Fidler, and David L Vaux. Error bars in experimental biology. The Journal of cell
biology, 177(1):711, 2007. ISSN 0021-9525. doi: 10.1083/jcb.200611141. URL http://jcb.rupress.org/
cgi/content/long/177/1/7.
17

REFERENCES

REFERENCES

Michaela Spitzer, Jan Wildenhain, Juri Rappsilber, and Mike Tyers. BoxPlotR: a web tool for generation
of box plots. Nature methods, 11(2):1212, feb 2014. ISSN 1548-7105. doi: 10.1038/nmeth.2811. URL
http://www.ncbi.nlm.nih.gov/pubmed/24481215.
David L Vaux, Fiona Fidler, and Geoff Cumming. Replicates and repeats - what is the difference and is it
significant ? A brief discussion of statistics and experimental design. EMBO Reports, 13(4):291296, 2012.
ISSN 1469-221X. doi: 10.1038/embor.2012.36. URL http://dx.doi.org/10.1038/embor.2012.36.

18

You might also like