Data Analysis For The Life Sciences With R 1st Edition

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Data Analysis for the Life Sciences with

R 1st Edition
Visit to download the full and correct content document:
https://ebookmass.com/product/data-analysis-for-the-life-sciences-with-r-1st-edition/
Statistics

Irizarry ■ Love
Genomics is being driven by new measurement technologies that permit us to ob-
serve certain molecular entities for the first time. These observations are leading
to discoveries analogous to identifying microorganisms and other breakthroughs

Data Analysis
permitted by the invention of the microscope. Choice examples of these tech-
nologies are next generation sequencing and microarrays. This book was written
for the many life science researchers who are becoming data analysts due to the
emergence of these new types of data.

for the

Data Analysis for the Life Sciences with R


Although the content of the book is mostly focused on advanced statistical con-
cepts, the basics are covered to make sure readers have a strong grounding on

Life Sciences
the fundamental statistical concepts required for all data analysis. The book be-
gins with statistical inference and then proceeds to an introduction to linear mod-
els and matrix algebra, high-dimensional data, distance and dimension reduction,

with R
and batch effects and factor analysis.
The emphasis is on using a computer to perform data analysis. All sections of
this book are reproducible as they were made with R markdown documents that
include the code used to produce the book’s figures, tables and results.
Rafael A. Irizarry is Professor of Applied Statistics at the Dana Farber Cancer
Center and Harvard School of Public Health. In 2009 he was awarded The Pres-
idents' Award by the Committee of Presidents of Statistical Societies (COPSS).
His work has been highly cited and his open source software tools widely down-
loaded.
Michael I. Love is a Postdoctoral Fellow at Harvard School of Public Health. He
received his Ph.D. in computational biology in 2013 from the Freie Universität
Rafael A. Irizarry
Berlin.
Professors Irizarry and Love have taught seven computational biology courses on
Michael I. Love
edX to hundreds of thousands of students.

K29691
6000 Broken Sound Parkway, NW
Suite 300, Boca Raton, FL 33487
711 Third Avenue
New York, NY 10017
2 Park Square, Milton Park
Abingdon, Oxon OX14 4RN, UK

w w w. c rc p r e s s . c o m A CHAPMAN & HALL BOOK

K29691_cover.indd 1 6/28/16 10:02 AM


Contents vii

6.5 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184


6.6 Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.7 The Bonferroni Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.8 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.9 Direct Approach to FDR and q-values (Advanced) . . . . . . . . . . . . . 195
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.11 Basic Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . 201
6.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7 Statistical Models 209


7.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
7.2 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . 213
7.4 Distributions for Positive Continuous Values . . . . . . . . . . . . . . . . 215
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.6 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.8 Hierarchical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

8 Distance and Dimension Reduction 237


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.2 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.3 Distance in High Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 239
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
8.5 Dimension Reduction Motivation . . . . . . . . . . . . . . . . . . . . . . . 241
8.6 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
8.8 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.9 Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
8.10 Multi-Dimensional Scaling Plots . . . . . . . . . . . . . . . . . . . . . . . 261
8.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8.12 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 267

9 Basic Machine Learning 273


9.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
9.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
9.3 Conditional Probabilities and Expectations . . . . . . . . . . . . . . . . . 281
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.5 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.6 Bin Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
9.7 Loess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
9.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.9 Class Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.10 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
viii Contents

10 Batch Effects 305


10.1 Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
10.2 Confounding: High-Throughput Example . . . . . . . . . . . . . . . . . . 311
10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
10.4 Discovering Batch Effects with EDA . . . . . . . . . . . . . . . . . . . . . 313
10.5 Gene Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
10.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
10.7 Motivation for Statistical Approaches . . . . . . . . . . . . . . . . . . . . 323
10.8 Adjusting for Batch Effects with Linear Models . . . . . . . . . . . . . . . 325
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
10.10 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
10.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
10.12 Modeling Batch Effects with Factor Analysis . . . . . . . . . . . . . . . . 335
10.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342

Index 343
List of Figures

1.1 GitHub page screenshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4


1.2 Integral of a function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Empirical cummulative distribution function for height. . . . . . . . . . . 17


2.2 Histogram for heights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Illustration of the null distribution. . . . . . . . . . . . . . . . . . . . . . 19
2.4 Null distribution with observed difference marked with vertical red line. 20
2.5 Histograms of all weights for both populations. . . . . . . . . . . . . . . 27
2.6 Quantile-quantile plots of all weights for both populations. . . . . . . . . 27
2.7 Quantile versus quantile plot of simulated differences versus theoretical
normal distribution for four different sample sizes. . . . . . . . . . . . . . 32
2.8 Quantile versus quantile plot of simulated ratios versus theoretical normal
distribution for four different sample sizes. . . . . . . . . . . . . . . . . . 33
2.9 Quantile-quantile plots for sample against theoretical normal distribution. 38
2.10 We show 250 random realizations of 95% confidence intervals. The color
denotes if the interval fell on the parameter or not. . . . . . . . . . . . . 42
2.11 We show 250 random realizations of 95% confidence intervals, but now
for a smaller sample size. The confidence interval is based on the CLT
approximation. The color denotes if the interval fell on the parameter or
not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.12 We show 250 random realizations of 95% confidence intervals, but now for
a smaller sample size. The confidence is now based on the t-distribution
approximation. The color denotes if the interval fell on the parameter or
not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.13 Power plotted against sample size. . . . . . . . . . . . . . . . . . . . . . . 48
2.14 Power plotted against cut-off. . . . . . . . . . . . . . . . . . . . . . . . . 49
2.15 p-values from random samples at varying sample size. The actual value of
the p-values decreases as we increase sample size whenever the alternative
hypothesis is true. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.16 Histogram of 1000 Monte Carlo simulated t-statistics. . . . . . . . . . . . 55
2.17 Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics
to theoretical normal distribution. . . . . . . . . . . . . . . . . . . . . . . 56
2.18 Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics
with three degrees of freedom to theoretical normal distribution. . . . . . 56
2.19 Quantile-quantile plot comparing 1000 Monte Carlo simulated t-statistics
with three degrees of freedom to theoretical t-distribution. . . . . . . . . 57
2.20 Quantile-quantile of original data compared to theoretical quantile distri-
bution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.21 Histogram of difference between averages from permutations. Vertical line
shows the observed difference. . . . . . . . . . . . . . . . . . . . . . . . . 61
2.22 Histogram of difference between averages from permutations for smaller
sample size. Vertical line shows the observed difference. . . . . . . . . . . 62

ix
x List of Figures

3.1 First example of qqplot. Here we compute the theoretical quantiles our-
selves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2 Second example of qqplot. Here we use the function qqnorm which com-
putes the theoretical normal quantiles automatically. . . . . . . . . . . . 71
3.3 Example of the qqnorm function. Here we apply it to numbers generated
to follow a normal distribution. . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 We generate t-distributed data for four degrees of freedom and make qq-
plots against normal theoretical quantiles. . . . . . . . . . . . . . . . . . 72
3.5 Histogram and QQ-plot of executive pay. . . . . . . . . . . . . . . . . . . 73
3.6 Simple boxplot of executive pay. . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Heights of father and son pairs plotted against each other. . . . . . . . . 74
3.8 Boxplot of son heights stratified by father heights. . . . . . . . . . . . . . 75
3.9 qqplots of son heights for four strata defined by father heights. . . . . . . 76
3.10 Average son height of each strata plotted against father heights defining
the strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.11 Pie chart of browser usage. . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.12 Barplot of browser usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.13 3D version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.14 Donut plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.15 Bad bar plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.16 Treatment data and control data shown with a boxplot. . . . . . . . . . 82
3.17 Bar plots with outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.18 Data and boxplots for original data (left) and in log scale (right). . . . . 83
3.19 The plot on the left shows a regression line that was fitted to the data
shown on the right. It is much more informative to show all the data. . . 84
3.20 Gene expression data from two replicated samples. Left is in original scale
and right is in log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.21 MA plot of the same data shown above shows that data is not replicated
very well despite a high correlation. . . . . . . . . . . . . . . . . . . . . . 86
3.22 Barplot for two variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.23 For two variables a scatter plot or a ‘rotated’ plot similar to an MA plot
is much more informative. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.24 Another alternative is a line plot. If we don’t care about pairings, then
the boxplot is appropriate. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.25 Gratuitous 3-D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.26 This plot demonstrates that using color is more than enough to distinguish
the three lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.27 Ignoring important factors. . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.28 Because dose is an important factor, we show it in this plot. . . . . . . . 90
3.29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.30 Normally distributed data with one point that is very large due to a mis-
take. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.31 Scatterplot showing bivariate normal data with one signal outlier resulting
in large values in both dimensions. . . . . . . . . . . . . . . . . . . . . . 96
3.32 Scatterplot of original data (left) and ranks (right). Spearman correlation
reduces the influence of outliers by considering the ranks instead of original
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.33 Histogram of original (left) and log (right) ratios. . . . . . . . . . . . . . 98
3.34 Histogram of original (left) and log (right) powers of 2 seen as ratios. . . 98
List of Figures xi

3.35 Data from two populations with two outliers. The left plot shows the
original data and the right plot shows their ranks. The numbers are the
w values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.1 Simulated data for distance travelled versus time of falling object mea-
sured with error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2 Galton’s data. Son heights versus father heights. . . . . . . . . . . . . . . 105
4.3 Mouse weights under two diets. . . . . . . . . . . . . . . . . . . . . . . . 105
4.4 Fitted model for simulated data for distance travelled versus time of falling
object measured with error. . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.5 Residual sum of squares obtained for several values of the parameters. . 108
4.6 Galton’s plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.7 Galton’s data with fitted regression line. . . . . . . . . . . . . . . . . . . 121
4.8 Fitted parabola to simulated data for distance travelled versus time of
falling object measured with error. . . . . . . . . . . . . . . . . . . . . . 122

5.1 Mice bodyweights stratified by diet. . . . . . . . . . . . . . . . . . . . . . 135


5.2 Estimated linear model coefficients for bodyweight data illustrated with
arrows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3 Distribution of estimated regression coefficients obtained from Monte
Carlo simulated falling object data. The left is a histogram and on the
right we have a qq-plot against normal theoretical quantiles. . . . . . . . 140
5.4 Distribution of estimated regression coefficients obtained from Monte
Carlo simulated father-son height data. The left is a histogram and on
the right we have a qq-plot against normal theoretical quantiles. . . . . . 141
5.5 Comparison of friction coefficients of spiders’ different leg pairs. The fric-
tion coefficient is calculated as the ratio of two forces (see paper Methods)
so it is unitless. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.6 Model matrix for linear model with one variable. . . . . . . . . . . . . . 150
5.7 Diagram of the estimated coefficients in the linear model. The green arrow
indicates the Intercept term, which goes from zero to the mean of the
reference group (here the ‘pull’ samples). The orange arrow indicates the
difference between the push group and the pull group, which is negative in
this example. The circles show the individual samples, jittered horizontally
to avoid overplotting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.8 Image of the model matrix for a formula with type + leg . . . . . . . . . 151
5.9 Diagram of the estimated coefficients in the linear model. As before, the
teal-green arrow represents the Intercept, which fits the mean of the
reference group (here, the pull samples for leg L1). The purple, pink,
and yellow-green arrows represent differences between the three other leg
groups and L1. The orange arrow represents the difference between the
push and pull samples for all groups. . . . . . . . . . . . . . . . . . . . . 153
5.10 Image of model matrix with interactions. . . . . . . . . . . . . . . . . . . 158
5.11 Diagram of the estimated coefficients in the linear model. In the design
with interaction terms, the orange arrow now indicates the push vs. pull
difference only for the reference group (L1), while three new arrows (yel-
low, brown and grey) indicate the additional push vs. pull differences in
the non-reference groups (L2, L3 and L4) with respect to the reference
group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
xii List of Figures

5.12 Image of model matrix for linear model with group variable. This model,
also with eight terms, gives a unique fitted value for each combination of
type and leg. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.13 Diagram of the estimated coefficients in the linear model, with each term
representing the mean of a combination of type and leg. . . . . . . . . . 165

6.1 P-value histogram for 10,000 tests in which null hypothesis is true. . . . 181
6.2 Normal qq-plots for one gene. Left plot shows first group and right plot
shows second group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.3 Q (false positives divided by number of features called significant) is a
random variable. Here we generated a distribution with a Monte Carlo
simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.4 Histogram of p-values. Monte Carlo simulation was used to generate data
with m 1 genes having differences between groups. . . . . . . . . . . . . . 191
6.5 Histogram of p-values with breaks at every 0.01. Monte Carlo simulation
was used to generate data with m 1 genes having differences between
groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.6 Plotting p-values plotted against their rank illustrates the Benjamini-
Hochberg procedure. The plot on the right is a close-up of the plot on
the left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.7 FDR estimates plotted against p-value. . . . . . . . . . . . . . . . . . . . 194
6.8 Histogram of Q (false positives divided by number of features called sig-
nificant) when the alternative hypothesis is true for some features. . . . . 195
6.9 p-value histogram with pi0 estimate. . . . . . . . . . . . . . . . . . . . . 197
6.10 q-values versus p-values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.11 P-value histogram. We show a simulated case in which all null hypotheses
are true (left) and p-values from the gene expression described above. . . 202
6.12 Histogram obtained after permuting labels. . . . . . . . . . . . . . . . . . 203
6.13 Boxplot for log-scale expression for all samples. . . . . . . . . . . . . . . 204
6.14 The 0.05, 0.25, 0.5, 0.75, and 0.95 quantiles are plotted for each sample. 204
6.15 Smooth histograms for each sample. . . . . . . . . . . . . . . . . . . . . . 205
6.16 Scatter plot (left) and M versus A plot (right) for the same data. . . . . 205

7.1 Number of people that win the lottery obtained from Monte Carlo simu-
lation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7.2 MA plot of simulated RNA-seq data. Replicated measurements follow a
Poisson distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.3 MA plot of replicated RNA-seq data. . . . . . . . . . . . . . . . . . . . . 212
7.4 Variance versus mean plot. Summaries were obtained from the RNA-seq
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
7.5 Palindrome count histogram. . . . . . . . . . . . . . . . . . . . . . . . . . 214
7.6 Likelihood versus lambda. . . . . . . . . . . . . . . . . . . . . . . . . . . 215
7.7 Observed counts versus theoretical Poisson counts. . . . . . . . . . . . . 216
7.8 Histograms of biological variance and technical variance. . . . . . . . . . 217
7.9 Normal qq-plot for sample standard deviations. . . . . . . . . . . . . . . 218
7.10 Histograms of sample standard deviations and densities of estimated dis-
tributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.11 qq-plot (left) and density (right) demonstrate that model fits data well. . 220
List of Figures xiii

7.12 Simulation demonstrating Bayes theorem. Top plot shows every individual
with red denoting cases. Each one takes a test and with 90% gives correct
answer. Those called positive (either correctly or incorrectly) are put in
the bottom left pane. Those called negative in the bottom right. . . . . . 226
7.13 Batting average histograms for 2010, 2011, and 2012. . . . . . . . . . . . 227
7.14 Volcano plot for t-test comparing two groups. Spiked-in genes are denoted
with blue. Among the rest of the genes, those with p-value < 0.01 are
denoted with red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
7.15 p-values versus standard deviation estimates. . . . . . . . . . . . . . . . . 232
7.16 Illustration of how estimates shrink towards the prior expectation. Forty
genes spanning the range of values were selected. . . . . . . . . . . . . . 233
7.17 Volcano plot for moderated t-test comparing two groups. Spiked-in genes
are denoted with blue. Among the rest of the genes, those with p-value <
0.01 are denoted with red. . . . . . . . . . . . . . . . . . . . . . . . . . . 234

8.1 Clustering of animals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237


8.2 Example of heatmap. Image Source: Heatmap, Gaeddal, 01.28.2007 Wiki-
media Commons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
8.4 Simulated twin pair heights. . . . . . . . . . . . . . . . . . . . . . . . . . 242
8.5 Twin height scatterplot (left) and MA-plot (right). . . . . . . . . . . . . 243
8.6 Distance computed from original data and after rotation is the same. . . 244
8.7 Twin height scatterplot (left) and after rotation (right). . . . . . . . . . 245
8.8 Distance computed with just one dimension after rotation versus actual
distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
8.9 Second PC plotted against first PC for the twins height data. . . . . . . 247
8.10 Entries of the diagonal of D for gene expression data. . . . . . . . . . . . 248
8.11 Percent variance explained by each principal component of gene expression
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.12 Cumulative variance explained by principal components of gene expression
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.13 Residuals from comparing a reconstructed gene expression table using 95
PCs to the original data with 189 dimensions. . . . . . . . . . . . . . . . 250
8.14 Illustration of projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.15 Geometric representation of Y. . . . . . . . . . . . . . . . . . . . . . . . . 255
8.16 Projection of Y onto new subspace. . . . . . . . . . . . . . . . . . . . . . 256
8.17 Plot of (2,3) as coordinates along Dimension 1 (1,0) and Dimension 2 (0,1). 258
8.18 Plot of (2,3) as a vector in a rotatated space, relative to the original
dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
8.19 Twin 2 heights versus twin 1 heights. . . . . . . . . . . . . . . . . . . . . 260
8.20 Rotation of twin 2 heights versus twin 1 heights. . . . . . . . . . . . . . . 261
8.21 Multi-dimensional scaling (MDS) plot for tissue gene expression data. . . 263
8.22 Variance explained for each principal component. . . . . . . . . . . . . . 263
8.23 Third and fourth principal components. . . . . . . . . . . . . . . . . . . . 264
8.24 MDS computed with cmdscale function. . . . . . . . . . . . . . . . . . . 265
8.25 Comparison of MDS first two PCs to SVD first two PCs. . . . . . . . . . 265
8.26 Twin heights scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . . . 268
8.27 Data projected onto space spanned by (1 0). . . . . . . . . . . . . . . . . 269
8.28 Data projected onto space spanned by first PC. . . . . . . . . . . . . . . 270
8.29 Plot showing SVD and prcomp give same results. . . . . . . . . . . . . . 271
xiv List of Figures

9.1 Dendrogram showing hierarchical clustering of tissue gene expression data. 274
9.2 Dendrogram showing hierarchical clustering of tissue gene expression data
with colors denoting tissues. . . . . . . . . . . . . . . . . . . . . . . . . . 275
9.3 Dendrogram showing hierarchical clustering of tissue gene expression data
with colors denoting tissues. Horizontal line defines actual clusters. . . . 275
9.4 Plot of gene expression for first two genes (order of appearance in data)
with color representing tissue (left) and clusters found with kmeans
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
9.5 Plot of gene expression for first two PCs with color representing tissues
(left) and clusters found using all genes (right). . . . . . . . . . . . . . . 278
9.6 Heatmap created using the 40 most variable genes and the function
heatmap.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
9.7 Histogram of son heights. . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
9.8 Son versus father height (left) with the red lines denoting the stratum
defined by conditioning on fathers being 71 inches tall. Conditional dis-
tribution: son height distribution of stratum defined by 71 inch fathers. . 283
9.9 Son versus father height showing predicted heights based on regression line
(left). Conditional distribution with vertical line representing regression
prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
9.10 MA-plot comparing gene expression from two arrays. . . . . . . . . . . . 285
9.11 MA-plot comparing gene expression from two arrays with fitted regression
line. The two colors represent positive and negative residuals. . . . . . . 286
9.12 MAplot comparing gene expression from two arrays with bin smoother fit
shown for two points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
9.13 Illustration of how bin smoothing estimates a curve. Showing 12 steps of
process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
9.14 MA-plot with curve obtained with bin-smoothed curve shown. . . . . . . 288
9.15 MA-plot comparing gene expression from two arrays with bin local regres-
sion fit shown for two points. . . . . . . . . . . . . . . . . . . . . . . . . . 288
9.16 Illustration of how loess estimates a curve. Showing 12 steps of the process. 289
9.17 MA-plot with curve obtained with loess. . . . . . . . . . . . . . . . . . . 289
9.18 Loess fitted with the loess function. . . . . . . . . . . . . . . . . . . . . . 290
9.19 Probability of Y=1 as a function of X1 and X2. Red is close to 1, yellow
close to 0.5, and blue close to 0. . . . . . . . . . . . . . . . . . . . . . . . 292
9.20 Bayes rule. The line divides part of the space for which probability is
larger than 0.5 (red) and lower than 0.5 (blue). . . . . . . . . . . . . . . 292
9.21 Training data (left) and test data (right). . . . . . . . . . . . . . . . . . . 293
9.22 We estimate the probability of 1 with a linear regression model with X1
and X2 as predictors. The resulting prediction map is divided into parts
that are larger than 0.5 (red) and lower than 0.5 (blue). . . . . . . . . . 295
9.23 Prediction regions obtained with kNN for k=1 (top) and k=200 (bottom).
We show both train (left) and test data (right). . . . . . . . . . . . . . . 296
9.24 Prediction error in train (pink) and test (green) versus number of neigh-
bors. The yellow line represents what one obtains with Bayes Rule. . . . 297
9.25 First two PCs of the tissue gene expression data with color representing
tissue. We use these two PCs as our two predictors throughout. . . . . . 300
9.26 Misclassification error versus number of neighbors. . . . . . . . . . . . . 302
9.27 Misclassification error versus number of neighbors when we use 5 dimen-
sions instead of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
List of Figures xv

10.1 Estimates of the astronomical unit with estimates of spread, versus year it
was reported. The two laboratories that reported more than one estimate
are shown in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
10.2 Percent of students that applied versus percent that were admitted by
gender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
10.3 Admitted are in green and majors are denoted with letters. Here we clearly
see that more males were admitted. . . . . . . . . . . . . . . . . . . . . . 309
10.4 Simpon’s Paradox illustrated. Admitted students are in green. Students
are now stratified by the major to which they applied. . . . . . . . . . . 309
10.5 Admission percentage by major for each gender. . . . . . . . . . . . . . . 310
10.6 Volcano plots for gene expression data. Comparison by ethnicity (left) and
by year within one ethnicity (right). . . . . . . . . . . . . . . . . . . . . . 312
10.7 Histogram of median expression y-axis. We can see females and males. . 315
10.8 Image of correlations. Cell (i,j) represents correlation between samples i
and j. Red is high, white is 0 and red is negative. . . . . . . . . . . . . . 316
10.9 Variance explained plot for simulated independent data. . . . . . . . . . 317
10.10 Variance explained plot for gene expression data. . . . . . . . . . . . . . 317
10.11 First two PCs for gene expression data with color representing ethnicity. 318
10.12 First two PCs for gene expression data with color representing processing
year. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
10.13 Boxplot of first four PCs stratified by month. . . . . . . . . . . . . . . . 320
10.14 Adjusted R-squared after fitting a model with each month as a factor to
each PC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.15 Square root of F-statistics from an analysis of variance to explain PCs
with month. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
10.16 Image of gene expression data for genes selected to show difference in
group as well as the batch effect, along with some randomly chosen genes. 324
10.17 p-value histogram and volcano plot for comparison between sexes. The Y
chromosome genes (considered to be positives) are highlighted in red. The
X chromosome genes (a subset is considered to be positive) are shown in
green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
10.18 p-value histogram and volcano plot for comparison between sexes after ad-
justment for month. The Y chromosome genes (considered to be positives)
are highlighted in red. The X chromosome genes (a subset is considered
to be positive) are shown in green. . . . . . . . . . . . . . . . . . . . . . 327
10.19 p-value histogram and volcano plot for comparison between sexes for Com-
bat. The Y chromosome genes (considered to be positives) are highlighted
in red. The X chromosome genes (a subset is considered to be positive)
are shown in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.20 Images of correlation between columns. High correlation is red, no corre-
lation is white, and negative correlation is blue. . . . . . . . . . . . . . . 331
10.21 Image of correlations. Cell (i,j) represents correlation between samples i
and j. Red is high, white is 0 and red is negative. . . . . . . . . . . . . . 333
10.22 Image of subset gene expression data (left) and image of correlations for
this dataset (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.23 Dates with color denoting month. . . . . . . . . . . . . . . . . . . . . . . 336
10.24 First PC plotted against order by date with colors representing month. . 336
10.25 Variance explained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
10.26 First 12 PCs stratified by dates. . . . . . . . . . . . . . . . . . . . . . . . 338
10.27 p-value histogram and volcano plot after blindly removing the first two
PCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
xvi List of Figures

10.28 Illustration of iterative procedure used by SVA. Only two iterations are
shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
10.29 p-value histogram and volcano plot obtained with SVA. . . . . . . . . . . 341
10.30 Original data split into three sources of variability estimated by SVA:
sex-related signal, surrogate-variable induced structure, and independent
error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Acknowledgments

The authors would like to thank Alex Nones for proofreading the manuscript during its
various stages. Also, thanks to Karl Broman for contributing the “Plots to Avoid” section
and to Stephanie Hicks for designing some of the exercises. Finally, thanks to John Kimmel
and three anonymous referees for excellent feedback and constructive criticism of the book.
This book was conceived during the teaching of several HarvardX courses, coordinated
by Heather Sternshein. We are also grateful to our TAs, Idan Ginsburg and Stephanie
Chan, and all the students whose questions and comments helped us improve the book.
The courses were partially funded by NIH grant R25GM114818. We are very grateful to the
National Institute of Health for its support.
A special thanks goes to all those that edited the book via GitHub pull requests: vjcitn,
yeredh, ste-fan, molx, kern3020, josemrecio, hcorrada, neerajt, massie, jmgore75, molecules,
lzamparo, eronisko, obicke, knbknb, and devrajoh.
Cover image credit: this photograph is La Mina Falls, El Yunque National Forest,
Puerto Rico, taken by Ron Kroetz https://www.flickr.com/photos/ronkroetz/14779273923
Attribution-NoDerivs 2.0 Generic (CC BY-ND 2.0)

xvii
Introduction

The unprecedented advance in digital technology during the second half of the 20th century
has produced a measurement revolution that is transforming science. In the life sciences,
data analysis is now part of practically every research project. Genomics, in particular, is
being driven by new measurement technologies that permit us to observe certain molecular
entities for the first time. These observations are leading to discoveries analogous to identify-
ing microorganisms and other breakthroughs permitted by the invention of the microscope.
Choice examples of these technologies are microarrays and next generation sequencing.
Scientific fields that have traditionally relied upon simple data analysis techniques have
been turned on their heads by these technologies. In the past, for example, researchers
would measure the transcription levels of a single gene of interest. Today, it is possible to
measure all 20,000+ human genes at once. Advances such as these have brought about a shift
from hypothesis to discovery-driven research. However, interpreting information extracted
from these massive and complex datasets requires sophisticated statistical skills as one can
easily be fooled by patterns arising by chance. This has greatly elevated the importance of
statistics and data analysis in the life sciences.

Who Will Find This Book Useful?


This book was written with the many life science researchers who are becoming data analysts
due to the increased reliance on data described above. If you are performing your own
analysis you have probably computed p-values, applied Bonferroni corrections, performed
principal component analysis, made a heatmap, or used one or more of the techniques listed
in the next section. If you don’t quite understand what these techniques are actually doing
or if you are not sure if you are using them appropriately, this book is for you.
Although the content of the book is mostly focused on advanced statistical concepts
we start by covering the basics to make sure all readers have a strong grounding on the
fundamental statistical concepts required for all data analysis. I find that many introduc-
tory statistics courses are taught in a way that makes it hard to relate the concepts to
data analysis. Our approach ensures that you learn the connection between practice and
theory. For this reason, the first two chapters, Inference and Exploratory Data Analysis,
are appropriate for an introductory undergraduate statistics or data science course. After
these two chapters the level of statistical sophistication ramps up relatively fast.
Although the typical reader of this book will have a masters or PhD, we try to keep
the mathematical content at undergraduate introductory level. You do not need calculus to
use this book. However, we do introduce and use linear algebra which is considered more
advanced than calculus. By explaining linear algebra in context of data analysis we believe
you will be able to learn the basics without knowing calculus. The harder part may be
getting used to the symbols and notation. More on this below.

xix
xx Introduction

What Does This Book Cover?


This book will cover several of the statistical concepts and data analytic skills needed to
succeed in data-driven life science research. We go from relatively basic concepts related to
computing p-values to advanced topics related to analyzing high-throughput data.
We start with one of the most important topics in statistics and in the life sciences:
statistical inference. Inference is the use of probability to learn population characteristics
from data. A typical example is deciphering if two groups (for example, cases versus controls)
are different on average. Specific topics covered include the t-test, confidence intervals,
association tests, Monte Carlo methods, permutation tests and statistical power. We make
use of approximations made possible by mathematical theory, such as the Central Limit
Theorem, as well as techniques made possible by modern computing. We will learn how to
compute p-values and confidence intervals and implement basic data analyses. Throughout
the book we will describe visualization techniques in the statistical computer language R
that are useful for exploring new datasets. For example, we will use these to learn when to
apply robust statistical techniques.
We will then move on to an introduction to linear models and matrix algebra. We will
explain why it is beneficial to use linear models to analyze differences across groups, and why
matrices are useful to represent and implement linear models. We continue with a review
of matrix algebra, including matrix notation and how to multiply matrices (both on paper
and in R). We will then apply what we covered on matrix algebra to linear models. We will
learn how to fit linear models in R, how to test the significance of differences, and how the
standard errors for differences are estimated. Furthermore, we will review some practical
issues with fitting linear models, including collinearity and confounding. Finally, we will
learn how to fit complex models, including interaction terms, how to contrast multiple
terms in R, and the powerful technique which the functions in R actually use to stably fit
linear models: the QR decomposition.
In the third part of the book we cover topics related to high-dimensional data. Specifi-
cally, we describe multiple testing, error rate controlling procedures, exploratory data anal-
ysis for high-throughput data, p-value corrections and the false discovery rate. From here we
move on to covering statistical modeling. In particular, we will discuss parametric distribu-
tions, including binomial and gamma distributions. Next, we will cover maximum likelihood
estimation. Finally, we will discuss hierarchical models and empirical Bayes techniques and
how they are applied in genomics.
We then cover the concepts of distance and dimension reduction. We will introduce the
mathematical definition of distance and use this to motivate the singular value decomposi-
tion (SVD) for dimension reduction and multi-dimensional scaling. Once we learn this, we
will be ready to cover hierarchical and k-means clustering. We will follow this with a basic
introduction to machine learning.
We end by learning about batch effects and how component and factor analysis are
used to deal with this challenge. In particular, we will examine confounding, show examples
of batch effects, make the connection to factor analysis, and describe surrogate variable
analysis.

How Is This Book Different?


While statistics textbooks focus on mathematics, this book focuses on using a computer to
perform data analysis. This book follows the approach of Stat Labs1 , by Deborah Nolan and
Terry Speed. Instead of explaining the mathematics and theory, and then showing examples,
1 https://www.stat.berkeley.edu/
~statlabs/
Introduction xxi

we start by stating a practical data-related challenge. This book also includes the computer
code that provides a solution to the problem and helps illustrate the concepts behind the
solution. By running the code yourself, and seeing data generation and analysis happen live,
you will get a better intuition for the concepts, the mathematics, and the theory.
We focus on the practical challenges faced by data analysts in the life sciences and
introduce mathematics as a tool that can help us achieve scientific goals. Furthermore,
throughout the book we show the R code that performs this analysis and connect the lines
of code to the statistical and mathematical concepts we explain. All sections of this book
are reproducible as they were made using R markdown documents that include R code used
to produce the figures, tables and results shown in the book. In order to distinguish it, the
code is shown in the following font:

x <- 2
y <- 3
print(x+y)

and the results in different colors, preceded by two hash characters (##):

## [1] 5

We will provide links that will give you access to the raw R markdown code so you can
easily follow along with the book by programming in R.
At the beginning of each chapter you will see the sentence:

The R markdown document for this section is available here.

The word “here” will be a hyperlink to the R markdown file. The best way to read this
book is with a computer in front of you, scrolling through that file, and running the R code
that produces the results included in the book section you are reading.
1
Getting Started

In this book we will be using the R programming language1 for all our analysis. You will learn
R and statistics simultaneously. However, we assume you have some basic programming
skills and knowledge of R syntax. If you don’t, your first homework, listed below, is to
complete a tutorial. Here we give step-by-step instructions on how to get set up to follow
along.

1.1 Installing R
The first step is to install R. You can download and install R from the Comprehensive R
Archive Network2 (CRAN). It is relatively straightforward, but if you need further help you
can try the following resources:

• Installing R on Windows3
• Installing R on Mac4
• Installing R on Ubuntu5

1.2 Installing RStudio


The next step is to install RStudio, a program for viewing and running R scripts. Technically
you can run all the code shown here without installing RStudio, but we highly recommend
this integrated development environment (IDE). Instructions are here6 and for Windows
we have special instructions7 .

1.3 Learn R Basics


The first homework assignment is to complete an R tutorial to familiarize yourself with
the basics of programming and R syntax. To follow this book you should be familiar with
1 https://cran.r-project.org/
2 https://cran.r-project.org/
3 https://github.com/genomicsclass/windows#installing-r
4 http://youtu.be/Icawuhf0Yqo
5 http://cran.r-project.org/bin/linux/ubuntu/README
6 http://www.rstudio.com/products/rstudio/download/
7 https://github.com/genomicsclass/windows

1
2 Getting Started

the difference between lists (including data frames) and numeric vectors, for-loops, how to
create functions, and how to use the sapply and replicate functions.
If you are already familiar with R, you can skip to the next section. Otherwise, you
should go through the swirl8 tutorial, which teaches you R programming and data science
interactively, at your own pace and in the R console. Once you have R installed, you can
install swirl and run it the following way:
install.packages("swirl")
library(swirl)
swirl()
Alternatively you can take the try R9 interactive class from Code School.
There are also many open and free resources and reference guides for R. Two examples
are:
• Quick-R10 : a quick online reference for data input, basic statistics and plots
• R reference card (PDF)[https://cran.r-project.org/doc/contrib/Short-refcard.pdf] by
Tom Short
Two key things you need to know about R is that you can get help for a function using
help or ?, like this:
?install.packages
help("install.packages")
and the hash character represents comments, so text following these characters is not
interpreted:
##This is just a comment

1.4 Installing Packages


The first R command we will run is install.packages. If you took the swirl tutorial you
should have already done this. R only includes a basic set of functions. It can do much more
than this, but not everybody needs everything so we instead make some functions available
via packages. Many of these functions are stored in CRAN. Note that these packages are
vetted: they are checked for common errors and they must have a dedicated maintainer.
You can easily install packages from within R if you know the name of the packages. As an
example, we are going to install the package rafalib which we use in our first data analysis
examples:
install.packages("rafalib")
We can then load the package into our R sessions using the library function:
library(rafalib)
From now on you will see that we sometimes load packages without installing them. This
is because once you install the package, it remains in place and only needs to be loaded
with library. If you try to load a package and get an error, it probably means you need
to install it first.
8 http://swirlstats.com/
9 http://tryr.codeschool.com/
10 http://www.statmethods.net/
Importing Data into R 3

1.5 Importing Data into R


The first step when preparing to analyze data is to read in the data into R. There are several
ways to do this and we will discuss three of them. But you only need to learn one to follow
along.
In the life sciences, small datasets such as the one used as an example in the next
sections are typically stored as Excel files. Although there are R packages designed to read
Excel (xls) format, you generally want to avoid this and save files as comma delimited
(Comma-Separated Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files.
These plain-text formats are often easier for sharing data with collaborators, as commercial
software is not required for viewing or working with the data. We will start with a simple
example dataset containing female mouse weights11 .
The first step is to find the file containing your data and know its path.

Paths and the Working Directory When you are working in R it is useful to know
your working directory. This is the directory or folder in which R will save or look for files
by default. You can see your working directory by typing:

getwd()

You can also change your working directory using the function setwd. Or you can change
it through RStudio by clicking on “Session”.
The functions that read and write files (there are several in R) assume you mean to look
for files or write files in the working directory. Our recommended approach for beginners
will have you reading and writing to the working directory. However, you can also type the
full path12 , which will work independently of the working directory.

Projects in RStudio We find that the simplest way to organize yourself is to start a
Project in RStudio (Click on “File” and “New Project”). When creating the project, you
will select a folder to be associated with it. You can then download all your data into this
folder. Your working directory will be this folder.

Option 1: Read file over the Internet You can navigate to the femaleMiceWeights.csv
file by visiting the data directory of dagdata on GitHub13 . If you navigate to the file, you
need to click on Raw on the upper right hand corner of the page.
Now you can copy and paste the URL and use this as the argument to read.csv. Here
we break the URL into a base directory and a filename and then combine with paste0
because the URL would otherwise be too long for the page. We use paste0 because we
want to put the strings together as is, if you were specifying a file on your machine you
should use the smarter function, file.path, which knows the difference between Windows
and Mac file path connectors. You can specify the URL using a single string to avoid this
extra step.
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
url <- paste0(dir, "femaleMiceWeights.csv")
dat <- read.csv(url)
11 https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/

femaleMiceWeights.csv
12 http://www.computerhope.com/jargon/a/absopath.htm
13 https://github.com/genomicsclass/dagdata/tree/master/inst/extdata
4 Getting Started

FIGURE 1.1
GitHub page screenshot.

Option 2: Download file with your browser to your working directory There are
reasons for wanting to keep a local copy of the file. For example, you may want to run
the analysis while not connected to the Internet or you may want to ensure reproducibility
regardless of the file being available on the original site. To download the file, as in option
1, you can navigate to the femaleMiceWeights.csv. In this option we use your browser’s
“Save As” function to ensure that the downloaded file is in a CSV format. Some browsers
add an extra suffix to your filename by default. You do not want this. You want your file
to be named femaleMiceWeights.csv. Once you have this file in your working directory,
then you can simply read it in like this:

dat <- read.csv("femaleMiceWeights.csv")

If you did not receive any message, then you probably read in the file successfully.

Option 3: Download the file from within R We store many of the datasets used here
on GitHub14 . You can save these files directly from the Internet to your computer using R.
In this example, we are using the download.file function in the downloader package to
download the file to a specific location and then read it in. We can assign it a random name
and a random directory using the function tempfile, but you can also save it in directory
with the name of your choosing.
library(downloader) ##use install.packages to install
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv"
url <- paste0(dir, filename)
if (!file.exists(filename)) download(url, destfile=filename)

We can then proceed as in option 2:

dat <- read.csv(filename)

Option 4: Download the data package (Advanced) Many of the datasets we include
in this book are available in custom-built packages from GitHub. The reason we use GitHub,
rather than CRAN, is that on GitHub we do not have to vet packages, which gives us much
more flexibility.

14 https://github.com/genomicsclass/
Exercises 5

To install packages from GitHub you will need to install the devtools package:

install.packages("devtools")

Note to Windows users: to use devtools you will have to also install Rtools. In general
you will need to install packages as administrator. One way to do this is to start R as
administrator. If you do not have permission to do this, then it is a bit more complicated15 .
Now you are ready to install a package from GitHub. For this we use a different function:

library(devtools)
install_github("genomicsclass/dagdata")

The file we are working with is actually included in this package. Once you install the
package, the file is on your computer. However, finding it requires advanced knowledge.
Here are the lines of code:

dir <- system.file(package="dagdata") #extracts the location of package


list.files(dir)

## [1] "data" "DESCRIPTION" "extdata" "help" "html"


## [6] "Meta" "NAMESPACE" "script"

list.files(file.path(dir,"extdata")) #external data is in this directory

## [1] "admissions.csv" "babies.txt"


## [3] "femaleControlsPopulation.csv" "femaleMiceWeights.csv"
## [5] "mice_pheno.csv" "msleep_ggplot2.csv"
## [7] "README" "spider_wolff_gorb_2013.csv"

And now we are ready to read in the file:

filename <- file.path(dir,"extdata/femaleMiceWeights.csv")


dat <- read.csv(filename)

1.6 Exercises
Here we will test some of the basics of R data manipulation which you should know
or should have learned by following the tutorials above. You will need to have the file
femaleMiceWeights.csv in your working directory. As we showed above, one way to do
this is by using the downloader package:
library(downloader)
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "femaleMiceWeights.csv"
url <- paste0(dir, filename)
if (!file.exists(filename)) download(url, destfile=filename)

1. Read in the file femaleMiceWeights.csv and report the body weight of the mouse
in the exact name of the column containing the weights.
15 http://www.magesblog.com/2012/04/installing-r-packages-without-admin.html
6 Getting Started

2. The [ and ] symbols can be used to extract specific rows and specific columns of
the table. What is the entry in the 12th row and second column?
3. You should have learned how to use the $ character to extract a column from a
table and return it as a vector. Use $ to extract the weight column and report
the weight of the mouse in the 11th row.
4. The length function returns the number of elements in a vector. How many mice
are included in our dataset?
5. To create a vector with the numbers 3 to 7, we can use seq(3,7) or, because
they are consecutive, 3:7. View the data and determine what rows are associated
with the high fat or hf diet. Then use the mean function to compute the average
weight of these mice.
6. One of the functions we will be using often is sample. Read the help file for
sample using ?sample. Now take a random sample of size 1 from the numbers
13 to 24 and report back the weight of the mouse represented by that row. Make
sure to type set.seed(1) to ensure that everybody gets the same answer.

1.7 Brief Introduction to dplyr


The learning curve for R syntax is slow. One of the more difficult aspects that requires some
getting used to is subsetting data tables. The dplyr package brings these tasks closer to
English and we are therefore going to introduce two simple functions: one is used to subset
and the other to select columns.
Take a look at the dataset we read in:

filename <- "femaleMiceWeights.csv"


dat <- read.csv(filename)
head(dat) #In R Studio use View(dat)

## Diet Bodyweight
## 1 chow 21.51
## 2 chow 28.14
## 3 chow 24.04
## 4 chow 23.45
## 5 chow 23.68
## 6 chow 19.79

There are two types of diets, which are denoted in the first column. If we want just the
weights, we only need the second column. So if we want the weights for mice on the chow
diet, we subset and filter like this:

library(dplyr)
chow <- filter(dat, Diet=="chow") #keep only the ones with chow diet
head(chow)

## Diet Bodyweight
## 1 chow 21.51
## 2 chow 28.14
## 3 chow 24.04
Brief Introduction to dplyr 7

## 4 chow 23.45
## 5 chow 23.68
## 6 chow 19.79

And now we can select only the column with the values:

chowVals <- select(chow,Bodyweight)


head(chowVals)

## Bodyweight
## 1 21.51
## 2 28.14
## 3 24.04
## 4 23.45
## 5 23.68
## 6 19.79

A nice feature of the dplyr package is that you can perform consecutive tasks by using
what is called a “pipe”. In dplyr we use %>% to denote a pipe. This symbol tells the program
to first do one thing and then do something else to the result of the first. Hence, we can
perform several data manipulations in one line. For example:

chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight)

In the second task, we no longer have to specify the object we are editing since it is
whatever comes from the previous call.
Also, note that if dplyr receives a data.frame it will return a data.frame.

class(dat)

## [1] "data.frame"

class(chowVals)

## [1] "data.frame"

For pedagogical reasons, we will often want the final result to be a simple numeric
vector. To obtain such a vector with dplyr, we can apply the unlist function which turns
lists, such as data.frames, into numeric vectors:

chowVals <- filter(dat, Diet=="chow") %>% select(Bodyweight) %>% unlist


class( chowVals )

## [1] "numeric"

To do this in R without dplyr the code is the following:

chowVals <- dat[ dat$Diet=="chow", colnames(dat)=="Bodyweight"]


8 Getting Started

1.8 Exercises
For these exercises, we will use a new dataset related to mammalian sleep. This data is
described here16 . Download the CSV file from this17 location. You can also download the
data using the downloader package:
library(downloader)
dir <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/"
filename <- "msleep_ggplot2.csv"
url <- paste0(dir, filename)
if (!file.exists(filename)) download(url,filename)

We are going to read in this data, then test your knowledge of they key dplyr functions
select and filter. We are also going to review two different classes: data frames and
vectors.
1. Read in the msleep ggplot2.csv file with the function read.csv and use the
function class to determine what type of object is returned.
2. Now use the filter function to select only the primates. How many animals in
the table are primates? Hint: the nrow function gives you the number of rows of
a data frame or matrix.
3. What is the class of the object you obtain after subsetting the table to only
include primates?
4. Now use the select function to extract the sleep (total) for the primates. What
class is this object? Hint: use %>% to pipe the results of the filter function to
select.
5. Now we want to calculate the average amount of sleep for primates (the average
of the numbers computed above). One challenge is that the mean function requires
a vector so, if we simply apply it to the output above, we get an error. Look at
the help file for unlist and use it to compute the desired average.
6. For the last exercise, we could also use the dplyr summarize function. We have
not introduced this function, but you can read the help file and repeat exercise
5, this time using just filter and summarize to get the answer.

1.9 Mathematical Notation


This book focuses on teaching statistical concepts and data analysis programming skills. We
avoid mathematical notation as much as possible, but we do use it. We do not want readers
to be intimidated by the notation though. Mathematics is actually the easier part of learning
statistics. Unfortunately, many text books use mathematical notation in what we believe
to be an over-complicated way. For this reason, we do try to keep the notation as simple
as possible. However, we do not want to water down the material, and some mathematical
notation facilitates a deeper understanding of the concepts. Here we describe a few specific
16 http://docs.ggplot2.org/0.9.3.1/msleep.html
17 https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_

ggplot2.csv
Mathematical Notation 9

symbols that we use often. If they appear intimidating to you, please take some time to
read this section carefully as they are actually simpler than they seem. Because by now you
should be somewhat familiar with R, we will make the connection between mathematical
notation and R code.

Indexing Those of us dealing with data almost always have a series of numbers. To describe
the concepts in an abstract way, we use indexing. For example 5 numbers:

x <- 1:5

can be generally represented like this x1 , x2 , x3 , x4 , x5 . We use dots to simplify this


x1 , . . . , x5 and indexing to simplify even more xi , i = 1, . . . , 5. If we want to describe a
procedure for a list of any size n, we write xi , i = 1, . . . , n.
We sometimes have two indexes. For example, we may have several measurements (blood
pressure, weight, height, age, cholesterol level) for 100 individuals. We can then use double
indexes: xi,j , i = 1, . . . , 100, j = 1, . . . , 5.

Summation A very common operation in data analysis is to sum several numbers. This
comes up, for example, when we compute averages and standard deviations. If we have
many numbers, there is a mathematical notation that makes it quite easy to express the
following:

n <- 1000
x <- 1:n
S <- sum(x)
P
and it is the notation (capital S in Greek):
n
X
S= xi
i=1

Note that we make use of indexing as well. We will see that what is included inside
the summation can become quite complicated. However, the summation part should not
confuse you as it is a simple operation.

Greek letters We would prefer to avoid Greek letters, but they are ubiquitous in the
statistical literature so we want you to become used to them. They are mainly used to
distinguish the unknown from the observed. Suppose we want to find out the average height
of a population and we take a sample of 1,000 people to estimate this. The unknown average
we want to estimate is often denoted with µ, the Greek letter for m (m is for mean). The
standard deviation is often denoted with σ, the Greek letter for s. Measurement error or
other unexplained random variability is typically denoted with ε, the Greek letter for e.
Effect sizes, for example the effect of a diet on weight, are typically denoted with β. We
may use other Greek letters but those are the most commonly used.
You should get used to these four Greek letters as you will be seeing them often: µ, σ,
β and ε.
Note that indexing is sometimes used in conjunction with Greek letters to denote dif-
ferent groups. For example, if we have one set of numbers denoted with x and another with
y we may use µx and µy to denote their averages.
10 Getting Started

Infinity In the text we often talk about asymptotic results. Typically, this refers to an
approximation that gets better and better as the number of data points we consider gets
larger and larger, with perfect approximations occurring when the number of data points is
∞. In practice, there is no such thing as ∞, but it is a convenient concept to understand.
One way to think about asymptotic results is as results that become better and better
as some number increases and we can pick a number so that a computer can’t tell the
difference between the approximation and the real number. Here is a very simple example
that approximates 1/3 with decimals:

onethird <- function(n) sum( 3/10^c(1:n))


1/3 - onethird(4)

## [1] 3.333333e-05

1/3 - onethird(10)

## [1] 3.333334e-11

1/3 - onethird(16)

## [1] 0

In the example above, 16 is practically ∞.

Integrals We only use these a couple of times so you can skip this section if you prefer.
However, integrals are actually much simpler to understand than perhaps you realize.
For certain statistical operations, we need to figure out areas under the curve. For
example, for a function f (x) . . .

FIGURE 1.2
Integral of a function.

. . . we need to know what proportion of the total area under the curve is grey.
The grey area can be thought of as many small grey bars stacked next to each other.
The area is then just the sum of the areas of these little bars. The problem is that we can’t
do this for every number between 2 and 4 because there are an infinite number. The integral
Mathematical Notation 11

is the mathematical solution to this problem. In this case, the total area is 1 so the answer
to what proportion is grey is the following integral:
Z 4
f (x) dx
2

Because we constructed this example, we know that the grey area is 2.27% of the total.
Note that this is very well approximated by an actual sum of little bars:

width <- 0.01


x <- seq(2,4,width)
areaofbars <- f(x)*width
sum( areaofbars )

## [1] 0.02298998

The smaller we make width, the closer the sum gets to the integral, which is equal to
the area.
Another random document with
no related content on Scribd:
DANCE ON STILTS AT THE GIRLS’ UNYAGO, NIUCHI

Newala, too, suffers from the distance of its water-supply—at least


the Newala of to-day does; there was once another Newala in a lovely
valley at the foot of the plateau. I visited it and found scarcely a trace
of houses, only a Christian cemetery, with the graves of several
missionaries and their converts, remaining as a monument of its
former glories. But the surroundings are wonderfully beautiful. A
thick grove of splendid mango-trees closes in the weather-worn
crosses and headstones; behind them, combining the useful and the
agreeable, is a whole plantation of lemon-trees covered with ripe
fruit; not the small African kind, but a much larger and also juicier
imported variety, which drops into the hands of the passing traveller,
without calling for any exertion on his part. Old Newala is now under
the jurisdiction of the native pastor, Daudi, at Chingulungulu, who,
as I am on very friendly terms with him, allows me, as a matter of
course, the use of this lemon-grove during my stay at Newala.
FEET MUTILATED BY THE RAVAGES OF THE “JIGGER”
(Sarcopsylla penetrans)

The water-supply of New Newala is in the bottom of the valley,


some 1,600 feet lower down. The way is not only long and fatiguing,
but the water, when we get it, is thoroughly bad. We are suffering not
only from this, but from the fact that the arrangements at Newala are
nothing short of luxurious. We have a separate kitchen—a hut built
against the boma palisade on the right of the baraza, the interior of
which is not visible from our usual position. Our two cooks were not
long in finding this out, and they consequently do—or rather neglect
to do—what they please. In any case they do not seem to be very
particular about the boiling of our drinking-water—at least I can
attribute to no other cause certain attacks of a dysenteric nature,
from which both Knudsen and I have suffered for some time. If a
man like Omari has to be left unwatched for a moment, he is capable
of anything. Besides this complaint, we are inconvenienced by the
state of our nails, which have become as hard as glass, and crack on
the slightest provocation, and I have the additional infliction of
pimples all over me. As if all this were not enough, we have also, for
the last week been waging war against the jigger, who has found his
Eldorado in the hot sand of the Makonde plateau. Our men are seen
all day long—whenever their chronic colds and the dysentery likewise
raging among them permit—occupied in removing this scourge of
Africa from their feet and trying to prevent the disastrous
consequences of its presence. It is quite common to see natives of
this place with one or two toes missing; many have lost all their toes,
or even the whole front part of the foot, so that a well-formed leg
ends in a shapeless stump. These ravages are caused by the female of
Sarcopsylla penetrans, which bores its way under the skin and there
develops an egg-sac the size of a pea. In all books on the subject, it is
stated that one’s attention is called to the presence of this parasite by
an intolerable itching. This agrees very well with my experience, so
far as the softer parts of the sole, the spaces between and under the
toes, and the side of the foot are concerned, but if the creature
penetrates through the harder parts of the heel or ball of the foot, it
may escape even the most careful search till it has reached maturity.
Then there is no time to be lost, if the horrible ulceration, of which
we see cases by the dozen every day, is to be prevented. It is much
easier, by the way, to discover the insect on the white skin of a
European than on that of a native, on which the dark speck scarcely
shows. The four or five jiggers which, in spite of the fact that I
constantly wore high laced boots, chose my feet to settle in, were
taken out for me by the all-accomplished Knudsen, after which I
thought it advisable to wash out the cavities with corrosive
sublimate. The natives have a different sort of disinfectant—they fill
the hole with scraped roots. In a tiny Makua village on the slope of
the plateau south of Newala, we saw an old woman who had filled all
the spaces under her toe-nails with powdered roots by way of
prophylactic treatment. What will be the result, if any, who can say?
The rest of the many trifling ills which trouble our existence are
really more comic than serious. In the absence of anything else to
smoke, Knudsen and I at last opened a box of cigars procured from
the Indian store-keeper at Lindi, and tried them, with the most
distressing results. Whether they contain opium or some other
narcotic, neither of us can say, but after the tenth puff we were both
“off,” three-quarters stupefied and unspeakably wretched. Slowly we
recovered—and what happened next? Half-an-hour later we were
once more smoking these poisonous concoctions—so insatiable is the
craving for tobacco in the tropics.
Even my present attacks of fever scarcely deserve to be taken
seriously. I have had no less than three here at Newala, all of which
have run their course in an incredibly short time. In the early
afternoon, I am busy with my old natives, asking questions and
making notes. The strong midday coffee has stimulated my spirits to
an extraordinary degree, the brain is active and vigorous, and work
progresses rapidly, while a pleasant warmth pervades the whole
body. Suddenly this gives place to a violent chill, forcing me to put on
my overcoat, though it is only half-past three and the afternoon sun
is at its hottest. Now the brain no longer works with such acuteness
and logical precision; more especially does it fail me in trying to
establish the syntax of the difficult Makua language on which I have
ventured, as if I had not enough to do without it. Under the
circumstances it seems advisable to take my temperature, and I do
so, to save trouble, without leaving my seat, and while going on with
my work. On examination, I find it to be 101·48°. My tutors are
abruptly dismissed and my bed set up in the baraza; a few minutes
later I am in it and treating myself internally with hot water and
lemon-juice.
Three hours later, the thermometer marks nearly 104°, and I make
them carry me back into the tent, bed and all, as I am now perspiring
heavily, and exposure to the cold wind just beginning to blow might
mean a fatal chill. I lie still for a little while, and then find, to my
great relief, that the temperature is not rising, but rather falling. This
is about 7.30 p.m. At 8 p.m. I find, to my unbounded astonishment,
that it has fallen below 98·6°, and I feel perfectly well. I read for an
hour or two, and could very well enjoy a smoke, if I had the
wherewithal—Indian cigars being out of the question.
Having no medical training, I am at a loss to account for this state
of things. It is impossible that these transitory attacks of high fever
should be malarial; it seems more probable that they are due to a
kind of sunstroke. On consulting my note-book, I become more and
more inclined to think this is the case, for these attacks regularly
follow extreme fatigue and long exposure to strong sunshine. They at
least have the advantage of being only short interruptions to my
work, as on the following morning I am always quite fresh and fit.
My treasure of a cook is suffering from an enormous hydrocele which
makes it difficult for him to get up, and Moritz is obliged to keep in
the dark on account of his inflamed eyes. Knudsen’s cook, a raw boy
from somewhere in the bush, knows still less of cooking than Omari;
consequently Nils Knudsen himself has been promoted to the vacant
post. Finding that we had come to the end of our supplies, he began
by sending to Chingulungulu for the four sucking-pigs which we had
bought from Matola and temporarily left in his charge; and when
they came up, neatly packed in a large crate, he callously slaughtered
the biggest of them. The first joint we were thoughtless enough to
entrust for roasting to Knudsen’s mshenzi cook, and it was
consequently uneatable; but we made the rest of the animal into a
jelly which we ate with great relish after weeks of underfeeding,
consuming incredible helpings of it at both midday and evening
meals. The only drawback is a certain want of variety in the tinned
vegetables. Dr. Jäger, to whom the Geographical Commission
entrusted the provisioning of the expeditions—mine as well as his
own—because he had more time on his hands than the rest of us,
seems to have laid in a huge stock of Teltow turnips,[46] an article of
food which is all very well for occasional use, but which quickly palls
when set before one every day; and we seem to have no other tins
left. There is no help for it—we must put up with the turnips; but I
am certain that, once I am home again, I shall not touch them for ten
years to come.
Amid all these minor evils, which, after all, go to make up the
genuine flavour of Africa, there is at least one cheering touch:
Knudsen has, with the dexterity of a skilled mechanic, repaired my 9
× 12 cm. camera, at least so far that I can use it with a little care.
How, in the absence of finger-nails, he was able to accomplish such a
ticklish piece of work, having no tool but a clumsy screw-driver for
taking to pieces and putting together again the complicated
mechanism of the instantaneous shutter, is still a mystery to me; but
he did it successfully. The loss of his finger-nails shows him in a light
contrasting curiously enough with the intelligence evinced by the
above operation; though, after all, it is scarcely surprising after his
ten years’ residence in the bush. One day, at Lindi, he had occasion
to wash a dog, which must have been in need of very thorough
cleansing, for the bottle handed to our friend for the purpose had an
extremely strong smell. Having performed his task in the most
conscientious manner, he perceived with some surprise that the dog
did not appear much the better for it, and was further surprised by
finding his own nails ulcerating away in the course of the next few
days. “How was I to know that carbolic acid has to be diluted?” he
mutters indignantly, from time to time, with a troubled gaze at his
mutilated finger-tips.
Since we came to Newala we have been making excursions in all
directions through the surrounding country, in accordance with old
habit, and also because the akida Sefu did not get together the tribal
elders from whom I wanted information so speedily as he had
promised. There is, however, no harm done, as, even if seen only
from the outside, the country and people are interesting enough.
The Makonde plateau is like a large rectangular table rounded off
at the corners. Measured from the Indian Ocean to Newala, it is
about seventy-five miles long, and between the Rovuma and the
Lukuledi it averages fifty miles in breadth, so that its superficial area
is about two-thirds of that of the kingdom of Saxony. The surface,
however, is not level, but uniformly inclined from its south-western
edge to the ocean. From the upper edge, on which Newala lies, the
eye ranges for many miles east and north-east, without encountering
any obstacle, over the Makonde bush. It is a green sea, from which
here and there thick clouds of smoke rise, to show that it, too, is
inhabited by men who carry on their tillage like so many other
primitive peoples, by cutting down and burning the bush, and
manuring with the ashes. Even in the radiant light of a tropical day
such a fire is a grand sight.
Much less effective is the impression produced just now by the
great western plain as seen from the edge of the plateau. As often as
time permits, I stroll along this edge, sometimes in one direction,
sometimes in another, in the hope of finding the air clear enough to
let me enjoy the view; but I have always been disappointed.
Wherever one looks, clouds of smoke rise from the burning bush,
and the air is full of smoke and vapour. It is a pity, for under more
favourable circumstances the panorama of the whole country up to
the distant Majeje hills must be truly magnificent. It is of little use
taking photographs now, and an outline sketch gives a very poor idea
of the scenery. In one of these excursions I went out of my way to
make a personal attempt on the Makonde bush. The present edge of
the plateau is the result of a far-reaching process of destruction
through erosion and denudation. The Makonde strata are
everywhere cut into by ravines, which, though short, are hundreds of
yards in depth. In consequence of the loose stratification of these
beds, not only are the walls of these ravines nearly vertical, but their
upper end is closed by an equally steep escarpment, so that the
western edge of the Makonde plateau is hemmed in by a series of
deep, basin-like valleys. In order to get from one side of such a ravine
to the other, I cut my way through the bush with a dozen of my men.
It was a very open part, with more grass than scrub, but even so the
short stretch of less than two hundred yards was very hard work; at
the end of it the men’s calicoes were in rags and they themselves
bleeding from hundreds of scratches, while even our strong khaki
suits had not escaped scatheless.

NATIVE PATH THROUGH THE MAKONDE BUSH, NEAR


MAHUTA

I see increasing reason to believe that the view formed some time
back as to the origin of the Makonde bush is the correct one. I have
no doubt that it is not a natural product, but the result of human
occupation. Those parts of the high country where man—as a very
slight amount of practice enables the eye to perceive at once—has not
yet penetrated with axe and hoe, are still occupied by a splendid
timber forest quite able to sustain a comparison with our mixed
forests in Germany. But wherever man has once built his hut or tilled
his field, this horrible bush springs up. Every phase of this process
may be seen in the course of a couple of hours’ walk along the main
road. From the bush to right or left, one hears the sound of the axe—
not from one spot only, but from several directions at once. A few
steps further on, we can see what is taking place. The brush has been
cut down and piled up in heaps to the height of a yard or more,
between which the trunks of the large trees stand up like the last
pillars of a magnificent ruined building. These, too, present a
melancholy spectacle: the destructive Makonde have ringed them—
cut a broad strip of bark all round to ensure their dying off—and also
piled up pyramids of brush round them. Father and son, mother and
son-in-law, are chopping away perseveringly in the background—too
busy, almost, to look round at the white stranger, who usually excites
so much interest. If you pass by the same place a week later, the piles
of brushwood have disappeared and a thick layer of ashes has taken
the place of the green forest. The large trees stretch their
smouldering trunks and branches in dumb accusation to heaven—if
they have not already fallen and been more or less reduced to ashes,
perhaps only showing as a white stripe on the dark ground.
This work of destruction is carried out by the Makonde alike on the
virgin forest and on the bush which has sprung up on sites already
cultivated and deserted. In the second case they are saved the trouble
of burning the large trees, these being entirely absent in the
secondary bush.
After burning this piece of forest ground and loosening it with the
hoe, the native sows his corn and plants his vegetables. All over the
country, he goes in for bed-culture, which requires, and, in fact,
receives, the most careful attention. Weeds are nowhere tolerated in
the south of German East Africa. The crops may fail on the plains,
where droughts are frequent, but never on the plateau with its
abundant rains and heavy dews. Its fortunate inhabitants even have
the satisfaction of seeing the proud Wayao and Wamakua working
for them as labourers, driven by hunger to serve where they were
accustomed to rule.
But the light, sandy soil is soon exhausted, and would yield no
harvest the second year if cultivated twice running. This fact has
been familiar to the native for ages; consequently he provides in
time, and, while his crop is growing, prepares the next plot with axe
and firebrand. Next year he plants this with his various crops and
lets the first piece lie fallow. For a short time it remains waste and
desolate; then nature steps in to repair the destruction wrought by
man; a thousand new growths spring out of the exhausted soil, and
even the old stumps put forth fresh shoots. Next year the new growth
is up to one’s knees, and in a few years more it is that terrible,
impenetrable bush, which maintains its position till the black
occupier of the land has made the round of all the available sites and
come back to his starting point.
The Makonde are, body and soul, so to speak, one with this bush.
According to my Yao informants, indeed, their name means nothing
else but “bush people.” Their own tradition says that they have been
settled up here for a very long time, but to my surprise they laid great
stress on an original immigration. Their old homes were in the
south-east, near Mikindani and the mouth of the Rovuma, whence
their peaceful forefathers were driven by the continual raids of the
Sakalavas from Madagascar and the warlike Shirazis[47] of the coast,
to take refuge on the almost inaccessible plateau. I have studied
African ethnology for twenty years, but the fact that changes of
population in this apparently quiet and peaceable corner of the earth
could have been occasioned by outside enterprises taking place on
the high seas, was completely new to me. It is, no doubt, however,
correct.
The charming tribal legend of the Makonde—besides informing us
of other interesting matters—explains why they have to live in the
thickest of the bush and a long way from the edge of the plateau,
instead of making their permanent homes beside the purling brooks
and springs of the low country.
“The place where the tribe originated is Mahuta, on the southern
side of the plateau towards the Rovuma, where of old time there was
nothing but thick bush. Out of this bush came a man who never
washed himself or shaved his head, and who ate and drank but little.
He went out and made a human figure from the wood of a tree
growing in the open country, which he took home to his abode in the
bush and there set it upright. In the night this image came to life and
was a woman. The man and woman went down together to the
Rovuma to wash themselves. Here the woman gave birth to a still-
born child. They left that place and passed over the high land into the
valley of the Mbemkuru, where the woman had another child, which
was also born dead. Then they returned to the high bush country of
Mahuta, where the third child was born, which lived and grew up. In
course of time, the couple had many more children, and called
themselves Wamatanda. These were the ancestral stock of the
Makonde, also called Wamakonde,[48] i.e., aborigines. Their
forefather, the man from the bush, gave his children the command to
bury their dead upright, in memory of the mother of their race who
was cut out of wood and awoke to life when standing upright. He also
warned them against settling in the valleys and near large streams,
for sickness and death dwelt there. They were to make it a rule to
have their huts at least an hour’s walk from the nearest watering-
place; then their children would thrive and escape illness.”
The explanation of the name Makonde given by my informants is
somewhat different from that contained in the above legend, which I
extract from a little book (small, but packed with information), by
Pater Adams, entitled Lindi und sein Hinterland. Otherwise, my
results agree exactly with the statements of the legend. Washing?
Hapana—there is no such thing. Why should they do so? As it is, the
supply of water scarcely suffices for cooking and drinking; other
people do not wash, so why should the Makonde distinguish himself
by such needless eccentricity? As for shaving the head, the short,
woolly crop scarcely needs it,[49] so the second ancestral precept is
likewise easy enough to follow. Beyond this, however, there is
nothing ridiculous in the ancestor’s advice. I have obtained from
various local artists a fairly large number of figures carved in wood,
ranging from fifteen to twenty-three inches in height, and
representing women belonging to the great group of the Mavia,
Makonde, and Matambwe tribes. The carving is remarkably well
done and renders the female type with great accuracy, especially the
keloid ornamentation, to be described later on. As to the object and
meaning of their works the sculptors either could or (more probably)
would tell me nothing, and I was forced to content myself with the
scanty information vouchsafed by one man, who said that the figures
were merely intended to represent the nembo—the artificial
deformations of pelele, ear-discs, and keloids. The legend recorded
by Pater Adams places these figures in a new light. They must surely
be more than mere dolls; and we may even venture to assume that
they are—though the majority of present-day Makonde are probably
unaware of the fact—representations of the tribal ancestress.
The references in the legend to the descent from Mahuta to the
Rovuma, and to a journey across the highlands into the Mbekuru
valley, undoubtedly indicate the previous history of the tribe, the
travels of the ancestral pair typifying the migrations of their
descendants. The descent to the neighbouring Rovuma valley, with
its extraordinary fertility and great abundance of game, is intelligible
at a glance—but the crossing of the Lukuledi depression, the ascent
to the Rondo Plateau and the descent to the Mbemkuru, also lie
within the bounds of probability, for all these districts have exactly
the same character as the extreme south. Now, however, comes a
point of especial interest for our bacteriological age. The primitive
Makonde did not enjoy their lives in the marshy river-valleys.
Disease raged among them, and many died. It was only after they
had returned to their original home near Mahuta, that the health
conditions of these people improved. We are very apt to think of the
African as a stupid person whose ignorance of nature is only equalled
by his fear of it, and who looks on all mishaps as caused by evil
spirits and malignant natural powers. It is much more correct to
assume in this case that the people very early learnt to distinguish
districts infested with malaria from those where it is absent.
This knowledge is crystallized in the
ancestral warning against settling in the
valleys and near the great waters, the
dwelling-places of disease and death. At the
same time, for security against the hostile
Mavia south of the Rovuma, it was enacted
that every settlement must be not less than a
certain distance from the southern edge of the
plateau. Such in fact is their mode of life at the
present day. It is not such a bad one, and
certainly they are both safer and more
comfortable than the Makua, the recent
intruders from the south, who have made USUAL METHOD OF
good their footing on the western edge of the CLOSING HUT-DOOR
plateau, extending over a fairly wide belt of
country. Neither Makua nor Makonde show in their dwellings
anything of the size and comeliness of the Yao houses in the plain,
especially at Masasi, Chingulungulu and Zuza’s. Jumbe Chauro, a
Makonde hamlet not far from Newala, on the road to Mahuta, is the
most important settlement of the tribe I have yet seen, and has fairly
spacious huts. But how slovenly is their construction compared with
the palatial residences of the elephant-hunters living in the plain.
The roofs are still more untidy than in the general run of huts during
the dry season, the walls show here and there the scanty beginnings
or the lamentable remains of the mud plastering, and the interior is a
veritable dog-kennel; dirt, dust and disorder everywhere. A few huts
only show any attempt at division into rooms, and this consists
merely of very roughly-made bamboo partitions. In one point alone
have I noticed any indication of progress—in the method of fastening
the door. Houses all over the south are secured in a simple but
ingenious manner. The door consists of a set of stout pieces of wood
or bamboo, tied with bark-string to two cross-pieces, and moving in
two grooves round one of the door-posts, so as to open inwards. If
the owner wishes to leave home, he takes two logs as thick as a man’s
upper arm and about a yard long. One of these is placed obliquely
against the middle of the door from the inside, so as to form an angle
of from 60° to 75° with the ground. He then places the second piece
horizontally across the first, pressing it downward with all his might.
It is kept in place by two strong posts planted in the ground a few
inches inside the door. This fastening is absolutely safe, but of course
cannot be applied to both doors at once, otherwise how could the
owner leave or enter his house? I have not yet succeeded in finding
out how the back door is fastened.

MAKONDE LOCK AND KEY AT JUMBE CHAURO


This is the general way of closing a house. The Makonde at Jumbe
Chauro, however, have a much more complicated, solid and original
one. Here, too, the door is as already described, except that there is
only one post on the inside, standing by itself about six inches from
one side of the doorway. Opposite this post is a hole in the wall just
large enough to admit a man’s arm. The door is closed inside by a
large wooden bolt passing through a hole in this post and pressing
with its free end against the door. The other end has three holes into
which fit three pegs running in vertical grooves inside the post. The
door is opened with a wooden key about a foot long, somewhat
curved and sloped off at the butt; the other end has three pegs
corresponding to the holes, in the bolt, so that, when it is thrust
through the hole in the wall and inserted into the rectangular
opening in the post, the pegs can be lifted and the bolt drawn out.[50]

MODE OF INSERTING THE KEY

With no small pride first one householder and then a second


showed me on the spot the action of this greatest invention of the
Makonde Highlands. To both with an admiring exclamation of
“Vizuri sana!” (“Very fine!”). I expressed the wish to take back these
marvels with me to Ulaya, to show the Wazungu what clever fellows
the Makonde are. Scarcely five minutes after my return to camp at
Newala, the two men came up sweating under the weight of two
heavy logs which they laid down at my feet, handing over at the same
time the keys of the fallen fortress. Arguing, logically enough, that if
the key was wanted, the lock would be wanted with it, they had taken
their axes and chopped down the posts—as it never occurred to them
to dig them out of the ground and so bring them intact. Thus I have
two badly damaged specimens, and the owners, instead of praise,
come in for a blowing-up.
The Makua huts in the environs of Newala are especially
miserable; their more than slovenly construction reminds one of the
temporary erections of the Makua at Hatia’s, though the people here
have not been concerned in a war. It must therefore be due to
congenital idleness, or else to the absence of a powerful chief. Even
the baraza at Mlipa’s, a short hour’s walk south-east of Newala,
shares in this general neglect. While public buildings in this country
are usually looked after more or less carefully, this is in evident
danger of being blown over by the first strong easterly gale. The only
attractive object in this whole district is the grave of the late chief
Mlipa. I visited it in the morning, while the sun was still trying with
partial success to break through the rolling mists, and the circular
grove of tall euphorbias, which, with a broken pot, is all that marks
the old king’s resting-place, impressed one with a touch of pathos.
Even my very materially-minded carriers seemed to feel something
of the sort, for instead of their usual ribald songs, they chanted
solemnly, as we marched on through the dense green of the Makonde
bush:—
“We shall arrive with the great master; we stand in a row and have
no fear about getting our food and our money from the Serkali (the
Government). We are not afraid; we are going along with the great
master, the lion; we are going down to the coast and back.”
With regard to the characteristic features of the various tribes here
on the western edge of the plateau, I can arrive at no other
conclusion than the one already come to in the plain, viz., that it is
impossible for anyone but a trained anthropologist to assign any
given individual at once to his proper tribe. In fact, I think that even
an anthropological specialist, after the most careful examination,
might find it a difficult task to decide. The whole congeries of peoples
collected in the region bounded on the west by the great Central
African rift, Tanganyika and Nyasa, and on the east by the Indian
Ocean, are closely related to each other—some of their languages are
only distinguished from one another as dialects of the same speech,
and no doubt all the tribes present the same shape of skull and
structure of skeleton. Thus, surely, there can be no very striking
differences in outward appearance.
Even did such exist, I should have no time
to concern myself with them, for day after day,
I have to see or hear, as the case may be—in
any case to grasp and record—an
extraordinary number of ethnographic
phenomena. I am almost disposed to think it
fortunate that some departments of inquiry, at
least, are barred by external circumstances.
Chief among these is the subject of iron-
working. We are apt to think of Africa as a
country where iron ore is everywhere, so to
speak, to be picked up by the roadside, and
where it would be quite surprising if the
inhabitants had not learnt to smelt the
material ready to their hand. In fact, the
knowledge of this art ranges all over the
continent, from the Kabyles in the north to the
Kafirs in the south. Here between the Rovuma
and the Lukuledi the conditions are not so
favourable. According to the statements of the
Makonde, neither ironstone nor any other
form of iron ore is known to them. They have
not therefore advanced to the art of smelting
the metal, but have hitherto bought all their
THE ANCESTRESS OF
THE MAKONDE
iron implements from neighbouring tribes.
Even in the plain the inhabitants are not much
better off. Only one man now living is said to
understand the art of smelting iron. This old fundi lives close to
Huwe, that isolated, steep-sided block of granite which rises out of
the green solitude between Masasi and Chingulungulu, and whose
jagged and splintered top meets the traveller’s eye everywhere. While
still at Masasi I wished to see this man at work, but was told that,
frightened by the rising, he had retired across the Rovuma, though
he would soon return. All subsequent inquiries as to whether the
fundi had come back met with the genuine African answer, “Bado”
(“Not yet”).
BRAZIER

Some consolation was afforded me by a brassfounder, whom I


came across in the bush near Akundonde’s. This man is the favourite
of women, and therefore no doubt of the gods; he welds the glittering
brass rods purchased at the coast into those massive, heavy rings
which, on the wrists and ankles of the local fair ones, continually give
me fresh food for admiration. Like every decent master-craftsman he
had all his tools with him, consisting of a pair of bellows, three
crucibles and a hammer—nothing more, apparently. He was quite
willing to show his skill, and in a twinkling had fixed his bellows on
the ground. They are simply two goat-skins, taken off whole, the four
legs being closed by knots, while the upper opening, intended to
admit the air, is kept stretched by two pieces of wood. At the lower
end of the skin a smaller opening is left into which a wooden tube is
stuck. The fundi has quickly borrowed a heap of wood-embers from
the nearest hut; he then fixes the free ends of the two tubes into an
earthen pipe, and clamps them to the ground by means of a bent
piece of wood. Now he fills one of his small clay crucibles, the dross
on which shows that they have been long in use, with the yellow
material, places it in the midst of the embers, which, at present are
only faintly glimmering, and begins his work. In quick alternation
the smith’s two hands move up and down with the open ends of the
bellows; as he raises his hand he holds the slit wide open, so as to let
the air enter the skin bag unhindered. In pressing it down he closes
the bag, and the air puffs through the bamboo tube and clay pipe into
the fire, which quickly burns up. The smith, however, does not keep
on with this work, but beckons to another man, who relieves him at
the bellows, while he takes some more tools out of a large skin pouch
carried on his back. I look on in wonder as, with a smooth round
stick about the thickness of a finger, he bores a few vertical holes into
the clean sand of the soil. This should not be difficult, yet the man
seems to be taking great pains over it. Then he fastens down to the
ground, with a couple of wooden clamps, a neat little trough made by
splitting a joint of bamboo in half, so that the ends are closed by the
two knots. At last the yellow metal has attained the right consistency,
and the fundi lifts the crucible from the fire by means of two sticks
split at the end to serve as tongs. A short swift turn to the left—a
tilting of the crucible—and the molten brass, hissing and giving forth
clouds of smoke, flows first into the bamboo mould and then into the
holes in the ground.
The technique of this backwoods craftsman may not be very far
advanced, but it cannot be denied that he knows how to obtain an
adequate result by the simplest means. The ladies of highest rank in
this country—that is to say, those who can afford it, wear two kinds
of these massive brass rings, one cylindrical, the other semicircular
in section. The latter are cast in the most ingenious way in the
bamboo mould, the former in the circular hole in the sand. It is quite
a simple matter for the fundi to fit these bars to the limbs of his fair
customers; with a few light strokes of his hammer he bends the
pliable brass round arm or ankle without further inconvenience to
the wearer.
SHAPING THE POT

SMOOTHING WITH MAIZE-COB

CUTTING THE EDGE


FINISHING THE BOTTOM

LAST SMOOTHING BEFORE


BURNING

FIRING THE BRUSH-PILE


LIGHTING THE FARTHER SIDE OF
THE PILE

TURNING THE RED-HOT VESSEL

NYASA WOMAN MAKING POTS AT MASASI


Pottery is an art which must always and everywhere excite the
interest of the student, just because it is so intimately connected with
the development of human culture, and because its relics are one of
the principal factors in the reconstruction of our own condition in
prehistoric times. I shall always remember with pleasure the two or
three afternoons at Masasi when Salim Matola’s mother, a slightly-
built, graceful, pleasant-looking woman, explained to me with
touching patience, by means of concrete illustrations, the ceramic art
of her people. The only implements for this primitive process were a
lump of clay in her left hand, and in the right a calabash containing
the following valuables: the fragment of a maize-cob stripped of all
its grains, a smooth, oval pebble, about the size of a pigeon’s egg, a
few chips of gourd-shell, a bamboo splinter about the length of one’s
hand, a small shell, and a bunch of some herb resembling spinach.
Nothing more. The woman scraped with the
shell a round, shallow hole in the soft, fine
sand of the soil, and, when an active young
girl had filled the calabash with water for her,
she began to knead the clay. As if by magic it
gradually assumed the shape of a rough but
already well-shaped vessel, which only wanted
a little touching up with the instruments
before mentioned. I looked out with the
MAKUA WOMAN closest attention for any indication of the use
MAKING A POT. of the potter’s wheel, in however rudimentary
SHOWS THE a form, but no—hapana (there is none). The
BEGINNINGS OF THE embryo pot stood firmly in its little
POTTER’S WHEEL
depression, and the woman walked round it in
a stooping posture, whether she was removing
small stones or similar foreign bodies with the maize-cob, smoothing
the inner or outer surface with the splinter of bamboo, or later, after
letting it dry for a day, pricking in the ornamentation with a pointed
bit of gourd-shell, or working out the bottom, or cutting the edge
with a sharp bamboo knife, or giving the last touches to the finished
vessel. This occupation of the women is infinitely toilsome, but it is
without doubt an accurate reproduction of the process in use among
our ancestors of the Neolithic and Bronze ages.
There is no doubt that the invention of pottery, an item in human
progress whose importance cannot be over-estimated, is due to
women. Rough, coarse and unfeeling, the men of the horde range
over the countryside. When the united cunning of the hunters has
succeeded in killing the game; not one of them thinks of carrying
home the spoil. A bright fire, kindled by a vigorous wielding of the
drill, is crackling beside them; the animal has been cleaned and cut
up secundum artem, and, after a slight singeing, will soon disappear
under their sharp teeth; no one all this time giving a single thought
to wife or child.
To what shifts, on the other hand, the primitive wife, and still more
the primitive mother, was put! Not even prehistoric stomachs could
endure an unvarying diet of raw food. Something or other suggested
the beneficial effect of hot water on the majority of approved but
indigestible dishes. Perhaps a neighbour had tried holding the hard
roots or tubers over the fire in a calabash filled with water—or maybe
an ostrich-egg-shell, or a hastily improvised vessel of bark. They
became much softer and more palatable than they had previously
been; but, unfortunately, the vessel could not stand the fire and got
charred on the outside. That can be remedied, thought our
ancestress, and plastered a layer of wet clay round a similar vessel.
This is an improvement; the cooking utensil remains uninjured, but
the heat of the fire has shrunk it, so that it is loose in its shell. The
next step is to detach it, so, with a firm grip and a jerk, shell and
kernel are separated, and pottery is invented. Perhaps, however, the
discovery which led to an intelligent use of the burnt-clay shell, was
made in a slightly different way. Ostrich-eggs and calabashes are not
to be found in every part of the world, but everywhere mankind has
arrived at the art of making baskets out of pliant materials, such as
bark, bast, strips of palm-leaf, supple twigs, etc. Our inventor has no
water-tight vessel provided by nature. “Never mind, let us line the
basket with clay.” This answers the purpose, but alas! the basket gets
burnt over the blazing fire, the woman watches the process of
cooking with increasing uneasiness, fearing a leak, but no leak
appears. The food, done to a turn, is eaten with peculiar relish; and
the cooking-vessel is examined, half in curiosity, half in satisfaction
at the result. The plastic clay is now hard as stone, and at the same
time looks exceedingly well, for the neat plaiting of the burnt basket
is traced all over it in a pretty pattern. Thus, simultaneously with
pottery, its ornamentation was invented.
Primitive woman has another claim to respect. It was the man,
roving abroad, who invented the art of producing fire at will, but the
woman, unable to imitate him in this, has been a Vestal from the
earliest times. Nothing gives so much trouble as the keeping alight of
the smouldering brand, and, above all, when all the men are absent
from the camp. Heavy rain-clouds gather, already the first large
drops are falling, the first gusts of the storm rage over the plain. The
little flame, a greater anxiety to the woman than her own children,
flickers unsteadily in the blast. What is to be done? A sudden thought
occurs to her, and in an instant she has constructed a primitive hut
out of strips of bark, to protect the flame against rain and wind.
This, or something very like it, was the way in which the principle
of the house was discovered; and even the most hardened misogynist
cannot fairly refuse a woman the credit of it. The protection of the
hearth-fire from the weather is the germ from which the human
dwelling was evolved. Men had little, if any share, in this forward
step, and that only at a late stage. Even at the present day, the
plastering of the housewall with clay and the manufacture of pottery
are exclusively the women’s business. These are two very significant
survivals. Our European kitchen-garden, too, is originally a woman’s
invention, and the hoe, the primitive instrument of agriculture, is,
characteristically enough, still used in this department. But the
noblest achievement which we owe to the other sex is unquestionably
the art of cookery. Roasting alone—the oldest process—is one for
which men took the hint (a very obvious one) from nature. It must
have been suggested by the scorched carcase of some animal
overtaken by the destructive forest-fires. But boiling—the process of
improving organic substances by the help of water heated to boiling-
point—is a much later discovery. It is so recent that it has not even
yet penetrated to all parts of the world. The Polynesians understand
how to steam food, that is, to cook it, neatly wrapped in leaves, in a
hole in the earth between hot stones, the air being excluded, and
(sometimes) a few drops of water sprinkled on the stones; but they
do not understand boiling.
To come back from this digression, we find that the slender Nyasa
woman has, after once more carefully examining the finished pot,
put it aside in the shade to dry. On the following day she sends me
word by her son, Salim Matola, who is always on hand, that she is
going to do the burning, and, on coming out of my house, I find her
already hard at work. She has spread on the ground a layer of very
dry sticks, about as thick as one’s thumb, has laid the pot (now of a
yellowish-grey colour) on them, and is piling brushwood round it.
My faithful Pesa mbili, the mnyampara, who has been standing by,
most obligingly, with a lighted stick, now hands it to her. Both of
them, blowing steadily, light the pile on the lee side, and, when the
flame begins to catch, on the weather side also. Soon the whole is in a
blaze, but the dry fuel is quickly consumed and the fire dies down, so
that we see the red-hot vessel rising from the ashes. The woman
turns it continually with a long stick, sometimes one way and
sometimes another, so that it may be evenly heated all over. In
twenty minutes she rolls it out of the ash-heap, takes up the bundle
of spinach, which has been lying for two days in a jar of water, and
sprinkles the red-hot clay with it. The places where the drops fall are
marked by black spots on the uniform reddish-brown surface. With a
sigh of relief, and with visible satisfaction, the woman rises to an
erect position; she is standing just in a line between me and the fire,
from which a cloud of smoke is just rising: I press the ball of my
camera, the shutter clicks—the apotheosis is achieved! Like a
priestess, representative of her inventive sex, the graceful woman
stands: at her feet the hearth-fire she has given us beside her the
invention she has devised for us, in the background the home she has
built for us.
At Newala, also, I have had the manufacture of pottery carried on
in my presence. Technically the process is better than that already
described, for here we find the beginnings of the potter’s wheel,
which does not seem to exist in the plains; at least I have seen
nothing of the sort. The artist, a frightfully stupid Makua woman, did
not make a depression in the ground to receive the pot she was about
to shape, but used instead a large potsherd. Otherwise, she went to
work in much the same way as Salim’s mother, except that she saved
herself the trouble of walking round and round her work by squatting
at her ease and letting the pot and potsherd rotate round her; this is
surely the first step towards a machine. But it does not follow that
the pot was improved by the process. It is true that it was beautifully
rounded and presented a very creditable appearance when finished,
but the numerous large and small vessels which I have seen, and, in
part, collected, in the “less advanced” districts, are no less so. We
moderns imagine that instruments of precision are necessary to
produce excellent results. Go to the prehistoric collections of our
museums and look at the pots, urns and bowls of our ancestors in the
dim ages of the past, and you will at once perceive your error.
MAKING LONGITUDINAL CUT IN
BARK

DRAWING THE BARK OFF THE LOG

REMOVING THE OUTER BARK


BEATING THE BARK

WORKING THE BARK-CLOTH AFTER BEATING, TO MAKE IT


SOFT

MANUFACTURE OF BARK-CLOTH AT NEWALA


To-day, nearly the whole population of German East Africa is
clothed in imported calico. This was not always the case; even now in
some parts of the north dressed skins are still the prevailing wear,
and in the north-western districts—east and north of Lake
Tanganyika—lies a zone where bark-cloth has not yet been
superseded. Probably not many generations have passed since such
bark fabrics and kilts of skins were the only clothing even in the
south. Even to-day, large quantities of this bright-red or drab
material are still to be found; but if we wish to see it, we must look in
the granaries and on the drying stages inside the native huts, where
it serves less ambitious uses as wrappings for those seeds and fruits
which require to be packed with special care. The salt produced at
Masasi, too, is packed for transport to a distance in large sheets of
bark-cloth. Wherever I found it in any degree possible, I studied the
process of making this cloth. The native requisitioned for the
purpose arrived, carrying a log between two and three yards long and
as thick as his thigh, and nothing else except a curiously-shaped
mallet and the usual long, sharp and pointed knife which all men and
boys wear in a belt at their backs without a sheath—horribile dictu!
[51]
Silently he squats down before me, and with two rapid cuts has
drawn a couple of circles round the log some two yards apart, and
slits the bark lengthwise between them with the point of his knife.
With evident care, he then scrapes off the outer rind all round the
log, so that in a quarter of an hour the inner red layer of the bark
shows up brightly-coloured between the two untouched ends. With
some trouble and much caution, he now loosens the bark at one end,
and opens the cylinder. He then stands up, takes hold of the free
edge with both hands, and turning it inside out, slowly but steadily
pulls it off in one piece. Now comes the troublesome work of
scraping all superfluous particles of outer bark from the outside of
the long, narrow piece of material, while the inner side is carefully
scrutinised for defective spots. At last it is ready for beating. Having
signalled to a friend, who immediately places a bowl of water beside
him, the artificer damps his sheet of bark all over, seizes his mallet,
lays one end of the stuff on the smoothest spot of the log, and
hammers away slowly but continuously. “Very simple!” I think to
myself. “Why, I could do that, too!”—but I am forced to change my
opinions a little later on; for the beating is quite an art, if the fabric is
not to be beaten to pieces. To prevent the breaking of the fibres, the
stuff is several times folded across, so as to interpose several
thicknesses between the mallet and the block. At last the required
state is reached, and the fundi seizes the sheet, still folded, by both
ends, and wrings it out, or calls an assistant to take one end while he
holds the other. The cloth produced in this way is not nearly so fine
and uniform in texture as the famous Uganda bark-cloth, but it is
quite soft, and, above all, cheap.
Now, too, I examine the mallet. My craftsman has been using the
simpler but better form of this implement, a conical block of some
hard wood, its base—the striking surface—being scored across and
across with more or less deeply-cut grooves, and the handle stuck
into a hole in the middle. The other and earlier form of mallet is
shaped in the same way, but the head is fastened by an ingenious
network of bark strips into the split bamboo serving as a handle. The
observation so often made, that ancient customs persist longest in
connection with religious ceremonies and in the life of children, here
finds confirmation. As we shall soon see, bark-cloth is still worn
during the unyago,[52] having been prepared with special solemn
ceremonies; and many a mother, if she has no other garment handy,
will still put her little one into a kilt of bark-cloth, which, after all,
looks better, besides being more in keeping with its African
surroundings, than the ridiculous bit of print from Ulaya.
MAKUA WOMEN

You might also like