Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

PUBLICATIONS

Journal of Advances in Modeling Earth Systems


RESEARCH ARTICLE Mapping the global depth to bedrock for land
10.1002/2016MS000686
surface modeling
Key Points: Wei Shangguan 1, Tomislav Hengl 2, Jorge Mendes de Jesus2, Hua Yuan 1, and Yongjiu Dai 1
 Observations from soil and
1
geological surveys are combined for School of Atmospheric Sciences, Sun Yat-sen University, Guangzhou, China, 2ISRIC — World Soil Information,
developing global spatial prediction Wageningen, Netherlands
models of depth to bedrock
 Machine learning explains 59% of
variation in spatial distribution of
depth to bedrock for interpolation Abstract Depth to bedrock serves as the lower boundary of land surface models, which controls hydro-
but much less for extrapolation logic and biogeochemical processes. This paper presents a framework for global estimation of depth to
 The framework proposed can be
bedrock (DTB). Observations were extracted from a global compilation of soil profile data (ca. 1,30,000
used to gradually improve accuracy
by adding more ground observations locations) and borehole data (ca. 1.6 million locations). Additional pseudo-observations generated by expert
knowledge were added to fill in large sampling gaps. The model training points were then overlaid on a
Supporting Information: stack of 155 covariates including DEM-based hydrological and morphological derivatives, lithologic units,
 Supporting Information S1 MODIS surface reflectance bands and vegetation indices derived from the MODIS land products. Global
 Table S1
spatial prediction models were developed using random forest and Gradient Boosting Tree algorithms. The
final predictions were generated at the spatial resolution of 250 m as an ensemble prediction of the two
Correspondence to:
independently fitted models. The 10–fold cross-validation shows that the models explain 59% for absolute
Y. Dai,
daiyj6@mail.sysu.edu.cn DTB and 34% for censored DTB (depths deep than 200 cm are predicted as 200 cm). The model for occur-
rence of R horizon (bedrock) within 200 cm does a good job. Visual comparisons of predictions in the study
Citation: areas where more detailed maps of depth to bedrock exist show that there is a general match with spatial
Shangguan, W., T. Hengl, J. Mendes de patterns from similar local studies. Limitation of the data set and extrapolation in data spare areas should
Jesus, H. Yuan, and Y. Dai (2017),
not be ignored in applications. To improve accuracy of spatial prediction, more borehole drilling logs will
Mapping the global depth to bedrock
for land surface modeling, J. Adv. need to be added to supplement the existing training points in under-represented areas.
Model. Earth Syst., 9, 65–88,
doi:10.1002/2016MS000686.

Received 4 APR 2016


Accepted 2 DEC 2016
1. Introduction
Accepted article online 20 DEC 2016 Bedrock is either exposed at the earth surface or buried under soil and regolith, sometimes over a thousand
Published online 22 JAN 2017
meters deep. Understanding the global pattern of underground boundaries such as groundwater and bed-
rock occurrence is of continuous interest to Earth and geosciences [Schenk and Jackson, 2005; Fan et al.,
2013]. In land surface modeling, depth to bedrock (DTB) serves as the lower boundary which affects the
energy, water and carbon cycle. A constant DTB was assumed in most models due to the lack of data, but
this can limit the performance of land surface models [Gochis et al., 2010]. Lawrence et al. [2008] found that
a deepening of the soil column will lead to improvements of the simulated near-surface soil temperature
for the permafrost area in the Community Land Model (CLM). Peterman et al. [2014] showed that a variable
DTB affects simulated carbon and water in a dynamic vegetation models. Brunke et al. [2016] implemented
a global DTB data set [Pelletier et al., 2016] in CLM4.5 and found that there were significant influences on
water and energy simulations. Information on DTB is also important to other fields such as hydrology, ecolo-
gy, soil science, geology, agriculture and civil engineering [Tromp-van Meerveld et al., 2007; Fu et al., 2011].
Bedrock restricts roots, animals and/or other biological activities and is a key soil property of interest for
global soil mapping [Arrouays et al., 2014]. In geology, DTB helps geologists describe the natural history of a
region and can be used as an input for modelling earthquake and land slide risks [McPherson, 2011]. Infor-
C 2016. The Authors.
V ~o et al.,
mation on DTB is useful for mineral exploration [Wilford et al., 2016] and crop yield modelling [Calvin
This is an open access article under the 2003]. Civil engineers need information on DTB to build safe, stable buildings, roads, railways, bridges, and
terms of the Creative Commons
to locate water wells [Price, 2009].
Attribution-NonCommercial-NoDerivs
License, which permits use and Ground observations of DTB can be used as training data to produce spatial predictions of DTB for the
distribution in any medium, provided whole area of interest. Various mapping methods, including physically based, interpolation from point sam-
the original work is properly cited, the
ples and empirical-statistical models [Kuriakose et al., 2009], have been used for this purpose. Pelletier and
use is non-commercial and no
modifications or adaptations are Rasmussen [2009] developed a numerical model to predict soil thickness in the upland by using the balance
made. between soil production and erosion modeled via a digital elevation model (DEM) data. Karlsson et al.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 65


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

[2014] developed a simplified regolith model to estimate regolith thickness in area with high fraction of out-
crop based on outcrops, slopes and the distance to outcrops in eight directions. Boer et al. [1996] showed
that the performances of the maximum likelihood classifier in the shale and limestone areas were better
than that in the phyllite area for the prediction of soil depth of dry Mediterranean areas. Shafique et al.
[2009] predicted regolith thickness by landform, elevation and distance to stream. Tesfa et al. [2009] applied
generalized additive models and random forest to predict soil depth from topographic and land cover
attributes. Dahlke et al. [2009] used class means of merged spatial explanatory variables to extrapolate the
soil depth measured at point locations. Wilford [2012] used airborne gamma-ray spectrometry and digital
terrain analysis to derive weathering intensity index, which can be used to estimate the appearance of
outcrops.
Globally, there are several existing maps of depth to bedrock. One of the first estimates of global distribu-
tion of DTB (limited to the upper 2 m) was produced by FAO [1996]. Here global soil depth was mapped
using expert rules, and primarily based on the soil unit’s classification name, the soil phase and the slope
class. Miller and White [1998] derived the DTB for the United States based on STATSGO (State Soil Geograph-
ic data). DTB in STATSGO2 is expressed as a shallowest depth of soil components that occupies less than
15% area of the map unit [USDA-NCSS, 2006]. Shangguan et al. [2013] estimated soil profile depth and seven
basic horizon thicknesses based on soil classes of China. Hengl et al. [2014] further tried using zero-inflated
models to estimate global DTB based on global compilation of soil profiles. All previous examples provide
only information about DTB within 2 m. Wilford et al. [2016] produced a regolith depth map for the whole
Australia at 3 arc-seconds resolution by using water well records and the R-Cubist package for model fitting
and prediction [Kuhn et al., 2014]. Recently, Pelletier et al. [2016] developed a global data set of the average
thicknesses of soil, intact regolith, and sedimentary deposits by representing upland areas by soil data and
lowland by water well data, using topography, climate, and geology data as input.
Above-mentioned global estimates of DTB are available at coarse resolutions only (1km or coarser) and/or
are often of limited accuracy. In addition, soil, hydrological and geological exploration is often done in iso-
lated domains: predictions based only on soil data, i.e., soil maps [e.g., FAO, 1996; Miller and White, 1998;
Hengl et al., 2014] are often limited to soil surface with values limited to several meters. Likewise, maps
based on boreholes from geological explorations are only available for some states in USA and small
regions with values up to several hundred meters [see e.g., Richard et al., 2007; Illinois State Geological
Survey, 2004; Witzke et al., 2010]. Combining soil profiles and boreholes in producing DTB maps are neces-
sary to fill this gap and provide consistent estimates.
In this paper we describe a framework to estimate depth to bedrock at the spatial resolution of 250 m by
using the state-of-the-art machine learning methods. As training points we use a compilation of publicly
available soil profiles and borehole logs. As covariates, we use an extensive list of remote sensing based
covariates including the most up-to-date lithologic map of the world, DEM-based hydrological and morpho-
logical derivatives and MODIS land products. Our main objective is to use a statistical framework to provide
best possible unbiased predictions of DTB. We develop this framework within domain of automated soil
mapping as part of the SoilGrids system [Hengl et al., 2014], in which spatial predictions can be gradually
improved by adding new training data.

2. Materials and Methods


2.1. Basic Definitions
Although soil depth is commonly recorded during the fieldwork, it can often mean different things to differ-
ent groups. For example, ‘‘soil depth’’ from many soil databases can not be considered equivalent to DTB.
‘‘Soil depth’’ is probably more comparable with the common synonyms such as: ploughing depth, rooting
depth etc. [Miller and White, 1998; Scholes and Colstoun, 2011; Tesfa et al., 2009]. In the Encyclopedia of Soil
Science, bedrock is de- fined as: ‘‘a rock body underlying a soil and its parent material’’ [Chesworth, 2008], so
that all rocks (no matter soft or hard, consolidated or not) below the soil surface may be considered as bed-
rock. Weathered rocks or weakly consolidated rocks are sometimes also classified as R horizon or bedrock in
WRB and Soil Taxonomy [FAO, 2014; Soil Survey Staff, 2014], although Cr horizon is most commonly used for
such cases. Several terms, including ‘‘regolith thickness’’ [Karlsson et al., 2014], ‘‘overburden thickness’’
[Missouri Geological Survey, 2013] and ‘‘drift thickness’’ [Illinois State Geological Survey, 2004] are also used in

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 66


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 1. Schematic explanation of the depth to bedrock.

literature to describe the depth to bedrock. In contrast, the term ‘‘bedrock’’ is used relatively consistently in
geological literature [Illinois State Geological Survey, 2004; Missouri Geological Survey, 2013; Karlsson et al.,
2014; Jain, 2014], though differences also exist.
We consider ‘‘bedrock’’ [Jain, 2014]:

‘‘the consolidated solid rock underlying unconsolidated surface materials, such as soil or other
regolith,’’

which is considered to be equivalent to the definition of the R horizon (hard rock) in soil science
[Schoeneberger et al., 2011].
Correspondingly, we consider ‘‘Depth to Bedrock’’ (DTB) (Figure 1):

‘‘depth (in cm) from the ground surface to the contact with coherent (continuous) bedrock.’’

As such, DTB is a skewed variable with a lot of values grouped around 0 depth, while maximum values can
range up to a few thousand meters. Exposed bedrock or bedrock visible at surface is referred as ‘‘rock out-
crop,’’ i.e., DTB5 0 [Jain, 2014].

2.2. Observation and Measurement of DTB


In different disciplines, depth to bedrock is most commonly observed and measured ranging from
(Figure 1):
1. In soil science—where bedrock is considered in soil profile description and soil classification [Soil Survey
Staff, 2014; Juilleret et al., 2014] and is commonly labeled as R layer or horizon. Although many countries
have their own classification systems, bedrock, i.e., the R horizon, is often the least problematic variable
for harmonization and translation from national to international systems.
2. In geology — where bedrock is identified via excavation, borehole drilling and via geophysical sensoring.
Borehole drilling includes water, oil or gas wells and holes drilled for other purposes such as geotechnical

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 67


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

investigations, mineral exploration or temperature measurement. Geophysical methods of measuring


DTB include refraction microtremor, electrical resistivity, ground penetrating radar, seismic refraction,
seismic reflection, resistivity methods and similar [Lowrie, 2007]. Although measurements from excava-
tions and geophysical methods such as electrical resistivity are quite reliable [Yamakawa et al., 2012],
they are still not appropriate for large area due to the cost and time of sampling [Karlsson et al., 2014].
From the two DTB data sources above, soil survey data contain measurements of depth to bedrock,
i.e., depth to the R horizon in most cases [Schoeneberger et al., 2012]. Records from borehole drillings
(primarily water wells) frequently contain more accurate DTB measurements and/or lithology observa-
tions than soil survey data, which will likely remain the major data source for DTB and are also the
main focus in this paper. Soil profiles are usually less than 200 cm, which are censored observations.
On the other hand, borehole drilling logs are usually very deep, and sometimes as deep as thousands
of meters. The problems of the definition of DTB and its measurements were discussed in the discus-
sion section.

2.3. Training Data


We use three major data sources for the purpose of training global spatial predictions models for DTB (Fig-
ure 2):
1. A global compilation of soil profiles data.
2. A global compilation of borehole drilling logs.
3. Pseudo-observations of DTB, i.e., simulated points containing values of target variables based on the
remote sensing data (shifting sands and rock outcrops) and on published literature sources (observations
without coordinates). For example, it is known from literature that depth to bedrock in Sahara is on aver-
age about 150 m [Dregne, 2011]. Also the rock outcrops (DTB50) are highly correlated with the slope of
local terrain—after a certain slope is reached the chances of surface rock outcrops becomes high, hence
we use a global map of terrain slope to generate pseudo-points for very steep terrains (>408 slope).
Our rationale for using pseudo-observations is the following. Any purely data-driven model fitted with large
gaps in the covariate space will most likely result in serious omissions, especially for the areas that are often
inaccessible or not of interest to soil surveys or geological exploration. Therefore we use pseudo-
observations to fill such gaps in representation of training points and to avoid overshooting predictions for
large areas that are under-represented. As a rule of thumb, and to prevent from adding too many soft
observations, we keep the amount of pseudo-observations to less than 1% of the total of training points.
Surficial geology maps usually provide the distribution of outcrops in geology survey [Vermont Geological
Survey and Vermont Agency of Natural Resources, 2008], but such data are often not available globally, hence
we add pseudo-observations of rock outcrops to fill in the gaps in data. We represent DTB using three varia-
bles (Figure 1):
1. the absolute DTB in cm,
2. the censored DTB in cm within 0–200 cm (here values equal to 200 cm indicate ‘‘deep as or deeper
than’’), and
3. the occurrence of R horizon (bedrock) within 0–200 cm expressed as 0–1 probability values.
The absolute DTB is only available for borehole drilling data and for soil profiles where the absolute DTB is
within the observed depth. Censored DTB (within 0–200 cm) is, on the other hand, heavily skewed variable
(essentially a zero-inflated variable) with majority (>90% values >200 cm). Censored DTB within 0–200 cm
and occurrence of R horizon within 0–200 cm are available at all locations, hence these are the most com-
plete variables.
2.3.1. Soil Profiles
We used the global compilation of soil profiles generated and maintained at ISRIC which includes various
national and regional soil profile databases [Hengl et al., 2014; Ribeiro et al., 2015]. In almost all cases, there
were no direct records of DTB, and DTB was derived by identifying the R horizon (or based on coarse frag-
ments) and then matching the observed depth for the given horizon. The systematic import of soil profiles
resulted in total of 132,193 points with observed or censored DTB (Figure 2). Note that the soil profiles have
good spatial coverage, but they are in >80% of cases censored, i.e., for many points we only know that DTB
is deeper than 200 cm, but we do not know actual absolute DTB. All import steps have been documented
via Github (R code, https://github.com/ISRICWorldSoil/SoilGrids250 m).

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 68


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 2. Global distribution of depth to bedrock observations. (a) Red colors indicate soil profiles, (b) blue colors boreholes, and (c) the
yellow colors pseudo observations, i.e., points inserted using expert knowledge.

2.3.2. Boreholes
We use 1,574,776 points with borehole logs from: the United States (661,441), Canada (580,063), Australia
(5,943), Sweden (320,451), Ireland (4,250), Brazil (2,004), China (598) and Russia (26). The spatial distribution of
boreholes is shown in Figure 2. Many states in the US established digital water well databases over the last
several decades. The databases includes data from Northern High Plains aquifer, South-Central Kansas, and 14
state databases, i.e., Alaska, Indiana, Iowa, Kentucky, Maine, Minnesota, Missouri, Nevada, New Hampshire,
New York, Ohio, Pennsylvania, Tennessee and Vermont. The coordinates of the points from Alaska were
derived from the Public Land Survey System, with a geo-location error ranging from 650 m to 6800 m (still
compatible with our target resolution of 250 m). For Canada, four provinces, i.e., British Columbia, Nova Scotia,
Prince Edward Island and Quebec, have a water well database. The list of water wells from the United States
and Canada are given in the supporting information. Boreholes of Russia are from Melnikov [1998].
For Australia, we derive DTB from the Australia National Groundwater Information System (ANGIS) (http://
www.bom.gov.au/water/groundwater/). Each well contains multiply layers of construction, hydro-

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 69


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

stratigraphy and lithology logs, which can be used to determine location of the the bedrock [Wilford et al.,
2016]. Although the number of recorded points is >200,000, only 5,992 points from the total can be classi-
fied as DTB measurements with high enough certainty to be further used for building global spatial predic-
tion models. The lookup tables used to convert original records in the ANGIS to values used for building
global spatial prediction models are available in the supporting information. For Brazil and China, DTB was
extracted from the lithology layer description by manual interpretation. The Brazil Groundwater Information
System (SIAGAS, http://siagasweb. cprm.gov.br) contains 273,972 water wells and the Chinese National
Database of Geological Drilling (http://zkinfo.cgsi.cn) contains 410,123 boreholes. Only a small fraction con-
tains lithological data that was used as training points, which distributes across Brazil and China quite
evenly.
2.3.3. Pseudo or Expert-Based Observations
We use two approaches to generating pseudo-observations:
1. Based on the global mask maps of sand dunes areas and steep bare surface areas (i.e., Himalayas) gener-
ated using remote sensing and slope map of the world, and
2. Based on the detailed geological maps reporting rock outcrops.
We generated the global mask maps of sand dunes areas and steep bare surface areas using the global
MODIS surface reflectance product (MCD43A4) and global DEM and slope maps based on the SRTM
DEM [Rabus et al., 2003], both derived at 500 m. After some visual inspection, we discovered that the
medium infrared band 7 from the MCD43A4 land product [Moody et al., 2005] can be used to detect
areas of high surface reflectance (sand dunes and bare rock). For the shifting sand areas we randomly
inserted 300 points (DTB5150 m; average depth of the sand in Sahara) and for the steep bare surface
areas 200 points (DTB50 m). Again, these points were carefully inserted only for the purpose of filling
the possible gaps in the data. The resulting global mask maps used to generate pseudo-observations
are shown in Figure 3.
In the second approach, we also generate few hundred points by using a number of detailed regional geo-
logical maps. Regions having exposed bedrock maps include New York State, Vermont, Alaska, Alberta,
Manitoba and Newfoundland and areas covered by NRCan Groundwater Program (http://gin.gw-info.net/).
All steps used to generate pseudo-points have been documented via Github (R code).

2.4. Covariate Layers


2.4.1. Land Mask and Covariates
We generate predictions using the official land mask (defines the prediction area) used within the SoilGrids
project for the purpose of global soil mapping. The global soil mask excludes water bodies, and all areas
covered with permanent ice, i.e., areas to the south of 608S.

Figure 3. Global mask maps of shifting sand areas (above) and steep bare surface areas. This map was derived using the medium infrared
band 7 from the MCD43A4 MODIS land product, and global DEM and slope images (based on the SRTM DEM). Projected in the original
MODIS sinusoidal projection system.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 70


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

The land mask is visible from the final prediction maps shown in Figure 8. As covariates, we use 155 global
environmental layers (most of them available from http://worldgrids.org/), which include:
SRTM DEM-derived parameters such as topographic wetness index, Valley Bottom Flatness index, slope, ter-
rain curvatures, openness and surface ruggedness index,
1. Global lithological map [Hartmann and Moosdorf, 2012],
2. Global landform map [Sayre et al., 2014],
3. Global land cover GLC30 product [Chen et al., 2015],
4. Climatic surface based on WorldClim [Hijmans et al., 2005],
5. MODIS land products, including EVI images and surface reflectance bands,
6. Global Water Table Depth in meters based on Fan et al. [2013],
7. Global 1 km Gridded Thickness of Soil, Regolith, and Sedimentary Deposit Layers based on Pelletier et al.
[2016].
The complete list of covariates is given in the supporting information. Note that the map by Pelletier et al.
[2016] is generated by combining process-based models and empirical models, and is as such ideal for sta-
tistical calibration using actual point data. For this purpose we use the layer of average soil and
sedimentary-deposit thickness which shows only depths up to 50 m.

2.5. Model Fitting and Validation


The framework of generating spatial predictions consists of four main steps (Figure 4): overlay observations of
DTB and covariates and prepare regression matrix, fit prediction models, apply spatial prediction models using
covariates, and assess accuracy using cross-validation and compare the prediction with regional maps.
Spatial predictions were generated using an ensemble model based on two data-driven algorithm imple-
mented via the R environment, i.e., random forest (ranger package) and Gradient Boosting Tree (xgboost

Figure 4. The spatial prediction framework used to fit models and predict DTB variables globally at 250 m resolution.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 71


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

package). Both models are tree ensemble methods. The random forest model uses fully grown decision
trees (low bias, high variance) and reducing error by reducing variance[Breiman, 2001]. The Gradient Boost-
ing Tree uses shallow trees (high bias, low variance) and reducing error mainly by reducing bias, and also to
some extent by reducing variance by aggregating the output from many models [Chen and Guestrin, 2016].
For model validation we used 10–fold cross-validation and comparison with regional maps. Cross-validation
is used to limit the problem of overfitting, which gives an insight on how the model will generalize to an
independent data set. For each of the three target variables we de rive the coefficient of determination (R2
or the amount of variation explained by the model), mean error (ME) and root mean square error (RMSE) to
evaluate the model performance. Amount of the variation explained by the model is:
   
SSE RMSE 2
R2 5 12 5 12 5½12100% (1)
SSTO r2Z

where SSE is the sum of squares for residuals at cross-validation points (i.e., RMSE2 n), and SSTO is the total
sum of squares. Coefficient of determination close to 1 indicates a perfect model, i.e., 100% of variation has
been explained by the model. RMSE is then:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
u1 X l
RMSE5t 3 ½^z ðSi Þ2z ðSi Þ2 (2)
l i51

where l is the number of validation points. For the occurrence of R horizon within 0–200 cm expressed as
0–1 probability values (in essence a binomial variable) we also derive the area under the receiver operating
characteristic (ROC) curve, known as the AUC. Values of AUC close to 1 are highly satisfactory, while the val-
ues of AUC close to 0.5 can be considered fairly poor.
To evaluate the extrapolation risk, we used a procedure as follows (referred as ‘‘cross-validation by region’’).
First, all samples was partitioned into subsets by regions. Then, the spatial prediction model was calibrated
using one subset of a region (or regions). Finally, this model was validated using the other subsets (or other
subset). At the continental scale, the spatial prediction model is calibrated using data from one continent
and then applied it to the other two. The three continents are North America (United States and Canada),
Europe (Sweden and Ireland) and Australia. A similar procedure is applied to the provinces of Canada and
states of US. For convenience, we call these spatial prediction models continental models and state (prov-
ince) models. The extrapolation risk is also evaluated by leave one state out in calibration for the United
States. For convenience, we called such spatial prediction model such as the ‘‘without Ohio’’ model. All code
used to generate predictions is available from the Github channels (https://github.com/ISRICWorldSoil/
SoilGrids250m).

3. Result
Table 1. Statistics of the Depth to Bedrock (DTB) in Centimetera 3.1. Summary Statistics
Variable Continent Minimum Mean Median Maximum Number
The statistics of the absolute
Africa 2 1,337.3 125 15,000 3,281 DTB and the censored DTB is
Asia 0 1,057.9 15 65,379 2,070
Absolute Oceania 0 3,335.9 2,250 66,900 6,251 given in Table 1. Figure 5 shows
DTB the histogram of the absolute
Europe 0 690.5 400 22,000 281,563 DTB and the censored DTB. The
North America 0 1,487.4 850 312,541 1,227,393
South America 0 1,595.1 500 37,000 2433 absolute DTB after logarithm
World 0 1,309.3 670 312,541 1,590,464 transform had distribution simi-
Africa 2 110.01 110 195 2,636
lar to normal distribution but
Asia 0 25.4 10 197 1,543
Censored Oceania 0 61.63 55 198 805 with many zero values (i.e., out-
DTB crops). The frequency of values
Europe 0 87.73 100 198 78,491
North America 0 105.88 120 199 192,214
larger than 1 m from soil pro-
South America 0 29.25 10 190 892 files decreased as the DTB
World 0 97.51 100 199 307,936 increases. Many borehole val-
a
There are 1,379,502 observations which have a value equal or large than 200 cm, and ues were around 0.5 m, 1 m,
these are excluded in calculating statistics of censored DTB. 1.5 m, 2 m, etc. as well as in

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 72


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 5. Histograms of (a, b) absolute depth to bedrock (DTB) and (c, d) censored DTB. For absolute DTB, values equal or large than
8800 cm are not shown. For censored DTB, values equal or large than 200 cm are not shown. The number of observations are 1,590,464,
13,416 and 2,93,095 for (a, b) absolute DTB, (c) censored DTB from soils, and (d) censored DTB from wells, respectively.

integer multiples of one foot (i.e., 30.48 cm). This is due to the fact that the DTB is usually recorded in feet
or (half-) meters in borehole logs.

3.2. Model Fitting Results


In most cases, model fitting using random forest and Gradient Boosting Tree algorithms do not report major
difference. However, Gradient Boosting Tree reports somewhat lower R2, but similar RMSE as derived using
Out-Of-Bag training samples. Table 2 shows complete summary results for model fitting and cross-
validation.
Figure 6 shows the scaled importance of covariates measured by residual sum of squares of the random
forest models. The most important covariates for the absolute DTB were precipitation, surface reflec-
tance of MODIS MIR band 7, valley depth and DEM. It should be noted that the DTB determined by Pel-
letier et al. [2016] was also important in prediction. Topography and geological units are also clearly
visible in the local patterns, while climatic conditions are most visible in the continental patterns. The
most important covariates were similar for the censored DTB and the occurrence of R horizon, which
include the latitude, surface reflectance of MODIS NIR band 4, daytime land surface temperature,
MODIS precipitable water vapor, surface roughness and Multi-resolution Index of Valley Bottom Flat-
ness and DEM.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 73


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Table 2. Summary Statistics and Mapping Performance for Depth to Bedrock (DTB)a
Amount of
Model Fit Model Fit Variation
Variable Type Units Range RF (R2) GB (R2) Explained ME RMSE
Absolute DTB log-normal cm 0–312,500 0.61 0.38 0.59 224.6 1,172
Censored DTB zero-inflated cm 0–200 0.35 0.25 0.34 1.25 51
Occurrence binomial prob. 0–1 0.35 0.23 0.34 20.006 0.34
(of R horizon)
a
RF indicates random forest, GB indicates Gradient Boosting Tree. Amount of variation explained, mean error (ME) and root mean
square error (RMSE) were determined using 10–fold cross validation.

3.3. Mapping Accuracy


Table 2 also shows cross-validation summary statistics of interpolation for models based on random forest
and Gradient Boosting Tree. Random forest yielded more accurate predictions than Gradient Boosting Tree.
The percentages of explained variance by random forest were relatively high for the absolute DTB com-
pared to the other two target variables. Figure 7 shows the cross-validation plot of the absolute DTB. This
shows that for the absolute DTB, lower values (especially values near zero) are significantly overestimated.
Overestimation of lowest values is a common problem in regression, especially when the model is not able
to explain >50% of variability in the target variable. Figure 7 also shows that prediction limits are relatively
wide. For the censored DTB, most of lower values were overestimated, while higher values around 2.5 m are
underestimated. The prediction of the occurrence of R horizon has an AUC value of 0.87, which indicates
the prediction is quite good. The error rates were 23.6% and 23.9% for the random forest and Gradient
Boosting Tree, respectively.
Table 3 shows the goodness of fitting of continental spatial prediction models and their validation. R2 of cal-
ibration are from 0.51 to 0.68. R2 were very low (below 0.04) for extrapolation, while they were from 0.44 to
0.63 for interpolations. The interpolation ME had similar value to those of calibration. Extrapolations had
higher absolute ME in most cases compared to interpolations and calibrations. An exception is that the
extrapolation of North America by the Australia model had a similar absolute ME. The North America model
had the best performance in calibration, but this model had similar accuracy of extrapolating predictions
with other models. Table 4 shows calibration and validation metrics of province models of Canada. The
Nova Scotia model had the lowest R2. All extrapolations had a low value of R2. The highest R2 of extrapola-
tions was produced by the Quebec model in predicting British Columbia. This implied that the predictability
of extrapolation increased at the province scale. Similar results were observed for the state models of
United States (not shown). Table 5 shows the goodness of fitting of leave one state out models of United
States and their validation. There was a significant decrease in the calibration R2 when the Northern High
Plains aquifer was left out for calibration. The R2 of the ‘‘without Northern High Plains’’ model was 0.677,
while other models were around 0.72. Except the ‘‘without Iowa’’ model and the ‘‘without Ohio’’ model, all
models had a R2 below 0.1 in extrapolation. The results showed that a general model (i.e., leave one state
out model) gave better results in extrapolation than a local model (i.e., state model) for most cases.

3.4. Final Predictions


Figure 8 shows output prediction of the absolute DTB, the censored DTB and the occurrence probability of
the R horizon by the ensemble model based on random forest and Gradient Boosting Tree at 250 m resolu-
tion. We choose to map DTB at the resolution of 250 m based on the available data sources and our avail-
able computing power. The mean absolute DTB predicted was 33.6 m. High values of DTB are visible in the
desert areas and around 708N, 458N and 408S. Somewhat lower values of DTB are visible in the tropics. For
the censored DTB, low values were found in the mountainous areas and especially in Mexico. About 85% of
the land surface we predict DTB to be larger than 2 m. R horizon i.e., shallow soils seem to be most correlat-
ed with topography, are visible in the mountainous areas along the major mountain chains. These patterns
fit well with expert knowledge [Brown et al., 2001; Dregne, 2011; Howell, 1960; Swinford, 2004].

3.5. Comparison With Regional Maps and Observations


We used regional maps of DTB from Iowa and Ohio to validate global predictions both visually and statisti-
cally. The Iowa map was drawn by geologists based on various data sources, including bedrock outcrop
maps, water wells, boreholes and soil description (filtered for those soils encountering bedrock) [Witzke

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 74


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 6. Scaled importance of covariates with the resolution of 250 m for target variables by random forest model. NIR is near infrared radiation. MIR is middle infrared radiation. MRVBF
is Multiresolution Index of Valley Bottom Flatness. LST is land surface temperature. PWV is precipitable water vapor.

et al., 2010]. The Ohio map was produced using over 162,000 data points as control for the bedrock-
topography lines [Swinford, 2004]. Ground-moraine dominated areas have a shallow DTB, the Ice-deposited
Wisconsinan-age ridge moraines generally have a medium DTB, and limited areas of deep DTB are largely
the results of deep bedrock valleys filled with drift.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 75


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 7. Plot showing cross-validation results for absolute depth to bedrock on the logarithmic scale. R-square is calculated using formula
in equation (1).

The correlation coefficient between our prediction and the regional maps are 0.82 and 0.6 for Iowa and
Ohio, respectively. Although the regional maps of DTB cannot be considered a ground truth, these maps
can be nevertheless considered several times more accurate than our global predictions. For both areas,
there is an underestimation according to the mean error (2422 for Iowa and 2528 for Ohio). Although the
differences in Figures 9 and 10 indicate that there is some underestimation of higher values, especially in
the case of Ohio, this comparison also shows that the general patterns between regional maps and our pre-
dictions match in most cases. In Iowa, the bedrock surface is buried by unconsolidated surficial sediments
(mostly Quaternary) over most of its extent. In the southwest and northwest of Iowa, shallow DTB was
found. Most areas of Ohio are covered by sediments left by continental glaciers. In the southwest Ohio, the
bedrock surface is very close to the land surface as this area was free from glaciation.
The correlation coefficient between the map of Pelletier et al. [2016] and the regional maps are 0.27 and
0.24 for Iowa and Ohio, respectively. For both areas, the spatial patterns were quite different between them
(Figures 9 and 10). For Iowa, the deep DTB in the east part didn’t appear in the map of Pelletier et al. [2016].
For Ohio, the frequency of medium values in the map of Pelletier et al. [2016] were very low compared to
the regional map, and most values were either near zero or 50 m.
Figure 11 shows the comparison between the observations, our prediction and the regional DTB maps
along a line in Iowa and Ohio. In general, our predictions coincide well with the observation and the region-
al maps. Compared to the regional map of Iowa, our prediction had an underestimation for the bedrock

Table 3. Calibration and Validation Metrics of Continental Modelsa


Validation
2
Calibration R ME
2
Calibration Area R ME North America Europe Australia North America Europe Australia
North America 0.684 20.02 0.63 0.005 0.0001 0.05 1.11 20.64
Europe 0.513 20.04 0.006 0.442 0.004 20.75 20.03 22.26
Australia 0.598 20.05 0.029 0.0003 0.546 0.11 0.95 20.11
a
ME is mean error, which is calculated after logarithm transform.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 76


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Table 4. Calibration and Validation Metrics of Province Models of Canadaa


Validation

Calibration R2 ME

Calibration British Nova British Nova


Area R2 ME Columbia Scotia Ontario Qu
ebec Columbia Scotia Ontario Qu
ebec
British Columbia 0.467 20.03 0.407 0.017 0.004 0.044 20.03 0.48 0.85 1.05
Nova Scotia 0.376 20.006 0.0001 0.302 0.036 0.005 20.4 0.01 20.32 0.15
Ontario 0.755 20.01 0.048 0.013 0.604 0.06 20.22 0.02 0.03 0.4
Quebec 0.444 20.03 0.065 0.025 0.008 0.386 21.92 21.01 20.64 0.006
a
ME is mean error, which is calculated after logarithm transform.

valley along the way from Alvord to Wever. Compared to the regional map of Ohio, our prediction had an
overestimation for the hill slope around 100km from Montpelier to Pomeroy, and an underestimation for
the valley around 250km.
Because our data set used the map of Pelletier et al. [2016] as a covariate in the prediction, the comparisons
above may be biased. However, the results from cross-validation shows that the amount of variation
explained decreased from 58.7% to 58.6% when the map of Pelletier et al. [2016] was took out from the
covariate list. The reason of this may be that many of the patterns in the map of Pelletier et al. [2016] has
been already represented in the existing list of covariates (especially DEM-derived parameters which are
also used as covariates in producing the map of Pelletier et al. [2016]). Thus, the resulting map will not
change much if the map of Pelletier et al. [2016] is taken out as a covariate, and the comparisons above is
not problematic.
Figures 12 and 13 show the comparison between observations, our prediction and the map of Pelletier et al.
[2016] for Kentucky and Pennsylvania. The correlation coefficient between our prediction and observations
was relative high. The machine learning models could reflect the major spatial pattern of DTB. However, the
underestimation of high value and the overestimation of low value were significant. On the contrary, the
map of Pelletier et al. [2016] gave extreme estimations, i.e., very high or very low, for almost all the areas.
But the major spatial patterns in observations are not reflected. For example, almost the whole state of
Kentucky has a shallow DTB, and the high values in the southeast corner of the state are almost missing.
This may be caused by the misclassification of landform.
We validated the map of Pelletier et al. [2016] with our DTB observations by excluding the values no less
than 50 m because the maximum value of Pelletier et al. [2016] is 50 m. For interpolation area including Indi-
ana, Kentucky, New York and Pennsylvania where they used DTB data for calibration, the amount of varia-
tion explained is 5%. For extrapolation, the amount of variation explained is 2%.

Table 5. Calibration and Validation Metrics of Leave One State Uut Models of United Statesa
Validation
2
Calibration R ME
Calibration Area
2 b b
(Without) R ME Interpolation Extrapolation Interpolation Extrapolation
Indiana 0.735 20.019 0.644 0.092 20.026 20.261
Iowa 0.717 20.018 0.63 0.201 0.024 0.041
South central Kansas 0.727 20.019 0.637 0.045 20.022 0.285
Kentucky 0.717 20.02 0.677 0.003 0.11 0.68
Maine 0.73 20.021 0.613 0.012 0.029 0.221
Minnesota 0.714 20.019 0.616 0.099 0.026 20.567
Missouri 0.741 20.023 0.683 0.054 20.024 20.306
New Hampshire 0.726 20.019 0.648 0.051 0.039 0.003
New York 0.746 20.02 0.684 0.008 20.034 20.151
Ohio 0.722 20.021 0.663 0.17 20.041 0.011
Northern High Plains 0.677 20.018 0.621 0.033 0.008 20.739
Pennsylvania 0.736 20.02 0.674 0.007 20.016 20.042
Tennessee 0.732 20.02 0.653 0.002 20.042 20.296
Vermont 0.729 20.02 0.652 0.012 0.034 0.119
a
ME is mean error, which is calculated after logarithm transform.
b
The average of all interpolations of leave one state out models.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 77


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 8. Final prediction of (a) the absolute depth to bedrock (cm), (b) the censored depth to bedrock (cm, here values equal to 200 cm
indicate ‘‘deep as or deeper than’’), and (c) occurrence of R horizon within 200 cm (%). The maximum value of the absolute depth to
bedrock is set as 250 m for the convenience of visualization. But the actual maximum predicted value is about 540 m.

4. Discussion
We used the most abundant depth to bedrock observations from soil survey and geologic boreholes (pri-
marily water wells) to estimate the global spatial distribution using data-driven models. This work presented

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 78


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 9. Comparison of (a) regional map of Iowa, (b) our prediction and (d) map of Pelletier et al. [2016]. (c, e) The scatter plots with the
correlation coefficient indicate how well our prediction and Pelletier et al.’s [2016] prediction match the regional predictions. Values have
been stretched using a log-scale to emphasize spatial patterns. Note that the maximum value of Pelletier et al. [2016] is 50 m. And we took
out the values no less than 50 meters for the corresponding scatter plots.

the most up-to-date global DTB maps with higher resolution 250 m and higher accuracy compared to previ-
ous studies such as Pelletier et al. [2016]. The cross-validation statistics show that the absolute DTB maps
and the occurrence of R horizon have moderate accuracy, and the censored DTB map has a low accuracy.
There is overestimation of the absolute DTB with mean error of 20.25 m, and an underestimation of the

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 79


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 10..

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 80


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 11. Comparison of measured and predicted absolute depth to bedrock for (a) Iowa and (b) Ohio. The points are the observations.
The black line is the land surface elevation. The red line is the predicted DTB. The blue line is the DTB of regional map.

censored DTB with mean error of 1.25 cm. The large RMSE (11.7 m) in relation to the mean predicted values
highlights the need for considered use of the depth predictions. Our prediction patterns of DTB also match
with regional maps from Iowa and Ohio, although the average differences in values are about 610 m.

4.1. Problems of the Definition of Depth to Bedrock and Its Measurements


In this study, we used DTB observations from soil profiles and borehole drillings, and considered that they
are under the same definition. However, some problems exist on the definition of depth to bedrock and its
measurements. In the soil survey, the definition of R horizon or hard rock is not strictly equal to bedrock,
because intact regolith (weathered bedrock) may be included in R horizon. Though the definition of bed-
rock in geology survey is more consistent, DTB measurements of borehole drillings are based almost entire-
ly on the judgements of nonscientists (i.e., drillers) and this lowers the accuracy of DTB. As shown in Figure
1, the majority of soil profiles (usually, less than 2 m) do not encounter the bedrock. So the DTB from soil
profiles are censored data. In contrast, borehole drillings goes much deeper and they do not encounter the
bedrock in few cases. For pseudo-observations, we also assumed that DTB is zero where local slopes exceed
408, even though such surface bedrock is often highly fractured and porous. However, this assumption was
made for simplicity and may not be a serious problem, because the resulting maps did not change signifi-
cantly when we changed these values to a random number between 0 and 20 cm.

4.2. Success and Limitation of the Data Set


As mentioned above, soil profiles are censored data. And censored data will produce censored map. As a
result, maps produced solely from soil profiles can not be interpreted as a true DTB, but as ‘‘deeper than’’
the predicted values. In this study, we used deep observations from boreholes, which can compensate the
shallow observations from soil profiles. Thus, the predicted maps were more realistic. For the occurrence of
R horizon, the models provided relative reliable estimation. However, for the censored DTB, the models

Figure 10. Comparison of (a) regional map of Ohio, (b) our predictions and (d) map of Pelletier et al. [2016]. (c, e) The scatter plots with the
correlation coefficient indicate how well our prediction and Pelletier et al.’s [2016] prediction match the regional predictions. Values have
been stretched using a log-scale to emphasize spatial patterns. Note that the maximum value of Pelletier et al. [2016] is 50 m. And we took
out the values no less than 50 meters for the corresponding scatter plots.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 81


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 12. Comparison of (a) observations of Kentucky, (b) our predictions and (d) map of Pelletier et al. [2016]. (c, e) The scatter plots with
the correlation coefficient indicate how well our prediction and Pelletier et al.’s [2016] prediction match the observations. Values have been
stretched using a log-scale to emphasize spatial patterns. Note that the maximum value of Pelletier et al. [2016] is 50 m. And we took out
the values no less than 50 m for the corresponding scatter plots.

were not very successful in finding the relationship between the target variable and the covariates, and the
resulting map remains experimental.
The amount of variation explained by the models for the absolute DTB is about 59%, which means almost
half is unexplained. Mapping depth to bedrock is certainly complex (as soils are hidden, results of past grad-
ual and abrupt processes). Most likely more detailed geomorphological maps and lithological maps could
be the key for improving the predictions. At the moment we used the GLiM data set, which is actually of

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 82


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Figure 13. Comparison of (a) regional map of Pennsylvania, (b) our predictions, and (d) map of Pelletier et al. [2016]. (c, e) The scatter plots
with the correlation coefficient indicate how well our prediction and Pelletier et al.’s [2016] prediction match the regional predictions.
Values have been stretched using a log-scale to emphasize spatial patterns. Note that the maximum value of Pelletier et al. [2016] is 50 m.
And we took out the values no less than 50 m for the corresponding scatter plots.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 83


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

very general scale and low quality. As soon as a more detailed global lithological data arrives to the public
domain, it will be useful to improve the predictions.
It is a common thing in regression, such as the machine learning models we used, that low/high values can
get smoothed out in the case R-square is smaller. The deepest observation in the source data is about
3000 m. But the actual maximum predicted value is about 540 m. The machine learning models also overes-
timated zero DTB values, i.e., many outcrop were predicted as values around 300 cm (Figure 7). As a result,
the hint of Andes, Himalayas or many other mountain ranges, where DTB is near zero, is not very clear in
the map of absolute DTB. Another reason of the poor performance for mountain ranges is that we have few
observations there but only some pseudo-observations.
We could not predict deep values such as > 1km deep in Andean foreland basin because the borehole data
are also censored to some extent, i.e., we do not have much deep observations in such areas. There is no
universal requirement on how deep a drilling should go. So we do not know how much the borehole data
are censored (likely dozens of meters). Luckily, most applications including Earth System Models are more
interested for the shallow DTBs. Even though we estimate the absolute DTB, it should be considered as a
censored DTB when the interest is for the deep DTBs.

4.3. Process-Based Model and Empirical Model


Pelletier et al. [2016] distinguished global land surface into three landform components, i.e., upland hillslope,
upland valley bottom, and lowland and used different models for each component to estimate the DTB. For
upland hillslope, a model based on the balance of soil production and erosion was calibrated by soil thick-
ness using topographic curvature and mean annual rainfall to estimate soil thickness, and the regolith was
estimated based on water table depth, which has a high uncertainty. For upland valley, the DTB was esti-
mated by assuming that the side-slope project down to a V–shape valley. For the lowland, an empirical
model was established between the topographic roughness index and the DTB using water well logs. Pro-
cess based models have strength in capturing one or two major factors and apply the general rule to global,
but their simplification of soil and regolith formation processes may have ignored other factors. The advan-
tage of statistical models, such as random forest and Gradient Boosting Tree used in this study, is that they
can utilize as many covariates as possible, including DTB maps, and reflect the spatial variation of the rela-
tionship between the target variables and the covariates. Compared to the process based model and other
empirical models, however, the machine learning models driven with big data require considerable compu-
tational power. Due to the comparison in the result section, our data-driven method provided more accu-
rate predictions in interpolation. But the process based model had its strength in extrapolation areas where
it had a R2 around 0.02 which was slightly higher than those of empirical models (Table 3), though both pro-
duced maps with high uncertainty. These two approaches were complimentary, and the best way may be
to use data fusion approaches to compensate weakness of the two approaches. It is still challenging to esti-
mate the global DTB accurately due to lack of observations and poor understanding of the processes affect-
ing DTB and improvements were needed for both approaches. In our study, we used the DTB map by
Pelletier et al. [2016] as a covariate, and it came out as the seventh important for the absolute DTB as shown
in Figure 6. This indicated that our prediction has quite different spatial patterns, but broad similarities are
visible. The algorithm picked up precipitation as the most important covariate, which coincide with its con-
trol on the rate of soil production. Topographic parameters, including DEM, valley depth and the Multireso-
lution Index of Valley Bottom Flatness, were also important factors as they affect soil erosion processes and
the transportation of sedimentary depositions. Surface reflectance of MODIS MIR band 7 also plays an
important role in predicting DTB. One deficiency of our study is that with the exception of the geology
units, the covariates used in this study reflect the current surface or the subsurface conditions. Therefore
there is little covariate information relating to deeper conditions and/or to long term changes to DTB that
could be included in the model fitting.

4.4. Extrapolation Risks


The ‘‘homosoil’’ is proposed to extrapolate from reference areas with soil data to interested areas without
soil data when these areas have similar soil forming factors [Boettinger et al., 2010]. However, the accuracy
of the extrapolation area is usually much lower than the interpolation area (Table (3–5)). It should be noted
that the accuracy assessment by cross-validation in this study is valid for the interpolation areas. Not only
the extrapolation in feature space but also the extrapolation in geographic space will lead to the poor

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 84


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

performance of spatial prediction models. In this study, though the spatial coverage of soil profiles was
quite good, boreholes are spatially clustered and the spatial coverage of boreholes was not ideal. Systemat-
ic omission of deep DTB observations where there are no water wells or other boreholes led to the underes-
timation of the DTB. For example, the tropical rainforests usually have a very deep regolith, but the above
feature is not predicted in the resulting DTB map due to the lack of deep observations in those areas. We
used the ellipsoid defined by Montgomery et al. [2001] to determine the feature space similarity. The results
shows that the feature space is covered well by the point observations (above 99.9%), indicating that there
is no extrapolation in feature space. However, the relationship between the dependent and independent
variables may not carry from one region to another. The spatial coverage of deep DTB observation is more
importance than their coverage in feature space to reduce the extrapolation risk.

4.5. Effect of Observation Density on Model Results


We developed a data thinning algorithm, which is implemented as a function named sample.grid in the R
package GSIF, to get a subset of spatial clustered data points in such a way that the output data points are
distributed evenly in the space. Spatial points are overlaid with spatial grids with a specified cell size and
then get a subset from each grid with a specified number at most. If one grid has fewer points than the
specified number, all the points are taken. If one grid has more points than the specified number, only this
number of points are taken by random sampling. To test the effect of observation density on model results,
we used the above algorithm to get subsets of the observations taking Kentucky (Figure 12) as an example.
Eight cell sizes (50, 100 m, 200 m, 500 m, 1000 m, 2000 m, 4000 m, and 8000 m) were tested. The maximum
number of each cell was set as 1. The density and the number of observations increased as the cell size
decreased. These subsets were used to fit random forest models, and the rest of the observations were
used to validate. Figure 14 shows that the performance of the models by validation increased rapidly as the
cell size decrease. But the amount of variation explained did not increase much when the cell size is 100 m
or less, which was slightly smaller than the amount of variation explained by 10-fold cross validation (54%).
This indicated that there should be at least one observation at each grid with a size of 100 m by 100 m to

Figure 14. Effects of observation density on the model performance for Kentucky. Black line is the amount of variation explained. Red line
is the percentage of the observations used for model calibration. There are 82,905 observations in total.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 85


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

represent the spatial variation of DTB. It should be noted that there were less than 1% of grids within the
interpolation area which had an observation when the cell size is 100 m by 100 m. This is because the
observations are spatially clustered. As a result, adding more observations will improve the prediction even
in such areas with high density of observations. We also tested the above procedure for the global observa-
tions. The results shows that the amount of variation explained by validation was 19% when the cell size
was 100km by 100km, and only 2,308 observations were used for model calibration. This indicated that
there were still some predictabilities when observations were very sparse but were evenly distributed in
space.

4.6. Suggestion on the Usage of the Data Set


Users should be aware of the limitation of the predicted maps and the low accuracy in the extrapolation
areas as mentioned above. If a user is interested in the deep DTB larger than 100 m or requires accurate val-
ue of shallow DTBs (such as several centimeters), the data set may not have high accuracy to satisfy these
demands. For global scale application such as earth system modelling, it is necessary to aggregate the data
set into a lower resolution by averaging. Though the resolution is 250 m, it is not the first choice for regional
applications if local maps exist. Users may need to make their own decision recognizing the advantages
and limitations that we described in the paper.

4.7. Further Improvements


Predicting what lies beneath soil surface is not trivial. We believe that key to improving the predictions of
DTB is in adding more training points, especially in areas such as Latin America, Asia, and Africa where the
model heavily extrapolates (see Figure 2). In that context, there are other less available data, including point
observation and regional maps, which could be used to improve the data sources used to produce a global
map. In some borehole observations, there is no direct record of depth to bedrock, though it is possible to
extract DTB values using the lookup tables, as we did with the Australian, Brazil and China data. On the oth-
er hand, matching of DTB values with lookup tables will introduce uncertainty. Records from seismic sources
and engineering borehole data could also be used to help improving the accuracy of predictions of DTB.
However, these data are usually for small areas with varied quality, which presents challenges for harmoni-
zation into a global data set. Additional regional DTB maps could also be used to improve global predic-
tions. For example, Brown et al. [2001] provided two classes of overburden thickness (> 5–10 m and <5–
10 m) in the Circum-Arctic Map of Permafrost. Likewise, surficial geology maps could also be used to infer
DTB of the adjacent area [Karlsson et al., 2014]. In any case, increasing the representation and quality of
training data will likely remain our main strategy to improve these maps. Earth System Models can handle
multiple layers and subgrid structure in utilizing depth estimation products such as Pelletier et al. [2016].
Because of the greater uncertainty of depth of intact regolith in uplands for the product of Pelletier et al.
[2016], it is still not practical to include it in Earth System Models. As a result, the application of the above
data set in the Community Land Model used only the DTB. In the future, it is necessary to include both DTB
and soil depth (or depth to regolith) to represent the reality of water and energy balance more accurately
in land surface processes because regolith and soil have quite different thermal and hydraulic properties.
This depends on the availability of depth data and the development of corresponding method.

5. Conclusions
We produced maps of the depth to bedrock including the absolute DTB, the censored DTB, and the occur-
rence of R horizon within 200 cm for the whole world using state-of- the-art ground observations of depth
to bedrock and machine learning algorithms. This data set provides Earth System Models with more accu-
rate estimation of the lower boundary condition. The cross-validation suggests that moderate performance
for the absolute DTB and the occurrence of R horizon. However, the censored DTB contains a significant
amount of over-predicted low values. The predictability of DTB was limited by the inherent variability, inac-
curacies, censored nature of the observations and biased spatial coverage of the input data. In addition,
almost all the covariates used in this study reflect surface or near surface characteristics and processes in
modern time. This restricts the ability of predicting the higher values of DTB (i.e., deeper DTB). Incorporation
of more observations, especially borehole drilling logs in the tropics, wetlands, mountain ranges, shifting
sand areas and similar, would help improve the resulting maps and increase accuracy, especially for higher
values of DTB. As all processes from point to raster overlay to model fitting are fully automated, by gradually

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 86


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

adding new training data we hope to produce more and more accurate maps of underlying boundary of
the world soil and regolith. The resulting global maps are available for download at http://globalchange.
bnu.edu.cn/and http://soilgrids.org/.

Acknowledgments References
This work was supported by the
Arrouays, D., et al. (2014), GlobalSoilMap: Toward a fine-resolution global grid of soil properties, Adv. Agron., 125, 93–134, doi:10.1016/
Natural Science Foundation of China
B978-0-12-800137-0.00003-0.
(under grants 41575072, 41405096)
Boer, M., G. DelBarrio, and J. Puigdefabregas (1996), Mapping soil depth classes in dry Mediterranean areas using terrain attributes derived
and R&D Special Fund for Nonprofit
from a digital elevation model, Geoderma, 72(1-2), 99–118, doi:10.1016/0016-7061(96)00024-9.
Industry (Meteorology,
Mallavan, B. P., B. Minasny, and A. B. McBratney (2010), Homosoil, a methodology for quantitative extrapolation of soil information across
GYHY201206013, GYHY201306066).
the globe, in Progress in Soil Science, vol. 2, edited by J. Boettinger et al., pp. 137–150, Springer Netherlands, Dordrecht, doi:10.1007/
ISRIC is a nonprofit-making
978-90-481-8863-5.
organization, core-funded by the
Breiman, L. (2001), Random forests, Mach. Learning, 45(1), 5–32, Dordrecht, doi:10.1023/A:1010933404324.
Dutch government, with a mandate to
Brown, J., O. F. Jr., J. Heginbottom, and E. Melnikov (2001), Circum-arctic map of permafrost and ground ice conditions, Digital media, Natl.
serve the international community as
Snow and Ice Data Cent., Boulder, Colo.
custodian of global soil information
Brunke, M. A., et al. (2016), Implementing and evaluating variable soil thickness in the Community Land Model version 4.5 (CLM4.5),
and to increase awareness and
J. Clim., 29, 3441–3461, doi:10.1175/JCLI-D-15-0307.1.
understanding of the role of soils in
Calvi~ no, P., V. O. Sadras, and F. H. Andrade (2003), Quantification of environmental and management effects on the yield of late-sown soy-
major global issues.
bean, Field Crops Res., 83(1), 67–77, doi:10.1016/S0378-4290(03)00062-5.
Chen, J., et al. (2015), Global land cover mapping at 30 m resolution: A POK-based operational approach, ISPRS J. Photogramm. Remote
Sens., 103, 7–27, doi:10.1016/j.isprsjprs.2014.09.002.
Chen, T., and C. Guestrin (2016), XGBoost: A Scalable Tree Boosting System, in KDD ’16 Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 785–794, Assoc. for Comput. Mach., New York, doi:10.1145/2939672.2939785.
Chesworth, W. (2008), Encyclopedia of Soil Science, Encycl. Earth Sci. Ser., Springer, Dordrecht, Netherlands, doi:10.1007/978-1-4020-3995-9.
Dahlke, H. E., T. Behrens, J. Seibert, and L. Andersson (2009), Test of statistical means for the extrapolation of soil depth point information
using overlays of spatial environmental data and bootstrapping techniques, Hydrol. Processes, 23(21), 3017–3029, doi:10.1002/hyp.7413.
Dregne, H. (2011), Soils of Arid Regions, Dev. Soil Sci., Elsevier Sci., Amsterdam.
Fan, Y., H. Li, and G. Miguez-Macho (2013), Global patterns of groundwater table depth, Science, 339(6122), 940–943, doi:10.1126/science.1229881.
FAO (1996), Digitized soil map of the world and derived soil properties, map, Rome.
FAO (2014), World Reference Base for Soil Resources 2014, Rome.
Fu, Z., Z. Li, C. Cai, Z. Shi, Q. Xu, and X. Wang (2011), Soil thickness effect on hydrological and erosion characteristics under sloping lands: A
hydropedological perspective, Geoderma, 167-168, 41–53, doi:10.1016/j.geoderma.2011.08.013.
Gochis, D. J., E. R. Vivoni, and C. J. Watts (2010), The impact of soil depth on land surface energy and water fluxes in the North American
Monsoon region, J. Arid Environ., 74, 564–571.
Hartmann, J., and N. Moosdorf (2012), The new global lithological map database GLiM: A representation of rock properties at the Earth
surface, Geochem. Geophys. Geosyst., 13, Q12004, doi:10.1029/2012GC004370.
Hengl, T., et al. (2014), Soilgrids1km — global soil information based on automated mapping, PLoS One, 9(8), e105992, doi:10.1371/
journal.pone.0105992.
Hijmans, R. J., S. E. Cameron, J. L. Parra, P. G. Jones, and A. Jarvis (2005), Very high resolution interpolated climate surfaces for global land
areas, Int. J. Climatol., 25, 1965–1978, doi:10.1002/joc.1276.
Howell, J. V. (1960), Glossary of Geology and Related Sciences, Am. Geol. Inst., Washington, D. C.
Illinois State Geological Survey (2004), Glacial drift in Illinois: Thickness and character, map, Champaign.
Jain, S. (2014), Fundamentals of Physical Geology, Springer, New Delhi.
Juilleret, J., S. Dondeyne, and C. Hissler (2014), What about the regolith, the saprolite and the bedrock? Proposals for classifying the subso-
lum in WRB, in EGU General Assembly Conference Abstracts, vol. 16, pp. 2716, Copernicus, Vienna.
Karlsson, C., I. Jamali, R. Earon, B. Olofsson, and U. M€ortberg (2014), Comparison of methods for predicting regolith thickness in previously
glaciated terrain, Stockholm, Sweden, Geoderma, 226-227, 116–129, doi:10.1016/j.geoderma.2014.03.003.
Kuhn, M., S. Weston, C. Keefer, N. Coulter, and R. Quinlan (2014), Cubist: Rule- and Instance-Based Regression Modeling, R package. [Available
at https://cran.r-project.org.]
Kuriakose, S. L., S. Devkota, D. G. Rossiter, and V. G. Jetten (2009), Prediction of soil depth using environmental variables in an anthropo-
genic landscape, a case study in the Western Ghats of Kerala, India, Catena, 79(1), 27–38, doi:10.1016/j.catena.2009.05.005.
Lawrence, D. M., A. G. Slater, V. E. Romanovsky, and D. J. Nicolsky (2008), Sensitivity of a model projection of near-surface permafrost degra-
dation to soil column depth and representation of soil organic matter, J. Geophys. Res., 113, F02011, doi:10.1029/2007jf000883.
Lowrie, W. (2007), Fundamentals of Geophysics, 2nd ed., Cambridge Univ. Press, New York.
McPherson, A. (2011), Development of the Australian National Regolith Site Classification Map, map, Geosci. Aust., Symonston ACT.
Melnikov, E. S. (1998), Catalog of boreholes from Russia and Mongolia, in International Permafrost Association, Data and Information Work-
ing Group, comp. Circumpolar Active-Layer Permafrost System (CAPS), Version 1.0 [CD-ROM], Natl. Snow and Ice Data Cent., Univ. of Colo.
at Boulder, Boulder, Colo.
Miller, D. A., and R. A. White (1998), A conterminous United States multilayer soil characteristics dataset for regional climate and hydrology
modeling, Earth Interact., 2, 1–26, doi:10.1175/1087-3562(1998)002h0001:ACUSMSi2.3.CO;2.
Missouri Geological Survey (2013), MO 2014 overburden thickness (depth to bedrock), map, Rolla, Mo.
Montgomery, D. C., E. A. Peck, and G. G. Vining (2001), Introduction to Linear Regression Analysis, Wiley Ser. Probab. Stat., 3rd ed., John Wiley,
Hoboken, N. J., doi:10.1198/tas.2003.s211.
Moody, E. G., M. D. King, S. Platnick, C. B. Schaaf, and F. Gao (2005), Spatially complete global spectral surface albedos: Value-added data-
sets derived from Terra MODIS land products, IEEE Trans. Geosci. Remote Sens., 43(1), 144–158, doi:10.1109/TGRS.2004.838359.
Pelletier, J. D., and C. Rasmussen (2009), Geomorphically based predictive mapping of soil thickness in upland watersheds, Water Resour.
Res., 45, W09417, doi:10.1029/2008WR007319.
Pelletier, J. D., P. D. Broxton, P. Hazenberg, X. Zeng, P. A. Troch, G.-Y. Niu, Z. Williams, M. A. Brunke, and D. Gochis (2016), A gridded global
data set of soil, immobile regolith, and sedimentary deposit thicknesses for regional and global land surface modeling, J. Adv. Model.
Earth Syst., 8, 41–65, doi:10.1002/2015MS000526.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 87


Journal of Advances in Modeling Earth Systems 10.1002/2016MS000686

Peterman, W., D. Bachelet, K. Ferschweiler, and T. Sheehan (2014), Soil depth affects simulated carbon and water in the MC2 dynamic glob-
al vegetation model, Ecol. Modell., 294, 84–93, doi:10.1016/j.ecolmodel.2014.09.025.
Price, D. G. (2009), Engineering Geology: Principles and Practice, Springer, Berlin.
Rabus, B., M. Eineder, A. Roth, and R. Bamler (2003), The shuttle radar topography mission—A new class of digital elevation models
acquired by spaceborne radar, ISPRS J. Photogramm. Remote Sens., 57(4), 241–262, doi:10.1016/S0924-2716(02)00124-7.
Ribeiro, E., N. Batjes, J. Leenaars, and A. van Oostrum (Eds.) (2015), Towards the standardization and harmonization of world soil data, ISRIC
Rep. 2015/03, 101 pp., ISRIC — World Soil Inf., Wageningen, Netherlands.
Richard, S., T. Shipman, L. Greene, and R. Harris (2007), Estimated depth to bedrock in Arizona, map, Arizona Geological Survey, Tucson,
Ariz.
Sayre, R., et al. (2014), A new map of global ecological land unitsan ecophysiographic stratification approach, map, Assoc. Am. Geogr.,
Washington, D. C.
Schenk, H. J., and R. B. Jackson (2005), Mapping the global distribution of deep roots in relation to climate and soil characteristics, Geo-
derma, 126(1-2), 129–140, doi:10.1016/j.geoderma.2004.11.018.
Schoeneberger, P., D. Wysocki, E. Benham, and W. Broderson (Eds.) (2011), Field Book for Describing and Sampling Soils, 3rd ed., Natl. Soil
Surv. Cent., NRCS USDA, Lincoln.
Schoeneberger, P., D. Wysocki, E. Benham, and Soil Survey Staff (2012), Field Book for Describing and Sampling Soils, Version 3.0, Nat. Resour.
Conserv. Serv., Natl. Soil Surv. Cent., Lincoln.
Scholes, R. J., and E. B. D. Colstoun (2011), ISLSCP II Global Gridded Soil Characteristics, map, Oak Ridge, Tenn.
Shafique, M., M. v. der Meijde, and D. G. Rossiter (2009), Geophysical and remote sensing-based approach to model regolith thickness in a
data-sparse environment, Catena, 87(1), 11–19, doi:10.1016/j.catena.2011.04.004.
Shangguan, W., et al. (2013), A China dataset of soil properties for land surface modeling, J. Adv. Model. Earth Syst., 5, 212–224, doi:10.1002/
jame.20026.
Soil Survey Staff (2014), Keys to Soil Taxonomy, 12th ed., USDA–Nat. Resour. Conserv. Serv., Washington, D. C.
Swinford, E. M. (2004), What the glaciers left behind, Ohio Geol., 1, 1–5.
Tesfa, T. K., D. G. Tarboton, D. G. Chandler, and J. P. McNamara (2009), Modeling soildepth from topographic and land cover attributes,
Water Resour. Res., 45, W10438, doi:10.1029/2008WR007474.
Tromp-van Meerveld, H. J., N. E. Peters, and J. J. McDonnell (2007), Effect of bedrock permeability on subsurface stormflow and the water
balance of a trenched hillslope at the Panola Mountain Research Watershed, Georgia, USA, Hydrol. Processes, 21(6), 750–769, doi:
10.1002/hyp.6265.
USDA-NCSS (2006), Digital General Soil Map of U.S., map, U.S. Dep. of Agric., Nat. Resour. Conserv. Serv., Fort Worth, Tex.
Vermont Geological Survey, and Vermont Agency of Natural Resources (2008), Bedrock geologic map of Vermont, map, Vermont Agency
of Nat. Resour., Montpelier.
Wilford, J. (2012), A weathering intensity index for the Australian continent using airborne gamma-ray spectrometry and digital terrain
analysis, Geoderma, 183-184, 124–142, doi:10.1016/j.geoderma.2010.12.022.
Wilford, J. R., R. Searle, M. Thomas, D. Pagendam, and M. J. Grundy (2016), A regolith depth map of the Australian continent, Geoderma,
266, 1–13, doi:10.1016/j.geoderma.2015.11.033.
Witzke, B. J., R. R. Anderson, and J. P. Pope (2010), Estimated depth to bedrock of Iowa as a 110-meter pixel, 32-bit Imagine Format Raster
Dataset, map, Iowa Geol. and Water Surv., DNR, Iowa City.
Yamakawa, Y., K. Kosugi, N. Masaoka, J. Sumida, M. Tani, and T. Mizuyama (2012), Combined geophysical methods for detecting soil thick-
ness distribution on a weathered granitic hillslope, Geomorphology, 145-146, 56–69, doi:10.1016/j.geomorph.2011.12. 035.

SHANGGUAN ET AL. GLOBAL MAP OF DEPTH TO BEDROCK 88

You might also like