Introduction
Soils store around 1500 Pg of carbon and represent the largest
terrestrial carbon pool ; thus, it is critical to
accurately quantify the variability of soil organic carbon (SOC) from
local to global scales. During the fourth session of the Global Soil Partnership
(GSP) Plenary Assembly held in May 2016 in Rome, it was agreed to develop a
Global Soil Organic Carbon Map (GSOCmap) . The overarching
goal is that a Global SOC Map of the Global Soil Partnership (GSOCmap-GSP)
will be developed using a distributed approach relying on country-specific
SOC maps. Country-specific maps represent a valuable source of information to
explain the high discrepancy of current global SOC estimates such as the
SoilGrids250m system and the Harmonized World Soil Database .
The Food and Agriculture Organization (FAO) recently compiled how different
statistical methods (e.g., regression kriging and machine learning) could be
used to generate country-specific SOC maps and calculate uncertainty
. All these approaches consider the reference framework of
the Soils, Climate, Organisms, Parent material, Age and (N) space or
spatial position (SCORPAN) model for digital soil mapping (DSM) . In the
SCORPAN reference framework, a soil attribute (e.g., SOC) can be predicted as
a function of the soil-forming environment, in correspondence with
soil-forming factors from the Dokuchaev hypothesis and Jenny's soil-forming
equation based on climate, organisms, relief, parent material, and elapsed
time of soil formation . The SCORPAN
reference framework is an empirical approach that can be expressed as in
Eq. (1):
Sa[x;yt]=f(S[x;yt],C[x;yt],O[x;yt],R[x;yt],P[x;yt],A[x;yt]),
where Sa is the soil attribute of interest at a specific location
N (represented by the spatial coordinates of field observations x; y)
and at a specific period of time (t); S is the soil or other soil
properties that are correlated with the soil attribute of interest (Sa); C is the climate or climatic properties of the
environment; O is the organisms, vegetation, fauna, or human activity; R
is topography or landscape attributes; P is parent material or lithology;
and A is the substrate age or the time factor. To generate predictions of
Sa across places where no soil data are available, N should be
explicit for the information layers representing the soil-forming factors.
These predictions will be representative of a specific period of time (t)
when soil available data were collected. Therefore, the prediction factors
ideally should represent the conditions of the soil-forming environment for
the same period of time (as much as possible) when soil available data were
collected. In Eq. (1), the left side is usually represented by the available
geospatial soil observational data (e.g., from legacy soil profile
collections) and the right side of the equation is represented by the soil
prediction factors. These prediction factors are normally derived from four
main sources of information: (a) thematic maps (i.e., soil type, rock type,
land use type); (b) remote sensing (i.e., active and passive sensors);
(c) climate surfaces and meteorological data; and (d) digital terrain
analysis or geomorphometry. The SCORPAN reference framework is widely used,
but one critical challenge is to quantify the relative importance of the
soil-forming factors (i.e., prediction factors) that could explain the
underlying soil processes controlling the spatial variability of a specific
soil attribute (i.e., SOC).
Arguably, there are two approaches for statistical modeling
that influence the predictions of the spatial variability
of SOC. One assumes that the variability of observations can be reproduced by
a given stochastic data model (e.g., with hypotheses about the spatial
structure of the variable). The other approach uses algorithms and treats as
unknown the mechanisms generating the structure of values in available
datasets (e.g., with hypothesis about the statistical distribution and
moments of the variable). For SOC modeling, the accuracies of global models
compared with country-specific estimates have not been systematically
evaluated on detail. While globally available SOC predictions rely on large
and complex multivariate spaces to represent the soil-forming environment,
local (i.e., more simple) models may be useful for validation purposes and
required to measure the bias of global SOC estimates, specifically, at
particular sites/countries (well represented by available data), where SOC drivers may be
easier to identify due to a smaller range of SOC variance. In addition, the
assumptions of global models compared with local efforts may be different,
and local datasets could complement global information sources. Because
different mapping approaches use available information (i.e., training data
and predictors) in different ways, comparing several approaches and methods
is useful to quantify the relative importance of prediction factors across
data configurations and distributional properties. We argue that a systematic
analysis of predictive algorithms and consequently selection of predictors
(by each one of the algorithms) could provide insights about the underlying
factors that control the spatial variability of SOC.
The last decade has seen an increasing diversity of approaches for DSM. Data
mining techniques have been successfully used to model and predict the
spatial variability of soil properties and generate site-specific and country-specific SOC maps
. The combination of regression
modeling approaches with geostatistics of independent model residuals (i.e.,
regression kriging) is a combined strategy that has been widely used to map
SOC . Machine learning
algorithms such as random forests or support vector machines have also been
used to increase statistical accuracy of soil carbon models
including applications for SOC
mapping . Machine
learning methods do not necessarily allow to extract information about the
main effects of prediction factors in the response variable (e.g., SOC);
consequently, a variable selection strategy is always useful to increase the
interpretability of machine learning algorithms. With this diversity of
approaches, one constant question is if there is a method that systematically
improves the prediction capacity of the others aiming to predict SOC across
large geographic areas (e.g., Latin America). We postulate that probably
there is no universal method (i.e., silver bullet) for DSM, but both global
and country-specific efforts are needed to test a variety of predictive
algorithms including variable and parameter selection strategies for
maximizing explained variance while minimizing prediction bias.
To minimize bias in SOC predictions, it is required to build baseline
reference estimates to quantify SOC stocks and contribute to better
parameterization for projections of SOC under future soil weathering
conditions and land degradation scenarios. Therefore, SOC estimates based on
statistical predictions should be ideally based on all available information
for specific countries or regions of interest, from both national and global
information sources. However, the availability of public SOC information is
limited across large areas of Latin America and large discrepancies exist in
current global SOC estimates . Thus, there is a pressing need
to validate the accuracy of global SOC estimates, improve interoperability
() and contribute to the capacity of countries to meet
the GlobalSoilMap specifications to inform policy
decisions around climate change mitigation strategies.
This study focuses on Latin America, where site- or region-specific modeling
efforts report high explained variance when mapping SOC .
Accurate SOC maps are required to identify areas with the potential for soil
carbon sequestration, and distinguish them from areas with high SOC. However,
site-specific efforts to map SOC across Latin America highlight the challenge
of predicting pedologically sound soil maps due to the complexity of SOC
spatial variability , including the inconsistencies of
using simple linear approaches to explain soil and depth interrelationships
. Site-specific SOC mapping efforts across Latin America
also suggest that variable selection and the spatial detail of SOC prediction
factors also contribute to discrepancies of SOC predictions
. To increase the accuracy of SOC predictions, the
use of high-performance computing through open-source platforms (i.e., Google
Earth) represents a valuable resource to make and continuously update (as new
and better data become available) country-specific SOC maps
. The constant challenge is how to increase SOC
prediction accuracy while also reducing the uncertainty and granularity of
SOC grids.
The overarching goal of this study is to compare different predictive
algorithms across 19 data/country scenarios with publicly available
information to support the development of country-specific SOC maps to be
included in the GSOCmap-GSP. Currently, SOC information across Latin America
has been derived from global models such as the SoilGrids system or the
Harmonized World Soil Database , which lack
quantification of uncertainty and where large areas remain parameterized with
limited country-specific information. This challenge is not unique for Latin
America as many regions around the world (e.g., Africa, Siberia) have limited
SOC information to parameterize models to estimate the SOC pool. To inform
future SOC mapping efforts, this study addresses two specific questions:
(a) which environmental variables (derived from publicly available
information) have the highest correlations with country-specific SOC
information, and (b) which method (i.e., predictive algorithm)
is best to represent SOC across Latin America and within each country. We assumed
that methods could inform each other as they may explain different aspects of
SOC variability. The ultimate aim of this study is to empower capacities for
digital SOC mapping across Latin America and to contribute to the
discussion about the importance of integrating country-specific information
for representing and predicting soil-related variables (e.g., SOC) to improve
regional-to-global SOC predictions.
Methods
We based our methodological approach on public sources of information and
methods implemented in open-source platforms for statistical computing. Thus,
our framework for modeling SOC stocks (Fig. ) could be
reproduced across the world for comparative purposes between country-specific
and global estimates.
Flow diagram of the main
methodological steps that we performed in order to generate country-specific
and regional SOC predictions. The World Soil Information Service (WoSIS)
dataset was harmonized with the http://worldgrids.org (last access:
20 February 2018) environmental data using 5 × 5 km grids. SOC
stocks were calculated at points and correlated predictors identified. Five
methods were parameterized and we created an ensemble of using a generalized
linear approach. Accuracy of models and the ensembles was assessed with
repeated cross-validation. Country-specific and regional (Latin America)
ensembles were compared with global models. KK is kernel-weighted nearest
neighbors, SVM is support vector machines, RF is random forests, PL is
partial least squares regression, and RK is regression kriging.
SOC observations
Soil organic carbon information was extracted from the World Soil Information
Service (WoSIS) soil profile database. This dataset represents a great
harmonization effort in which a large number of national legacy datasets have
been compiled. It includes local-to-national soil profile collections with a
sampling strategy generally based on morphological soil attributes
. The goal of the GSOCmap-GSP is to produce global
information for the first 30 cm; thus, we generated synthetic horizons for
this depth using a mass-preserving spline approach . We
applied a pedotransfer function based on organic matter (OM) if the bulk
density (BLD) information was missing: BLD is 1/(0.6268+0.0361×OM) . We decided to use this equation because it
showed less extreme values than other available pedotransfer functions during
preliminary discussion and training exercises (data not shown). Another
reason is that there is not a single pedotransfer function applicable to all
conditions across Latin America. This equation is representative for soils
with organic matter content between 0.17 and 13.5 % . For
coarse fragments (CRFVOL), a value of 0 % was used for missing
information prior to the mass-preservative spline modeling. SOC estimates (0
to 30 cm) were derived following a standardized SOC calculation method
(Eq. 2):
SOCstock=ORCDR1000×H100×BLD×(100-CRFVOL)100,
where ORCDR is SOC density (gkg-1) and H is soil depth
(30 cm).
Because of the limitations and uncertainty in the available BD and CRFVOL
data, we also included an error approximation of SOC estimates. This error
was derived using Global Soil Information Facilities (GSIF;
) as explained in the next section.
SOC error estimates
The GSIF approach for estimating SOC (function OCSKGM) includes an
approximate error which we used to quantify the reliability of SOC estimates
. This error was approximated using the Taylor series
method, by a truncated Taylor series centered by the means explained
previously . We mapped the error trend of SOC estimates
by interpolating the values on a per country basis using the generic
framework for predictive modeling based on machine learning and buffer
(geographical) distances . We followed
this method to provide a spatial explicit measure of the SOC estimation
error. We used this method because it can be implemented without prediction
factors (e.g., only buffer distances) and because it is practically free of
assumptions but considers the geographical proximity to and composition of
the sampling location points as explained by its developers
. SOC error estimates represent a
component of uncertainty of the overall quality of country-specific input
data.
SOC training data and exploratory analysis
Each country-specific SOC dataset was transformed to its natural logarithm to
reduce the right-skewed distribution of SOC values and because exploratory
analysis showed that this transformation can improve the prediction capacity
of further modeling methods. To analyze the statistical distribution of SOC
values, a probability distribution function was plotted and a Shapiro–Wilk
test of normality was conducted on each dataset. The units of the SOC
estimates are kgm-2. Our global (Latin America) dataset of
11 268 SOC estimates was divided using a simple bootstrapping technique
and 25 % of data were used for independent validation
purposes, and the remaining 75 % of data for training prediction models.
We coupled this information with a public source of prediction factors; see
Sect. 2.4.
Soils prediction factors
We used environmental information from WorldGrids (worldgrids.org), which is
an initiative of ISRIC-World Soil Information. We downloaded and masked 118
environmental layers (i.e., prediction factors) for each country to
quantitatively represent the soil-forming environment
(http://worldgrids.org/doku.php/wiki:layers, last access:
20 February 2018). The prediction factors were harmonized into a
1 × 1 km global grid by the WorldGrids project from three main
information sources: remote sensing, climate surfaces, and digital terrain
analysis. Additional terrain parameters (e.g., terrain slope, aspect,
catchment area, channel network base level, terrain curvature, topographic
wetness index, and length–slope factor) from elevation data were calculated
in the System for Automated Geoscientific Analyses geographic information
system (SAGA GIS) for each country following the standard implementation for
basic terrain parameters . We resampled the prediction
factors into a 5 × 5 km pixel size grid to reduce the computational
demand required to make predictions and facilitate the reproducibility of
this DSM framework without the need for high-performance computing.
Prediction of SOC
We made predictions on a country-specific and on a regional (Latin American)
basis. We based our prediction framework on the following six steps:
First, the relationship between SOC and prediction factors was explored
using simple correlation analysis.
Second, the 10 prediction factors with highest correlations with SOC data
were identified for each country and used for further analyses.
Third, we explored, parameterized, and compared five statistical methods
with different assumptions to model SOC variability across Latin America:
regression kriging (based on a multiple linear regression model (RK) and
partial least squares (PLS) regression, support vector machines (SVMs), random
forests (RF), and kernel-weighted nearest neighbors (KK). A brief explanation
for each modeling approach is provided in Appendix A.
Fourth, we used a five times repeated 5-fold cross-validation strategy
of the aforementioned models to estimate the RMSE. Then, we used the
caretEnsemble tools for stacking the five predictions . The
caretEnsemble approach uses the RMSE to weight and create ensembles of regression models
under a generalized approach to create a linear blend of predictions.
Fifth, we calculated independent model residuals (by predicting the
25 % of data not used for model parameterization). For each
5 × 5 km pixel, we estimated the full conditional response of these
residuals to the SOC prediction factors following the quantile regression
method available within the quantregForest modeling framework . We used this map as a surrogate of model
uncertainty complementary to the approximated error trend of SOC estimates.
Sixth, we used all Latin American data in the WoSIS system to repeat
the fourth and fifth steps of our modeling framework, generating regional
predictions of SOC and comparing with country-specific results and global SOC
estimates. We also evaluated the prediction capacity of these models.
Model evaluation and accuracy
First, we selected the optimal parameters for each model/country by the means
of a 10-fold cross-validation strategy following a generic recommendation
(see parameter description in
Appendix A). For each model, the train function of the caret package
included simple resampling techniques for automatic model
parameter selection. Thus, we obtained unbiased residuals for each
model/country that we compared using Taylor diagrams . A Taylor
diagram summarizes multiple aspects of model performance, such as the
agreement and variance between observed and predicted values
. In a Taylor diagram, each model is represented by a point
in the plot describing how well the patterns of observed and modeled values match
each other. Two models have a similar predictive capacity if they overlap
across the intersection of an error vector, a variance ratio, and a
correlation vector.
We analyzed the overall ratio (ECr) between model errors (RMSE)
and the correlation between observed and predicted values (corr) for each
model across all countries. We propose this ratio ECr as an
approach to better understand the agreement between the correlation
(calculated by the means of cross-validation) and the RMSE (derived from the
unbiased residuals of cross-validation). Before calculating the
RMSE / correlation ratio, the RMSE and the correlation between observed
and predicted values were standardized (by its maximum and minimum values) to a
range between 0 and 1 using
RMSESD=RMSEi-min(RMSE)range(RMSE)corrSD=corri-min(corr)range(corr)ECr=RMSESDcorrSD,
where ECr is the proposed ratio between errors and correlation
between observed and predicted values; RMSEi is the observed RMSE
for the ith model; min(RMSE) is the minimum observed value of
RMSE, and range(RMSE) is the difference between the maximum
and minimum observed values of RMSE; corri is the observed
correlation for the ith model; min(corr) is the minimum observed
value of correlation, and range(corr) is the difference
between the maximum and minimum observed values of correlation
Spatial distribution of available SOC in WoSIS for Latin America.
SOC estimates are calculated for each point using Eq. (2) (a). The
approximated error is based on Taylor series as implemented in the R-GSIF
package, as is explained in (b). Thus,
panel (b) represents the uncertainty of SOC estimates at each point. The
values of this map could be associated with data limitations and missing
information for BLD and CRFVOL.
If the value of the ECr was close to 0, then there was a stronger
agreement between high RMSE and low correlation, or low RMSE and high
correlation. If this value deviated from 0 (up to 1 or more), then the RMSE
would tend to be high while the correlation was also high, suggesting that
the method represented the variability of SOC but with high bias.
Model accuracy (also represented by the RMSE and R2) was assessed for the
model ensembles with a more strict (but computationally expensive) 5-fold and
five times repeated cross-validation strategy. This model refitting allowed
more stable accuracy results with the ultimate goal of comparing
country-specific and regional (Latin America) estimates. Repeated 10- and
5-fold cross-validation have been used to compare both machine learning and
geostatistical approaches for mapping soil properties from book examples to
real applications at the global scale . In addition, independent model residuals were also obtained from
the 25 % of data not used for the country-specific and regional ensembles
to estimate a spatially explicit measure of uncertainty (as explained in
step five of our prediction framework).
SOC stocks
First, we analyzed the influence of the maximum allowed prediction limits for
each prediction algorithm. We harmonized the units of our SOC estimates with
global datasets in Mgha (megagrams per hectare at 30 cm depth). The
sensitivity of the total SOC stock to the model prediction limit was tested
by increasing (every 10 Mgha) the maximum prediction limit from
0.5 Mgha. until finding a stable rate. Geopolitical limits were
obtained from the Global Administrative areas project
(https://gadm.org/, last access: 16 July 2018). Using these country
limits we report our country-specific and Latin American SOC estimates. For
comparative purposes, we also extracted for each country the global SOC
estimates from the SoilGrids system , the Harmonized World
Soil Database , and the GSOCmap-GSP (see
http://54.229.242.119/apps/GSOCmap.html, last access: 16 July 2018). We
also report stocks across the land cover classes derived from the Latin
American Network for Monitoring and Studying of Natural Resources, a product
with an estimated accuracy of 84 % . We report the
overall uncertainty of these stocks with the independent model residuals map
and the approximated error trend of the SOC estimates. Some countries with no
data were filled with the average of the surrounding extent of the SOC
predictions. All analyses were performed using the R software .
Results
Descriptive statistics
SOC across different countries showed a wide diversity of data scenarios
(Table ). Costa Rica (with a mean of
11.05 kgm2), Chile (with a mean of 9.88 kgm2), and
Colombia (with a mean of 8.15 kgm2) are the countries with the
highest SOC mean values. Brazil (n=5616) and Mexico (n=4321) were the
countries with highest data availability. In contrast, Honduras (n=11),
Guatemala (n=20), and Belize (n=21) were the countries with lowest
density of SOC estimated values (Table ). With the original
(untransformed) dataset, the only countries that showed a normal distribution
after the Shapiro–Wilk test of normality with an α of 0.05 were
Belize, Guatemala, Honduras, and Suriname.
Descriptive statistics of SOC estimates (in kgm2) and
total land area for each analyzed country. n is the number of observations.
We provide quantiles, median, mean, and the standard deviation of SOC data.
The columns p and plog represent the probability values derived from the
Shapiro–Wilk test of normality before p and after plog the log
transformation of SOC values. When p is larger than plog, the log
transformation of the data did not increase the probability of normality in
the dataset. For comparative purposes, we provide (Fig. S1 in the Supplement)
the probability distribution functions of available data before and after the
log transformations. ARG is Argentina, BLZ is Belize,
BOL is Bolivia, BRA is Brazil, CHL is Chile, COL is Colombia,
CRI is Costa Rica, CUB is Cuba, ECU is Ecuador, GTM is Guatemala,
HND is Honduras, JAM is Jamaica, MEX is Mexico,
NIC is Nicaragua, PAN is Panama, PER is Peru, SUR is Suriname,
SLV is El Salvador, URY is Uruguay, and VEN is Venezuela.
Country
n
Land area (km2)
Min.
First Q
Med.
Mean
Third Q
Max.
SD
p/plog
ARG
231
2 736 690
0.34
1.88
3.21
5.65
5.96
86.85
9.33
<0.001/0.03
BLZ
21
22 970
1.84
4.49
6.72
7.71
9.99
19.48
4.32
0.08/0.99
BOL
76
1 083 301
0.64
1.83
2.56
2.64
3.20
7.65
1.21
<0.001/0.08
BRA
5616
8 358 140
0.07
1.99
2.67
3.23
3.34
573.76
9.18
<0.001/<0.001
CHL
44
743 812
0.43
3.58
5.19
9.88
16.52
31.87
8.86
<0.001/0.01
COL
166
1 038 700
0.66
3.44
5.78
8.15
9.95
52.62
7.35
<0.001/0.96
CRI
43
51 060
2.27
4.07
7.23
11.05
10.85
82.57
14.90
<0.001/0.001
CUB
48
109 820
0.36
2.85
3.61
4.32
5.73
10.98
2.23
0.004/<0.001
ECU
77
276 841
0.99
2.37
3.65
5.15
4.36
24.36
5.15
<0.001/<0.001
GTM
20
107 159
2.60
5.66
8.48
7.73
9.75
12.41
3.11
0.14/0.007
HND
11
111 890
2.69
5.25
6.48
6.71
8.32
12.38
2.78
0.72/0.39
JAM
76
10 831
1.29
3.01
3.99
4.35
4.83
12.90
1.99
<0.001/0.72
MEX
4321
1 943 945
0.00
1.73
2.49
2.56
3.25
35.55
1.49
<0.001/<0.001
NIC
26
119 990
2.93
3.94
7.31
7.50
9.04
15.91
3.78
0.05/0.09
PAN
25
74 177
3.39
4.90
7.53
7.59
9.13
19.89
3.76
0.003/0.49
PER
145
1 279 996
0.19
1.89
2.93
2.92
3.55
8.35
1.42
0.005/<0.001
SUR
27
156 000
1.38
2.60
3.35
3.37
4.07
6.01
1.20
0.69/0.51
URY
130
175 015
0.82
2.70
3.38
4.34
3.90
46.54
4.67
<0.001/<0.001
VEN
164
882 050
0.31
2.58
4.14
5.92
6.57
44.35
6.37
<0.001/0.11
Spatial distribution and point error estimates
There were large areas of Latin America with no available SOC observational
data in the WoSIS system (e.g., the south of Chile, Argentina, or across large
areas of Central America). We found substantial error estimates across large
areas with high density of SOC data but low carbon contents, such as northern
Mexico or the Brazilian semiarid savanna located at the eastern side of that
country (Fig. ).
Correlation of SOC and its predictors
Best correlated predictors were not the same across countries. We found
higher correlations with the original datasets transformed to their natural
logarithm, as data had a right-skewed distribution and did not follow a
normal distribution (i.e., log normal). Highest correlations of available SOC
data and their environmental predictors were associated with
temperature-related variables across Honduras, Costa Rica, Peru, Chile,
Guatemala, and Suriname (the r2 varied from 0.35 to 0.58). However,
there were a low number of available SOC observations across these countries
in the WoSIS system (between 11 to 34). Similarly, across countries with high
data availability (e.g., Mexico and Brazil), the strongest correlations
between SOC and prediction factors were associated with temperature-related
variables (Table 2). In all cases, the relationship between SOC and
temperature-related variables was negative. In contrast, SOC had a positive
relationship with elevation-derived terrain parameters (r2 varied from
0.43 to 0.59) such as terrain curvature, potential incoming solar radiation,
and slope of terrain.
Lower correlations of SOC data with prediction factors were found across
Brazil, Bolivia, Uruguay, Cuba, Panama, Venezuela, and Argentina (e.g.,
r2 < 0.2). The correlation analysis was useful to formulate a
working hypothesis about the major drivers of the spatial variability of SOC
across countries based on our DSM conceptual framework (e.g.,
SOCARG = f[px4wcl3a + px3wcl3a + evmmod3a
+ l07igb3a + px2wcl3a + …]). For example, the best
correlated predictors with SOC for Argentina were precipitation-related
variables (px4wcl3a, px3wcl3a, px2wcl3a), remote-sensing-based vegetation
indexes (evmmod3a), and a probability-based shrubland map (l07igb3a)
(Table ) (see sources of these maps in
http://worldgrids.org/doku.php/wiki:layers, last access:
20 February 2018).
Best correlated predictors and their frequency across the analyzed
data country scenarios, given available data in the WoSIS system; see
predictor codes in http://worldgrids.org/doku.php/wiki:layers (last
access: 20 February 2018). ARG is Argentina, BLZ is Belize, BOL is Bolivia,
BRA is Brazil, CHL is Chile, COL is Colombia, CRI is Costa Rica, CUB is Cuba,
DOM is Dominican Republic, ECU is Ecuador, GTM is Guatemala, HND is Honduras,
JAM is Jamaica, MEX is Mexico, NIC is Nicaragua, PAN is Panama, PER is Peru,
SUR is Suriname, SLV is El Salvador, URY is Uruguay, and VEN is Venezuela.
Var
Factor
Subfactor
Freq.
Country
gachws3a
Soil
Soil type
2
CUB, SUR
garhws3a
Soil
Soil type
2
PER, URY
ghshws3a
Soil
Soil type
2
BLZ, URY
gphhws3a
Soil
Soil type
2
CUB, JAM
gplhws3a
Soil
Soil type
2
BLZ, BOL
gvrhws3a
Soil
Soil type
2
JAM, URY
tdmmod3a
Climate
Temperature
11
ARG, BOL, BRA, CHL, COL, CRI, CUB, ECU, MEX, PER, VEN
tx1mod3a
Climate
Temperature
10
ARG, BOL, BRA, COL, CUB, ECU, JAM, NIC, PER, URY
tx4mod3a
Climate
Temperature
10
BRA, CHL, CRI, CUB, ECU, GTM, JAM, MEX, PER, VEN
tx5mod3a
Climate
Temperature
9
BOL, BRA, CHL, CUB, ECU, JAM, MEX, PER, VEN
tx6mod3a
Climate
Temperature
9
ARG, BOL, BRA, CHL, COL, CRI, ECU, MEX, VEN
tnhmod3a
Climate
Temperature
8
BLZ, COL, CRI, GTM, HND, JAM, PAN, VEN
tnmmod3a
Climate
Temperature
8
BLZ, COL, CRI, GTM, HND, PAN, URY, VEN
tx3mod3a
Climate
Temperature
7
BRA, CHL, CUB, ECU, PAN, PER, VEN
tdhmod3a
Climate
Temperature
6
ARG, CUB, ECU, JAM, MEX, URY
tdlmod3a
Climate
Temperature
6
BRA, CHL, COL, ECU, GTM, JAM
tnsmod3a
Climate
Temperature
5
ARG, MEX, NIC, PAN, SUR
tx2mod3a
Climate
Temperature
4
ARG, ECU, PER, URY
tdsmod3a
Climate
Temperature
3
MEX, PAN, SUR
tnlmod3a
Climate
Temperature
3
BLZ, COL, GTM
px2wcl3a
Climate
Precipitation
2
BOL, PAN
px3wcl3a
Climate
Precipitation
2
CHL, MEX
px4wcl3a
Climate
Precipitation
2
BRA, CHL
etmnts3a
Climate
ET
2
ARG, MEX
evmmod3a
Organism
Vegetation
5
ARG, ECU, HND, MEX, VEN
l07igb3a
Organism
Vegetation
2
ARG, CHL
DEMSRE3a
Topography
5
COL, CRI, GTM, HND, SUR
twisre3a
Topography
5
BRA, JAM, NIC, PAN, SUR
ChannNetworkBLevel
Topography
4
COL, HND, PAN, SUR
l3pobi3b
Topography
4
COL, CRI, PAN, VEN
inssre3a
Topography
3
BLZ, HND, SUR
opisre3a
Topography
3
CRI, NIC, SUR
SLPSRT3a
Topography
3
CRI, NIC, SUR
AnalyticalHillshading
Topography
2
BLZ, CUB
Aspect
Topography
2
BLZ, BOL
CovergenceIndex
Topography
2
BOL, HND
inmsre3a
Topography
2
CRI, GTM
ValleyDepth
Topography
2
BLZ, JAM
geaisg3a
Age
3
CHL, NIC, SUR
SOC-related properties
Correlations between SOC density (ORCDR) and prediction factors were higher
with maximum and mean nighttime temperature, where Costa Rica and Chile had
the highest correlations (r2 varied from 0.61 to 0.71). The best
correlated variables with BLD were terrain parameters: relative slope
position, vertical distance to channel network, flow accumulation areas, and
potential incoming solar radiation. These correlations were stronger across
Guatemala, Belize, and Panama (r2 varied from 0.52 to 0.67). We found
that terrain slope and the standard deviation of temperature were the
variables with highest correlations with CRFVOL, where Nicaragua, Honduras,
and Argentina had the highest correlations (r2 varied from 0.40 to
0.55). We did not find a dominant algorithm to predict ORCDR, BLD, and CRFVOL.
Slightly higher correlations between observed and predicted values were
achieved with RF, but in most cases different methods showed similar
prediction capacity. The highest prediction error was found with RK for
CRFVOL, but for all other output variables all prediction algorithms had a
similar range of errors (Fig. ). The PLS and SVM had the
lowest variance for prediction of each one of the four soil properties. The
r2 values for predicting the combined SOC-related properties (i.e.,
ORCDR, CRFVOL, and BLD) for each prediction algorithm were RK (r2 0.67
to 0.76), RF (r2 0.56 to 0.74), SVM (r2 0.32 to 0.71), PL (r2
0.46 to 0.69), and KK (r2 0.19 to 0.64). Across countries with lower data
availability and sparse distribution, SVM and RK algorithms resulted in lower
model performance.
Taylor diagrams showing the performance of the five models evaluated.
SOC stock (a), ORCDR (b), BLD (c), and
CRFVOL (d). This analysis is based on all available data across
Latin America. Although RF tends to generate higher correlation, it also shows
high variance in predictions. The points are close to each other and the
differences in accuracy between them generally fall within the same
intersection of error, variance, and correlation, suggesting a similar
prediction capacity by the implemented approaches.
Country-specific SOC predictions
We did not find a dominant algorithm to predict SOC on a country-specific
basis (Fig. ). Overall, machine learning
prediction algorithms generated similar results. Higher agreement of machine
learning prediction algorithms was found in small countries where
environmental conditions and land cover/use characteristics tend to be more
homogeneous (e.g., Jamaica, Suriname). RK showed higher discrepancies in
countries where data distribution was sparse (e.g., Suriname, Chile,
Guatemala) but effective across countries with higher and/or
well-distributed data availability (e.g., Mexico, Brazil). Machine learning SOC
predictions were conservative compared with RK (RK generated the higher
density of extreme and unreliable SOC values). PL had comparable results with
machine learning algorithms (i.e., KK, SVM, RF). From the cross-validation
strategy, higher r2 values between observed and predicted data were
found for Costa Rica (0.58; n=21) using SVM, while the lowest error was
found in Suriname (0.36 kgm-2; n=37) using PL. In contrast,
algorithms had lower prediction capacity for countries with large areas
(e.g., Brazil, Mexico) despite the large data availability.
Taylor diagrams showing the performance of the five models evaluated
for country-specific SOC estimates across Latin America. The position of each
point/method varies from each dataset to another, suggesting that the
predictive capacity changes when data characteristics are different.
The simple correlation (main effect) between the r2 and RMSE for RF, PL,
KK, and RK was positive (0.18, 0.35, 0.32, and 0.1, respectively). In contrast,
this correlation was stronger for SVM (but negative; -0.65) where
increasing the explained variance resulted in a lower error. Thus, we found a
low level of agreement between these two information criteria (r2 and
RMSE) commonly used in DSM to assess performance of prediction algorithms.
Agreement between the RMSE and r2 was found only in 12 of the 19
countries, resulting in country-specific “recommended” prediction
algorithms. Here, we list the prediction algorithms that generated the best
correlation and the best RMSE for each country: ARG (RK, RK), BLZ (RF, RK),
BOL (SVM, KK), BRA (RF, RF), CHL (PL, PL), COL (RF, RF), CRI (SVM, SVM), CUB
(PL, PL), ECU (RK, RK), GTM (KK, RF), HND (SVM, KK), JAM (RF, RF), MEX (RK ,
RK), NIC (RF, RF), PAN (PL, KK), PER (KK, KK), SUR (SVM, PL), URY (RF, RK),
and VEN (RK, RK) (see country codes in Table 1). Brazil and Mexico had the
highest number of observations (nearly 80 % of the total) and the same
method yielded the highest r2 and the lowest RMSE. We clarify that the best
within-country method was not the same for every country. The higher
ECr was found with PL (0.96), followed by RF (0.54) and KK (0.43),
informing that these predictive algorithms did not minimize prediction bias
while increasing the explained variance. SVM (with 0.008) and RK (with 0.003)
had the lowest ECr, as they maximize the explained variance while
minimizing prediction bias.
Model ensembles and SOC maps
High discrepancy was found among country-specific SOC predictions and between
country-specific and regional SOC predictions. Although both maps predict SOC
following a similar general pattern, the country-specific ensemble showed a
higher density of unrealistic patterns across Guatemala, Venezuela, northern
Brazil, and the surroundings of Uruguay (Fig. a). These
areas correspond to areas where we report both higher SOC calculation errors
and model uncertainty (Fig. ).
Country-specific (a) and regional (Latin
America) (b) predictions of SOC based on a linear ensemble of
methods. We present the units as Mgha for visualization purposes.
These units were used to reduce the digits of the value range and highlight
larger differences between SOC maps.
Compared with the country-specific ensemble, the regional model showed
spatial differences predicting higher SOC across the highlands of the Southern
Andes and boundaries of the Amazon Basin (Fig. b). As
expected, the country-specific model showed spatial artifacts associated with
country geopolitical borders. Based on the repeated 5-fold cross-validation,
we report a r2=0.39 for the regional model and r2 values for the
country-specific approach that vary from 0.01 to 0.55.
The full conditional response of residuals to the prediction factors
on a country-specific basis (a). The full conditional response of
residuals to the SOC prediction factors in the regional (Latin America)
model (b). The trend of the approximated error of SOC estimates is
derived from buffer distances and the random forest spatial
framework (c).
High uncertainty in our modeling framework was found across tropical, arid,
and semiarid regions of Latin America (Fig. a, b).
Residual uncertainty from independent validation in the country-specific
ensemble showed higher errors across geopolitical borders (in Chile,
Argentina, Colombia, Ecuador, Venezuela, and the Brazilian savanna), while
the residual uncertainty map from the regional model had higher uncertainty
across ecologically meaningful transitions, with no evident effect of
geopolitical borders. The trend of the mean approximated error suggests high
uncertainty in the SOC calculation method (Fig. c). We used
this map just to visualize the general trend of error estimates based only on
geographical buffer distances.
Primarily, the Pacific coastal plains, the delta of the Amazon river, some
closed watersheds and wetlands across Mexico, and some sparse points across
Central America showed the higher discrepancies. Mexico and Brazil, with
higher density of SOC data, were the countries with less discrepancy between
country and global models (Fig. a). We report that the
geographical areas where country-specific models tend to predict higher SOC
values than the regional ensemble (Fig. b). However, we report a
similar SOC stock from both modeling approaches (country-specific and global)
as we explain in Sect. 3.7.
The absolute distance (Mgha) between the country-specific
and the regional ensemble (a). The areas in white are areas where
the country-specific modeling is predicting higher SOC than the regional
estimate (i.e., country-specific is greater than regional) (b).
SOC stocks and model uncertainties
For comparative purposes with previous reports (i.e., the SoilGrids system
and the Harmonized World Soil Database), we harmonized the units of our maps
to Mgha, which was also useful for visualization purposes. For our
models, the uncertainty of the maximum prediction limit was estimated to be
±10 Pg, which was the variance of the SOC stock by increasing the
prediction limit from 1 to 700 Mgha (Fig. ). This
relationship showed a stable (close to 0) trend after 200 Mgha. A
larger density of extreme values was found with the regional model, and we
calculated a maximum possible SOC stock of 83.62 Pg with this model.
Despite the spatial differences reported for the country-specific and
regional ensembles, we report a similar stock between both approaches
(77.8 ± 42.2 and 76.8 ± 45.1 Pg, respectively). We found that
the global ensemble yields a slightly higher uncertainty. Our country-specific
ensembles suggested that countries with highest SOC stocks were Brazil,
Argentina, Colombia, Mexico, Peru, and Venezuela
(Table ).
Consistently, all models showed that tropical broadleaf evergreen forests,
croplands, and temperate shrublands were the land cover classes that had
higher SOC across all SOC available estimates (Table ).
However, using only the dataset contained in the WoSIS system, we predict
nearly the half of SOC compared with previously reported SOC estimates such
as the SoilGrids system (Table ).
Relationship between the SOC stock and the prediction limit. The
average breakdown points of this relationship are shown in the vertical line
at the right of the plot.
SOC stocks (Pg) at the contextual resolution of 5 × 5 km
grids. The terms used are defined as follows: ens is country-specific,
regional is Latin America ensemble, sg is the SoilGrids system, GSOCmap-GSP
is country-specific 1 km, and hw is the Harmonized World Soil Database.
Country
ens
regional
sg
GSOCmap-GSP
hw
1
Argentina
13.19
12.77
24.45
18.00
18.13
2
Belize
0.24
0.12
0.28
0.28
0.19
3
Bolivia
3.29
3.39
8.39
6.99
5.96
4
Brazil
26.82
27.16
68.45
42.79
47.20
5
Chile
6.31
7.20
15.15
1.93
8.28
6
Colombia
7.01
5.96
15.50
5.12
14.99
7
Costa Rica
0.56
0.34
0.83
0.83
0.71
8
Cuba
0.52
0.51
1.48
0.82
0.64
9
Ecuador
1.31
1.36
4.04
1.57
2.63
10
Guatemala
1.02
0.57
1.27
1.27
0.99
11
Jamaica
0.05
0.05
0.14
0.07
0.07
12
Mexico
5.98
6.12
14.43
9.04
17.59
13
Nicaragua
0.74
0.62
1.42
0.71
0.92
14
Panama
0.56
0.43
1.10
0.33
0.69
15
Peru
4.38
5.13
17.08
3.14
10.51
16
Suriname
0.56
0.51
1.20
0.45
1.33
17
Uruguay
0.92
0.88
1.99
0.84
2.27
18
Venezuela
4.71
3.77
9.39
5.28
5.64
The model variance of predicted SOC reached values over 300 % for
countries such as Mexico and Bolivia. In contrast, countries with higher SOC
per unit area and relatively low prediction variances were Panama, Guatemala,
Costa Rica, Nicaragua, and Belize. Overall, we found a median model
prediction variance of 53 % across countries in Latin America. Areas with
high uncertainty and model variance were across northern Mexico, Central
America, limits between Colombia and Brazil, and the border between Chile and
Argentina.
Discussion
We developed a DSM framework to characterize the spatial variability of SOC
across Latin America. Our results suggest that a multi-model approach was
suitable to better understand modeling bias and uncertainty of SOC maps. We
argue that uncertainty on SOC mapping can be associated with (a) the
complexity of the property of interest (i.e., SOC), (b) the environmental
heterogeneity within the area/country of interest, and (c) the
characteristics of available data (e.g., data density, data quality, and data
representativeness) to meet model-specific assumptions. Thus, when legacy
soil profile collections that were collected for different purposes along
long periods of time (i.e., decades), a multi-model approach (i.e., ensemble)
would be convenient to maximize the predictive capacity considering the
available information.
To maximize accuracy of our models, we used a generalized linear approach to
combine single predictions, and at the continental scale we were able to
explain 39 % of SOC variance using only information contained in the
WoSIS system for Latin America. This result was within the range of the
prediction capacity of country-specific models. Besides the low density of
observation points, the performance could be partially affected by the
generalization from the 1 : 1 scale of a soil profile (or field SOC
observation) to a 5 × 5 km grid, representing an additional source
of uncertainty. Higher discrepancy between country-specific and global
efforts was evident across Brazil, the largest country, where our models tend
to predict nearly half of SOC compared to previous efforts (e.g., the GSOCmap-GSP,
the SoilGrids system, and the Harmonized World Soil Database). The SoilGrids
system tends to predict the highest values, while our country-specific
ensemble predicts the lowest. The GSOCmap-GSP and our ensembles predicted < 100 Pg
of SOC across the analyzed countries, while all other products suggest higher
stocks (see Tables and ).
Another source of discrepancy can be associated with the lack of available
data to represent the SOC stock at the depth of interest (i.e., -30 cm of
mineral soil). The predictive performance of the mass-preservative spline to
continuously represent the SOC and depth relationships in some cases could be
strongly influenced by the lack of observations across highly variable soil
profiles. Some examples include SOC-rich agricultural soil profiles
constantly transformed for food production purposes, or a volcanic setting.
These high levels of missing data lead the trend map of approximated error
(Fig. ), which provides an idea of the uncertainty in the
SOC estimates.
SOC stocks at the contextual resolution of 5 × 5 km across
land cover classes of Latin America for the 18 analyzed countries. The terms
used are defined as follows: ens is country-specific, regional is Latin
America ensemble, sg is the SoilGrids system, GSOCmap-GSP is country-specific
1 km, and hw is the Harmonized World Soil Database. These are the land cover
classes described in Blanco et al. (2013). This land cover product was
generated using 500 m grids and has 84 % of accuracy.
Land cover
ens
GSOCmap-GSP
hw
sg
regional
1
Tropical broadleaf evergreen forest
30.39
40.30
59.15
80.44
29.73
2
Tropical broadleaf deciduous forest
0.43
0.65
1.00
1.09
0.42
3
Subtropical broadleaf evergreen forest
2.38
3.91
4.51
6.57
2.25
4
Subtropical broadleaf deciduous forest
1.42
2.04
1.87
2.55
1.07
5
Temperate broadleaf evergreen forest
3.32
1.26
4.97
6.91
3.56
6
Temperate broadleaf deciduous forest
0.48
0.52
1.02
1.21
0.63
7
Subtropical needleleaf forest
0.00
0.01
0.00
0.01
0.00
8
Temperate needleleaf forest
0.23
0.36
0.45
0.54
0.24
9
Mixed forest
0.67
1.08
1.34
1.66
0.66
10
Tropical shrubland
4.25
6.58
6.98
10.30
4.18
11
Subtropical shrubland
3.17
4.18
6.62
6.33
2.90
12
Temperate shrubland
4.56
5.08
7.33
9.97
5.32
13
Tropical grassland
3.01
2.48
3.56
5.46
2.45
14
Subtropical grassland
1.15
1.35
2.28
2.58
1.12
15
Temperate grassland
2.75
3.31
4.86
5.92
3.04
16
Inland water bodies
1.21
1.37
2.07
3.45
1.21
17
Urban area
0.24
0.31
0.45
0.55
0.22
18
Permanent ice and snow
0.14
0.08
0.14
0.38
0.17
19
Barren land
1.74
2.38
2.43
2.95
1.70
20
Cropland
12.95
19.33
21.89
27.94
12.42
21
Wetland
0.37
0.56
0.66
1.24
0.35
22
Salt flat
0.13
0.17
0.16
0.18
0.10
23
Coastal areas
1.59
1.39
2.23
4.31
1.78
The GSOCmap-GSP, for example, was generated on a country basis, but the
amount of SOC observations used for the countries to generate these maps was
considerable higher than the available data in the WoSIS system
(> 1 000 000 points). Both of our models predicted more conservative
results than the GSOCmap-GSP, while at the same time, the GSOCmap-GSP
predicted less SOC than the SoilGrids system and the Harmonized World Soil
Database. Respectively, the SoilGrids system relies on a multivariate space
suitable to represent the global soil-forming environment; however, a model
would assume a similar relation of each covariate with the response across
all land area in the world. The Harmonized World Soil Database may be a
pedologically sound product, but large areas of Latin America have not been
mapped at detailed scales (i.e., larger scales than 1 : 1 million) and this
results in a polygon-based approach relying on wide generalizations.
Despite the aforementioned limitations, across Latin America, there is an
increasing availability of relevant SOC information across site- and
country-specific regions , which could serve for
validating and calibrating global SOC estimates. Thus, regional approaches
considering multiple Latin American countries and SOC models could be a
valuable resource to explain discrepancies between site- or country-specific
and global SOC models.
Our results incorporate a multi-model perspective for quantifying/evaluating
the spatial variability of SOC. The model with higher predictive capacity in
terms of cross-validated r2 was RF, an ensemble of regression trees based
on bagging. However, this method yields high ECr, and therefore it
tends to capture the trend but with high bias. Taylor diagrams show that RF
in any case yield the lower variance. SVM and RK were methods with higher
agreement between RMSE and corr, and therefore lower ECr. Large
values of ECr represent an accuracy limitation that was evident
for RF, PL, and KK. To overcome these types of modeling biases, previous
studies have suggested that the theory of ensemble learning applied to soil
datasets could increase the accuracy of results . Furthermore, recent studies highlight the applicability of
selective ensembles across a large diversity of model algorithms useful for
digital soil mapping purposes . Thus, our modeling
approach included the combination of multiple predictions by using a linear
stack of models as implemented in the caretEnsemble package of R ,
with the ultimate goal of reducing the uncertainty on SOC mapping efforts.
Across Latin America, we did not find a common predictive algorithm for SOC.
These results suggest that country-specific environmental predictors and
available data influence the applicability of different approaches. This
assessment is needed to address the requirements from the GSOCmap-GSP with
the official mandate to generate and update country-specific soil information
by the means of DSM. Thus, we argue that the DSM form of each country should
assess and incorporate country-specific available data and environmental
predictors to select the best prediction algorithm. The FAO SOC mapping
cookbook explores possibilities to derive country-specific SOC maps from a
variety of prediction algorithms , and multiple resources
have described the state of the art of modeling methods focused on DSM of
soil carbon including geostatistics
. Thus, data characteristics (e.g., spatial
structure, representativeness) are specifically important for developing a
DSM framework as legacy soil profile collections, generated with long-term
soil inventory purposes, will determine data availability and spatial
distribution within a country.
This country-specific approach to map regional SOC results in artifacts
across geopolitical borders. Therefore, data sharing, model validation, and
calibration experiments among countries are required to better capture the
spatial variability of SOC. The use of a natural-defined prediction domain
(e.g., ecoregional or physiographic map) could reduce the border effects.
However, we understand that geopolitical borders are required for policy
decisions around country-specific needs. We highlight that there is a lack of
publicly available country-specific data that ultimately influence the
performance of both country-specific to regional-to-global SOC estimates.
To achieve the highest possible accuracy of country-specific SOC estimates,
the availability of point data sources for SOC modeling and mapping is an
important consideration when selecting an efficient modeling strategy,
especially when dealing with legacy SOC datasets. Our results highlight
important uncertainty levels (> 100 %) across large areas of Latin
America (Table ). The data contained in WoSIS have a
low-density distribution given the large area and environmental complexity of
several countries analyzed. Thus, larger uncertainty dominates countries with
larger SOC pools probably because available data do not capture the large
spatial heterogeneity of SOC stocks. We highlight that the WoSIS dataset is a
unique and invaluable effort that has proven to generate global SOC
predictions , but there is a global need to
increase information and networking capabilities for SOC .
This study generated predictions of SOC across Latin America but also
provided information about the main relationships driving the spatial
distribution of SOC. Machine learning (i.e., data-driven) models have proven
to be more efficient to model non-linear relationships of SOC
, but our results suggest that linear-based models (e.g.,
RK) could outperform machine learning methods under well-distributed and
representative SOC data scenarios. Similar results were found across
productive landscapes of Brazil . We argue that our
capacity to meet modeling assumptions will determine the most suitable
prediction algorithm or ensemble methods (i.e., stack, blend, bucket of
models). Machine learning models are usually conceived as black boxes and the
influence of non-informative SOC prediction factors on machine learning-based
SOC models has not been evaluated in detail. Therefore, we propose that the
use of simple linear methods (i.e., correlation of available data and their
predictors) can be a useful and parsimonious first step to inform data-driven
approaches and enhance the interpretability of machine learning models to
predict SOC. However, the simple selection of prediction factors based on
simple correlation analysis does not prevent multi-collinearity, in which
hypothesis-driven methods (e.g., RK) may be at risk to fail, but provides
useful information about the main effects of the predictors on SOC. Thus, the
use of machine learning and other statistical models (i.e., PL) is suitable
to overcome the bias associated with the potential statistical redundancy of
our simple variable selection approach based on simple correlation analysis.
Furthermore, our data suggest that country-specific predictor factors are
needed to better parameterize models but also could be useful for
country-specific model interpretation. These results have important
implications because it has been proposed that an extensive set of prediction
factors is required to capture the large variance of the global SOC pool
. Thus, we propose that limited but informative
country-specific prediction factors could be jointly explored to describe the
local biophysical characteristics controlling SOC variability.
This study is expected to increase the capacity of Latin American
institutions to provide accurate baseline estimates of SOC with a
country-specific perspective following recommendations of GSOCmap-GSP.
Ultimately, these efforts will enhance the development of new guidelines for
measuring, mapping, reporting, verification, and monitoring SOC stocks .
Accurate country-specific DSM frameworks for SOC are required to facilitate
interoperability and inform environmental policy across developing countries
. Our results highlight that attention is needed to better
understand the influence of model prediction limits (e.g., the full
conditional distribution) for the predicted SOC stocks. Setting an unreliable
(excessive or low) prediction limit can have important effects (under- or
overestimating) on the overall estimated stocks (Fig. ).
Therefore, we argue that data science systems for DSM focused on carbon
assessments should be fundamentally based on SOC expert knowledge and
informed by expert-based soil mapping systems.