Evaluation of digital soil mapping approaches with large sets of environmental covariates

The spatial assessment of soil functions requires maps of basic soil properties. Unfortunately, these are either missing for many regions or are not available at the desired spatial resolution or down to the required soil depth. The field-based generation of large soil datasets and conventional soil maps remains costly. Meanwhile, legacy soil data and comprehensive sets of spatial environmental data are available for many regions. Digital soil mapping (DSM) approaches relating soil data (responses) to environmental data (covariates) face the challenge of building statistical models from large sets of covariates originating, for example, from airborne imaging spectroscopy or multi-scale terrain analysis. We evaluated six approaches for DSM in three study regions in Switzerland (Berne, Greifensee, ZH forest) by mapping the effective soil depth available to plants (SD), pH, soil organic matter (SOM), effective cation exchange capacity (ECEC), clay, silt, gravel content and fine fraction bulk density for four soil depths (totalling 48 responses). Models were built from 300–500 environmental covariates by selecting linear models through (1) grouped lasso and (2) an ad hoc stepwise procedure for robust external-drift kriging (georob). For (3) geoadditive models we selected penalized smoothing spline terms by component-wise gradient boosting (geoGAM). We further used two tree-based methods: (4) boosted regression trees (BRTs) and (5) random forest (RF). Lastly, we computed (6) weighted model averages (MAs) from the predictions obtained from methods 1–5. Lasso, georob and geoGAM successfully selected strongly reduced sets of covariates (subsets of 3–6 % of all covariates). Differences in predictive performance, tested on independent validation data, were mostly small and did not reveal a single best method for 48 responses. Nevertheless, RF was often the best among methods 1–5 (28 of 48 responses), but was outcompeted by MA for 14 of these 28 responses. RF tended to over-fit the data. The performance of BRT was slightly worse than RF. GeoGAM performed poorly on some responses and was the best only for 7 of 48 responses. The prediction accuracy of lasso was intermediate. All models generally had small bias. Only the computationally very efficient lasso had slightly larger bias because it tended to under-fit the data. Summarizing, although differences were small, the frequencies of the best and worst performance clearly favoured RF if a single method is applied and MA if multiple prediction models can be developed. Published by Copernicus Publications on behalf of the European Geosciences Union. 2 M. Nussbaum et al.: Evaluation of statistical methods for DSM


Introduction
Human well-being depends on numerous services that soils provide in agriculture, forestry, natural hazards, water protection, resources management and other environmental domains.The capacity of soil to deliver services is largely determined by its functions, e.g.regulation of water, nutrient and carbon cycles, filtering of compounds, production of food and biomass or providing habitats for plants and soil fauna (Haygarth and Ritz, 2009;Robinson et al., 2013).The assessment of the multi-functionality of soils depends on the availability of datasets on chemical, physical and biological soil properties (Calzolari et al., 2016).Greiner et al. (2017) compiled a set of approved assessment methods for soil functions from the applied soil science community that cover the multi-functionality of soils (Table 1).This set of soil functions can be assessed with 12 basic soil properties (see references in Table 1).Unfortunately, the spatial assessment of soil functions is often hindered because accurate maps of soil properties are missing in many countries of the world (Hartemink et al., 2013;Rossiter, 2016).However, for many regions legacy data on soil properties (responses) and comprehensive spatial environmental data (covariates) are available and can be linked by digital soil mapping techniques (DSM; e.g.McBratney et al., 2003;Scull et al., 2003).
Many recent DSM studies used relatively small sets of no more than 30 covariates (e.g.Li et al., 2011;Liess et al., 2012;Adhikari et al., 2013;Vaysse and Lagacherie, 2015;Were et al., 2015;Lacoste et al., 2016;Mulder et al., 2016;Somarathna et al., 2016;Taghizadeh-Mehrjardi et al., 2016;Yang et al., 2016).Geodata availability and deemed importance often determine what covariates are used for DSM.However, Brungard et al. (2015) showed that a priori preselection of covariates using pedological expertise might result in a decreased accuracy of soil class predictions.Using comprehensive environmental geodata for DSM improves prediction accuracy because soil-forming factors are likely better represented by a larger number of covariates.Derivatives of geological or legacy soil maps (Nussbaum et al., 2014), multi-scale terrain analysis (Behrens et al., 2010a(Behrens et al., , b, 2014;;Miller et al., 2015), wide ranges of climatic parameters (Liddicoat et al., 2015) and (multi-temporal) imaging spectroscopy (Mulder et al., 2011;Poggio et al., 2013;Viscarra Rossel et al., 2015;Fitzpatrick et al., 2016;Hengl et al., 2017;Maynard and Levi, 2017) all contribute to generating high-dimensional sets of partly multi-collinear covariates.One usually presumes that DSM techniques benefit from a large number of covariates even if a method selects only a small subset of relevant covariates for creating the predictions.DSM model building therefore faces the challenge of dealing with (very) large covariate sets.If, in addition, many responses have to be mapped, a DSM approach should 1.efficiently build models without much user interaction, even if there are more covariates p than observations n (n < p), 2. cope with numerous multi-collinear and likely noisy covariates, 3. result in predictions with good accuracy and 4. avoid over-fitting the calibration data.
The method should fulfil basic DSM requirements like modelling non-linear and non-stationary relations between response and covariates, considering spatial autocorrelation, allowing for a check of pedological plausibility of the modelled relationships and quantifying predictive uncertainty.DSM approaches used in the past can broadly be grouped into (1) linear regression models (LMs), (2) variants of geostatistical approaches, (3) generalized additive models (GAMs), (4) methods based on single trees like classification and regression trees (CARTs), (5) (ensemble) machine learners like boosted regression trees (BRTs) or random forest (RF), and ( 6) averaging predictions of any of the mentioned methods (model averaging, MA).LM (e.g.Meersmans et al., 2008;Wiesmeier et al., 2013) cannot be fitted for n < p, and estimates of coefficients become unstable with collinear covariates.Liddicoat et al. (2015) and Fitzpatrick et al. (2016) used lasso (least absolute shrinkage and selection operator), a form of penalized LM suitable for large correlated covariate sets.Fitzpatrick et al. (2016) found that lasso clearly outperformed different stepwise LM selection procedures.Geostatistical approaches are generally popular in DSM (McBratney et al., 2003), and they have clear advantages over other methods: they allow for a change of support, and predictive uncertainty follows straightforwardly from the kriging variances.Similar to LM, external-drift kriging (EDK) requires a parsimonious linear trend model.Nussbaum et al. (2014) used lasso for initial covariate selection, but subsequent manual model-building steps were needed.Non-linear additive modelling through GAM also relies on covariate selection for stable trend estimation.Poggio et al. (2013) used a covariate selection procedure with a random component, and Nussbaum et al. (2017) applied component-wise gradient boosting to preselect relevant covariates.Unless combined with either lasso or boosting, large sets of covariates are difficult to process through LM, EDK or GAM.But these methods allow for simple model interpretation with partial effects plots (Faraway, 2005, p. 73) Generally, more complex approaches seem to yield more accurate predictions than simpler DSM methods (Liess et al., 2012;Brungard et al., 2015).The effects of interactions between covariates on responses can be modelled by tree-based methods, but single trees (CART; e.g.Liess et al., 2012;Heung et al., 2016) tend to be noisy (large variance).Cubist, an extension of CART with LM at the terminal nodes of the tree, was only occasionally applied to large covariate sets (e.g.Viscarra Rossel et al., 2015;Miller et al., 2015).Forming ensembles of trees aims to reduce their variance and likely outperforms its single components (Liess et al., 2012).RF seems stable for large sets of covariates (Behrens et al., 2010a(Behrens et al., , b, 2014)), while BRT, compared to RF by Yang et al. (2016), yields similar model accuracy.Averaging predictions from different models (MAs) follows the strategy of ensemble learners, possibly reducing prediction variance (Hastie et al., 2009, p. 288ff.),but MA has rarely been used for DSM.Malone et al. (2014) explored different weighing strategies for MA, but it was unclear from the study whether MA was indeed better than predictions by a single method because predictions were not validated by independent data.Li et al. (2011) averaged only predictions computed by the best-performing (very similar) models, and this did not result in any advantage of MA over single models.
The comparative studies mentioned above used only small sets of covariates (p < 30), tested only few or very similar (Fitzpatrick et al., 2016) approaches or, with exception of Vaysse and Lagacherie (2015), did not extend the evaluation to several soil properties or study regions.It is therefore currently unclear how well models can be built form large covariate sets by popular DSM methods.Empirical evidence is still too limited to rate DSM methods with respect to the criteria 1-4 listed above.In particular, it is not known whether methods can be identified that are more prone to over-fit soil data or that yield accurate predictions more often than others.
The objectives of this study were to evaluate for a broad choice of currently used DSM methods how well they cope with requirements 1-4 listed above.We compared in our study (a) lasso, (b) robust EDK (georob), (c) spatial GAM with model selection based on boosting (geoGAM), two ensemble tree methods, which are (d) BRT and (e) RF, and (f) weighted MA.In more detail, our objectives were to i. automatically build models with methods (a)-(e) and compute MA of (a)-(e) for numerous responses from large sets of covariates (300-500), ii. evaluate the predictive performance of these models with independent validation data, iii. evaluate over-fitting behaviour and the practical usage of approaches iv. and briefly compare the accuracies of DSM predictions and predictions derived from a legacy soil map with scale 1 : 5000.
We focused on three study regions in Switzerland: a forested region and two regions with agricultural land where harmonized legacy soil data and, in the latter regions, airborne imaging spectrometer data were available.For the agricultural land, the soil properties required for assessing regulation, habitat and production functions were mapped (Table 1).For forests, we had less diverse soil data and mapped only properties to assess acidification status (Zimmermann et al., 2011).

Study regions
We chose three study regions on the Swiss Plateau with contrasting patterns regarding land use, geology, soil types and availability of airborne remote sensing images (Fig. 1, Table 2).Agricultural land north of the city of Berne and around Lake Greifensee (Canton of Zurich) was selected within the outline of imaging spectroscopy data gathered by the APEX spectrometer in the years 2013 and 2014 (Schaepman et al., 2015).Agricultural land was defined as area not covered by any areal features extracted from the Swiss topographic landscape model (swissTLM3D; Swisstopo, 2013a); hence wetlands, forests, parks, gardens and developed areas were excluded.
The majority (80 %) of the study region Berne was covered by cropland and 15 % by permanent grassland.In the Greifensee region cropland covered roughly half of the area and one-third was permanent grassland.The remaining areas were orchards, vineyards, horticultural areas or mountain pastures (Hotz et al., 2005).
The third study region is comprised of the forested areas of the Canton of Zurich (ZH forest), as derived from the forested area of the topographic landscape model (swis-sTLM3D;Swisstopo, 2013a).Two-thirds of the forested area are dominated by conifers (FSO, 2000b).In all three study regions soils formed mostly on weathered molasse formations and Pleistocene sediments dominantly deposited during the last glaciation.In the north-eastern part, the limestone Jura Hills reach into ZH forest (Hantke, 1967)   of the Berne study region, alluvial plains with silty sediments or peat formations prevail (Swisstopo, 2005).Soils are rather young in all study regions (< 20 000 years old) as they mostly formed after the end of the last glaciation.Typical soils are Cambisols and Luvisols (calcaric to dystric), Gleysols and Fluvisols (reflecting frequent wet conditions), and Histosols (on former peatlands).Shallower soils are often Regosols (FSO, 2000a).

Origin of soil data and data harmonization
We gathered and harmonized legacy soil data from various soil surveys performed between 1960 and 2014.Berne data were collected mostly before 1980 in small soil mapping projects for land improvement.Data for Greifensee and ZH forest originate from long-term soil monitoring of the Canton of Zurich (KaBo), a soil pollutant survey (Wegelin, 1989), field surveys for creating soil maps of the agricultural land (scale 1 : 5000; Jäggli et al., 1998) or soil investigations in the course of forest vegetation surveys by the Swiss Federal Institute for Forest, Snow and Landscape Research (WSL; Walthert et al., 2004).Hence, the compiled soil database is comprised of data on soil properties that were measured or estimated for pedogenic soil horizons of soil profiles or measured at a fixed depth from bulked soil samples.Sites for pollution surveying were chosen on a regular grid.The remaining sites were selected through purposive sampling (Webster and Lark, 2013, p. 86) by field surveyors to best represent soils typical for the given landform.The sites of WSL were chosen by purposive sampling according to the aims of the project.Collating the data from the different sources showed that soil data were not directly comparable, and tailored harmonization procedures were required to provide consistent soil datasets.The heterogeneity of legacy soil data resulted from several standards of soil description and soil classification, different data keys, different analytical methods and, particularly, often missing metadata for a proper interpretation of the datasets.Therefore, we elaborated a general harmonization scheme that covers all steps required to merge different legacy soil data into one common, consistent database (Walthert et al., 2016).Sampling sites were recorded in the field on topographic maps (scale 1 : 25 000), and hence we estimated the accuracy of coordinates to about ±25 m.
Horizon-based (and non-fixed depth) soil property data were converted to fixed-depth data for 0-10, 10-30, 30-50 and 50-100 cm of soil depth for Berne and Greifensee and 0-20 and 40-60 cm of depth for ZH forest.The latter intervals were chosen because at the majority of forest sites only these depths had been sampled.Values o t for depth t were computed from horizon (or fixed-depth) data o i by using with w i given by the product of the fraction of the thickness of horizon and fixed depth i within t and the bulk density ρ i of the soil fraction with particle size ≤ 2 mm.Because we lacked estimates of volumetric gravel content for the majority of samples, we assumed that it was constant; ρ i was partly derived by pedotransfer functions (PTFs; Table S1 in the Supplement).Soil properties were either measured by standard laboratory procedures, estimated in the field or calculated by PTF (see overview in Table S1).We accounted for fluctuations in the observations over the long period during which the data had been collected and for possible differences between laboratory measurements, field estimates and PTF predictions by statistical modelling.We included categorical covariates (factors) in the statistical models that coded separately for laboratory measurements, field estimates and PTF predictions the period when the data had been gathered.For Berne three periods (1968-1974, 1975-1978 and 1979-2010) were coded separately for laboratory measurements and field estimates.For Greifensee and ZH forest, coding required more care because we had replicate samples from soil monitoring.Instead of only using mean or median values per site this coding allowed us to use all individual observations.For Greifensee we coded the years 1960-1989, 1990-1994 and 1995-1999 separately for laboratory and field data and 2000-2014 for laboratory measurement only.For ZH forest we distinguished the periods 1985-1994, 1995-1999, 2000-2004, 2005-2009 and 2010-2014 for laboratory measurements and a further two levels for predictions by PTF or pH measurements on field-moist samples (see Table S1).Data on pH, soil organic matter (SOM) and effective cation exchange capac-ity (ECEC) that were older (or newer) than reported above were discarded.To compute model predictions for mapping we used the most recent time period and laboratory measurements as a reference level.

Soil properties
For the agricultural land (Berne, Greifensee) we modelled clay and silt, gravel content, pH, SOM and effective soil depth available to plants (SD).For ZH forest we modelled ECEC, pH and bulk density of the fine soil fraction (≤ 2 mm; BD).For Berne and Greifensee (possibly incomplete) soil data were available for 1052 and 2050 sites, respectively, and for ZH forest we had 2379 sites with soil data (Figs.S1 to S3 in the Supplement).We used roughly 20 % of the sampled sites for independent model validation.Depending on data availability, this resulted in 120-300 validation sites that were chosen by weighted random sampling.We ensured an even distribution of validation sites over the study regions by assigning to each site a sampling weight that was proportional to the respective forested and agricultural area within its Dirichlet polygon (Dirichlet, 1850).
Models for properties of agricultural soils were calibrated with data from 700-900 sites.For SOM there were more topsoil sites available ( 1140), but in the subsoil we had only data from 400 (Greifensee) and 530 (Berne) sites, respectively.For ZH forest topsoil chemical properties were available for 1055 (ECEC) to 1470 (pH) sites, but for subsoils data were again scarce (ECEC 380 and pH 690 sites).For modelling BD we had only 550 (topsoil) to 370 (subsoil) sites.On average we calibrated the models with the following spatial data densities: Berne 2.9-3.6,Greifensee 4.2-5.1 and ZH forest 1.2-1.8observations per km 2 .
Tables S3 to S7 report descriptive statistics for all soil properties.In general, soils in the Greifensee region were richer in clay (mean clay content 26 %) than in Berne (17-19 %) and had larger gravel content (8-13 % vs. 3-5 %).In both agricultural study regions, large SOM contents were occasionally found (> 40 %) as drained organic soils were sampled at some sites.Topsoil pH in Berne and Greifensee showed similar variation (mean 6.3-6.7 and standard deviation 0.7-0.9)because agricultural management probably evens out pedogenic differences.ZH forest soils were more acidic (mean topsoil pH 4.7) and pH varied more strongly (minimum pH 2.6).

Covariates for statistical modelling
To represent soil-forming factors we used data from 28 sources, totalling roughly 480 covariates for Berne and Greifensee and 330 for ZH forest where APEX imaging spectrometer data were not available (Tables 3 and S2).The exact numbers of covariates used depended on soil properties.When the sampling density of soil data was small we excluded covariates that showed hardly any spatial variation www.soil-journal.net/4/1/2018/  SOIL, 4, 1-22, 2018 (e.g.coarse-gridded climate data) or that resulted in few data points per factor level.Wherever possible, we aggregated factor levels based on pedological knowledge to obtain at least 20 observations per level.

Methods
The large number of responses -21 each for Berne and Greifensee and 6 for ZH forest -and of covariates (Table 3) required that statistical models be automatically built without user interaction.Hence, we used five approaches, including lasso (Sect.3.1), robust external-drift kriging (georob; Sect.3.2), geoadditive modelling (geoGAM; Sect.3.3) and two tree-based machine learning procedures: boosted regression trees (BRTs; Sect.3.4) and random forest (RF; Sect.3.5).The predictions by the five methods were moreover combined by weighted averaging (MA; Sect.3.6).To create the final maps we predicted each response at the nodes of a 20 m grid.
For parametric methods (Sects.3.1 to 3.3) we transformed strongly positively skewed responses Y (s) (see Tables S4, S6 and S7 for skewness).Transformation by natural logarithm was applied to soil organic matter (SOM) and effective cation exchange capacity (ECEC), while gravel content was transformed by square root (sqrt).Predictions of log-transformed data were unbiasedly backtransformed according to Cressie (2006, Eq. (20); see also Nussbaum et al., 2017) and for sqrttransformed data we used with f (x(s)) 2 being the prediction of the sqrt-transformed response, σ 2 the estimated residual variance in the fitted model and Var[ f (x(s))] the variance in f (x(s)) as provided again by the final model.Predictions by group lasso (Sect.3.1) were backtransformed by exp(•) or (•) 2 because Var[ f (x(s))] was not known.
For tree-based models (Sects.3.4 and 3.5), responses were not transformed.Clay and silt were modelled independently.Sand was computed as the remainder to 100 % because for field estimates, which were a substantial part of the used texture data (Table S1), sand was obtained in the same way (Brunner et al., 1997;Jäggli et al., 1998).Additive logratio transformation (ALR) for compositional data (Aitchison, 1986, p. 113ff.) was tested for geoGAM (Sect.3.3), but as ALR had no advantage, we preferred to model textural components on their original scale.
To find optimal tuning parameters, we minimized root mean square error (RMSE; Eq. 4) in 10-fold cross-validation using the same cross-validation subsets for all methods in Sects.3.1 to 3.4.For RF (Sect.3.5) root mean square error (RMSE) was computed for out-of-bag predictions.All computations were done in R (R Core Team, 2016) using the functions reported below.

Group lasso
The lasso (least absolute shrinkage and selection operator) is a shrinkage method that likely excludes non-relevant covariates and is therefore an attractive framework for highdimensional covariate selection.Lasso estimates the coefficients of a linear model by minimizing a penalized residual sum of squares, with the penalty being equal to the weighted sum of absolute values of the estimated coefficients.By increasing the weight λ of the penalty term, a kind of continuous subset selection is performed.Covariates with coefficients shrunken exactly to zero are excluded from the model (Hastie et al., 2009, Sect. 3.4).
We used the grouped lasso, which jointly shrinks all coefficients of a factor (R package grpreg; Breheny and Huang, 2015).The optimal λ was chosen such that we obtained the least complex model with cross-validation mean square error (MSE) one standard error (SE) larger than the optimal MSE (Hastie et al., 2009, p. 62).

Robust external-drift kriging (georob)
We applied external-drift kriging (EDK) with robustly estimated trend coefficients and exponential variogram parameters (R package georob; Papritz, 2016;Nussbaum et al., 2014).Building a parsimonious trend model from a large number p of covariates was challenging for georob.We built trend models by concatenating several covariate selection steps.First, we did a preselection by finding common covariates in repeated lasso cross-validation runs (32 repetitions, optimal λ from argmin i (MSE i ) + 1 SE, R package glmnet; Friedman et al., 2010).Then we reduced and expanded this initial covariate set by repeated stepwise covariate selection (models were reduced by step-function minimizing Bayesian information criterion (BIC) and enlarged by adding covariates with p ≤ 0.05 in Wald tests).Covariates with inflated coefficients due to multi-collinearity had to be removed manually from the final models (40 % of responses).We used a robustness parameter ψ equal to 1.75.When the robust algorithm did not find a root of the estimating equations, we first increased ψ and fitted the model non-robustly if this did not help (8 % of responses; see Tables S10 and S11).

Boosted geoadditive model (geoGAM)
Additive models accommodate linear effects and smooth non-linear effects of continuous covariates.Spatial autocorrelation can be represented in geoGAM by a smooth function of the spatial coordinates (smooth spatial surface), and non-stationary effects are modelled by interactions between smooth spatial functions and covariates.We based model building for geoGAM on component-wise gradient boosting, a slow stage-wise additive model-building algorithm.At each stage, base procedures are fitted to the residuals of the previous model and the best-fitting base procedure is retained  (DMC, 2015) 22 m 4 spectroscopy bands, aggregated SPOT5 mosaic (Mathys and Kellenberger, 2009) 10 m Zf 12 vegetation units, canopy height APEX spectrometer mosaics (Schaepman et al., 2015) 2 m Gr, Be 180 Share of coniferous trees (FSO, 2000b) 25 m Zf 1 Vegetation map (Schmider et al., 1993) 1 : 5000 Zf 2 Species composition data (Brassel and Lischke, 2001) 25 to update the model by a small step size v.We used nonparametric penalized smoothing splines for continuous covariates and linear base procedures for factors.After boosting further, model reduction was achieved by the stepwise removal of covariates and the aggregation of factor levels.The optimal numbers of boosting iterations m stop and parameters for further model reduction were found by minimizing crossvalidation RMSE.For more details on the model-building procedure, see Nussbaum et al. (2017) and the R package geoGAM (Nussbaum, 2017).
Non-stationary effects were added for all continuous covariates, but cross-validation RMSE did not substantially decrease, and we preferred the simpler stationary models throughout.Maximum boosting iterations m max were kept on default 300 iterations (geoGAM; Nussbaum, 2017)  Classification and regression trees (CARTs) are based on recursive binary partitioning of the covariates and can capture complex interaction structures in a dataset.Generally, single trees tend to be noisy (large variance), but have small bias.Combining trees by ensemble methods aims to reduce their variance (Hastie et al., 2009, Chaps. 9 and 10).One such approach uses regression trees as base procedures in component-wise gradient boosting (Sect.3.3).
The learning rate was kept similarly small as for geoGAM with v = 0.1 (Sect. 3.3;Hastie et al., 2009, Chap. 10), and a minimal number of observations in each end node was set to 5 as in RF (Sect.3.5).

Random forest (RF)
RF (Breiman, 2001), another method of balancing the instability of CART, averages a committee of fully grown trees.Two mechanisms are used to de-correlate trees and consequently reduce the variance in the predictions: (1) bootstrap sampling (bagging) creates a different response vector for each tree, and (2) at each node only m try < p randomly selected covariates are tested as candidates for binary splitting.Predictions are simple means of all n tree fitted trees.
Tuning parameters are the number of trees n tree , the minimal number of observations at terminal nodes n min and the number of tested covariates m try at each split.Tests with five different responses confirmed that tuning n tree and n min did not reduce out-of-bag RMSE substantially (Spiess, 2016).Therefore, we used default values of n tree = 500 and n min = 5 for all RF fits (R package randomForest; Liaw and Wiener, 2002).To find optimal m try we minimized out-of-bag RMSE by iterating through m try = 1, 2, . .., p.

Model averaging (MA)
The five methods described above likely represent different aspects of the covariates and can be seen as different means of reducing the high-dimensional covariate input.Hence, combining the predictions of several models possibly improves predictive performance over single methods as the large variance in individual models is reduced through averaging (Hastie et al., 2009, Sect. 8.8).We computed weighted sums of the predictions by our five digital soil mapping (DSM) procedures with weights proportional to the inverse cross-validation or out-of-bag RMSE (Tables S8, S10 and S11).

Legacy soil map
For the Greifensee region a legacy soil map with scale 1 : 5000 was available, which reported classes of clay and gravel content for topsoil and subsoil and effective soil depth available to plants (SD; Jäggli et al., 1998).Experienced soil surveyors assigned to each class or combination of classes a typical value of these soil properties (Nussbaum and Papritz, 2017), and we used these as predictions when we computed the statistics for the validation sets (Sect.3.8).
The map defined topsoil by pedogenetic A horizon without indicating a particular depth.We therefore compared predictions for topsoil to values observed in 0-10 and 10-30 cm of depth and predictions for subsoil to observations in 30-50 and 50-100 cm.Inhomogeneous mapping units (complex polygons with multiple soil units assigned) were excluded from the validation.Since all the sites in the validation sets had been used to create the map, validation statistics give goodness of fit instead of rigorous validation measures for the legacy map.

Evaluating predictive performance
The accuracy of predictions by the six statistical DSM approaches and the legacy soil map was evaluated by comparing predicted Y (s i ) with observed Y (s i ) soil properties for all locations s i of the validation sets.To rate the methods, we used bias, RMSE and mean square error skill score (SS mse ; Wilks, 2011, p. 359 SS mse has the same interpretation as the R 2 which is occasionally reported, with SS mse = 1 for perfect predictions (RMSE = 0), SS mse = 0 if predictions have the same variance as the data of the validation set and SS mse < 0 for predictions with larger variance.Note, however, that some DSM studies report R 2 values identical to SS mse (e.g.Vaysse and Lagacherie, 2015;Viscarra Rossel et al., 2015), while others report R 2 with R as the Pearson correlation coefficient of Y (s i ) and Ŷ (s i ) (e.g.Behrens et al., 2014;Somarathna et al., 2016).Such R 2 values differ, except for linear models fitted by ordinary least squares, from SS mse .Since the computation of reported R 2 is sometimes not clear, we call the statistic SS mse , which makes it clear that it is a skill score.

Model building
Grouped lasso, robust external-drift kriging (georob) and boosted geoadditive models (geoGAMs) successfully selected strongly reduced sets of covariates.On average, lasso models had 21, georob 27 and geoGAM only 12 covariates in the final models.This corresponds to only 3-6 % of all covariates.Boosted regression trees (BRTs) performed weak covariate selection.The stage-wise forward algorithm selected on average 43 % of all covariates (covariates with importance > 0) for its models.Nonetheless, the complexity of BRT models varied quite strongly with 12 % of covariates selected for the smallest and 86 % for the largest model.The number of covariates in final lasso, geoGAM, georob and BRT models was positively correlated over the responses (Pearson correlation between methods 0.43-0.58).
Random forest (RF) included all available covariates in its models (all covariates with importance > 0).Having models that depend only on a reduced set of the initial input covariates is desirable because computing predictions is then less demanding and interpreting the modelled effects of covariates is easier.For three responses we therefore checked whether covariate importance (Hastie et al., 2009, p. 368) can be used to select covariates for RF.We selected q = 10, 20, . .., 50 most important covariates or selected covariates by stepwise recursive elimination of the least important covariate.For a given q, both approaches selected similar sets (correspondence 60-90 %), and root mean square error (RMSE) computed with independent validation data did not change much by the selection.For example, for effective cation exchange capacity (ECEC) 0-20 cm, RMSE increased only by 0.5 mmol c kg −1 for a model with 50 instead of 325 covariates.This increase was clearly within normal fluctuations of RMSE by bagging and random covariate selection (Spiess, 2016).Brungard et al. (2015) even improved the prediction accuracy of RF by recursive covariate elimination.Optimal values of m try were quite large, and hence trees were not strongly de-correlated.Out of 48 models, 32 tuned fits had m try > p 3 which is the software default (Liaw and Wiener, 2002).However, the gain obtained by optimizing m try was generally small.On average, the RMSEs of models fitted with default m try were only 1.015 times as large as the RMSE of models with optimized m try .The largest relative benefit of tuning was found for the topsoil bulk density of the fine soil fraction (BD) in ZH forest where optimal m try reduced out-of-bag RMSE from 0.052 to 0.049 Mg m −3 .
In contrast, BRT profited more from tuning its parameters n trees and i d .In particular, optimizing n trees resulted in some reduction of cross-validation RMSE; 69 % of the fits had smaller optimal n trees than the software default (100).Tuning n trees reduced RMSE on average by a factor of 0.941.Optimizing the interaction depth i d (mean optimal value = 10, default = 1) decreased RMSE on average by a factor of 0.982.
Tuning n trees and i d had the largest effect for subsoil ECEC and pH in ZH forest where cross-validation RMSE was reduced from 47.3 to 40.9 mmol c kg −1 and 0.91 to 0.75 pH units, respectively.
The residual spatial autocorrelation of georob models was much weaker than the autocorrelation of the original responses (Tables S4, S6 and S7).Effective ranges for Greifensee and ZH forest were less than 300 m for most models, and for Berne effective ranges varied between 1 and 10 km with rather large nugget effects (around 50 % of the total sill).Only 5 of 48 final geoGAM models contained a smooth spatial surface.They seemed often too smooth to represent small-scale residual spatial autocorrelation (median of effective range of residual variogram: 270 m).
Since cross-validation and out-of-bag RMSE did not vary much between the five methods, model averaging (MA) weights generally did not differ much from 1/5 (interquartile range of weights: 0.18-0.21).Only for subsoil soil organic matter (SOM; 50-100 cm) were the cross-validation RMSEs of parametric models larger compared to BRT and RF and resulted in somewhat larger differences between MA weights.A complete list of model parameters and MA weights is given in Tables S10 and S11.
Summing up, lasso, georob, geoGAM and partly BRT effectively selected relevant covariates from a large set.The reduction of covariates in RF, as tested on a few responses, seems promising.The benefit of tuning model parameters was sometimes only small, but remained relevant when considering all responses.

General performance
Table 4 reports the RMSE and mean square error skill score (SS mse ) of all models for independent validation data, and Fig. 2 summarizes SS mse by method and study region.Overall, the models accounted for only a moderate part of the variance in the validation data (median SS mse of best-performing method per response: 0.257).SOM in 10-30 cm of soil depth in study region Berne was best predicted with SS mse of 0.677.Soil properties in Greifensee were in general more difficult to predict, yielding for some responses negative SS mse for all methods.For pH in 30-50 cm of soil depth, for example, lasso performed "best" with SS mse of −0.089, which is unquestionably a bad result.In general, topsoil properties were predicted more accurately than subsoil properties (Fig. 3).We are aware of the limitation that we did not validate the methods with data collected by a randomized statistical design (Brus et al., 2011).This is a common drawback if digital soil mapping (DSM) is based on legacy soil data and thus represents a typical situation.Other studies that validated DSM methods with independent data on several soil properties from multiple depths -and likely did not suppress evidence of poor performance -reported similar R 2 values: www.soil-journal.net/4/1/2018/SOIL, 4, 1-22, 2018 negative up to 0.75 (Vaysse and Lagacherie, 2015), 0.1 to 0.48 (Mulder et al., 2016), 0.6 to 0.68 (Kempen et al., 2011), 0.36 to 0.52 (Viscarra Rossel et al., 2015) and 0.26 to 0.55 (Adhikari et al., 2013).Also, these studies found that the R 2 values of the predictions of topsoil properties were generally larger than R 2 related to subsoils.

Performance of methods
There was no method that consistently performed best for all soil properties, soil depths and study regions.method consistently outperformed the others, Figs. 2 and 3 suggest that the tree-based methods BRT and in particular RF performed best on average.For 28 out of 48 responses, RF had maximum SS mse , and it never had minimum SS mse .
In contrast, georob and geoGAM most often fared the worst (for 15 and 14 out of 48 responses, respectively) and were the best only for two (georob) and five responses (geoGAM).Lasso ranked between these two methods and BRT.MA further improved on RF: for 14 out of the 28 responses for which RF was the best, MA resulted in even larger SS mse , and MA was the best for another 9 of the 20 remaining responses.Hence, for 23 out of 48 responses MA had the overall largest SS mse .
Apart from overall accuracy as captured by RMSE and SS mse , bias also matters for choosing a DSM method.In general, marginal bias was small (median bias 2 -to-MSE ratio < 6 %; Fig. 4, Table S9).Bias contributed more to mean square error (MSE) when SS mse was small (methods lasso, georob, geoGAM; study region Greifensee), except for the tree-based methods RF and BRT, which often had very small bias 2 -to-MSE ratios.BRT had slightly lower bias 2 -to-MSE ratios compared to RF, confirming that boosting reduces bias in an adaptive way, while bagging in RF lowers only variance but not bias (Hastie et al., 2009, p. 588).The largest bias 2 -to-MSE ratios were most often found for lasso, and they were especially large (12 to 17 %; Table S9) for predicting gravel content in Greifensee and SOM in Berne in 50-100 cm of depth.Shrinkage methods, such as lasso, trade reduced variance in the predictions for increased bias (Hastie et al., 2009, Chap. 3).RF also occasionally resulted in biased predictions, for example for SOM 30-50 cm in Berne.Conditional bias, which is the distortion of predictions conditional on the observed values (Wilks, 2011, p. 304), did not differ between methods.Predictions were only conditionally biased if overall accuracy was small.Lastly, we evaluated whether the various methods tended to over-fit the data by computing differences between crossvalidation (CV) or out-of-bag (OOB, RF) SS mse and independent validation SS mse (Fig. 5).Through repeated crossvalidation on the same subsets and choice of tuning parameters with OOB statistics (RF), the cross-validation and OOB SS mse can be considered as conservative goodness-offit SS mse .We interpret positive or negative differences in the sequel as indications of over-fitting or under-fitting, respectively, although we cannot exclude the possibility that differences between the calibration and validation datasets contributed to discrepancies in SS mse .In particular, replicated observations from a given site were not always assigned to the same CV subset, and this possibly contributed to overly optimistic CV or OOB results.Except for lasso, all methods (b to f in Fig. 5) often had larger CV or OOB than independent validation SS mse .As also found by Liddicoat et al. (2015), lasso partly under-fitted the data, likely because we penalized the residual sum of squares by the "optimum plus 1 standard error" rule (Sect.under-fit the data remained unclear.Georob and RF tended to over-fit the data most, and geoGAM was intermediate.For all the methods, differences in SS mse were largest for poorly performing models (small SS mse in independent validation).
For georob this was most pronounced.For ZH forest ECEC (40-60 cm) CV yielded SS mse of 0.70 and independent validation SS mse of −0.683.Hence, repeated covariate selection steps based on BIC and Wald tests tend to over-fit the data when responses only weakly depend on covariates.

Factors controlling predictive performance
We explored whether characteristics of the (spatial) empirical distributions of the responses were in some way related to variations in predictive performance observed between responses.We checked whether SS mse and bias 2 -to-MSE ratios depended on spatial sampling density, skewness, (robust) coefficient of variation, strength of spatial autocorrelation and tuning parameters of methods (Tables S3 to S7 and S10 to S11), but no clear relationships became evident.Particularly, we could not find any relationships between predictive performance and strength of autocorrelation as measured by spatially structured variance ratios (1 − nugget / sill total , Vaysse and Lagacherie, 2015) or spatial ranges of response variograms.
Only for extremely positively skewed responses (SOM below 30 cm in Greifensee) did we find that BRT and RF were clearly better than lasso, georob and geoGAM, likely because log transformation was too weak to fully account for skew-ness.For skewness < 2 the advantage of tree-based methods disappeared.

Performance of legacy soil map
The RMSE and SS mse of the legacy soil map (Table 4) were mostly within the range of values observed for DSM methods.Only for subsoil gravel (50-100 cm) and the effective soil depth available to plants (SD) were predictions by the legacy soil map better than DSM predictions.(Note, however, that RMSE and SS mse of the legacy soil map indicate goodness of fit rather than rigorous validation measures because all data in the validation set were used to create the soil map.)Vaysse and Lagacherie (2015) also found that a legacy soil map predicted only SD more accurately than DSM methods.To create the legacy soil map on a scale of 1 : 5000 many auger samples were taken to delineate map units (Jäggli et al., 1998), but these data were not recorded and are therefore unavailable for DSM.This might explain why the legacy map modelled SD substantially better (SS mse 0.76) than DSM methods (SS mse 0.15-0.32).

Covariate importance
To characterize the "predictive skill" of covariates by topic, we computed weighted averages of RF covariate importance (Hastie et al., 2009, p. 368), weighing the importance of covariates by validation SS mse (Fig. 6).Overall, terrain attributes were important covariates.For Greifensee they were the main source of information for modelling soil properties.None of the other covariate groups were able to capture much of the variation in soil properties for this study region.Likely, this explains why DSM generally performed poorly here (Sect.4.2) and indicates that the performance of DSM also depends on region-specific conditions.In the study region Berne, climatic covariates were important for chemical but not for physical soil properties, particularly in topsoil.Additionally, geology, information on soil, sampling period and type of data had moderate importance for this study region and also for physical properties.Similarly, for ZH forest covariate importance differed between chemical (pH, ECEC) and physical properties (BD).Vegetation was very influential for modelling pH and ECEC, whereas for BD spatial location, sampling period and type of data were important as well.
The sampling period and type of soil data were important for many responses (legacy data correction; Fig. 6).As also mentioned by Mulder et al. (2016), this emphasizes the necessity to compensate for temporal changes and differences in analytics when using legacy soil data.Topsoil pH and SOM in Berne -among the responses predicted best in this study -were mostly "explained" by maps of mean monthly and yearly precipitation, a geological overview map, an agricultural suitability map and topographic wetness indices smoothed by different radii (7-60 m).The geological and soil overview maps were also important for modelling SD in Berne.Unlike Greifensee, terrain attributes did not contribute much to modelling SD in Berne.In Greifensee, a map of historic wetlands and distances to water bodies were, in addition to terrain attributes (mostly indicating local depressions), important for modelling SD.Predictions of physical and chemical properties in Greifensee relied mainly on vertical and horizontal distance to water bodies, local topographic indices and curvatures (50-90 m radii), and the multi-resolution valley bottom flatness (MRVBF).For ZH forest by far the most important covariate was the vegetation map accounting for nearly half of the predictive skill in topsoil for pH and ECEC and for one-third in subsoil.Terrain attributes were important for ZH forest predictions on both a small (variation of slope in 20 m radius) and large scale (topographic indices in radii 50 and 125 km).Overall, APEX covariates had very small importance (average rank of covariate importance 168 for RF and 48 for BRT).Differences in reflectance intensities between autumn and spring flights and between agricultural land with partly bare soils and various crops most likely obscured relations between the surface reflectance of vegetation and soil properties.Preprocessing using co-kriging with data from bare soil areas possibly improves predictive capabilities for the present study regions (Lagacherie et al., 2012).

Covariate interpretation
In addition to studying covariate importance, we evaluated the effects of single covariates on the responses by using partial effects or dependence plots.Given the large number of models and covariates we chose a continuous and a factor covariate to illustrate the effects for one response.Figure 7 shows how MRVBF and the factor for a different sampling period and type of soil data (legacy data correction) affected topsoil clay content (0-10 cm; Greifensee).The effect of MRVBF on clay content was similar for all five methods.Large MRVBF values point to accumulation sites in the landscape (Gallant and Dowling, 2003), and such sites often have larger clay contents.BRT and RF partial dependence plots suggest that the relation is non-linear with a sharp transition at MRVBF equal to 4. Patterns of estimated differences between sampling periods and type of data were similar for the five methods, which further strengthens the evidence that such differences should be compensated for when one uses legacy soil data.

Mapping
In addition to the reported analysis, we visually inspected the soil property maps generated by the six DSM methods.Figure 8 shows DSM maps of topsoil clay content (0-10 cm) for a section of the Greifensee study region along with a map of clay content derived from the legacy soil map.All methods, including the soil map, predicted soils rather rich in clay with clay content > 20 % for most sites, which agrees with available observations (coloured dots, panel g in Fig. 8).Modelled patterns of the maps were similar, but lasso and particularly RF predictions were very smooth.In contrast, predictions by georob, geoGAM and BRT varied more with larger clay content on valley bottoms to the east.MA performed best for this response (SS mse 0.28; Table 4) and, being a weighted average of (a) to (e), showed smoother spatial predictions than georob and geoGAM because RF had the highest model averaging weight (0.24; Table S11).The legacy soil map predicted for most polygons the class sandy loam to loam with 10-30 % clay to which we assigned a typical clay content of 20 %, which is less than most DSM predictions.As is typical for polygon maps, small areas with deviating clay content were delineated.The DSM methods were not able to map clay content with similar detail because calibration data were too scarce (Sect.4.2.4).According to the legacy soil map, there were organic soils in depressions (dark green polygons, panel g in Fig. 8), but these had not been sampled.Maps of georob and BRT predictions showed artefacts (single pixels in georob or bands in BRT) with very large predicted values.In the MA map, outlying predictions were smoothed out.Outlying georob predictions were caused by the multiflow specific catchment area (2 m resolution), an extremely positively skewed terrain attribute.This covariate was not chosen for the geoGAM model; in lasso its coefficient was strongly shrunken, and BRT and RF do not create extrapolation errors for extreme values of covariates.The cause of the artefact in BRT was impossible to spot because the BRT model contained 148 covariates (Table S11).
In addition to creating extrapolation errors, parametric methods (lasso, georob, geoGAM) predicted physically impossible values (e.g.clay content < 0 or > 100 %) that we had to eliminate.In contrast, trees do not extrapolate beyond the range of observed values of the response when computing predictions.

Practical use of statistical methods
All tested DSM methods were able to process large sets of factors and continuous covariates.Although RF more often performed best and MA even improved on that, the ad-vantage measured in validation SS mse was small (Sect.4.2).Hence, reasons other than accuracy might become more decisive for choosing a particular approach.
In our study residual spatial autocorrelation was weak or short ranged.For a response with strong residual autocorrelation a geostatistical approach might still offer an advantage.The smooth spatial surface of geoGAM is possibly too coarse to capture short-ranged autocorrelation.BRT and RF include spatial coordinates as covariates, but if the response depends only weakly on other covariates, spatial coordinates become overly important.Repeated recursive splitting on coordinates likely leads to "chessboard-type" artefacts.
All methods allowed for an interpretation of modelled relationships (Fig. 7), but a large number of remaining covariates in a model hinders the interpretation of partial effects or dependencies.The most parsimonious models were chosen by geoGAM with only 12 remaining covariates on average (lasso: 21, georob: 27).For BRT and RF a covariate selection scheme would still need to be implemented and tuned.Preliminary results suggest that this might be well worth the effort (Sect.4.1).But even without covariate selection, BRT and RF allowed us to analyse the importance of the covariates (Fig. 6).
R packages are readily available for all methods used in this study.Lasso and geoGAM optimize their tuning parameters directly without any further input to the software, while RF and BRT require specification of parameter ranges to be tested.The number of parameters to tune influences computing times considerably.Using default m try for RF (Sect.4.1) and coarse grids for finding optimal BRT parameters (Sect.3.4) might be a good compromise to balance computing efforts with good predictive performance.Computational effort was especially large for georob, for which there is no established efficient procedure for building models from large sets of covariates.Lasso, based on a coordinate descent algorithm, built models the most quickly (see also Fitzpatrick et al., 2016), while computational effort for ge-oGAM model building was quite variable depending mainly on the number of observations and the number of covariates selected by the boosting step (Nussbaum et al., 2017;Sect. 2.2).
Moreover, ease of modelling predictive uncertainty is another factor relevant for the choice of a DSM method.In georob uncertainties can be directly derived from the kriging variances.For RF, conditional quantiles of predictive distributions can be estimated directly at the cost of a larger memory requirement (R package quantregForest; Meinshausen, 2015).For lasso, geoGAM, BRT and MA, model-based bootstrapping can be used to simulate predictive distributions (see Nussbaum et al., 2017, for geoGAM uncertainties for topsoil ECEC), but bootstrapping involves quite some computational effort.Given the large number of responses and methods requiring bootstrapping, we were not able to compute and evaluate uncertainty for the presented models within a reasonable amount of time.
Responses for DSM are not always continuous soil properties.Binary, multinomial (e.g.soil types) or ordinal (e.g.drainage classes) responses are sometimes relevant.Grouped lasso is available for binary (R package grpreg; Breheny and Huang, 2015) and nominal responses (R package glmnet; Friedman et al., 2010).Logistic geostatistical models could be fitted in the generalized linear mixed model framework (R package geoRGLM; Christensen and Ribeiro Jr., 2002;Diggle and Ribeiro Jr., 2002;Pringle et al., 2014), but this is practical only for small datasets.INLA (integrated nested Laplace approximation; Rue et al., 2009;Lindgren et al., 2011) could be a viable alternative.GeoGAMs accommodate binary and ordinal responses (Nussbaum, 2017), but extension to nominal responses would be straightforward.Classification for binary and nominal responses is easily fitted by RF (R package randomForest; Liaw and Wiener, 2002) and BRT (R package gbm; Southworth, 2015), while ordinal response BRT could be implemented by R package mboost (Hothorn et al., 2015) with slightly larger effort on model specification.

Conclusions
We applied -to a total of 48 soil responses observed in three study regions in Switzerland -six statistical digital soil mapping (DSM) methods: grouped lasso (least absolute shrinkage and selection operator), robust external-drift kriging (georob), boosted geoadditive models (geoGAMs), boosted regression trees (BRTs), random forest (RF) and model averaging (MA).We used 300-500 environmental covariates as input to each method.Performance was assessed by comparing model predictions with independent validation data.
From this study we conclude the following.
-All methods successfully built models automatically from large sets of covariates.The applied ad hoc procedure to find a parsimonious trend model for georob was, however, very inefficient.
-Except for lasso, cross-validation and out-of-bag accuracy measures were sometimes better than actually observed for the validation data.This suggests that the methods partly tended to over-fit the data and underpins the necessity of model evaluation with independent data.
-The best-performing method frequently did not have a much larger mean square error skill score (SS mse ) than its closest competitors, and the empirical distributions of SS mse did not differ much for BRT, RF and MA (Figs. 2 and 3).Nevertheless, the frequencies of the best and worst performance clearly favoured RF if only one method is used.Applying model averaging (MA) of several approaches even improves on RF.
-Correcting for sampling period and soil data type by adding a factor to the models turned out to be impor-

Figure 8 .
Figure8.Predictions of clay content (%) in 0-10 cm of soil depth computed by six DSM methods (a-f) on a grid of 20 m resolution and by a legacy soil map with scale 1 : 5000 (g) for a section of the Greifensee study region.The legacy soil map (g) predicted texture classes to each of which we assigned a typical clay content displayed here.For complex polygons the texture class of the main unit is shown.Dots in (g) depict observations of clay content used for calibrating (a) to (f) and for creating the soil map (g).

Table 1 .
Nussbaum et al., 2017)needed for spatial soil function assessment in the three study regions.Most soil functions required data on further, expensive-to-measure soil properties that were inferred by pedotransfer functions (PTFs) from the basic soil properties (seeGreiner et al., 2017; BD: fine fraction bulk density, SOM: soil organic matter, SD: soil depth available to plants, d w : depth of stagnic or gleyic horizon, d c : drainage class (d w and d c defined inNussbaum et al., 2017), BS: base saturation, ECEC: effective cation exchange capacity, BC / Al: ratio of sum of basic cations to aluminium).

Table 3 .
Overview of geodata sets and derived covariates (for more information see TableS2in the Supplement); r: pixel size for raster datasets or scale for vector datasets, a: limited to study region (Be: Berne, Gr: Greifensee, Zf: ZH forest), n: number of covariates per dataset, NDVI: normalized differenced vegetation index, TPI: topographic position index, TWI: topographic wetness index, MRVBF: multiresolution valley bottom flatness).
unless visual inspection of the sequence of cross-validation RMSE values suggested that RMSE had not yet levelled off (20 % of the responses).

Table 4 .
Accuracy of the predictions of soil properties by study region and soil depth computed with independent validation data (RMSE: root mean square error, SS mse : mean square error skill score according to Eq. (5), legacy map: legacy soil map 1 : 5000, lasso: grouped least absolute shrinkage and selection operator, georob: robust external-drift kriging, geoGAM: boosted geoadditive model, BRT: boosted regression tree, RF: random forest, MA: model averaging, NA: no convergence of georob algorithm).
3.1).Why BRT tended to partially Box plots of bias 2 -to-MSE ratio (for independent validation data) grouped by method and study region.Box plots summarize ratios of n = 21 soil properties for study regions Berne and Greifensee (20 for georob in Greifensee).For ZH forest ratios are individually shown for n = 6 soil properties (lasso: grouped least absolute shrinkage and selection operator, georob: robust externaldrift kriging, geoGAM: boosted geoadditive model, BRT: boosted regression tree, RF: random forest, MA: model averaging).
tant.Legacy soil data are inherently heterogeneous for various reasons, but one can (and should) compensate for this variation through careful statistical modelling. www.soil-journal.net/4/1/2018/