Building Continental Europe Digital Terrain Model at 30 m resolution using Machine Learning
Development team: OpenGeoHub, MultiOne,
The Geo-harmonizer project requires an up-to-date terrain model of Continental Europe that can be used for land cover, vegetation and soil mapping. Currently at least three public global Digital Elevation Models exist that cover Europe and are available for spatial modeling:
- Digital Surface Model based on the USGS NASADEM,
- Digital Surface Model based on ALOS AW3D,
- Copernicus EU DEM based on the SRTM and ASTER DEMs,
This project obviously needs only a single model of relief, so a logical solution is to derive an Ensemble estimate of terrain height i.e. elevation of the bare earth or land surface.
Since 2018 several space-born missions have been collecting altitude / elevation data using precise and consistent instruments. This include:
Because GEDI is newer mission and more focused on mapping vegetation, we have decided to use it as a reference, in this work, to train a model to predict the most probable terrain elevation. GEDI data can be obtained using the rGEDI package. We have downloaded GEDI Level2B data for whole world and then prepared an extract of points covering continental EU (about 28 million points with "elev.lowestmode" i.e. the estimate of the terrain height). GEDI unfortunately does not cover latitudes above 52°N (see below).
The NASA DEM and ALSO AW3D are surface models hence they include canopy height and this represent a problem for deriving terrain model. UMD GLAD group has recently release a global canopy height map for 2019, also based on GEDI data. This data can be used help predict the terrain model with higher accuracy. There are however two limitations of the GEDI data set and these are: (1) limited coverage to 52 degrees north, (2) systematic clustering of points on the satellite path. These did not seem to cause problems in the modeling but should not be ignored.
In addition, Global Surface Water Explorer and Global Forest Change projects have release number of global products that contain information about the surface water probability and forest tree cover. In total we have prepared 9 auxiliary gridded layers at 30 m that we use to train a model to predict terrain. The summary ensemble model is hence:
m.elev = train_eml(t.var="elev_lowestmode", pr.var = c("eu_AW3D_30m_f", "eu_NASADEM_30m_f", "eu_dem25m_", "eu_GlobalSurfaceWater_2016_30m_", "treecover2010_", "bare2010_", "treecover2000_", "eu_canopy_height_30m_"), X=as.data.frame(ovT.df), out.dir="/data/EU_layers/modelsM/", SL.library = c("regr.ranger", "regr.glmnet", "regr.cvglmnet", "regr.cubist"))
where t.var is the target variable, ovT.df is the regression matrix containing values of the GEDI points and result of spatial overlay with all layers, and SL.library contains the list of integrated learners used in the mlr package framework (mlr::makeStackedLearner) to estimate the ensemble model. Note that, for training the EML, we also use spatial blocks of 30x30 km to ensure that cross-validation protects from over-fitting due to spatial clustering. The summary result we get (note: we randomly subset to max 7 million points for the purpose computational efficiency i.e. to avoid memory problems) show:
Call: stats::lm(formula = f, data = d) Residuals: Min 1Q Median 3Q Max -65.580 -2.630 0.648 3.120 181.769 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.1448129 0.4663283 -8.888 < 2e-16 *** regr.ranger 0.2667469 0.0009676 275.677 < 2e-16 *** regr.glmnet -4.7183974 0.6038334 -7.814 5.54e-15 *** regr.cvglmnet 4.6966219 0.6042481 7.773 7.69e-15 *** regr.cubist 0.7643997 0.0012860 594.378 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.729 on 6757726 degrees of freedom Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996 F-statistic: 4.644e+09 on 4 and 6757726 DF, p-value: < 2.2e-16
Which indicates that the elevation errors are in average (2/3rd of pixels) between ±2-3 m. This in general demonstrates that combining multiple Digital Surface Models with canopy heights and forest density maps can be used to significantly improve (absolute) accuracy of e.g. SRTM DEM (currently estimated to be around ±12 m).
The final predictions and the training points can be downloaded as Cloud Optimized GeoTIFFS and the code used to prepare data, fit models and generate predictions can be accessed here. Note that the initial results are promising, but there are still number of issue that we plan to resolve:
- In many places canopy height is still visible on the hillshading images, indicating that even after using the canopy height, the true terrain elevation in forests is over-estimated,
- Additional filtering is needed to remove also human built objects in urban areas.
- The EML-based Digital Terrain Model has not be hydrologically corrected / pre-processed, hence additional processing is needed before this DTM could be used for spatial modeling,