Skip to main content
x
Void filling in SRTM DEM data with splines (r.fillnulls). See related article in GRASS Newsletter #3, June 2005 (screenshot: M. Neteler)

M1 Matrix comparison of technology for automated data harmonization

Geo-harmonizer project

Prepared by: mundialis, OpenGeoHub, CTU in Prague and MultiOne


Executive summary

Interoperability of cross-border data is meant in two directions: a) geometric harmonisation (to be seamlessly compatible across national boundaries), and b) semantic harmonisation to be comparable in respect to content. This document aims at a systematic review of relevant geospatial data harmonization methods and software packages. The main objective here is to serve other “Geo-harmonizer” project activities with a systematic review and offer guidance for using software for harmonization.


Introduction to the harmonization of geospatial data

Geodata harmonization is the process of combining data belonging to the same topic but originating from heterogeneous data sources and hence of different file formats, naming conventions, and projections into one cohesive geospatial data set. Harmonization with respect to geospatial data also requires the agreement on the target projections, resolution and attribute table structures. Issues arise from different taxonomies (classification systems, legends, nomenclatures; Čerba et al. 2012) represent often repeated reasons of spatial data heterogeneity, esp. in cross-border or even sometimes in federal constellations (Parycek et al., 2014, Shvaiko et al., 2012).


Types of harmonization

Basic definition: data harmonization (in the geoscience context) is a process of transforming and re-organizing multi-source data so that it can be seamlessly merged or bound. It can be compared to searching for the lowest common denominator in mathematics. The main objectives of data harmonization are usually:

  • Convert or transform data into common standards/data formats,
  • Reduce bias between the methods (translate values to a common reference),
  • Bind data into a single database.

In other words, data harmonization comprises in itself processes such as standardization, translation, pre- and post-processing. All these can be highly complex and lead to artifacts and failures. Not all data can be harmonized and sometimes it is not even worth starting the harmonization process. Some examples of incompatible data:

  • Incompatible extent, scale and/or spatio-temporal reference or similar,
  • Incompatible target variables and/or measurement procedures.

Standard geodata harmonization operations with multisource data include:

  1. Data format translation (e.g. using GDAL/OGR or similar),
  2. Conversion of values to a common standard/reference,
  3. Combining multi-source data into virtual/physical mosaics,
  4. Resampling of data to ensure a consistent sampling interval,
  5. Filling in gaps/imputation of missing values and making sure that completeness of all layers is consistent.

We distinguish the following groups of harmonization:

  1. Harmonization of geographical entities (upscaling, downscaling, resolving of edge-problems and cross-border problems) so that seamless geographical layers can be produced (Hintz, 2012), 
  2. Harmonization of variables coming from different sources to the same set of variables,
  3. Semantic harmonization allowing for combination of data with different legends and mapping concepts,
  4. Harmonization of the map styling and metadata standards,
  5. Harmonization of data quality, i.e. providing close-to-homogeneous data quality standards.

Overview of software for data harmonization

Table M1.1: Matrix comparison of commonly used open-source packages for geospatial data harmonization.

Harmonization objective

Software

Overview

Software License

Automated data harmonization and seamless transition to cloud processing

PODPAC

PODPAC, the Pipeline for Observation Data Processing Analysis and Collaboration, is a Python-based software library for automated data harmonization and seamless transition to cloud processing. Data sources encapsulated by PODPAC are automatically projected and interpolated to a user-specified geospatial reference system. This allows plug-and-play development of processing pipelines using multi-scale and multi-source data. While these pipelines may be developed in Jupyter notebooks on local machines, they can also be exported using an automatically-generated text-based description and run on massively distributed remote cloud servers. PODPAC is under development under a permissive open-source license and is available at https://github.com/creare-com/podpac.

Apache License 2.0

HALE studio

Create and use Open Standards data to its full potential. hale»studio is built from the ground up to support rich open standards such as OGC GML and CityGML, INSPIRE, ALKIS/NAS, IFC or any other XML- or JSON based standard. It also supports PostgreSQL, Oracle, File Geodatabases and many other formats.

GNU Lesser General Public License Version 3

Geospatial database

PostGIS

PostGIS is a spatial database extender for PostgreSQL object-relational database.

GNU General Public License (GPLv2 or later)

SpatiaLite

SpatiaLite is an open-source library intended to extend the SQLite core to support fully fledged Spatial SQL capabilities.

MPL tri-license terms; choose from:

- the MPL 1.1
- the GPL v2.0 or any subsequent version
- the LGPL v2.1 or any subsequent version

GIS


 

GRASS GIS

The Geographic Resources Analysis Support System, commonly referred to as GRASS GIS, is an Open Source Geographic Information System providing powerful raster, vector and geospatial processing capabilities in a single integrated software suite. GRASS GIS includes tools for spatial modeling, visualization of raster and vector data, management and analysis of geospatial data, and the processing of satellite and aerial imagery. It also provides the capability to produce sophisticated presentation graphics and hardcopy maps. GRASS GIS has been translated into about twenty languages and supports a huge array of data formats. It can be used either as a stand-alone application or as a backend for other software packages such as QGIS and R geostatistics. It is distributed freely under the terms of the GNU General Public License (GPL). GRASS GIS is a founding member of the Open Source Geospatial Foundation (OSGeo).

GNU General Public License (GPLv2 or later)

SAGA

System for Automated Geoscientific Analyses (SAGA GIS) is a geographic information system (GIS) computer program, used to edit spatial data. It is free and open-source software, mainly licensed under the GPL license.

- SAGA GUI + SAGA CMD: GNU General Public License (GPL)

- SAGA API: GNU Lesser General Public License (LGPL)

- SAGA Modules: Most modules are licensed under the GNU General Public License (GPL).

QGIS

QGIS is a user-friendly Open Source Geographic Information System (GIS) licensed under the GNU General Public License.

GNU General Public License v2.0

Map server

GeoServer

GeoServer implements industry-standard OGC protocols such as Web Feature Service (WFS), Web Map Service (WMS), and Web Coverage Service (WCS). Implemented in Java programming language.

GNU General Public License v2.0

MapServer

MapServer is an Open Source platform for publishing spatial data and interactive mapping applications to the web. Implemented in C programming language.

MIT-style license

QGIS Server

QGIS Server is an open-source WMS 1.3, WFS 1.0.0, WFS 1.1.0 and WCS 1.1.1 implementation that, in addition, implements advanced cartographic features for thematic mapping. 

GNU General Public License v2.0

Raster and vector processing workbench

GDAL

GDAL is a translator library for raster and vector geospatial data formats that is released under an X/MIT style Open Source License by the Open Source Geospatial Foundation. As a library, it presents a single raster abstract data model and single vector abstract data model to the calling application for all supported formats. It also comes with a variety of useful command line utilities for data translation and processing.

MIT/X style

Rasterio

Rasterio reads and writes geospatial raster datasets. Employs GDAL under the hood for file I/O and raster formatting. Rasterio provides a Python API based on Numpy N-dimensional arrays and GeoJSON.

BSD license

Statistical language

R

R (programming language), an environment for statistical computing and graphics. It comes with several packages supporting automatic translation of data formats (e.g. “zoo” for time data formats, “sf” and “sp” for spatial data), and “measurements” for units and standards, it also can be used to more efficiently use GDAL (via rgdal). Number of packages in R have been developed specifically to support automated harmonization e.g. “soiltexture” package to convert from various national systems to international systems, “febr” to convert soil properties.

GNU General Public License (GPLv2 or later)

Vector data operations

GEOS

GEOS (Geometry Engine - Open Source) is a C++ port of the ​JTS Topology Suite (JTS). It aims to contain the complete functionality of JTS in C++. This includes all the ​OpenGIS Simple Features for SQL spatial predicate functions and spatial operators, as well as specific JTS enhanced functions. GEOS provides spatial functionality to many other projects and products.

GNU Lesser General Public License (LGPL)

Geometry Boost

Boost. Geometry (aka Generic Geometry Library, GGL), part of the collection of the Boost C++ Libraries, defines concepts, primitives and algorithms for solving geometry problems.

Boost Software License - Version 1.0

pyGEOS

PyGEOS is a C/Python library with vectorized geometry functions. The geometry operations are done in the open-source geometry library GEOS. PyGEOS wraps these operations in NumPy ufuncs providing a performance improvement when operating on arrays of geometries.

BSD 3-Clause

Shapely

Shapely is a BSD-licensed Python package for manipulation and analysis of planar geometric objects. It is based on the widely deployed GEOS (the engine of PostGIS) and JTS (from which GEOS is ported) libraries.

BSD License (BSD)

Machine learning packages

TensorFlow

Open-source C++/Python package allowing fast mathematical operations

Apache License 2.0

Scikit-learn

Open-source Python machine learning library 

BSD license


 

Table M1.2: Matrix comparison of available open-source functionality for geospatial data harmonization as per software package.

Harmonization functionality

Software

Function names / description

Operational use

Reference

Raster reprojection and resampling

GDAL

gdal_translate
gdalwarp

change of spatial resolution; reprojection

manual

GRASS GIS

r.resamp*
r.proj

change of spatial resolution; reprojection

manual resampling; manual reprojection

SAGA

saga_cmd
grid_filter

change of spatial resolution

manual

PostGIS

ST_Transform
ST_Resample

change of spatial resolution; reprojection

manual
manual

Vector reprojection

GDAL/OGR

ogr2ogr

reprojection

manual

GRASS GIS

v.proj

reprojection

manual

SAGA

saga_cmd
pj_proj4

reprojection

manual

PostGIS

ST_Transform

reprojection

manual

Gap filling (spatial)

GRASS GIS

r.fillnulls

bilinear, bicubic,RST based gap filling

manual

GRASS GIS

r.fill.stats

IDW based gap filling

manual

GDAL

gdal_fillnodata

filling of no-data values in rasters

manual

Gap filling (temporal)

GRASS GIS

r.hants

HANTS gap-filling of time series  

manual

GRASS GIS

r.series.lwr

local weighted regression   gap-filling of time series

manual

Vector clipping

GRASS GIS

v.clip
 

manual

GEOS

geos::operation::intersection
 

manual

PostGIS

ST_Intersection
 

manual

QGIS

Clip
 

manual

Vector dissolving

GRASS GIS

v.dissolve
   

PostGIS

ST_Union
ST_Collect
 

manual

manual

QGIS

Dissolve
 

manual

Vector topological cleaning

GRASS GIS

v.clean
v.clean.ogr

cleans non-topological polygons in an OGR data source by importing, cleaning, and exporting these polygons

manual

GEOS

(set of different C/C++ classes)

 

manual

 

PostGIS

(Set of different types and functions)

 

manual

Map geometric shifts

GRASS GIS

r.region

Correct linearly the geometric position of a raster map

manual

GRASS GIS

v.transform

Performs an affine transformation (shift, scale and rotate) on a vector map

manual

PostGIS

ST_Translate

Returns a new geometry whose coordinates are translated delta x, delta y, delta z units

manual

Machine learning

TensorFlow

tensorflow

Mathematical operations, building up machine learning models

manual

Keras

tensorflow.keras

Speeds up deployments of deep learning models in TensorFlow significantly

manual

Scikit-learn

sklearn.ensemble

Algorithms to perturb-and-combine techniques [B1998] specifically designed for trees.

manual


Benchmarking of technology

In general, benchmarking means comparing the processes and performance metrics of one product with the industry best practices of other products. The dimensions measured are usually quality, time/performance and cost. Data harmonization is a major task performed within the Geo-Harmonizer project. Especially when data harmonization needs to be automated and becomes a recurring process that is regularly used to harmonize newly incoming data, performance and usability of the tools used are one of the most important issues. The main goal of benchmarking is to define the main criteria for the measurements, the selection of the tools we are investigating. The best practice methodology to be implemented will indirectly contribute to finding areas for improvement in interoperability (Veeckman et al., 2017).

The main criteria are as follows:

  • Precision of operations - efforts are required to quantify accuracy and to carry out validations,
  • Processing performance in relation to the resources used,
  • Supported maximum map size (raster model: pixels; vector model: number of objects),
  • Flexibility in the use of software tools,
  • Portability of software (different environments and operating systems).

 

Due to the uncertainties resulting from the expected high complexity of data harmonisation, the reliability and robustness of the tools used will be difficult to assess. One of the main problems of benchmarking in this context is that we do not expect to receive methods from other third parties that help us to compare our results objectively with others. Therefore, above all, the definitions of precision and performance must ultimately fit the needs that we will identify in the course of the project. After an analysis of the results we will accordingly derive solutions for possible improvements of the above mentioned instruments. Accordingly, we will then be able to assess, on the basis of possible improvements in relation to the effort required, which of these suggestions for improvement should ultimately be implemented.


Cited references

  • Čerba, O., Charvát, K., Janečka, J., Jedlička, K., Ježek, J., & Mildorf, T. (2012). The overview of spatial data harmonisation approaches and tools. In Proceedings of the 4th International Conference on Cartography and GIS (Vol. 1, pp. 113-124).
  • Hintz, D. (2012). Data Harmonization Principles and Development Approaches as Applied to INSPIRE SDIs. INSPIRE-GMES Information Brochure. Technische Universität München.
  • Parycek, P., Hochtl, J., & Ginner, M. (2014). Open government data implementation evaluation. Journal of theoretical and applied electronic commerce research, 9(2), 80-99.
  • Shvaiko, P., Farazi, F., Maltese, V., Ivanyukovich, A., Rizzi, V., Ferrari, D., & Ucelli, G. (2012, November). Trentino government linked open geo-data: a case study. In International Semantic Web Conference (pp. 196-211). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35173-0_13
  • Veeckman, C., Jedlička, K., De Paepe, D., Kozhukh, D., Kafka, Š., Colpaert, P., & Čerba, O. (2017). Geodata interoperability and harmonization in transport: a case study of open transport net. Open Geospatial Data, Software and Standards, 2(1), 1-11. https://doi.org/10.1186/s40965-017-0015-6 

Software references

List of scientific articles and books related to the software packages presented in this document (software name in bold):

We use cookies on our website to support technical features that enhance your user experience.

We also use analytics & advertising services. To opt-out click for more information.