M1 Matrix comparison of technology for automated data harmonization
Geo-harmonizer project
Prepared by: mundialis, OpenGeoHub, CTU in Prague and MultiOne
Prepared by: mundialis, OpenGeoHub, CTU in Prague and MultiOne
Executive summary
Interoperability of cross-border data is meant in two directions: a) geometric harmonisation (to be seamlessly compatible across national boundaries), and b) semantic harmonisation to be comparable in respect to content. This document aims at a systematic review of relevant geospatial data harmonization methods and software packages. The main objective here is to serve other “Geo-harmonizer” project activities with a systematic review and offer guidance for using software for harmonization.
Introduction to the harmonization of geospatial data
Geodata harmonization is the process of combining data belonging to the same topic but originating from heterogeneous data sources and hence of different file formats, naming conventions, and projections into one cohesive geospatial data set. Harmonization with respect to geospatial data also requires the agreement on the target projections, resolution and attribute table structures. Issues arise from different taxonomies (classification systems, legends, nomenclatures; Čerba et al. 2012) represent often repeated reasons of spatial data heterogeneity, esp. in cross-border or even sometimes in federal constellations (Parycek et al., 2014, Shvaiko et al., 2012).
Types of harmonization
Basic definition: data harmonization (in the geoscience context) is a process of transforming and re-organizing multi-source data so that it can be seamlessly merged or bound. It can be compared to searching for the lowest common denominator in mathematics. The main objectives of data harmonization are usually:
- Convert or transform data into common standards/data formats,
- Reduce bias between the methods (translate values to a common reference),
- Bind data into a single database.
In other words, data harmonization comprises in itself processes such as standardization, translation, pre- and post-processing. All these can be highly complex and lead to artifacts and failures. Not all data can be harmonized and sometimes it is not even worth starting the harmonization process. Some examples of incompatible data:
- Incompatible extent, scale and/or spatio-temporal reference or similar,
- Incompatible target variables and/or measurement procedures.
Standard geodata harmonization operations with multisource data include:
- Data format translation (e.g. using GDAL/OGR or similar),
- Conversion of values to a common standard/reference,
- Combining multi-source data into virtual/physical mosaics,
- Resampling of data to ensure a consistent sampling interval,
- Filling in gaps/imputation of missing values and making sure that completeness of all layers is consistent.
We distinguish the following groups of harmonization:
- Harmonization of geographical entities (upscaling, downscaling, resolving of edge-problems and cross-border problems) so that seamless geographical layers can be produced (Hintz, 2012),
- Harmonization of variables coming from different sources to the same set of variables,
- Semantic harmonization allowing for combination of data with different legends and mapping concepts,
- Harmonization of the map styling and metadata standards,
- Harmonization of data quality, i.e. providing close-to-homogeneous data quality standards.
Overview of software for data harmonization
Harmonization objective | Software | Overview | Software License |
Automated data harmonization and seamless transition to cloud processing | PODPAC | PODPAC, the Pipeline for Observation Data Processing Analysis and Collaboration, is a Python-based software library for automated data harmonization and seamless transition to cloud processing. Data sources encapsulated by PODPAC are automatically projected and interpolated to a user-specified geospatial reference system. This allows plug-and-play development of processing pipelines using multi-scale and multi-source data. While these pipelines may be developed in Jupyter notebooks on local machines, they can also be exported using an automatically-generated text-based description and run on massively distributed remote cloud servers. PODPAC is under development under a permissive open-source license and is available at https://github.com/creare-com/podpac. | Apache License 2.0 |
HALE studio | Create and use Open Standards data to its full potential. hale»studio is built from the ground up to support rich open standards such as OGC GML and CityGML, INSPIRE, ALKIS/NAS, IFC or any other XML- or JSON based standard. It also supports PostgreSQL, Oracle, File Geodatabases and many other formats. | GNU Lesser General Public License Version 3 | |
Geospatial database | PostGIS | PostGIS is a spatial database extender for PostgreSQL object-relational database. | GNU General Public License (GPLv2 or later) |
SpatiaLite | SpatiaLite is an open-source library intended to extend the SQLite core to support fully fledged Spatial SQL capabilities. | MPL tri-license terms; choose from:
– the MPL 1.1 |
|
GIS
|
GRASS GIS | The Geographic Resources Analysis Support System, commonly referred to as GRASS GIS, is an Open Source Geographic Information System providing powerful raster, vector and geospatial processing capabilities in a single integrated software suite. GRASS GIS includes tools for spatial modeling, visualization of raster and vector data, management and analysis of geospatial data, and the processing of satellite and aerial imagery. It also provides the capability to produce sophisticated presentation graphics and hardcopy maps. GRASS GIS has been translated into about twenty languages and supports a huge array of data formats. It can be used either as a stand-alone application or as a backend for other software packages such as QGIS and R geostatistics. It is distributed freely under the terms of the GNU General Public License (GPL). GRASS GIS is a founding member of the Open Source Geospatial Foundation (OSGeo). | GNU General Public License (GPLv2 or later) |
SAGA | System for Automated Geoscientific Analyses (SAGA GIS) is a geographic information system (GIS) computer program, used to edit spatial data. It is free and open-source software, mainly licensed under the GPL license. | – SAGA GUI + SAGA CMD: GNU General Public License (GPL)
– SAGA API: GNU Lesser General Public License (LGPL) – SAGA Modules: Most modules are licensed under the GNU General Public License (GPL). |
|
QGIS | QGIS is a user-friendly Open Source Geographic Information System (GIS) licensed under the GNU General Public License. | GNU General Public License v2.0 | |
Map server | GeoServer | GeoServer implements industry-standard OGC protocols such as Web Feature Service (WFS), Web Map Service (WMS), and Web Coverage Service (WCS). Implemented in Java programming language. | GNU General Public License v2.0 |
MapServer | MapServer is an Open Source platform for publishing spatial data and interactive mapping applications to the web. Implemented in C programming language. | MIT-style license | |
QGIS Server | QGIS Server is an open-source WMS 1.3, WFS 1.0.0, WFS 1.1.0 and WCS 1.1.1 implementation that, in addition, implements advanced cartographic features for thematic mapping. | GNU General Public License v2.0 | |
Raster and vector processing workbench | GDAL | GDAL is a translator library for raster and vector geospatial data formats that is released under an X/MIT style Open Source License by the Open Source Geospatial Foundation. As a library, it presents a single raster abstract data model and single vector abstract data model to the calling application for all supported formats. It also comes with a variety of useful command line utilities for data translation and processing. | MIT/X style |
Rasterio | Rasterio reads and writes geospatial raster datasets. Employs GDAL under the hood for file I/O and raster formatting. Rasterio provides a Python API based on Numpy N-dimensional arrays and GeoJSON. | BSD license | |
Statistical language | R | R (programming language), an environment for statistical computing and graphics. It comes with several packages supporting automatic translation of data formats (e.g. “zoo” for time data formats, “sf” and “sp” for spatial data), and “measurements” for units and standards, it also can be used to more efficiently use GDAL (via rgdal). Number of packages in R have been developed specifically to support automated harmonization e.g. “soiltexture” package to convert from various national systems to international systems, “febr” to convert soil properties. | GNU General Public License (GPLv2 or later) |
Vector data operations | GEOS | GEOS (Geometry Engine – Open Source) is a C++ port of the JTS Topology Suite (JTS). It aims to contain the complete functionality of JTS in C++. This includes all the OpenGIS Simple Features for SQL spatial predicate functions and spatial operators, as well as specific JTS enhanced functions. GEOS provides spatial functionality to many other projects and products. | GNU Lesser General Public License (LGPL) |
Geometry Boost | Boost. Geometry (aka Generic Geometry Library, GGL), part of the collection of the Boost C++ Libraries, defines concepts, primitives and algorithms for solving geometry problems. | Boost Software License – Version 1.0 | |
pyGEOS | PyGEOS is a C/Python library with vectorized geometry functions. The geometry operations are done in the open-source geometry library GEOS. PyGEOS wraps these operations in NumPy ufuncs providing a performance improvement when operating on arrays of geometries. | BSD 3-Clause | |
Shapely | Shapely is a BSD-licensed Python package for manipulation and analysis of planar geometric objects. It is based on the widely deployed GEOS (the engine of PostGIS) and JTS (from which GEOS is ported) libraries. | BSD License (BSD) | |
Machine learning packages | TensorFlow | Open-source C++/Python package allowing fast mathematical operations | Apache License 2.0 |
Scikit-learn | Open-source Python machine learning library | BSD license |
Harmonization functionality | Software | Function names / description | Operational use | Reference |
Raster reprojection and resampling | GDAL |
gdal_translate gdalwarp |
change of spatial resolution; reprojection | manual |
GRASS GIS |
r.resamp* r.proj |
change of spatial resolution; reprojection | manual resampling; manual reprojection | |
SAGA |
saga_cmd grid_filter |
change of spatial resolution | manual | |
PostGIS |
ST_Transform ST_Resample |
change of spatial resolution; reprojection | manual manual |
|
Vector reprojection | GDAL/OGR |
ogr2ogr |
reprojection | manual |
GRASS GIS |
v.proj |
reprojection | manual | |
SAGA |
saga_cmd pj_proj4 |
reprojection | manual | |
PostGIS |
ST_Transform |
reprojection | manual | |
Gap filling (spatial) | GRASS GIS |
r.fillnulls |
bilinear, bicubic,RST based gap filling | manual |
GRASS GIS |
r.fill.stats |
IDW based gap filling | manual | |
GDAL |
gdal_fillnodata |
filling of no-data values in rasters | manual | |
Gap filling (temporal) | GRASS GIS |
r.hants |
HANTS gap-filling of time series | manual |
GRASS GIS |
r.series.lwr |
local weighted regression gap-filling of time series | manual | |
Vector clipping | GRASS GIS |
v.clip |
manual | |
GEOS |
geos::operation::intersection |
manual | ||
PostGIS |
ST_Intersection |
manual | ||
QGIS |
Clip |
manual | ||
Vector dissolving | GRASS GIS |
v.dissolve |
||
PostGIS |
ST_Union ST_Collect |
manual | ||
QGIS |
Dissolve |
manual | ||
Vector topological cleaning | GRASS GIS |
v.clean v.clean.ogr |
cleans non-topological polygons in an OGR data source by importing, cleaning, and exporting these polygons | manual |
GEOS | (set of different C/C++ classes) | manual | ||
PostGIS | (Set of different types and functions) | manual | ||
Map geometric shifts | GRASS GIS |
r.region |
Correct linearly the geometric position of a raster map | manual |
GRASS GIS |
v.transform |
Performs an affine transformation (shift, scale and rotate) on a vector map | manual | |
PostGIS |
ST_Translate |
Returns a new geometry whose coordinates are translated delta x, delta y, delta z units | manual | |
Machine learning | TensorFlow |
tensorflow |
Mathematical operations, building up machine learning models | manual |
Keras |
tensorflow.keras |
Speeds up deployments of deep learning models in TensorFlow significantly | manual | |
Scikit-learn |
sklearn.ensemble |
Algorithms to perturb-and-combine techniques [B1998] specifically designed for trees. | manual |
Benchmarking of technology
In general, benchmarking means comparing the processes and performance metrics of one product with the industry best practices of other products. The dimensions measured are usually quality, time/performance and cost. Data harmonization is a major task performed within the Geo-Harmonizer project. Especially when data harmonization needs to be automated and becomes a recurring process that is regularly used to harmonize newly incoming data, performance and usability of the tools used are one of the most important issues. The main goal of benchmarking is to define the main criteria for the measurements, the selection of the tools we are investigating. The best practice methodology to be implemented will indirectly contribute to finding areas for improvement in interoperability (Veeckman et al., 2017).
The main criteria are as follows:
- Precision of operations – efforts are required to quantify accuracy and to carry out validations,
- Processing performance in relation to the resources used,
- Supported maximum map size (raster model: pixels; vector model: number of objects),
- Flexibility in the use of software tools,
- Portability of software (different environments and operating systems).
Due to the uncertainties resulting from the expected high complexity of data harmonisation, the reliability and robustness of the tools used will be difficult to assess. One of the main problems of benchmarking in this context is that we do not expect to receive methods from other third parties that help us to compare our results objectively with others. Therefore, above all, the definitions of precision and performance must ultimately fit the needs that we will identify in the course of the project. After an analysis of the results we will accordingly derive solutions for possible improvements of the above mentioned instruments. Accordingly, we will then be able to assess, on the basis of possible improvements in relation to the effort required, which of these suggestions for improvement should ultimately be implemented.
Cited references
- Čerba, O., Charvát, K., Janečka, J., Jedlička, K., Ježek, J., & Mildorf, T. (2012). The overview of spatial data harmonisation approaches and tools. In Proceedings of the 4th International Conference on Cartography and GIS (Vol. 1, pp. 113-124).
- Hintz, D. (2012). Data Harmonization Principles and Development Approaches as Applied to INSPIRE SDIs. INSPIRE-GMES Information Brochure. Technische Universität München.
- Parycek, P., Hochtl, J., & Ginner, M. (2014). Open government data implementation evaluation. Journal of theoretical and applied electronic commerce research, 9(2), 80-99.
- Shvaiko, P., Farazi, F., Maltese, V., Ivanyukovich, A., Rizzi, V., Ferrari, D., & Ucelli, G. (2012, November). Trentino government linked open geo-data: a case study. In International Semantic Web Conference (pp. 196-211). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35173-0_13
- Veeckman, C., Jedlička, K., De Paepe, D., Kozhukh, D., Kafka, Š., Colpaert, P., & Čerba, O. (2017). Geodata interoperability and harmonization in transport: a case study of open transport net. Open Geospatial Data, Software and Standards, 2(1), 1-11. https://doi.org/10.1186/s40965-017-0015-6
Software references
List of scientific articles and books related to the software packages presented in this document (software name in bold):
- Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M. and Kudlur, M. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation. OSDI 16:265-283.
- Casagrande, L., Cavallini, P., Frigeri, A., Furieri, A., Marchesini, I., Neteler, M. (2014): GIS Open Source: GRASS GIS, Quantum GIS and SpatiaLite. ISBN-13: 9788857902906
- Conrad, O., Bechtel, B., Bock, M., Dietrich, H., Fischer, E., Gerlitz, L., Wehberg, J., Wichmann, V., and Boehner, J. (2015): System for Automated Geoscientific Analyses (SAGA) v. 2.1.4. Geosci. Model Dev., 8, 1991-2007, https://doi.org/10.5194/gmd-8-1991-2015
- Graser, A. (2013). Learning QGIS 2.0. Packt Publishing Ltd., 110 pp, ISBN-13: 9781783280001
- Hale Studio (2018), at https://www.wetransform.to/products/halestudio/ [accessed 27 Feb 2020]
- Iacovella, S. (2014). GeoServer Cookbook. Packt Publishing Ltd. ISBN 978-1-78328-961-5
- Lime S. (2008) MapServer. In: Hall G.B., Leahy M.G. (eds) Open Source Approaches in Spatial Data Handling. Advances in Geographic Information Science, vol 2. Springer, Berlin, Heidelberg, https://doi.org/10.1007/978-3-540-74831-1_4
- McInerney, D., & Kempeneers, P. (2015). Introduction to GDAL utilities. In Open Source Geospatial Tools (pp. 61-62). Springer, Cham., https://doi.org/10.1007/978-3-319-01824-9_4
- Neteler, M., Bowman, M. H., Landa, M., & Metz, M. (2012): GRASS GIS: A multi-purpose open source GIS. Environmental Modelling & Software, 31, 124-130, https://doi.org/10.1016/j.envsoft.2011.11.014
- Obe, R., & Hsu, L. (2011). PostGIS in action. GEOInformatics, 14(8), 30-33.
- R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
- Ueckermann, M. P., Bieszczad, J., Entekhabi, D., & Shapiro, M. L. (2018): Use of the Open Source PODPAC Library for Remote, Cloud-Based Data Analysis, Visualization, and Collaboration in a Web Browser. In AGU Fall Meeting Abstracts.
- Pedregosa et al. (2011): Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, JMLR 12, pp. 2825-2830, 2011.