We are continually working on improving the functionality of our web scraping and data extraction platform Diggernaut. This time, the enhancement package includes functions that supposed to work with geospatial data.
First, we would like to inform you that we have completely redesigned the function to extract multi-polygons by the OSM relation ID. Previously, we used a third-party service. However, as it turned out during the intensive usage, not all relations can be converted into WKT using this service. Also, this service does not work correctly if there are inner rings (holes) in the multi-polygon. It merely turns all inner rings into outer ones. That’s why we code own routine to extract multi-polygons in WKT format. It works correctly with inner rings and can give out any relation that has at least one closed ring (polygon) in WKT format. To do it, you will need to use the wkt command as before. Please note that the changes have only affected the WKT format, for the GeoJSON format everything remains the same.
The second great improvement was the addition of address parsing functions for almost any country in the world. We connected the libpostal library as a microservice, and its functionality became available from scrapers within the package for working with geospatial data. The library is written in C and uses the statistical NLP for parsing and normalizing postal addresses, using pre-trained models with data from OpenStreetMap, OpenAddresses and other sources. We added 2 functions: address_parse – for parsing (splitting to the elementary parts) and address_expand – for address normalization. Since very often the addresses on the sites are represented as a single block with the text, splitting it into parts (street, city, zip, etc.) may become problematic, it seems to us that this functionality together with the geocoding command can be extremely useful for you to solve your tasks. If you want to learn more about how libpostal works, we recommend that you read the following article: Statistical NLP based on OpenStreetMap data, part two.