Geospatial Data

Working with addresses

When we collect information from websites, it often happens that the addresses of companies or objects are written in free form or in single container. In such cases, it may take serious efforts to split an address into parts: house number, street, apartment, city, area, zip code and country. To simplify the job, we have implemented support for a well-known library for address parsing, which is called libPostal. This library is written in C and uses statistical NLP together with open data sets from OSM and OpenAddresses to normalize and parse addresses around the globe.

To parse postal address you can use the address_parse command, which supports following parameters:

Parameter Description
address Postal address you need to parse.

Example of address parsing.

      # SWITCHING TO THE GEO CONTEXT
- geo:
    do:
    # Parse address
    - address_parse:
        address: 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
        do:
                            
Time Level Message
2018-07-11 21:05:25:806 info Scrape is done
2018-07-11 21:05:25:792 debug Page content: ...
2018-07-11 21:05:24:760 info Retrieving page (POST/JSON): https://geo.diggernaut.net/parse
2018-07-11 21:05:24:752 debug Address: 781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA
2018-07-11 21:05:24:739 debug Parsing address
2018-07-11 21:05:24:728 info Starting scrape
2018-07-11 21:05:24:690 debug Setting up surf
2018-07-11 21:05:24:657 info Starting digger: OSM test [2794]
            <html>

<head></head>

<body>
    <body_safe>
        <body_safe>
            <components>
                <label>house_number</label>
                <value>781</value>
            </components>
            <components>
                <label>road</label>
                <value>franklin ave</value>
            </components>
            <components>
                <label>suburb</label>
                <value>crown heights</value>
            </components>
            <components>
                <label>city_district</label>
                <value>brooklyn</value>
            </components>
            <components>
                <label>city</label>
                <value>nyc</value>
            </components>
            <components>
                <label>state</label>
                <value>ny</value>
            </components>
            <components>
                <label>postcode</label>
                <value>11216</value>
            </components>
            <components>
                <label>country</label>
                <value>usa</value>
            </components>
            <status>success</status>
        </body_safe>
    </body_safe>
</body>

</html>
            

Example of address normalization.

        # SWITCHING TO THE GEO CONTEXT
- geo:
provider: osm
do:
# Normalize address
- address_expand:
    address: One-hundred twenty E 96th St
    do:
        
Time Level Message
2018-07-12 01:58:42:548 info Scrape is done
2018-07-12 01:58:42:530 debug Page content: ...
2018-07-12 01:58:41:317 info Retrieving page (POST/JSON): https://geo.diggernaut.net/expand
2018-07-12 01:58:41:309 debug Address: One-hundred twenty E 96th St
2018-07-12 01:58:41:301 debug Normalizing address
2018-07-12 01:58:41:293 info Starting scrape
2018-07-12 01:58:41:253 debug Setting up surf
2018-07-12 01:58:41:221 info Starting digger: OSM test [2794]
            <html>

<head></head>

<body>
<body_safe>
<body_safe>
    <expansions>120 east 96th saint</expansions>
    <expansions>120 east 96th street</expansions>
    <expansions>120 e 96th saint</expansions>
    <expansions>120 e 96th street</expansions>
    <expansions>120 east 96 saint</expansions>
    <expansions>120 east 96 street</expansions>
    <expansions>120 e 96 saint</expansions>
    <expansions>120 e 96 street</expansions>
    <status>success</status>
</body_safe>
</body_safe>
</body>

</html>
                

Next we will learn about a number of complimentary commands that can be useful in different situations.