Building a New Zealand Address Geocoder (2)

Building a Geocoding Service

In this post we create an address geocoder using an address parser and a database of addresses.
geocoding
postcode
LINZ
akka-http
Author

Chris Hansen

Published

April 2, 2023

Overview

In an earlier post we set out to build an address geocoder for a database of New Zealand addresses. We suggested a simple process as follows:

  • parse an address string into its constituent pieces
  • search an address database for a record with similar components

Building an address parser, the first step, was the focus of the first post. In this post we create a basic geocoding service that builds on this.

Source Code

Source code for the ‘complete’ service can be found on GitHub:

cmhh/linzgeocode

The service is written in Scala, and is provided as an sbt project. The service uses Akka HTTP to provide end-points, and deeplearning4j is used to evaluate a pre-trained address parser.

Creating a LINZ Address Database

The LINZ Data Service is an excellent resource, but it’s a pity that the more common feature classes can’t simply be downloaded via a direct link–a set of GeoPackages dumped in a public S3 bucket, for example. This introduces a needles manual step in what might otherise be automated processes. Alas. Either way, to get a local copy of the NZ Addresses addresses feature, one must:

  • create an account and sign in
  • visit the Addresses layer page and click the ‘add/remove item’ button
  • click the ‘Download or Order’ button
  • wait…
  • wait some more…
  • download using provided URL when ready.

This section will assume that users have downloaded addresses as a GeoPackage. It doesn’t really matter which spatial reference is used, though our geocoding service will use WGS84 exclusively, so we might as well choose that.

Tip

If you are, PLEASE stop using proprietary formats such as shapefile and geodatabase. Open formats such as GeoPackage are every bit as good, if not better (especially true for shapefile). Any decent GIS system, even proprietary ones, will happily import them, so absolutely nothing is lost.

We can load the addresses to a PostgreSQL database using Docker. Assume the addresses have been saved as data/linz/nz-addresses.gpkg, and the following files exist:

FROM ubuntu:22.04

ENV DEBIAN_FRONTEND=noninteractive
ENV SHELL=/bin/bash
ARG POSTGRES_VERSION=15

RUN apt-get update && apt-get -y dist-upgrade && \
  apt-get install -y --no-install-recommends \
    wget curl gnupg2 ca-certificates gdal-bin sudo vim lsb-release && \
  sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' && \
  wget -qO- https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo tee /etc/apt/trusted.gpg.d/postgresql.asc && \
  apt-get update && apt-get install -y --no-install-recommends \
    postgresql-${POSTGRES_VERSION} postgresql-${POSTGRES_VERSION}-postgis-3 postgis && \
  apt-get clean && \
  rm -rf /var/lib/apt/lists/*

RUN service postgresql start && \
  sudo -u postgres psql -c 'create database gis;' && \
  sudo -u postgres psql -d gis -c 'create extension postgis;' && \
  sudo -u postgres psql -d gis -c 'create extension postgis_raster;' && \
  sudo -u postgres psql -d gis -c 'create extension postgis_sfcgal;' && \
  sudo -u postgres psql -d gis -c 'create extension postgis_topology;' && \
  sudo -u postgres psql -d gis -c "SET postgis.gdal_enabled_drivers = 'ENABLE_ALL';" && \
  sudo -u postgres psql -c 'create user gisuser;' && \
  sudo -u postgres psql -c "alter user gisuser with encrypted password 'gisuser';" && \
  sudo -u postgres psql -c 'grant all privileges on database gis to gisuser;' && \
  printf "\tlisten_addresses='*'\t" >> /etc/postgresql/${POSTGRES_VERSION}/main/postgresql.conf && \
  sed -i -E '/local +all +all +peer/ s/peer/scram-sha-256/' /etc/postgresql/${POSTGRES_VERSION}/main/pg_hba.conf && \
  sed -i -E '/host +all +all +127.0.0.1\/32 +scram-sha-256/ s/127.0.0.1\/32/0.0.0.0\/0   /' /etc/postgresql/${POSTGRES_VERSION}/main/pg_hba.conf && \
  sed -i -E '/host +all +all +::1\/128 +scram-sha-256/ s/::1\/128/::0\/0  /' /etc/postgresql/${POSTGRES_VERSION}/main/pg_hba.conf &&\ 
  printf "localhost:5432:gis:gisuser:gisuser" >> /root/.pgpass && \
  chmod 0600 /root/.pgpass

EXPOSE 5432

CMD service postgresql start && \
  tail -f /dev/null  
version: "3.5"
services:
  db:
    image: "postgis"
    container_name: postgis
    build:
      context: ./
      dockerfile: ./Dockerfile
    volumes:
      - ./data:/data:ro
      - pgdata:/var/lib/postgresql
    ports:
      - 5433:5432
volumes:
  pgdata:
#! /usr/bin/env bash

PGPASSWORD=gisuser psql -U gisuser -d gis -c 'create schema linz;'

ogr2ogr \
  -f PostgreSQL PG:"dbname='gis' user='gisuser' password='gisuser'" \
  /data/linz/nz-addresses.gpkg \
  -nln linz.addresses

PGPASSWORD=gisuser psql -U gisuser -d gis \
  -c 'CREATE INDEX linz_address_idx ON linz.addresses USING GIST (geom);'

PGPASSWORD=gisuser psql -U gisuser -d gis \
  -c 'CREATE INDEX addr_suburb_idx ON linz.addresses (lower(suburb_locality));'

PGPASSWORD=gisuser psql -U gisuser -d gis \
  -c 'CREATE INDEX addr_rd_idx ON linz.addresses (lower(road_name));'

PGPASSWORD=gisuser psql -U gisuser -d gis \
  -c 'CREATE INDEX addr_rdtype2_idx ON linz.addresses (lower(road_type_name));'

PGPASSWORD=gisuser psql -U gisuser -d gis \
  -c 'CREATE INDEX addr_num_idx ON linz.addresses (address_number);'

PGPASSWORD=gisuser psql -U gisuser -d gis \
  -c 'CREATE INDEX addr_town2_idx ON linz.addresses (lower(town_city));'

We start our database container by running:

docker compose up -d

At the moment, the database is empty. We gain terminal acccess by running:

docker exec -it postgis /bin/bash

We then load the data by running:

source /data/load.sh

At this point, a feature class called linz.adresses will have been created. Any data we load will persist, so we need to do this only once. If we have access to a postcode feature class, we can also create an address view containing postcode easily enough. Assuming postcodes have also been loaded to nzpost.postcode, with columns code and geom holding the postcodes and geometries, respectively, we can create a materialized view by running the following query:

create materialized view 
  linz.addresses_v 
as select 
  address_id,
  unit_type, 
  unit_value, level_type, level_value, 
  address_number, address_number_suffix, address_number_high, 
  road_name, road_type_name, road_suffix, p.code as postcode, 
  case 
    when town_city is null then s.major_name 
    else town_city 
  end as town_city,
  st_x(st_transform(a.geom, 4326)) as lng,
  st_y(st_transform(a.geom, 4326)) as lat,
  a.geom
from 
  linz.addresses a 
left join 
  nzpost.postcode p 
on 
  st_intersects(a.geom, p.geom)

Searching Our Address Database

Consider the address:

1/45A Memorial Avenue, Ilam, Christchurch 8053

Using the parser we created earlier, we obtain the following:

{
  "unit_type":null,
  "unit_value":"1",
  "level_type":null,
  "level_value":null,
  "address_number":45,
  "address_number_suffix":"45",
  "address_number_high":null,
  "road_name":"Memorial",
  "road_type_name":"Avenue",
  "road_suffix":null,
  "suburb_locality":"Islam",
  "town_city":"Christchurch",
  "postcode":"8053"
}

We can use this to search our address database easily enough:

select 
  *
from 
  linz.addresses_v
where 
  unit_value = '1' 
  and address_number = 45 and address_number_suffix = 'A' 
  and road_name = 'Memorial' and road_type_name = 'Avenue'
  and suburb_locality = 'Ilam' 
  and town_city = 'Christchurch' 
  and postcode = '8053'
Name                 |Value       |
---------------------+------------+
address_id           |1964541     |
unit_type            |            |
unit_value           |1           |
level_type           |            |
level_value          |            |
address_number       |45          |
address_number_suffix|A           |
address_number_high  |            |
road_name            |Memorial    |
road_type_name       |Avenue      |
road_suffix          |            |
suburb_locality      |Ilam        |
postcode             |8053        |
town_city            |Christchurch|
lng                  |172.58617673|
lat                  |-43.51674017|

In practice, we probably want our search to be a touch more forgiving. This sort of query is exact, and even a single minor typo will yield an empty result. For example, had we searched for a road_type_name of Ave instead of Avenue.

We could appeal to a tool such as elaticsearch or solr to provide the flexibility we desire, but let’s keep things simple for now. Let’s instead create functionality to ‘correct’ a set of address components by doing a reverse lookup of the components, replacing each if required with the best match actually found in the address database. For example, let’s say the road name and road type were provided as Memoral and Ave, respectively:

Roads.find(Some("Memoral"), Some("Ave"), None)
Vector[(Option[String], Option[String], Option[String])] = Vector((Some(memorial),Some(avenue),None))

Similarly, say suburb was provided as Ilam, then

Suburbs.find("ilm")
Vector[Option[String]] = Vector(Some(ilam))

And, similarly, say town was provided as Chch, then

Towns.find("Chch")
Vector[Option[String]] = Vector(Some(christchurch))

Currently, these methods all rely on a naive appeal to the Jaro-Winkler distance so results might not always be the best. Regardless, this brings us to this:

val dirty = AddressComponents(
  None, Some("1"), 
  None, None, 
  Some(45), Some("A"), None, 
  Some("Memoral"), Some("Ave"), None, 
  Some("Ilm"), Some("Chch"), Some("8053"), 
)

val clean = dirty.repair
clean.toJson  
{
  "unit_type": null,
  "unit_value": "1",
  "level_type": null,
  "level_value": null,
  "address_number": 45,
  "address_number_suffix": "A",
  "address_number_high": null,
  "road_name": "memorial",
  "road_type_name": "avenue",
  "road_suffix": null,
  "suburb_locality": "ilam",
  "town_city": "christchurch",
  "postcode": "8053"
}

To actually search our database (it goes without saying that you should not await a future in practice, of course 😊):

val res = Await.result(DAO1.search(clean, 5), Duration.Inf)
(1/45A Memorial Avenue, Ilam, Christchurch 8053,1.0)
(45B Memorial Avenue, Ilam, Christchurch 8053,0.9)
(1/45 Memorial Avenue, Ilam, Christchurch 8053,0.9)
(2/45A Memorial Avenue, Ilam, Christchurch 8053,0.9)

The query strategy is essentially to find records matching every non-null field, except we also fetch records that match on all fields except level number and address number suffix. We define a relevance score and then select the top \(n\) best matching records in more than \(n\) are returned. The slick library is used, and in this case the underlying SQL query is:

DAO1.searchQuery(clean, 5).result.statements.mkString
(
  select 
    "address_id" as x2, 
    "unit_type" as x3, "unit_value" as x4, 
    "level_type" as x5, "level_value" as x6, 
    "address_number" as x7, "address_number_suffix" as x8, 
    "address_number_high" as x9, 
    "road_name" as x10, "road_type_name" as x11, 
    "road_suffix" as x12, 
    "suburb_locality" as x13, 
    "town_city" as x14, "postcode" as x15, 
    "gd2000_xcoord" as x16, "gd2000_ycoord" as x17 
  from 
    "linz"."addresses" 
  where 
    (
      (
        (
          ("postcode" = '8053')
          and (
            (lower("town_city") = 'christchurch') 
            and (lower("suburb_locality") = 'ilam')
          )
        ) 
        and (lower("road_name") = 'memorial')
      ) 
      and (lower("road_type_name") = 'avenue')
    )     
    and ("address_number" = 45) 
  limit 5
)
union all 
(
  select 
    "address_id" as x2, 
    "unit_type" as x3, "unit_value" as x4, 
    "level_type" as x5, "level_value" as x6, 
    "address_number" as x7, "address_number_suffix" as x8, 
    "address_number_high" as x9, 
    "road_name" as x10, "road_type_name" as x11, 
    "road_suffix" as x12, 
    "suburb_locality" as x13, 
    "town_city" as x14, "postcode" as x15, 
    "gd2000_xcoord" as x16, "gd2000_ycoord" as x17 
  from 
    "linz"."addresses" 
  where 
    (
      (
        (
          (
            (
              ("postcode" = '8053')
              and (
                (lower("town_city") = 'christchurch') 
                and (lower("suburb_locality") = 'ilam')
              )
            ) 
            and (lower("road_name") = 'memorial')
          ) 
          and (lower("road_type_name") = 'avenue')
        ) 
        and ("address_number" = 45)
      ) 
      and (lower("address_number_suffix") = 'a')
    ) 
    and (lower("unit_value") = '1') 
  limit 5
)

It’s a bit ugly, but we trust PostgreSQL to optimise it for us as required.

Putting it all Together

Our final service consists of a single end-point /linzgeocode/geocode. Under the hood we just conduct the search as outined above, and we return the result as a JSON array. In this case:

curl -s \
  "http://localhost:9001/linzgeocode/geocode?text=1/45A%20Memorial%20Avenue,%20Ilam,%20Christchurch%208053" | jq
[
  {
    "linz_id": 1964541,
    "relevance": 1,
    "formatted_address": "1/45A Memorial Avenue, Ilam, Christchurch 8053",
    "coordinates": {
      "longitude": 172.58617673,
      "latitude": -43.51674017
    }
  },
  {
    "linz_id": 190617,
    "relevance": 0.9,
    "formatted_address": "45B Memorial Avenue, Ilam, Christchurch 8053",
    "coordinates": {
      "longitude": 172.58572588,
      "latitude": -43.51685562
    }
  },
  {
    "linz_id": 190619,
    "relevance": 0.9,
    "formatted_address": "1/45 Memorial Avenue, Ilam, Christchurch 8053",
    "coordinates": {
      "longitude": 172.586439,
      "latitude": -43.516492
    }
  },
  {
    "linz_id": 1964540,
    "relevance": 0.9,
    "formatted_address": "2/45A Memorial Avenue, Ilam, Christchurch 8053",
    "coordinates": {
      "longitude": 172.58600722,
      "latitude": -43.51682523
    }
  },
  {
    "linz_id": 1979184,
    "relevance": 0.9,
    "formatted_address": "2/45 Memorial Avenue, Ilam, Christchurch 8053",
    "coordinates": {
      "longitude": 172.586325,
      "latitude": -43.516594
    }
  },
  {
    "linz_id": 3004259,
    "relevance": 0.9,
    "formatted_address": "45E Memorial Avenue, Ilam, Christchurch 8053",
    "coordinates": {
      "longitude": 172.58662582,
      "latitude": -43.51647956
    }
  }
]

Note that it is up to the user here to encode the query correctly. Many places in New Zealand have Māori place names, so vowels with macrons are relatively common. In this case, it might be more convenient to run the same query using POST instead:

curl -s -X POST \
  -d '{"text": "1/45A Memorial Avenue, Ilam, Christchurch 8053"}' \
  http://localhost:9001/linzgeocode/geocode | jq