Overview
In an earlier post we set out to build an address geocoder for a database of New Zealand addresses. We suggested a simple process as follows:
- parse an address string into its constituent pieces
- search an address database for a record with similar components
Building an address parser, the first step, was the focus of the first post. In this post we create a basic geocoding service that builds on this.
Source Code
Source code for the ‘complete’ service can be found on GitHub:
The service is written in Scala, and is provided as an sbt project. The service uses Akka HTTP to provide end-points, and deeplearning4j is used to evaluate a pre-trained address parser.
Creating a LINZ Address Database
The LINZ Data Service is an excellent resource, but it’s a pity that the more common feature classes can’t simply be downloaded via a direct link–a set of GeoPackages dumped in a public S3 bucket, for example. This introduces a needles manual step in what might otherise be automated processes. Alas. Either way, to get a local copy of the NZ Addresses addresses feature, one must:
- create an account and sign in
- visit the Addresses layer page and click the ‘add/remove item’ button
- click the ‘Download or Order’ button
- wait…
- wait some more…
- download using provided URL when ready.
This section will assume that users have downloaded addresses as a GeoPackage. It doesn’t really matter which spatial reference is used, though our geocoding service will use WGS84 exclusively, so we might as well choose that.
If you are, PLEASE stop using proprietary formats such as shapefile and geodatabase. Open formats such as GeoPackage are every bit as good, if not better (especially true for shapefile). Any decent GIS system, even proprietary ones, will happily import them, so absolutely nothing is lost.
We can load the addresses to a PostgreSQL database using Docker. Assume the addresses have been saved as data/linz/nz-addresses.gpkg
, and the following files exist:
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV SHELL=/bin/bash
ARG POSTGRES_VERSION=15
RUN apt-get update && apt-get -y dist-upgrade && \
apt-get install -y --no-install-recommends \
&& \
wget curl gnupg2 ca-certificates gdal-bin sudo vim lsb-release sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' && \
wget -qO- https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo tee /etc/apt/trusted.gpg.d/postgresql.asc && \
apt-get update && apt-get install -y --no-install-recommends \
${POSTGRES_VERSION} postgresql-${POSTGRES_VERSION}-postgis-3 postgis && \
postgresql-apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN service postgresql start && \
sudo -u postgres psql -c 'create database gis;' && \
sudo -u postgres psql -d gis -c 'create extension postgis;' && \
sudo -u postgres psql -d gis -c 'create extension postgis_raster;' && \
sudo -u postgres psql -d gis -c 'create extension postgis_sfcgal;' && \
sudo -u postgres psql -d gis -c 'create extension postgis_topology;' && \
sudo -u postgres psql -d gis -c "SET postgis.gdal_enabled_drivers = 'ENABLE_ALL';" && \
sudo -u postgres psql -c 'create user gisuser;' && \
sudo -u postgres psql -c "alter user gisuser with encrypted password 'gisuser';" && \
sudo -u postgres psql -c 'grant all privileges on database gis to gisuser;' && \
printf "\tlisten_addresses='*'\t" >> /etc/postgresql/${POSTGRES_VERSION}/main/postgresql.conf && \
sed -i -E '/local +all +all +peer/ s/peer/scram-sha-256/' /etc/postgresql/${POSTGRES_VERSION}/main/pg_hba.conf && \
sed -i -E '/host +all +all +127.0.0.1\/32 +scram-sha-256/ s/127.0.0.1\/32/0.0.0.0\/0 /' /etc/postgresql/${POSTGRES_VERSION}/main/pg_hba.conf && \
sed -i -E '/host +all +all +::1\/128 +scram-sha-256/ s/::1\/128/::0\/0 /' /etc/postgresql/${POSTGRES_VERSION}/main/pg_hba.conf &&\
"localhost:5432:gis:gisuser:gisuser" >> /root/.pgpass && \
printf
chmod 0600 /root/.pgpass
EXPOSE 5432
CMD service postgresql start && \
tail -f /dev/null
version: "3.5"
services:
db:
image: "postgis"
container_name: postgis
build:
context: ./
dockerfile: ./Dockerfile
volumes:
- ./data:/data:ro
- pgdata:/var/lib/postgresql
ports:
- 5433:5432
volumes:
pgdata:
#! /usr/bin/env bash
PGPASSWORD=gisuser psql -U gisuser -d gis -c 'create schema linz;'
ogr2ogr \
-f PostgreSQL PG:"dbname='gis' user='gisuser' password='gisuser'" \
\
/data/linz/nz-addresses.gpkg -nln linz.addresses
PGPASSWORD=gisuser psql -U gisuser -d gis \
-c 'CREATE INDEX linz_address_idx ON linz.addresses USING GIST (geom);'
PGPASSWORD=gisuser psql -U gisuser -d gis \
-c 'CREATE INDEX addr_suburb_idx ON linz.addresses (lower(suburb_locality));'
PGPASSWORD=gisuser psql -U gisuser -d gis \
-c 'CREATE INDEX addr_rd_idx ON linz.addresses (lower(road_name));'
PGPASSWORD=gisuser psql -U gisuser -d gis \
-c 'CREATE INDEX addr_rdtype2_idx ON linz.addresses (lower(road_type_name));'
PGPASSWORD=gisuser psql -U gisuser -d gis \
-c 'CREATE INDEX addr_num_idx ON linz.addresses (address_number);'
PGPASSWORD=gisuser psql -U gisuser -d gis \
-c 'CREATE INDEX addr_town2_idx ON linz.addresses (lower(town_city));'
We start our database container by running:
docker compose up -d
At the moment, the database is empty. We gain terminal acccess by running:
docker exec -it postgis /bin/bash
We then load the data by running:
source /data/load.sh
At this point, a feature class called linz.adresses
will have been created. Any data we load will persist, so we need to do this only once. If we have access to a postcode feature class, we can also create an address view containing postcode easily enough. Assuming postcodes have also been loaded to nzpost.postcode
, with columns code
and geom
holding the postcodes and geometries, respectively, we can create a materialized view by running the following query:
create materialized view
linz.addresses_v as select
address_id,
unit_type,
unit_value, level_type, level_value,
address_number, address_number_suffix, address_number_high, as postcode,
road_name, road_type_name, road_suffix, p.code case
when town_city is null then s.major_name
else town_city
end as town_city,
4326)) as lng,
st_x(st_transform(a.geom, 4326)) as lat,
st_y(st_transform(a.geom,
a.geomfrom
linz.addresses a left join
nzpost.postcode p on
st_intersects(a.geom, p.geom)
Searching Our Address Database
Consider the address:
1/45A Memorial Avenue, Ilam, Christchurch 8053
Using the parser we created earlier, we obtain the following:
{
"unit_type":null,
"unit_value":"1",
"level_type":null,
"level_value":null,
"address_number":45,
"address_number_suffix":"45",
"address_number_high":null,
"road_name":"Memorial",
"road_type_name":"Avenue",
"road_suffix":null,
"suburb_locality":"Islam",
"town_city":"Christchurch",
"postcode":"8053"
}
We can use this to search our address database easily enough:
select
*
from
linz.addresses_vwhere
= '1'
unit_value and address_number = 45 and address_number_suffix = 'A'
and road_name = 'Memorial' and road_type_name = 'Avenue'
and suburb_locality = 'Ilam'
and town_city = 'Christchurch'
and postcode = '8053'
Name |Value |
---------------------+------------+
address_id |1964541 |
unit_type | |
unit_value |1 |
level_type | |
level_value | |
address_number |45 |
address_number_suffix|A |
address_number_high | |
road_name |Memorial |
road_type_name |Avenue |
road_suffix | |
suburb_locality |Ilam |
postcode |8053 |
town_city |Christchurch|
lng |172.58617673| lat |-43.51674017|
In practice, we probably want our search to be a touch more forgiving. This sort of query is exact, and even a single minor typo will yield an empty result. For example, had we searched for a road_type_name
of Ave
instead of Avenue
.
We could appeal to a tool such as elaticsearch or solr to provide the flexibility we desire, but let’s keep things simple for now. Let’s instead create functionality to ‘correct’ a set of address components by doing a reverse lookup of the components, replacing each if required with the best match actually found in the address database. For example, let’s say the road name and road type were provided as Memoral
and Ave
, respectively:
.find(Some("Memoral"), Some("Ave"), None) Roads
Vector[(Option[String], Option[String], Option[String])] = Vector((Some(memorial),Some(avenue),None))
Similarly, say suburb was provided as Ilam
, then
.find("ilm") Suburbs
Vector[Option[String]] = Vector(Some(ilam))
And, similarly, say town was provided as Chch
, then
.find("Chch") Towns
Vector[Option[String]] = Vector(Some(christchurch))
Currently, these methods all rely on a naive appeal to the Jaro-Winkler distance so results might not always be the best. Regardless, this brings us to this:
val dirty = AddressComponents(
None, Some("1"),
None, None,
Some(45), Some("A"), None,
Some("Memoral"), Some("Ave"), None,
Some("Ilm"), Some("Chch"), Some("8053"),
)
val clean = dirty.repair
.toJson clean
{
"unit_type": null,
"unit_value": "1",
"level_type": null,
"level_value": null,
"address_number": 45,
"address_number_suffix": "A",
"address_number_high": null,
"road_name": "memorial",
"road_type_name": "avenue",
"road_suffix": null,
"suburb_locality": "ilam",
"town_city": "christchurch",
"postcode": "8053"
}
To actually search our database (it goes without saying that you should not await a future in practice, of course 😊):
val res = Await.result(DAO1.search(clean, 5), Duration.Inf)
(1/45A Memorial Avenue, Ilam, Christchurch 8053,1.0)
(45B Memorial Avenue, Ilam, Christchurch 8053,0.9)
(1/45 Memorial Avenue, Ilam, Christchurch 8053,0.9) (2/45A Memorial Avenue, Ilam, Christchurch 8053,0.9)
The query strategy is essentially to find records matching every non-null field, except we also fetch records that match on all fields except level number and address number suffix. We define a relevance score and then select the top \(n\) best matching records in more than \(n\) are returned. The slick library is used, and in this case the underlying SQL query is:
.searchQuery(clean, 5).result.statements.mkString DAO1
(select
"address_id" as x2,
"unit_type" as x3, "unit_value" as x4,
"level_type" as x5, "level_value" as x6,
"address_number" as x7, "address_number_suffix" as x8,
"address_number_high" as x9,
"road_name" as x10, "road_type_name" as x11,
"road_suffix" as x12,
"suburb_locality" as x13,
"town_city" as x14, "postcode" as x15,
"gd2000_xcoord" as x16, "gd2000_ycoord" as x17
from
"linz"."addresses"
where
(
(
("postcode" = '8053')
(and (
lower("town_city") = 'christchurch')
(and (lower("suburb_locality") = 'ilam')
)
) and (lower("road_name") = 'memorial')
) and (lower("road_type_name") = 'avenue')
) and ("address_number" = 45)
limit 5
)union all
(select
"address_id" as x2,
"unit_type" as x3, "unit_value" as x4,
"level_type" as x5, "level_value" as x6,
"address_number" as x7, "address_number_suffix" as x8,
"address_number_high" as x9,
"road_name" as x10, "road_type_name" as x11,
"road_suffix" as x12,
"suburb_locality" as x13,
"town_city" as x14, "postcode" as x15,
"gd2000_xcoord" as x16, "gd2000_ycoord" as x17
from
"linz"."addresses"
where
(
(
(
(
("postcode" = '8053')
(and (
lower("town_city") = 'christchurch')
(and (lower("suburb_locality") = 'ilam')
)
) and (lower("road_name") = 'memorial')
) and (lower("road_type_name") = 'avenue')
) and ("address_number" = 45)
) and (lower("address_number_suffix") = 'a')
) and (lower("unit_value") = '1')
limit 5
)
It’s a bit ugly, but we trust PostgreSQL to optimise it for us as required.
Putting it all Together
Our final service consists of a single end-point /linzgeocode/geocode
. Under the hood we just conduct the search as outined above, and we return the result as a JSON array. In this case:
curl -s \
"http://localhost:9001/linzgeocode/geocode?text=1/45A%20Memorial%20Avenue,%20Ilam,%20Christchurch%208053" | jq
[
{
"linz_id": 1964541,
"relevance": 1,
"formatted_address": "1/45A Memorial Avenue, Ilam, Christchurch 8053",
"coordinates": {
"longitude": 172.58617673,
"latitude": -43.51674017
}
},
{
"linz_id": 190617,
"relevance": 0.9,
"formatted_address": "45B Memorial Avenue, Ilam, Christchurch 8053",
"coordinates": {
"longitude": 172.58572588,
"latitude": -43.51685562
}
},
{
"linz_id": 190619,
"relevance": 0.9,
"formatted_address": "1/45 Memorial Avenue, Ilam, Christchurch 8053",
"coordinates": {
"longitude": 172.586439,
"latitude": -43.516492
}
},
{
"linz_id": 1964540,
"relevance": 0.9,
"formatted_address": "2/45A Memorial Avenue, Ilam, Christchurch 8053",
"coordinates": {
"longitude": 172.58600722,
"latitude": -43.51682523
}
},
{
"linz_id": 1979184,
"relevance": 0.9,
"formatted_address": "2/45 Memorial Avenue, Ilam, Christchurch 8053",
"coordinates": {
"longitude": 172.586325,
"latitude": -43.516594
}
},
{
"linz_id": 3004259,
"relevance": 0.9,
"formatted_address": "45E Memorial Avenue, Ilam, Christchurch 8053",
"coordinates": {
"longitude": 172.58662582,
"latitude": -43.51647956
}
}
]
Note that it is up to the user here to encode the query correctly. Many places in New Zealand have Māori place names, so vowels with macrons are relatively common. In this case, it might be more convenient to run the same query using POST instead:
curl -s -X POST \
-d '{"text": "1/45A Memorial Avenue, Ilam, Christchurch 8053"}' \
| jq http://localhost:9001/linzgeocode/geocode