Rough Coarse Geocoding

data
Published

August 26, 2020

A coarse geocoder takes a human description of a large area like a city, area or country and returns the details of that location. I’ve been looking into the source of the excellent Placeholder (a component of the Pelias geocoder) to understand how this works. The overall approach is straightforward, but it takes a lot of work to get it to be reliable.

A key component geocoder is a gazetteer that contains the names of locations. Placeholder uses Who’s on First which is a large open dataset that captures locations as GeoJSON based on how people describe them (including names in many languages). The returned locations are Who’s on First entities. Placeholder stores these as tables in a SQLite database, which can be used to refine locations in Placeholder.

The overall approach of Placeholder is:

So for example consider some text like “Saint Albans, Australia”. This gets normalised to “st albans australia”. This then gets tokenised to “st albans” and “australia”. Next “australia” is matched to the country. Then “st albans” is searched for in Australia, and it finds a few results in Victoria and New South Wales. Further it does an R-tree search for “st albans” in locations within 2 degrees of Australia and finds another location in New Zealand. These are then returned ordered by Who’s on First id.

Normalisation

Normalisation is the process of taking the input text and putting it in a format that makes it easy to match with Who’s on First data. This includes adding synonyms to expand the search. This is really important in making the geocoder work in practice.

In Placeholder most of this work is done by the function normalize in /lib/analysis.js. This function does a lot; I’ll just show a few transformations to give an idea of what it is doing.

All the separating punctuation is stripped away.

  // remove certain punctuation
  input = input.replace(/[\.]+/g,'');

  // replace certain punctuation with spaces
  input = input.replace(/[",]+/g,' ');

I’m guessing Who’s on first tends to use contracted forms because Placeholder replaces e.g. “saint” with “st”.

  // generic synonym contractions
  input = input.replace(/\b(sainte)\b/gi, 'ste')
               .replace(/\b(saint)\b/gi, 'st')
               .replace(/\b(mount)\b/gi, 'mt')
               .replace(/\b(fort)\b/gi, 'ft');

The synonyms are actually a list because there can be multiple ways to describe a place. For example if we have “city of sydney” it will try both “city of sydney” and “sydney”.

  // synonymous representations of official designations
  if (input.match(/county|city|township/i) ){
    synonyms = synonyms.concat(
      synonyms.map(synonym => {
        return synonym
          .replace(/^county\s(of\s)?(.*)$/gi, '$2')
          .replace(/^(.*)\scounty$/gi, '$1')
          .replace(/^city\sof(?!\s?the)\s?(.*)$/gi, '$1')
          .replace(/^(.*\s)charter\s(township)$/gi, '$1$2');
      })
    );
  }

Finally all text is converted to lowercase and unicode accents are removed.

These kind of transformations are really important for real world performance, but require a lot of experience to get right. If you wanted to write your own geocoder based on Who’s on First I’d seriously consider using their tests in analysis.js.

Tokenize

The tokenisation is a little difficult in that place names can contain multiple words. The approach in Placeholder, in prototypes/tokenize.js, is relatively simple. First break the query into words, and start at the leftmost token. Then take the span from the first to last word and if that’s in the gazetteer then use that as the word, otherwise remove the last word from the span and repeat until you find a token or get down to a single word. Then continue to tokenize the rest of the text.

For example consider “Port of Spain Trinidad and Tobago”. This isn’t in the gazetteer, not if “Port of Spain Trinidad and”, or “Port of Spain Trinidad”, but “Port of Spain” is and so that’s our first token. Then to tokenize “Trinidad and Tobago” that is in our gazetteer and so is a token. So we get two tokens “Port of Spain” and “Trinidad and Tobago”.

As another example “Melbourne CBD Australia” tokenizes to “Melbourne CBD” and “Australia”, since “Melbourne CBD” is in Who’s on First. But “Sydney CBD Australia” (currently) tokenizes to “Sydney”, “CBD” and “Australia” since “Sydney CBD” is not in Who’s on First.

This is a simple strategy but works pretty well.

Sort

The sorting is very important, for example if I’m searching for “Paris” without any context I’m most likely to be searching for Paris, France than Paris, USA. However as far as I can tell Placeholder just sorts in order of the Who’s on First id. In practice this seems to work remarkably well; the larger and more populous places tend to occur first. I don’t know why this is; maybe it’s because the dataset was built up starting with the most common places first.

Putting it together

The Placeholder geocoder takes a relatively straightforward approach, but it’s pretty effective. I’ve been using it to geocode Australian locations and it’s really easy to use through docker. However I’m finding I want to be able to customise it and make it fit better with the rest of my Python code. I don’t think it would be tremendously difficult to port to Python, although requires to be deliberate to get exactly the same results.