Trifork Blog

Axon Framework, DDD, Microservices

Server-side clustering of geo-points on a map using Elasticsearch

August 1st, 2013 by
|

Plotting markers on a map is easy using the tooling that is readily available. However, what if you want to add a large number of markers to a map when building a search interface? The problem is that things start to clutter and it’s hard to view the results. The solution is to group results together into one marker. You can do that on the client using client-side scripting, but as the number of results grows, this might not be the best option from a performance perspective.

This blog post describes how to do server-side clustering of those markers, combining them into one marker (preferably with a counter indicating the number of grouped results). It provides a solution to the “too many markers” problem with an Elasticsearch facet.

The Problem

The image below renders quite well the problem we were facing in a project:

serversideclustering_elasticsearch_1

The mass of markers is so dense that it replicates the shape of the Netherlands! These items represent monuments and other things of general interest in the Netherlands; for an application we  developed for a customer we need to manage about 200,000 of them and they are especially concentrated in the cities, as you can see in this case in Amsterdam: The  “draw everything” strategy doesn’t help much here.

serversideclustering_elasticsearch_2

Looking for a solution

This article by Google gives a good overview of the problem, but it presents only client side solutions. As the start of this post anticipates, client side processing is not good in our case: if the browser received all of the 200,000 points of interest the webpage would freeze every time the user drags or zooms.

Therefore we looked at means for doing this server side.

Since we are using Elasticsearch as our storage, we searched for a solution which was either built-in or available in a plugin, and found https://github.com/zenobase/geocluster-facet. It is an Elasticsearch plugin which uses a clustering algorithm to group nearby documents.

We installed the plugin and tried some use cases, and although liking the base idea we concluded that we needed more control over the grouping: in our tests, the way groups were calculated by the plugin varied a lot depending on the particular bounding box used in the query. This would result in unstable group icons while a user pans around the map.

The “grid” idea mentioned in Google’s article would allow us to fine control the level of grouping. It basically consists in dividing the earth’s surface in cells of a rectangular grid, with the grid lines following meridians and parallels, and grouping the items according to the cell they fall into. Since we want the grouping level to vary with the zoom level, there is no single grid which accommodates all cases but instead grids of different granularity are needed. If the map is showing the whole world, the surface should be divided in a few ‘mega’ cells, spanning big portions of the continents; on the other extreme, if the map is set on a single neighbourhood the cells should be as small as buildings, in order to group only very close documents.

So we put together 2 good ideas: a grid-based grouping algorithm, packed as an Elasticsearch plugin.

Enter GeoHash

In the end, we didn’t have to deal directly with calculations on grids because a nice idea called geohash handles it in a concise and elegant way.

A geohash is a way to encode a pair of geographical coordinates as a string, where the encoded value has an interesting property: the more two locations are close to each other, the more chars their respective geohashes have in common. For example, the Dam in Amsterdam (geohash u173zq4xmc) and the Nieuwemarkt, also in Amsterdam (geohash u173zw0m3wes) share the first 5 chars, while two places as distant as the Colosseum (sr2yk3bse2q) and the Liberty Statue (dr5r7p4rn) have no chars in common (shared chars are intended as shared prefixes).

In other words, a geohash prefix (which, by the way, is a valid geohash in itself) identifies a region; the shorter the prefix, the bigger the region.

You can look on http://www.bigdatamodeling.org/2013/01/intuitive-geohash.html for a nice visual exposition of the concept; the article shows cells corresponding to different geohash lengths.

Grouping using this technique boils down to group together documents with identical geohashes to the nth char. The smaller n, the more “aggressive” the grouping, because a short prefix specifies a wide region, which includes more items. On the other end, if you consider the whole geohash length for the matching (elasticsearch uses geohashes of 12 chars) then the only items which end up grouped together are the ones with identical coordinates.

The Plugin

The plugin we made is a facet plugin: this means that the output of the grouping will be available as the result of a facet request, where a single facet instance is one group. The code relies on the geohash calculated by Elasticsearch and available through the GeoPoint#geohash() method. We’ve made this available for your use.

Interested?

If so then first; Download plugin

Use

The project started as a fork of the original geocluster-facet mentioned before. The plugin infrastructure and domain classes are still there, but the base algorithm relies now on the geohashes. So, go and clone the project if you want to know how it works! In the following I’ll go into details of the usage of the plugin.Once you you have installed the plugin: launch a mvn clean package to build the jar, copy it to $ES_HOME/plugins/somefolder/ and restart Elasticsearch.

Let’s assume that the documents in the index contain a field “location” of type geopoint; then an example query to see the plugin in action might be:

curl -XPOST http://localhost:9200/myindex/venues -d ‘{
    "query" : { ... }
    "facets" : {
        "places" : {
            "geohash" : {
                "field" : "location",
                "factor" : 0.9
            }
        }
    }
}’

The factor parameter must have a value in the interval [0, 1], where 0 mean “don’t do any grouping” and 1 corresponds to the most aggressive grouping. The value of factor is translated in the number of geohash chars to consider in the grouping algorithm.

The response would be something like:

{
    ...,

  "facets" : {
	"places" : {
  	"_type" : "geohash",
  	"factor" : 0.9,
  	"clusters" : [ {
    	"total" : 8,
    	"center" : {
      	"lat" : 16.95292075,
      	"lon" : 122.036081375
    	},
    	"top_left" : {
      	"lat" : 33.356026,
      	"lon" : 121.00589
    	},
    	"bottom_right" : {
      	"lat" : 14.60962,
      	"lon" : 129.247421
    	}
  	}, {
    	"total" : 5,
    	"center" : {
      	"lat" : 37.12715661612766,
      	"lon" : -83.15270566
    	},
    	"top_left" : {
      	"lat" : 40.78255308063828,
      	"lon" : -85.50469199999998
    	},
    	"bottom_right" : {
      	"lat" : 36.177174,
      	"lon" : -73.96547029999999
    	}
  	} ]
	}
  }
}

The first element of the “clusters” array is a group made of 8 documents, with its bounding box and center coordinates (the center is calculated as the mean of the coordinates of the documents composing the group).

A map done using the Plugin

Time again for some map screenshots! Here is the center of Amsterdam and its points of interest:

serversideclustering_elasticsearch_3

Most of the icons are now groups, collapsing together 2 or more documents (the number of documents is shown inside the icon). You can also see a couple of single documents at the bottom right.

The work of the plugin becomes more visible when we zoom in two levels from the previous view:

serversideclustering_elasticsearch_4

The big groups we had before have bursted in smaller ones, better representing the distribution of the documents in the space. And if we get even closer we start seeing more single documents, each one in its real location:

serversideclustering_elasticsearch_5

That’s it, I hope this will help someone else with a use case like ours. I have a couple of closing thoughts about the topic:

  • The latest Elasticsearch (0.90.2 as of writing) allows you to do powerful things with geohashes, and I don’t exclude that some future versions might even make this plugin superfluous! Which would make me happy, having one plugin less to install.
  • But while this plugin is still needed, I’m thinking about improvements on the fine tuning of the grouping at various zoom levels. A single char of difference in a geohash represents a big change in scale (32x in the area of the cell); going to a sub-character level would probably help finding the most appropriate grouping for a certain map view.

33 Responses

  1. August 2, 2013 at 19:37 by Eric Jain

    Neat! I had considered using geohashes, but didn’t like the fact that nearby points can end up in clusters that are very far away. On the other hand, this results in stable clusters, so you can pan… I’m curious about other approaches; most algorithms I found weren’t suitable for elasticsearch (e.g. because they require keeping all points in memory and doing multiple passes).

    • August 5, 2013 at 10:23 by Gianluca Ortelli

      Yes, there is the side effect of cells potentially pulling apart nearby points. But for this use case I don’t need the clusters to absolutely “tell the truth” about the data; they are just supposed to give a clue to the eye.

      As I say in the closing part, I’m more interested in fine tuning the transition between a level of geohash and the next, since it’s a bit jumpy now. I would like to be able to switch from a certain cell size to one 4 times bigger or smaller, instead of 32. If I improve this part, the problem you mention should be alleviated as well.

      I didn’t make a pull request out of my work because your project doesn’t support multiple algorithms (neither mine), but maybe we could end up with a single plugin where the algorithm is chosen by the client?

      • August 5, 2013 at 21:13 by Eric Jain

        Having a single facet that supports different algorithms could make sense, as long as the parameters are similar enough. I know the elasticsearch guys want to include a facet to cluster geo points, but this is pending a major refactoring of the facet code. Ask them about it next time you bump into one of them in Amsterdam 🙂

  2. August 17, 2013 at 02:34 by John Hinnegan

    This is pretty cool. I’m trying to get this running, but it just seems to hang. I’m running on a dataset of 80K points in the US. The geo_cluster one comes back in a few seconds. :S

    • August 19, 2013 at 14:28 by Gianluca Ortelli

      Are you using elasticsearch 0.90.2? I haven’t tested the plugin with 0.90.3.
      If you are already on 0.90.2 I would like to see the JSON of your request. 80K points shouldn’t be a problem.

  3. August 19, 2013 at 18:48 by John Hinnegan

    I’m using 0.90.3. Looking at the original plugin, I’m guessing something broke on upgrade: https://github.com/zenobase/geocluster-facet/commit/fcea3cf57b70955ca85b57433a83b47f3142d64d

  4. August 19, 2013 at 19:37 by John Hinnegan

    Got it working. It’s awesome! Pull request submitted for 0.9.3 upgrade

    • August 20, 2013 at 15:44 by Gianluca Ortelli

      Merged. Many thanks!

  5. September 20, 2013 at 07:48 by Joe

    i got the 0.0.7-SNAPSHOT version and am running with elasticsearch 0.90.5. have the following problem performing the geohash facet:

    SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; ElasticSearchException[Found class org.elasticsearch.index.fielddata.GeoPointValues, but interface was expected]; nested: IncompatibleClassChangeError[Found class org.elasticsearch.index.fielddata.GeoPointValues, but interface was expected];

  6. October 17, 2013 at 21:02 by David

    I am having the same issue as Joe.

    SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; ElasticSearchException[Found class org.elasticsearch.index.fielddata.GeoPointValues, but interface was expected]; nested: IncompatibleClassChangeError[Found class org.elasticsearch.index.fielddata.GeoPointValues, but interface was expected];

    The original project, https://github.com/zenobase/geocluster-facet, have upgraded to the latest version of ES , 0.90.5. I tried their suggestion and recompiled the project but was not able to get it work. I have never programmed in Java so I am a bit at a lose. Any help would be greatly appreciated as I feel that geo-hashing is better way of clustering data.

    • November 1, 2013 at 17:28 by Gianluca Ortelli

      Upgraded to 0.90.5. It turned out to be just a one character change in pom.xml. Lazy me!

  7. November 13, 2013 at 11:48 by acreux

    Hello everyone,

    this plugin is very interesting. Just one comment :

    at https://github.com/triforkams/geohash-facet/blob/master/src/main/java/nl/trifork/elasticsearch/facet/geohash/ClusterReducer.java#L16

    The merge returns a new Cluster, should not we have map.put(clusterGeohash, map.get(clusterGeohash).merge(cluster));
    instead ? It does not update the cluster otherwise.

  8. November 14, 2013 at 14:46 by denyil

    Hi,where can i download 0.0.9-SNAPSHOT its seems that 0.0.7 not compatible with 0.90.5

    • November 21, 2013 at 10:04 by Gianluca Ortelli

      Hi Denyil, the contact form was lagging behind the project. Now it’s updated, you can check it out again

  9. November 15, 2013 at 18:19 by Tim Bunce

    Hi Gianluca.

    Is there any chance you could include a pre-built .jar file in your releases? I’d love to use this plugin but don’t have any Java building experience, so the need to build it myself represents a significant hurdle.

    The zenobase/geocluster-facet repo includes a .jar file in releases and instructions for installing it with a single bin/plugin command. It would be a big help if your geohash-facet plugin could be as easy to install.

    Thanks.

    p.s. Any reason issues are disabled on the github repo?

    • November 21, 2013 at 10:06 by Gianluca Ortelli

      Hi Tim,

      I added instructions for installation from command line. I’m so used to install by file copy that I forgot about that method.

      I enabled the issues on github, but I see that you already found it out 🙂

  10. November 26, 2013 at 22:23 by Peter Marklund

    Hi!
    This looks really useful, thanks for sharing! Is there a demo of this running somewhere where you can get a feel for what the clustering is like? We are working on http://www.naturkartan.se in Sweden and we are considering the Maptoolkit product that comes with really nice server side clustering with animation effects, i.e. see http://www.bikemap.net However, it seems that even with the Maptoolkit clustering there may be some odd behaviours (the cluster numbers are not always correct and markers don’t always merge with the closest cluster) but I guess it’s hard to find a clustering algorithm that “works perfectly”.

    Cheers

    Peter

    • November 27, 2013 at 11:56 by Gianluca Ortelli

      We don’t have a running demo server, but you can go on the github project and have a look at the example code in https://github.com/triforkams/geohash-facet/tree/master/examples. It contains the web page and javascript which I used to produce the images in the post. To make some use of it you need to provide an elasticsearch instance (with the plugin installed and your own geolocalized documents) which the page can query to populate the map.

      The map on http://www.bikemap.net/ is catchy! The animation helps a lot keeping track of how the clusters decompose across zoom levels. Obviously, if you go with elasticsearch you will need to code the animation yourself, although the plugin gives you all the data you need to realize it.

      If http://www.bikemap.net is based on Maptoolkit, I would say that it too uses some cell based approach; when you zoom out more and more, you start seeing a grid pattern in the way icons are located. You will see a similar artifact with the geohash-facet plugin.

      I don’t know how “perfect” our approach is, but in principle the algorithm always calculates precise cluster totals and merges the markers in a predictable way. If you start playing with the plugin and find out that this is not always the case, please go to https://github.com/triforkams/geohash-facet/issues
      and help yourself!

  11. November 27, 2013 at 13:48 by Martin

    Hey Gianluca!

    A view days before I stumbled upon this article cause I searched for a server side clustering.
    I installed elasticsearch (0.90.7) with your plugin (geohash-facet-0.0.11) for the first time.
    It works really good and fast, thank you for sharing this plugin!

    But I have a problem with the bounding box:
    If I use the example query with match_all and the filter “geo_bounding_box” the facet will return all clusters of the type, but I only want to display the clusters in the bounding box.
    What am I doing wrong?

    Thanks.

    • November 28, 2013 at 13:28 by Gianluca Ortelli

      Hi,

      can you post your query? Or a simplified version of it, just to see how you are using the filter.

  12. November 29, 2013 at 12:41 by Martin

    My query:
    http://pastebin.com/m4Vz8bxL

    • November 29, 2013 at 14:18 by Gianluca Ortelli

      I see that you are using a “normal” filter: it restricts the documents returned by the query, but it doesn’t affect the facet count. I guess everybody needs to stumble on this sooner or later 🙂
      To obtain only the facets inside the bounding box, you might use a filtered query. Another suggestion which is given here is to use a facet filter.

  13. December 13, 2013 at 12:05 by Martin

    Thanks, it works 🙂
    I will post the final result in a view weeks.

  14. January 23, 2014 at 13:21 by Sapan

    How do I use this facet from within the Java client?

    • February 14, 2014 at 18:54 by Gianluca Ortelli

      I have just published on https://github.com/triforkams/geohash-facet/releases/download/geohash-facet-0.0.13/geohash-facet-0.0.13.jar version 0.0.13 of the plugin, which contains a facet builder for the geohash facet.

      To use the builder, beside having the plugin installed on you elasticsearch cluster you must also make it available to you client code as a library. Then you can write code like this:

      Settings settings = ImmutableSettings.settingsBuilder()
      .put("cluster.name", "my-cluster").build();

      Client client = new TransportClient(settings).addTransportAddress(new InetSocketTransportAddress("localhost", 9300));

      SearchResponse response = client.prepareSearch("myindex").setTypes("mytype")
      .setQuery(QueryBuilders.matchAllQuery())
      .addFacet(new GeoFacetBuilder("places").field("location").factor(0.9).showGeohashCell(true))
      .execute().actionGet();

    • February 19, 2014 at 10:04 by Gianluca Ortelli

      That’s nice!

      I expected that the aggregation framework was fit to replace the plugin, but now I see that they even defined a specialized aggregation for this use case.

      I wonder how easy it is to provide all data that are currently returned by the plugin, namely cluster centroid and bounding box. It probably requires some scripting on top of the call to geohash_grid.

  15. January 9, 2015 at 10:53 by Dan

    What is the limit of points to cluster? for example http://www.clustermash.com has been able to cluster 8 million points of the geonames database on one map

  16. March 18, 2015 at 16:29 by Dane

    Elasticsearch now provides this with the “geohash grid” aggregation. It buckets documents based on the geohash with a specified precision.

    You can also calculate additional aggregations for each bucket, here’s an example for the average lat/long and min price for the bucket:

    “aggregations” : {
    “price_grid” : {
    “geohash_grid” : {
    “field” : “location”,
    “precision” : 4
    },

    “aggregations”: {
    “min_price”: {
    “min”: {
    “field”: “price”
    }
    },
    “lat”: {
    “avg”: {
    “field”: “location.lat”
    }
    },
    “lon”: {
    “avg”: {
    “field”: “location.lon”
    }
    }
    }
    }
    }

    Note that the location must be of field type geo_point and must hav lat_lon fields enabled:

    “properties” : {
    “location”: {
    “type”: “geo_point”,
    “lat_lon”: true
    },

  17. March 24, 2015 at 14:53 by Luka

    Hello. When you update your plugin to ES 1.5.0?

  18. May 6, 2015 at 16:15 by Martin

    Hey!
    The downloadlink is broken, after sending the form there is a 404 error.
    Is it possible to use this plugin with the current version of ES?
    Thanks!

  19. May 18, 2015 at 18:01 by Martin

    The current version is on github:
    https://github.com/triforkams/geohash-facet