Trifork Blog

Server-side clustering of geo-points on a map using Elasticsearch – continued

March 26th, 2014 by
|

In a previous post I described a problem of data visualization and a possible solution provided by a plugin of elasticsearch. I noticed that elasticsearch might one day evolve to make the plugin unnecessary. That day seems to have come: starting from version 1.0.0, elasticsearch includes Aggregations, a new API for data mining. In this post I’ll show you how to use aggregations to reproduce the functionality of the plugin.

Note: if you want to be sure to understand what follows, you’d better read the original post first. After reading, you will know:

  • what exactly the problem of the “too many markers on a map” is;
  • how you can use geohashes to solve the problem
  • a solution for the elasticsearch ecosystem based on a plugin

Elasticsearch aggregations, described here, are a tool to extract aggregated (hence the name) information from a set of documents. It is similar to the GROUP BY clause and aggregate functions in SQL.

The authors of the API also defined a very specific type of aggregation, geohash_grid. It does something similar to what the plugin does: grouping documents with a location, according to the similarity of the respective geohash strings. All I’m going to do is to add a little functionality that is not offered by the plain geohash_grid aggregation, and which helps in displaying the documents on a map. This missing part is the average position of the documents belonging to the same bucket.

In short, this will be a verbose post to show a small enhancement on a powerful feature of elasticsearch.

Let’s proceed in order:

1. Filtering

Since we are processing the documents with the final purpose to show them on a map, it’s a good idea to filter out immediately the documents which fall outside the visible region of the map. and this is the first version of our query:

{
    "query": {
        "match_all": {}
    },
    "size": 0,
    "aggs": {
        "filtered_cells": {
            "filter": {
                "geo_bounding_box": {
                    "location": {
                        "top_left": "50.1234, 4.0000",
                        "bottom_right": "52.1234, 4.5555"
                    }
                }
            }
        }
    }
}

2. Aggregate by geohash

Now we want to use the geohash_grid to do the actual geohash based grouping, but in the context of the filter just defined: therefore, we declare the next aggregation as nested inside the “filtered_cells” aggregation.

{
    "query": {
        "match_all": {}
    },
    "size": 0, "aggs": {
        "filtered_cells": {
            "filter": {
                "geo_bounding_box": {
                    "location": {
                        "top_left": "50.1234, 4.0000",
                        "bottom_right": "52.1234, 4.5555"
                    }
                }
            },
            "aggs": {
                "cells": {
                    "geohash_grid": {
                        "field": "location",
                        "precision": 3
                    }
                }
            }
        }
    }
}

In this sample request, we want to group together all the documents which have a common 3-chars prefix in their geohash.

3. Add group centroid

EDIT: the coming part makes use of dynamic scripting, a feature that is disabled by default starting from ES 1.2.0. In case you have that version or a more recent one, you will need to enable scripting by setting script.disable_dynamic: false in your elasticsearch.yml file.

Last, since we need to eventually draw each bucket on a map, it’s a good idea to calculate the average coordinate of the documents included in the cluster. We might also position the marker at the center of the cell specified by the geohash, but this would result in all the markers being perfectly aligned in a grid, losing some visual information. The only way I could add this data was by further nesting 2 aggregations inside the “cells” aggregation, one to calculate the average value of the latitude and the other to do the same with longitude:

{
    "query": {
        "match_all": {}
    },
    "size": 0,
    "aggs": {
        "filtered_cells": {
            "filter": {
                "geo_bounding_box": {
                    "location": {
                        "top_left": "50.1234, 4.0000",
                        "bottom_right": "52.1234, 4.5555"
                    }
                }
            },
            "aggs": {
                "cells": {
                    "geohash_grid": {
                        "field": "location",
                        "precision": "precision"
                    },
                    "aggs": {
                        "center_lat": {
                            "avg": {
                                "script": "doc['location'].lat"
                            }
                        },
                        "center_lon": {
                            "avg": {
                                "script": "doc['location'].lon"
                            }
                        }
                    }
                }
            }
        }
    }
}

You can see 2 aggregations, both nested inside “cells” and both of type avg. Each one calculates the average on values specified by a script, respectively, the latitude and longitude value of a document.

The previous query will be answered with a JSON like below:

{
    ...
    "aggregations": {
        "filtered-cells": {
            "cells": {
                "buckets": [
                    {
                        "center_lat": {
                            "value": 51.82441700086297
                        },
                        "center_lon": {
                            "value": 4.698991342820276
                        },
                        "doc_count": 116950,
                        "key": "u15"
                    },
                    {
                        "center_lat": {
                            "value": 52.47565999229339
                        },
                        "center_lon": {
                            "value": 4.977245411610905
                        },
                        "doc_count": 46845,
                        "key": "u17"
                    },
                    ...
                ]
            },
            "doc_count": 191775
        }
    },
    ...
}

which is equivalent to what you can obtain with the plugin.

To recapitulate, there is good news: elasticsearch became more powerful with these data mining operations. This allows us to solve the use case described in the previous post by using just base functionality.

There’s one thing I couldn’t get to work though: showing the ID of the single document in a bucket, for all the buckets with size 1. It’s a feature that I was requested to add to the plugin, and it’s wasn’t difficult to implement in there. If anyone finds a way to achieve this with aggregations, comments to the post are welcome!

6 Responses

  1. July 10, 2014 at 21:50 by Antoine

    Hi Gianluca,

    thanks for this great tutorial. Easy to set up except that I struggled because ES disabled by default the dynamic scripting.

    Did you find a solution for the ID of the single document in the bucket ? I still can not find how to do that.

    Cheers !

    Antoine

    • July 14, 2014 at 17:39 by Gianluca Ortelli

      Hi Antoine,

      thanks for reporting this, I’ll update the post to save other people the same struggle 🙂

      About the problem of showing the ID inside a single-document bucket, I haven’t investigated the API further. I’m thinking more about writing a custom aggregation, which should replace the current, facet-based plugin. This will take some time though, I have just started looking into the internals of the aggregations and I’m still lost.

      Greetings,
      Gianluca

      • July 14, 2014 at 17:43 by Antoine

        Thanks for the feedback.

        On my side, I decided to deal with it at an higher level of my technical stack. I go through the buckets if it has a single element I get it in my database using it’s latitude/longitude as an index. Not perfect but that’s the only workaround I found ’til now.

        Best regards,

        Antoine

  2. September 23, 2014 at 18:06 by Hakan

    Hi,

    nice tutorial 😉 I must say that I was also stuck on finding the ID in the case of a single item bucket. But the I found this top_hits aggregation :

    “aggregations”: {
    “my_grid”: {
    “geohash_grid”: {
    “field”: “location”,
    “precision”: 6
    },
    “aggs”: {
    “top_points_hits”: {
    “top_hits”: {
    “size”: 1
    }
    }
    }
    }
    }

    Hope this helps you out 🙂

    Hakan.

  3. January 9, 2015 at 11:40 by Dan

    Have you looked at http://clustermash.com, they have been able to cluster at least 8 million points of the GeoNames DB on one of their demos. They also show integration with MapQuest and their geocoding and direction API’s.

  4. June 8, 2015 at 16:09 by Ayush

    Can anyone tell me how GeoHash grid Aggregation works with in case of nested documents?

    Thanks