In a previous post I described a problem of data visualization and a possible solution provided by a plugin of elasticsearch. I noticed that elasticsearch might one day evolve to make the plugin unnecessary. That day seems to have come: starting from version 1.0.0, elasticsearch includes Aggregations, a new API for data mining. In this post I’ll show you how to use aggregations to reproduce the functionality of the plugin.

Note: if you want to be sure to understand what follows, you’d better read the original post first. After reading, you will know:

  • what exactly the problem of the “too many markers on a map” is;
  • how you can use geohashes to solve the problem
  • a solution for the elasticsearch ecosystem based on a plugin

Elasticsearch aggregations, described here, are a tool to extract aggregated (hence the name) information from a set of documents. It is similar to the GROUP BY clause and aggregate functions in SQL.

The authors of the API also defined a very specific type of aggregation, geohash_grid. It does something similar to what the plugin does: grouping documents with a location, according to the similarity of the respective geohash strings. All I’m going to do is to add a little functionality that is not offered by the plain geohash_grid aggregation, and which helps in displaying the documents on a map. This missing part is the average position of the documents belonging to the same bucket.

In short, this will be a verbose post to show a small enhancement on a powerful feature of elasticsearch.

Let’s proceed in order:

1. Filtering

Since we are processing the documents with the final purpose to show them on a map, it’s a good idea to filter out immediately the documents which fall outside the visible region of the map. and this is the first version of our query:

{
    "query": {
        "match_all": {}
    },
    "size": 0,
    "aggs": {
        "filtered_cells": {
            "filter": {
                "geo_bounding_box": {
                    "location": {
                        "top_left": "50.1234, 4.0000",
                        "bottom_right": "52.1234, 4.5555"
                    }
                }
            }
        }
    }
}

2. Aggregate by geohash

Now we want to use the geohash_grid to do the actual geohash based grouping, but in the context of the filter just defined: therefore, we declare the next aggregation as nested inside the “filtered_cells” aggregation.

{
    "query": {
        "match_all": {}
    },
    "size": 0, "aggs": {
        "filtered_cells": {
            "filter": {
                "geo_bounding_box": {
                    "location": {
                        "top_left": "50.1234, 4.0000",
                        "bottom_right": "52.1234, 4.5555"
                    }
                }
            },
            "aggs": {
                "cells": {
                    "geohash_grid": {
                        "field": "location",
                        "precision": 3
                    }
                }
            }
        }
    }
}

In this sample request, we want to group together all the documents which have a common 3-chars prefix in their geohash.

3. Add group centroid

EDIT: the coming part makes use of dynamic scripting, a feature that is disabled by default starting from ES 1.2.0. In case you have that version or a more recent one, you will need to enable scripting by setting script.disable_dynamic: false in your elasticsearch.yml file.

Last, since we need to eventually draw each bucket on a map, it’s a good idea to calculate the average coordinate of the documents included in the cluster. We might also position the marker at the center of the cell specified by the geohash, but this would result in all the markers being perfectly aligned in a grid, losing some visual information. The only way I could add this data was by further nesting 2 aggregations inside the “cells” aggregation, one to calculate the average value of the latitude and the other to do the same with longitude:

{
    "query": {
        "match_all": {}
    },
    "size": 0,
    "aggs": {
        "filtered_cells": {
            "filter": {
                "geo_bounding_box": {
                    "location": {
                        "top_left": "50.1234, 4.0000",
                        "bottom_right": "52.1234, 4.5555"
                    }
                }
            },
            "aggs": {
                "cells": {
                    "geohash_grid": {
                        "field": "location",
                        "precision": "precision"
                    },
                    "aggs": {
                        "center_lat": {
                            "avg": {
                                "script": "doc['location'].lat"
                            }
                        },
                        "center_lon": {
                            "avg": {
                                "script": "doc['location'].lon"
                            }
                        }
                    }
                }
            }
        }
    }
}

You can see 2 aggregations, both nested inside “cells” and both of type avg. Each one calculates the average on values specified by a script, respectively, the latitude and longitude value of a document.

The previous query will be answered with a JSON like below:

{
    ...
    "aggregations": {
        "filtered-cells": {
            "cells": {
                "buckets": [
                    {
                        "center_lat": {
                            "value": 51.82441700086297
                        },
                        "center_lon": {
                            "value": 4.698991342820276
                        },
                        "doc_count": 116950,
                        "key": "u15"
                    },
                    {
                        "center_lat": {
                            "value": 52.47565999229339
                        },
                        "center_lon": {
                            "value": 4.977245411610905
                        },
                        "doc_count": 46845,
                        "key": "u17"
                    },
                    ...
                ]
            },
            "doc_count": 191775
        }
    },
    ...
}

which is equivalent to what you can obtain with the plugin.

To recapitulate, there is good news: elasticsearch became more powerful with these data mining operations. This allows us to solve the use case described in the previous post by using just base functionality.

There’s one thing I couldn’t get to work though: showing the ID of the single document in a bucket, for all the buckets with size 1. It’s a feature that I was requested to add to the plugin, and it’s wasn’t difficult to implement in there. If anyone finds a way to achieve this with aggregations, comments to the post are welcome!