Trifork Blog

Category ‘Elasticsearch’

Refactoring from Elasticsearch version 1 with Java Transport client to version 6 with High Level REST client

February 27th, 2018 by
(https://blog.trifork.com/2018/02/27/refactoring-from-elasticsearch-version-1-with-java-transport-client-to-version-6-with-high-level-rest-client/)

Every long running project accrues technical debt. It may be that the requirements today have evolved in a different direction from what was foreseen when the project was designed, or it may be that difficult infrastructure tasks have been put off in favor of new functionality. From time to time, you need to refactor your code to clean up this technical debt. I recently finished such a refactoring task for a customer, so in the category ‘from the trenches’, I would like to share the story here.

Elasticsearch exposes both a REST interface and the internal Java API, via the binary transport client, for connecting with the search engine. Just over a year ago, Elastic announced to the world that it plans to deprecate the transport client in favor of the high level REST client, “as soon as the REST client is feature complete and is mature enough to replace the Java API entirely”. The reasons for this are clearly explained in Luca Cavanna’s blogpost, but the most important disadvantage is that using the transport client, you introduce a tight coupling between your application and the exact major and minor release of your ES cluster. As long as Elasticsearch exposes its internal API, it has to worry about breaking thousands of applications all over the world that depend on it.

The “as soon as…” timetable sounds somewhat vague and long term, but there may be good reasons to migrate your search functionality now, rather than later. In the case of our customer, their reason is wanting to use the AWS Elasticsearch service. The entire codebase is already running in AWS, and for the past few years they have been managing their own Elasticsearch cluster running in EC2 instances. This turns out to be labor intensive when updates have to be applied to these VMs. It would be easier and probably cheaper to let Amazon manage the cluster. As the AWS Elasticsearch service only exposes the REST API, the dependence on the transport protocol will have to be removed.

Action plan

The starting situation was a dependency on Elasticsearch 1.4.5, using the Java API. The goal was the most recent Elasticsearch version available in the Amazon Elasticsearch Service, which at the time was 6.0.2, using the REST API.

In order to reduce the complexity of the refactoring operation, we decided early on, to reindex the data, rather than trying to convert the indices. Every Elasticsearch release comes with a handy list of breaking changes. Looking through this list, we tried to make a list of breaking changes that would likely affect the search implementation of our customer. There are more potential breaking changes than listed here, but these are the ones that an initial investigation suggested might have an impact:

1.x – 2.x:

  • Facets replaced by aggregations
  • Field names can’t contain dots

2.x – 5.x:

5.x – 6.0:

  • Support for indices with multiple mapping types dropped

The plan was first to convert the existing code to work with ES 6, and only then migrate from the transport client to the High Level REST client.

Implementation 

The entire search functionality, originally written by our former colleague Frans Flippo, was exhaustively covered by unit- and integration tests, so the first step was to update the maven dependency to the current version, run the tests, and see what broke. First there were compilation errors that were easily fixed. Some examples:

Replace FilterBuilder with QueryBuilder, RangeFilterBuilder with RangeQueryBuilder, TermsFilterBuilder with TermsQueryBuilder, PercolateRequestBuilder with PercolateQueryBuilder etc, switch to HighlightBuilder for highlighters, replace ‘fields’ with ‘storedFields’. The count API was removed in version 5.5, and its use had to be replaced by executing a search with size 0. Facets had already been replaced by aggregations by our colleague Attila Houtkooper, so we didn’t have to worry about that.

In ES 5, the suggest API was removed, and became part of the search API. This turned out not to have an impact on our project, because the original developer of the search functionality implemented a custom suggestions service based on aggregation queries. It looks like he wanted the suggestions to be ordered by the number of occurrences in a ‘bucket’, which couldn’t be implemented using the suggest API at the time. We decided that refactoring this to use Elasticsearch suggesters would be new functionality, and outside the scope of this upgrade, so we would continue to use aggregations for now.

Some updates were required to the index mappings. The most obvious one was replacing ‘string’ with either ‘text’ or ‘keyword’. Analyzer became search_analyzer, while index_analyzer became analyzer.

Syntax ES 1:

"fields": {
    "analyzed": {
        "type": "string",
        "analyzer" : "dutch",
        "index_analyzer": "default_min_word_length_2"
    },
    "not_analyzed": {
        "type": "string",
        "index": "not_analyzed"
    }
}

Syntax ES 6:

"fields": {
  "analyzed": {
    "type": "text",
    "search_analyzer": "dutch",
    "analyzer": "default_min_word_length_2"
  },
  "not_analyzed": {
    "type": "keyword",
    "index": true
  }
}

Document id’s were associated with a path:

"_id": {
    "path": "id"
},

The _id field is no longer configurable, so in order to have document ids in Elasticsearch match ids in the database, the id has to be set explicitly, or Elasticsearch will generate a random one.

All in all, it was roughly a day of work to get the project to compile and ready to run the unit tests. All of them were red.

Running Elasticsearch integration tests

The integration tests depended on a framework that spun up an embedded Elasticsearch node, to run the tests against. This mechanism is no longer supported. Though with some effort it is still possible to get this to work, the main point of the operation was to move the search implementation back into ‘supported’ territory, so we decided to abandon this approach.

First we tried out the integration test framework offered  by Elasticsearch: ESIntegTestCase. Unfortunately, this framework turns out to be highly opinionated. It wants the tests to run under the RandomizedRunner, which is also used to test Lucene. In order to work with the ESIntegTestCase, we would have had to partially rewrite most of the existing integration tests. Ideally, you can rely on the unit tests to prove that your refactoring preserved the expected functionality. If you have to change the tests, you run the risk that you end up ‘fixing’ a test to make it green. We decided to go with the Elasticsearch maven plugin instead, which required no code changes to the tests other than configuring the ES client.

<plugin>
    <groupId>com.github.alexcojocaru</groupId>
    <artifactId>elasticsearch-maven-plugin</artifactId>
    <version>6.0</version> <!--Plugin version-->
    <configuration>
        <clusterName>testCluster</clusterName>
        <transportPort>9500</transportPort>
        <httpPort>9400</httpPort>
        <version>6.0.2</version> <!--Elasticsearch version-->
        <timeout>20</timeout>
    </configuration>
    <executions>
        <execution>
            <id>start-elasticsearch</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>runforked</goal>
            </goals>
        </execution>
        <execution>
            <id>stop-elasticsearch</id>
            <phase>post-integration-test</phase>
            <goals>
                <goal>stop</goal>
            </goals>
        </execution>
    </executions>
</plugin>

That’s all there is to it. With this configuration, when maven enters the pre-integration test phase, the plugin starts a single node elasticsearch cluster of version 6.0.2 in a new process, listening to ports 9500 and 9400, runs the tests, and then stops the cluster and cleans up. The plugin allows a more fine grained configuration of the cluster, including ES plugins, but we were trying to replace a simple embedded ES node, so this was unnecessary. Now all the tests were yellow: the project compiles, tests run and assertions fail.

Fixing the pitfalls

Some search results were missing data. In version 1.x, fields added to your query were retrieved from the _source field, and returned in your search result. Where we replaced these with storedFields, the fields in question had to be explicitly marked as stored fields in the mapping. Fields that are not stored, are included in your search, but not returned in the search result. This can be useful in queries where you want to retrieve just a few fields from a large document.

Some aggregations were failing with the message ‘Fielddata is disabled on text fields by default’. In ES 1, there were ‘string’ fields, not ‘text’ and ‘keyword’ fields. By default, operations like sorting and aggregations are not allowed on ‘text’ fields, unless you explicitly mark the field as such, with “fielddata”:true. This is generally not a good idea on analyzed fields, as it can cause a substantial performance hit, and may return results you were probably not expecting. We decided to use “copy_to” to make ‘keyword’ type copies of the fields in question to run the aggregations on.

It seems that ES 1.4 supported the java regex engine, while ES 6.x uses a different one, which doesn’t support boundary matchers such as \b. Luckily there was a workaround, which should work in most cases.

In the REST API, a query field can be boosted with a caret: “fields”:[ “title^5”]. In this search implementation based on the transport client, this was implemented by appending ^ and the boost to the field name: “title^”+titleBoost. In ES 6, this approach no longer seemed to work. There was a  clear difference in results between a query through the Java API, and executing the exact same query, via the toString() method of the QueryBuilder, using the REST API. The correct approach is myQueryStringQueryBuilder.field(fieldName, boost). Because the query looked fine in the console, and worked fine when copied and run against the REST API, this pitfall was not immediately obvious.

Differences in search results

A fair number of the tests failed because the order of search results by relevance had changed between ES 1 and 6. From the way the tests were written, we got the impression that in ES 1, documents with the exact same score were returned in the order in which they were indexed, while in ES 6, this doesn’t seem to be the case. We could have made the tests ‘green’, by adding an additional sort by id after the _score, which would make the test perform consistently over our tiny artificial test data set, but this didn’t ‘feel’ right. We didn’t find an explicit mention of this behavior in the documentation, or any change over ES versions, and there wasn’t a good use case for it. The numerical value of the document id has no special relevance to the users. In real world data, search results with the exact same _score would probably be exactly equally relevant, and the customer agreed with us not to second guess Elasticsearch and Lucene on the matter of search result relevance.

A more interesting issue was a test where documents that looked more relevant to us, were now getting a lower score than seemingly less relevant ones. The tiny data set for these tests contained a large number of mentions of the search term in the queries. Somehow, documents that contained an additional, boosted, field that matched the search term, were scoring lower than an otherwise identical document where one of those fields was left blank. The weight of a search term is the product of a function that combines the term frequency, the inverse document frequency and the boost on a field. Artificially packing a document with more mentions of the search term was actually making a match on this search term less relevant to Lucene. We could have spent considerable time tweaking the queries to make documents in our very artificial test data set equally relevant to ES 6 as they were to ES 1, but the customer agreed with us that this wasn’t the way to go. Considerable effort in the development of search engines goes into making search results more relevant. It seemed like a good idea to trust the experts at Elastic and Lucene when it comes to search result relevance, so the customer agreed we shouldn’t spend a lot of time trying to replicate the behavior of a more primitive version of the search engine, fine tuned to a tiny, artificial test data set. All tests were now green. On to the next stage.

Migrating to the High Level REST Client

In order to work with the Amazon Elasticsearch Service, we had to remove the dependence on the transport client. In the long term, this is a good idea anyway, as Elastic intends to deprecate and ultimately remove this client. In order to facilitate the move away from the transport client, ES has been working on the so called High Level REST Client. This client uses the low level REST client to send requests, but accepts the existing query builders from the Java API, and returns the same response objects. At least in theory, this should allow you to move your search functionality to the new client with minimal code changes. When it came to searching, practice matched this theory quite nicely. Here is a pseudocode representation of the syntax using the transport client:

// Query
SearchRequestBuilder searchRequestBuilder = client.prepareSearch(indexName);
searchRequestBuilder.setTypes(objectType);
searchRequestBuilder.setQuery(buildMainQuery(formObject));

// Paging
searchRequestBuilder.setFrom(maxSize * formObject.getPage());
searchRequestBuilder.setSize(maxSize);

// Fields to return
for (String field : formObject.getRequestedFields()) {
    searchRequestBuilder.addStoredField(field);
}

// Filters
searchRequestBuilder.setPostFilter(buildFilterQuery(formObject));

// Sorting
searchRequestBuilder.addSort(formObject.getSortField(), toSortOrder(formObject.getSortOrder()));

(...)

SearchResponse searchResponse = searchRequestBuilder.execute().actionGet();

And here is the syntax using the High Level REST Client

// Query
SearchSourceBuilder sb = new SearchSourceBuilder();
sb.query(buildMainQuery(formObject));

// Paging
sb.from(maxSize * formObject.getPage());
sb.size(maxSize);

// Fields to return
sb.storedFields(asList(formObject.getRequestedFields()));

// Filters
sb.postFilter(buildFilterQuery(formObject));

// Sorting
sb.sort(formObject.getSortField(), toSortOrder(formObject.getSortOrder()));

SearchRequest request = new SearchRequest(indexName);
request.types(objectType);
request.source(sb);

(...)

SearchResponse searchResponse = client.search(searchRequest);

 

The query builders were accepted as they were, and the search response object was the same. We ran the tests, and they were still green. The search service no longer depended on the transport client. For indexing operations, the conversion was similarly straight forward:

BulkRequestBuilder indexBulkRequestBuilder = client.prepareBulk();
(...)
IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName, objectType);
indexRequestBuilder.setId(domainObject.getId());
String json = jsonConverter.convertToJson(domainObject);
indexRequestBuilder.setSource(new BytesArray(json), XContentType.JSON);
indexBulkRequestBuilder.add(indexRequestBuilder);
(...)
BulkResponse response = indexBulkRequestBuilder.execute().actionGet();

And with the Low level REST client:

BulkRequest bulkRequest = new BulkRequest();
(...)
IndexRequest indexRequest = new IndexRequest();
String json = jsonConverter.convertToJson(domainObject);
indexRequest.index(indexName);
indexRequest.id(domainObject.getId());
indexRequest.source(new BytesArray(json), XContentType.JSON);
indexRequest.type(objectType);
bulkRequest.add(indexRequest);
(...)
BulkResponse response = client.bulk(bulkRequest);

Now for the admin operations that manage indices. In our target release of Elasticsearch, 6.0.2, the indices API was not yet supported. In the current release (6.2 at the time of writing this blog), there is support for this, so at first we tried using the 6.2 version of the ES libraries, against a cluster running 6.0.2. This turned out to break the search functionality. The 6.2 version of the query builders was including keywords in some queries, that were not yet supported in 6.0. Elastic had warned about this in their announcement mentioned earlier:

“… the high-level client still depends on Elasticsearch, like the Java API does today. This may not be ideal, as it still ties users of the client to depend on a certain version of Elasticsearch, but this decision allows users to migrate away more easily from the transport client. We would like to get rid of this direct dependency in the future, but since this is a separate long-term project, we didn’t want this to affect the timing of the client’s first release.”

As the High Level REST Client implementation wasn’t directly compatible with the admin operations in our application anyway, we decided to just implement these features using the Low Level REST client. Here is an example of creating an index template using the transport client

String template = readFile(templateFileName);

elasticClient.admin()
            .indices()
            .preparePutTemplate(indexName)
            .setPatterns(indexNamePattern)
            .setSource(new BytesArray(template), XContentType.JSON)
            .get();

And here is the same operation using the low level REST client:

String template = readFile(templateFileName);

HttpEntity entity = new NStringEntity(template, ContentType.APPLICATION_JSON);

client.performRequest("PUT", "_template/" + templateName, emptyMap(), entity);

The index name pattern now had to be included in the template JSON and was no longer dynamically configurable in the code, but this was not a problem for our application. With all the tests nice and green, the project was done!

Conclusion

Software problems always seem easy once you know how to solve them. This report from the trenches may give the impression that this was a simple, straight forward migration that only took a few days, but in reality the whole project involved several weeks of investigation, backtracking from dead ends, and a lot of trial and error. Hopefully, this blog can help fellow Java developers who are tasked with the same problem save a lot of time, or you could contact Trifork to provide a helping hand.

Kibana Histogram on Day of Week

September 4th, 2017 by
(https://blog.trifork.com/2017/09/04/kibana-histogram-on-day-of-week/)

I keep track of my daily commutes to and from the office. One thing I want to know is how the different days of the week are affecting my travel duration. But when indexing all my commutes into Elasticsearch, I can not (out-of-the-box) create a histogram on the day of the week. My first visualization will look like this:

Read the rest of this entry »

Smart energy consumption insights with Elasticsearch and Machine Learning

August 21st, 2017 by
(https://blog.trifork.com/2017/08/21/smart-energy-consumption-insights-with-elasticsearch-and-machine-learning/)

At home we have a Youless device which can be used to measure energy consumption. You have to mount it to your energy meter so it can monitor energy consumption. The device then provides energy consumption data via a RESTful api. We can use this api to index energy consumption data into Elasticsearch every minute and then gather energy consumption insights by using Kibana and X-Pack Machine Learning.

The goal of this blog is to give a practical guide how to set up and understand X-Pack Machine Learning, so you can use it in your own projects! After completing this guide, you will have the following up and running:

  • A Complete data pre-processing and ingestion pipeline, based on:
    • Elasticsearch 5.4.0 with ingest node;
    • Httpbeat 3.0.0.
  • An energy consumption dashboard with visualizations, based on:
    • Kibana 5.4.0.
  • Smart energy consumption insights with anomaly detection, based on:
    • Elasticsearch X-Pack Machine Learning.

The following diagram gives an architectural overview of how all components are related to each other:

Read the rest of this entry »

Simulating an Elasticsearch Ingest Node pipeline

February 2nd, 2017 by
(https://blog.trifork.com/2017/02/02/elasticsearch-ingest-node/)

Indexing document into your cluster can be done in a couple of ways:

  • using Logstash to read your source and send documents to your cluster;
  • using Filebeat to read a log file, send documents to Kafka, let Logstash connect to Kafka and transform the log event and then send those documents to your cluster;
  • using curl and the Bulk API to index a pre-formatted file;
  • using the Java Transport Client from within a custom application;
  • and many more…

Before version 5 however there where only two ways to transform your source data to the document you wanted to index. Using Logstash filters, or you had to do it yourself.

In Elasticsearch 5 the concept of the Ingest Node has been introduced. Just a node in your cluster like any other but with the ability to create a pipeline of processors that can modify incoming documents. The most frequently used Logstash filters have been implemented as processors.

For me, the best part of pipelines is that you can simulate them. Especially in Console, simulating your pipelines makes creating them very fast; the feedback loop on testing your pipeline is very short. Making using pipelines a very convenient way to index data.

Read the rest of this entry »

Public Elasticsearch clusters are being held ransom

January 18th, 2017 by
(https://blog.trifork.com/2017/01/18/public-elasticsearch-clusters-are-being-held-ransom/)

Last week several news sites and researchers reported that Elasticsearch clusters that are connected to the internet without proper security are being held ransom.

You can use shodan.io to search for Elasticsearch clusters: https://www.shodan.io/search?query=port%3A9200+json&language=en.

The first hit is actually a cluster that is ‘infected’:

There are some secured clusters as well:

But the default ‘root’ account with username “elastic” and password “changeme” (docs) will grant access. So not much security here… But at least your data is still there. For now.

Please do not connect your cluster to the internet without securing. Use X-Pack Security for authentication and authorization.

Elastic Cloud could also be something for you. Security in Elastic Cloud is default.

Handling a massive amount of product variations with Elasticsearch

December 22nd, 2016 by
(https://blog.trifork.com/2016/12/22/handling-a-massive-amount-of-product-variations-with-elasticsearch/)

In this blog we will review different techniques for modelling data structures in Elasticsearch. A project case is used to describe our approach on handling a small sized product data set with a large sized related product variations data set. Furthermore we will show how certain modelling decisions resulted in a 1000 factor query performance gain!

The flat world

Elasticsearch is a great product if you want to index and search through a large number of documents. Functionality like term and range queries, full-text search and aggregations on large data sets are very fast and powerful. But Elasticsearch prefers to treat the world as if it were flat. This means that an index is a flat collection of documents. Furthermore, when searching, a single document should contain all of the information that is required to decide whether it matches the search request.

In practice, however, domains often are not flat and contain a number of entities which are related to each other. These can be difficult to model in Elasticsearch in such a way that the following conditions are met:

  • Multiple entities can be aggregated from a single query;
  • Query performance is stable with low response times;
  • Large numbers of documents can easily be mutated or removed.

The project case

This blog is based on a project case. In the project, two data sets were used. The data sets have the following characteristics:

  • Products:
    • Number of documents: ~ 75000;
    • Document characteristics: A product contains a set of fields which contains the primary information of a product;
    • Mutation frequency: Updates on product attributes can occur fairly often (e.g. every 15 minutes).
  • Product variations:
    • Number of documents: ~ 500 million;
    • Document characteristics: A product variation consists of a set of additional attributes which contain extra information on top of the corresponding product. The number of product variations per product varies a lot, and can go up to 50000;
    • Mutation frequency: During the day, there is a continuous stream of updates and new product variations.

Read the rest of this entry »

Collecting data from a private LoRaWAN sensor network into Elastic

May 20th, 2016 by
(https://blog.trifork.com/2016/05/20/collecting-data-from-a-private-lorawan-sensor-network-into-elastic/)

Introduction to LoRaWAN and ELK

Why LoRaWAN, and what makes it different from other types of low power consumption, high range wireless protocols like ZigBee, Z-Wave, etc … ?

LoRa is a wireless modulation for long-range, low-power, low-data-rate applications developed by Semtech. The main features of this technology are the big amount of devices that can connect to one network and the relatively big range that can be covered with one LoRa router. One gateway can coordinate around 20’000 nodes in a range of 10–30km. It’s a very flexible protocol and allows the developers build various types of network architectures according to the demand of the client. The general description of the LoRaWAN protocol together with a small tutorial are available in my previous post.

What is the ELK stack, and why use it with LoRaWAN?

In the figure above, you can see a simplified model of what a typical LoRaWAN network looks like.
As you can see, the data from the LoRa endpoints, has to go through several devices before it reaches the back end application. Nowadays there are a lot of tools that would allow us to gather and manipulate the data. A very good solution is the ELK stack which consists of Elasticsearch, Logstash and Kibana; these three tools allow to gather, store and analyze big amounts of data. More information and details can be found on the official website: https://www.elastic.co/.

Read the rest of this entry »

Elastic{ON} 2016

February 20th, 2016 by
(https://blog.trifork.com/2016/02/20/elasticon-2016/)

Elastic{ON} 2016 - ViewLast week a colleague and I attended Elastic{ON} in San Francisco. The venue at Pier 48 gave a nice view on (among others) the Oakland Bay Bridge. Almost 2000 Elastic fanatics converged to listen to and talk about everything in the Elastic Stack.

I have been to a lot of sessions. I think the two most important things that I will take home are “5.0” and “graphs”.

5.0

The next version of the Elastic Stack will be 5.0. This means that all main Elastic products (Elasticsearch, Logstash, Kibana and Beats) are having the same version number in all following release bonanzas. This will be easier for all customers and clients.

I mentioned the Elastic Stack. This is a little rebranding of the ELK Stack plus Beats. More rebranding is the renaming of the Elastic as a Service solution Found to Elastic Cloud. I think those are simple but good changes.

Also Elastic created the concept of packs to combine extensions. Most notably the X-Pack will all the monitoring, alerting and security (and more) goodies wrapped together.

More about 5.0 on the Elastic blog.

Graphs

Elastic{ON} 2016 - GraphThe other main take-away are the graph capabilities (Graph API) that will be added to Elasticsearch (through the X-Pack). It is still in an early phase but it looks awesome! It looks very easy to use and it is very fast. The UI is written as a Kibana plugin.

Actually there will be some more Kibana plugins. Managing users and roles via the Security API, for example.

Talks

Off course there were a lot of talks. Common subjects were security and recommendation. Graphs could play an important role there!

Some talks were cool user stories of companies that implemented (parts of) the Elastic Stack. Other talks dove deep into the different Elastic products. Some of those turned out to be a little out of my league. For example the math behind the new default BM25 scoring algorithm.

The talks will be put online in the next couple of weeks. So be sure to check them out! Maybe I will see you next year!

Dealing with NodeNotAvailableExceptions in Elasticsearch

April 8th, 2015 by
(https://blog.trifork.com/2015/04/08/dealing-with-nodenotavailableexceptions-in-elasticsearch/)

tl;dr

Elasticsearch provides distributed search with minimal setup and configuration. Now the nice thing about it is that, most of the time, you don’t need to be particularly concerned about how it does what it does. You give it some parameters – “I want 3 nodes”, “I want 3 shards”, “I want every shard to be replicated so it’s on at least two nodes”, and Elasticsearch figures out how to move stuff around so you get the situation you asked for. If a node becomes unreachable, Elasticsearch tries to keep things going, and when the lost node appears and rejoins, the administration is updated so everything is hunky-dory again.

The problem is when things don’t work the way you expect…

Computer says “no node available”

Read the rest of this entry »

Shield your Kibana dashboards

March 5th, 2015 by
(https://blog.trifork.com/2015/03/05/shield-your-kibana-dashboards/)

You work with sensitive data in Elasticsearch indices that you do not want everyone to see in their Kibana dashboards. Like a hospital with patient names. You could give each department their own Elasticsearch cluster in order to prevent all departments to see the patient’s names, for example.

But wouldn’t it be great if there was only one Elasticsearch cluster and every departments could manage their own Kibana dashboards? And still have the security in place to prevent leaking of private data?

With Elasticsearch Shield, you can create a configurable layer of security on top of your Elasticsearch cluster. In this article, we will explore a small example setup with Shield and Kibana.

Read the rest of this entry »