Trifork Blog

Category ‘Custom Development’

Refactoring from Elasticsearch version 1 with Java Transport client to version 6 with High Level REST client

February 27th, 2018 by

Every long running project accrues technical debt. It may be that the requirements today have evolved in a different direction from what was foreseen when the project was designed, or it may be that difficult infrastructure tasks have been put off in favor of new functionality. From time to time, you need to refactor your code to clean up this technical debt. I recently finished such a refactoring task for a customer, so in the category ‘from the trenches’, I would like to share the story here.

Elasticsearch exposes both a REST interface and the internal Java API, via the binary transport client, for connecting with the search engine. Just over a year ago, Elastic announced to the world that it plans to deprecate the transport client in favor of the high level REST client, “as soon as the REST client is feature complete and is mature enough to replace the Java API entirely”. The reasons for this are clearly explained in Luca Cavanna’s blogpost, but the most important disadvantage is that using the transport client, you introduce a tight coupling between your application and the exact major and minor release of your ES cluster. As long as Elasticsearch exposes its internal API, it has to worry about breaking thousands of applications all over the world that depend on it.

The “as soon as…” timetable sounds somewhat vague and long term, but there may be good reasons to migrate your search functionality now, rather than later. In the case of our customer, their reason is wanting to use the AWS Elasticsearch service. The entire codebase is already running in AWS, and for the past few years they have been managing their own Elasticsearch cluster running in EC2 instances. This turns out to be labor intensive when updates have to be applied to these VMs. It would be easier and probably cheaper to let Amazon manage the cluster. As the AWS Elasticsearch service only exposes the REST API, the dependence on the transport protocol will have to be removed.

Action plan

The starting situation was a dependency on Elasticsearch 1.4.5, using the Java API. The goal was the most recent Elasticsearch version available in the Amazon Elasticsearch Service, which at the time was 6.0.2, using the REST API.

In order to reduce the complexity of the refactoring operation, we decided early on, to reindex the data, rather than trying to convert the indices. Every Elasticsearch release comes with a handy list of breaking changes. Looking through this list, we tried to make a list of breaking changes that would likely affect the search implementation of our customer. There are more potential breaking changes than listed here, but these are the ones that an initial investigation suggested might have an impact:

1.x – 2.x:

  • Facets replaced by aggregations
  • Field names can’t contain dots

2.x – 5.x:

5.x – 6.0:

  • Support for indices with multiple mapping types dropped

The plan was first to convert the existing code to work with ES 6, and only then migrate from the transport client to the High Level REST client.


The entire search functionality, originally written by our former colleague Frans Flippo, was exhaustively covered by unit- and integration tests, so the first step was to update the maven dependency to the current version, run the tests, and see what broke. First there were compilation errors that were easily fixed. Some examples:

Replace FilterBuilder with QueryBuilder, RangeFilterBuilder with RangeQueryBuilder, TermsFilterBuilder with TermsQueryBuilder, PercolateRequestBuilder with PercolateQueryBuilder etc, switch to HighlightBuilder for highlighters, replace ‘fields’ with ‘storedFields’. The count API was removed in version 5.5, and its use had to be replaced by executing a search with size 0. Facets had already been replaced by aggregations by our colleague Attila Houtkooper, so we didn’t have to worry about that.

In ES 5, the suggest API was removed, and became part of the search API. This turned out not to have an impact on our project, because the original developer of the search functionality implemented a custom suggestions service based on aggregation queries. It looks like he wanted the suggestions to be ordered by the number of occurrences in a ‘bucket’, which couldn’t be implemented using the suggest API at the time. We decided that refactoring this to use Elasticsearch suggesters would be new functionality, and outside the scope of this upgrade, so we would continue to use aggregations for now.

Some updates were required to the index mappings. The most obvious one was replacing ‘string’ with either ‘text’ or ‘keyword’. Analyzer became search_analyzer, while index_analyzer became analyzer.

Syntax ES 1:

"fields": {
    "analyzed": {
        "type": "string",
        "analyzer" : "dutch",
        "index_analyzer": "default_min_word_length_2"
    "not_analyzed": {
        "type": "string",
        "index": "not_analyzed"

Syntax ES 6:

"fields": {
  "analyzed": {
    "type": "text",
    "search_analyzer": "dutch",
    "analyzer": "default_min_word_length_2"
  "not_analyzed": {
    "type": "keyword",
    "index": true

Document id’s were associated with a path:

"_id": {
    "path": "id"

The _id field is no longer configurable, so in order to have document ids in Elasticsearch match ids in the database, the id has to be set explicitly, or Elasticsearch will generate a random one.

All in all, it was roughly a day of work to get the project to compile and ready to run the unit tests. All of them were red.

Running Elasticsearch integration tests

The integration tests depended on a framework that spun up an embedded Elasticsearch node, to run the tests against. This mechanism is no longer supported. Though with some effort it is still possible to get this to work, the main point of the operation was to move the search implementation back into ‘supported’ territory, so we decided to abandon this approach.

First we tried out the integration test framework offered  by Elasticsearch: ESIntegTestCase. Unfortunately, this framework turns out to be highly opinionated. It wants the tests to run under the RandomizedRunner, which is also used to test Lucene. In order to work with the ESIntegTestCase, we would have had to partially rewrite most of the existing integration tests. Ideally, you can rely on the unit tests to prove that your refactoring preserved the expected functionality. If you have to change the tests, you run the risk that you end up ‘fixing’ a test to make it green. We decided to go with the Elasticsearch maven plugin instead, which required no code changes to the tests other than configuring the ES client.

    <version>6.0</version> <!--Plugin version-->
        <version>6.0.2</version> <!--Elasticsearch version-->

That’s all there is to it. With this configuration, when maven enters the pre-integration test phase, the plugin starts a single node elasticsearch cluster of version 6.0.2 in a new process, listening to ports 9500 and 9400, runs the tests, and then stops the cluster and cleans up. The plugin allows a more fine grained configuration of the cluster, including ES plugins, but we were trying to replace a simple embedded ES node, so this was unnecessary. Now all the tests were yellow: the project compiles, tests run and assertions fail.

Fixing the pitfalls

Some search results were missing data. In version 1.x, fields added to your query were retrieved from the _source field, and returned in your search result. Where we replaced these with storedFields, the fields in question had to be explicitly marked as stored fields in the mapping. Fields that are not stored, are included in your search, but not returned in the search result. This can be useful in queries where you want to retrieve just a few fields from a large document.

Some aggregations were failing with the message ‘Fielddata is disabled on text fields by default’. In ES 1, there were ‘string’ fields, not ‘text’ and ‘keyword’ fields. By default, operations like sorting and aggregations are not allowed on ‘text’ fields, unless you explicitly mark the field as such, with “fielddata”:true. This is generally not a good idea on analyzed fields, as it can cause a substantial performance hit, and may return results you were probably not expecting. We decided to use “copy_to” to make ‘keyword’ type copies of the fields in question to run the aggregations on.

It seems that ES 1.4 supported the java regex engine, while ES 6.x uses a different one, which doesn’t support boundary matchers such as \b. Luckily there was a workaround, which should work in most cases.

In the REST API, a query field can be boosted with a caret: “fields”:[ “title^5”]. In this search implementation based on the transport client, this was implemented by appending ^ and the boost to the field name: “title^”+titleBoost. In ES 6, this approach no longer seemed to work. There was a  clear difference in results between a query through the Java API, and executing the exact same query, via the toString() method of the QueryBuilder, using the REST API. The correct approach is myQueryStringQueryBuilder.field(fieldName, boost). Because the query looked fine in the console, and worked fine when copied and run against the REST API, this pitfall was not immediately obvious.

Differences in search results

A fair number of the tests failed because the order of search results by relevance had changed between ES 1 and 6. From the way the tests were written, we got the impression that in ES 1, documents with the exact same score were returned in the order in which they were indexed, while in ES 6, this doesn’t seem to be the case. We could have made the tests ‘green’, by adding an additional sort by id after the _score, which would make the test perform consistently over our tiny artificial test data set, but this didn’t ‘feel’ right. We didn’t find an explicit mention of this behavior in the documentation, or any change over ES versions, and there wasn’t a good use case for it. The numerical value of the document id has no special relevance to the users. In real world data, search results with the exact same _score would probably be exactly equally relevant, and the customer agreed with us not to second guess Elasticsearch and Lucene on the matter of search result relevance.

A more interesting issue was a test where documents that looked more relevant to us, were now getting a lower score than seemingly less relevant ones. The tiny data set for these tests contained a large number of mentions of the search term in the queries. Somehow, documents that contained an additional, boosted, field that matched the search term, were scoring lower than an otherwise identical document where one of those fields was left blank. The weight of a search term is the product of a function that combines the term frequency, the inverse document frequency and the boost on a field. Artificially packing a document with more mentions of the search term was actually making a match on this search term less relevant to Lucene. We could have spent considerable time tweaking the queries to make documents in our very artificial test data set equally relevant to ES 6 as they were to ES 1, but the customer agreed with us that this wasn’t the way to go. Considerable effort in the development of search engines goes into making search results more relevant. It seemed like a good idea to trust the experts at Elastic and Lucene when it comes to search result relevance, so the customer agreed we shouldn’t spend a lot of time trying to replicate the behavior of a more primitive version of the search engine, fine tuned to a tiny, artificial test data set. All tests were now green. On to the next stage.

Migrating to the High Level REST Client

In order to work with the Amazon Elasticsearch Service, we had to remove the dependence on the transport client. In the long term, this is a good idea anyway, as Elastic intends to deprecate and ultimately remove this client. In order to facilitate the move away from the transport client, ES has been working on the so called High Level REST Client. This client uses the low level REST client to send requests, but accepts the existing query builders from the Java API, and returns the same response objects. At least in theory, this should allow you to move your search functionality to the new client with minimal code changes. When it came to searching, practice matched this theory quite nicely. Here is a pseudocode representation of the syntax using the transport client:

// Query
SearchRequestBuilder searchRequestBuilder = client.prepareSearch(indexName);

// Paging
searchRequestBuilder.setFrom(maxSize * formObject.getPage());

// Fields to return
for (String field : formObject.getRequestedFields()) {

// Filters

// Sorting
searchRequestBuilder.addSort(formObject.getSortField(), toSortOrder(formObject.getSortOrder()));


SearchResponse searchResponse = searchRequestBuilder.execute().actionGet();

And here is the syntax using the High Level REST Client

// Query
SearchSourceBuilder sb = new SearchSourceBuilder();

// Paging
sb.from(maxSize * formObject.getPage());

// Fields to return

// Filters

// Sorting
sb.sort(formObject.getSortField(), toSortOrder(formObject.getSortOrder()));

SearchRequest request = new SearchRequest(indexName);


SearchResponse searchResponse =;


The query builders were accepted as they were, and the search response object was the same. We ran the tests, and they were still green. The search service no longer depended on the transport client. For indexing operations, the conversion was similarly straight forward:

BulkRequestBuilder indexBulkRequestBuilder = client.prepareBulk();
IndexRequestBuilder indexRequestBuilder = client.prepareIndex(indexName, objectType);
String json = jsonConverter.convertToJson(domainObject);
indexRequestBuilder.setSource(new BytesArray(json), XContentType.JSON);
BulkResponse response = indexBulkRequestBuilder.execute().actionGet();

And with the Low level REST client:

BulkRequest bulkRequest = new BulkRequest();
IndexRequest indexRequest = new IndexRequest();
String json = jsonConverter.convertToJson(domainObject);
indexRequest.source(new BytesArray(json), XContentType.JSON);
BulkResponse response = client.bulk(bulkRequest);

Now for the admin operations that manage indices. In our target release of Elasticsearch, 6.0.2, the indices API was not yet supported. In the current release (6.2 at the time of writing this blog), there is support for this, so at first we tried using the 6.2 version of the ES libraries, against a cluster running 6.0.2. This turned out to break the search functionality. The 6.2 version of the query builders was including keywords in some queries, that were not yet supported in 6.0. Elastic had warned about this in their announcement mentioned earlier:

“… the high-level client still depends on Elasticsearch, like the Java API does today. This may not be ideal, as it still ties users of the client to depend on a certain version of Elasticsearch, but this decision allows users to migrate away more easily from the transport client. We would like to get rid of this direct dependency in the future, but since this is a separate long-term project, we didn’t want this to affect the timing of the client’s first release.”

As the High Level REST Client implementation wasn’t directly compatible with the admin operations in our application anyway, we decided to just implement these features using the Low Level REST client. Here is an example of creating an index template using the transport client

String template = readFile(templateFileName);

            .setSource(new BytesArray(template), XContentType.JSON)

And here is the same operation using the low level REST client:

String template = readFile(templateFileName);

HttpEntity entity = new NStringEntity(template, ContentType.APPLICATION_JSON);

client.performRequest("PUT", "_template/" + templateName, emptyMap(), entity);

The index name pattern now had to be included in the template JSON and was no longer dynamically configurable in the code, but this was not a problem for our application. With all the tests nice and green, the project was done!


Software problems always seem easy once you know how to solve them. This report from the trenches may give the impression that this was a simple, straight forward migration that only took a few days, but in reality the whole project involved several weeks of investigation, backtracking from dead ends, and a lot of trial and error. Hopefully, this blog can help fellow Java developers who are tasked with the same problem save a lot of time, or you could contact Trifork to provide a helping hand.

How to manage Database Migrations with Flyway?

December 5th, 2017 by

Joris Kuipers, CTO at Trifork, presented a webinar on some usage patterns for Flyway. You can find the recording on our Trifork YouTube channel.

Tools like Flyway address a common concern for many people, which quickly leads to questions on how to pick a tool and then apply it in the best manner for one’s particular situation.

In this blog post, Joris has summarized the Q&A session – he provides the readers with his answers and ideas for managing database migrations.

1. How does Flyway compare to Liquibase?

When I was choosing a DB schema migration tool 4 or 5 years ago, I’ve looked at both Liquibase and Flyway. In general, I’d say that Flyway is a bit more light-weight: Liquibase addresses some requirements that Flyway explicitly chooses not to support. These include support to define migrations declaratively (e.g. in XML) and then generate the correct SQL DDL statements for your particular DBMS and the support to generate migration rollbacks (countering operations).
Read the rest of this entry »

Using Axon with PostgreSQL without TOAST

October 9th, 2017 by

The client I work for at this time is leveraging Axon 3. The events are stored in a PostgreSQL database. PostgreSQL uses a thing called TOAST (The Oversized-Attribute Storage Technique) to store large values.

From the PostgreSQL documentation:

“PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows”

As it happens, in our setup using JPA (Hibernate) to store events, the DomainEventEntry entity has a @Lob annotation on the payload and the metaData fields (via extension of the AbstractEventEntry class):

For PostgreSQL this will result in events that are not easily readable:

SELECT payload FROM domainevententry;

| payload |
| 24153   |

The data type of the payload column of the domainevententry table is OID.

The PostgreSQL JDBC driver obviously knows how to deal with this. The real content is deTOASTed lazily. Using PL/pgSQL it is possible to store a value in a file. But this needs to be done value by value. But when you are debugging your application and want a quick look at the events of your application, this is not a fun route to take.

So we wanted to change the data type in our database to something more human readable. BYTEA for example. Able to store store large values in, yet still readable. As it turned out, a couple changes are needed to get it working.

It took me a while to get all the pieces I needed. Although the solution I present here works for us, perhaps this could not be the most elegant of even the best solution for everyone.
Read the rest of this entry »

Heterogeneous microservices

July 13th, 2017 by

Heterogeneous microservices

Microservices architecture is increasingly popular nowadays. One of the promises is flexibility and easier working in larger organizations by reducing the amount of communication and coordination between teams. The thinking is, teams have their own service(s) and don’t depend on other teams, meaning they can work independently, therebty reducing coordination efforts.

Especially with multiple teams and multiple services per team, this can mean there are quite a few services with quite different usage. Different teams can have different technology preferences, for example because they are more familiar with the one or the other. Similar different usage can mean quite different requirements, which might be easier to fulfill with the one or the other technology.

The question i’m going to discuss in this blog post, how free or constrained should technology choices be in such an environment?

Read the rest of this entry »

How to send your Spring Batch Job log messages to a separate file

April 14th, 2017 by

In one of my current projects we’re developing a web application which also has a couple of dozen batch jobs that perform all sort of tasks at particular times. These jobs produce quite a bit of logging output when they’re run, which is important to see what has happened during a job exactly. What we noticed however, is that the batch logging would make it hard to quickly spot the other logging performed by the application while also running a batch job. In addition to that, it wasn’t always clear in the context of what job a log statement was issued.
To address these issues I came up with a simple solution based on Logback Filters, which I’ll describe in this blog.

Logback Appenders

We’re using Logback as a logging framework. Logback defines the concept of appenders: appenders are responsible for handling the actual log messages emitted by the loggers in the application by writing them to the console, to a file, to a socket, etc.
Many applications define one or more appenders and them simply list them all as part of their root logger section in the logback.xml configuration file:

<configuration scan="true">

  <appender name="LOGSTASH" class="net.logstash.logback.appender.LogstashTcpSocketAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>

  <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %mdc %-5level %logger{36} - %msg%n</pattern>
  <root level="info">
    <appender-ref ref="LOGSTASH"/>
    <appender-ref ref="FILE"/>


This setup will send all log messages to both of the configured appenders. Read the rest of this entry »

Machine Learning: Predicting house prices

February 16th, 2017 by

Recently i have followed an online course on machine learning to understand the current hype better. As with any subject though, only practice makes perfect, so i was looking to apply this new knowledge.

While looking to sell my house i found that would be a nice opportunity: Check if the prices a real estate agents estimates are in line with what the data suggests.

Linear regression algorithm should be a nice algorithm here, this algorithm will try to find the best linear prediction (y = a + bx1 + cx2 ; y = prediction, x1,x2 = variables). So for example this algorithm can estimate a price per square meter floor space or price per square meter of garden. For a more detailed explanation, check out the wikipedia page.

In the Netherlands funda is the main website for selling your house, so i have started by collecting some data, i used data on the 50 houses closest to my house. I’ve excluded apartments to try and limit data to properties similar to my house. For each house i collected the advertised price, usable floor space, lot size, number of (bed)rooms, type of house (row-house, corner-house, or detached) and year of construction (..-1930, 1931-1940, 1941-1950, 1950-1960, etc). These are the (easily available) variables i expected would influence house price the most. Type of house is a categorical variable, to use that in regression I modeled them as several binary (0/1) variables.

As preparation, i checked for relations between the variables using correlation. This showed me that much of the collected data does not seem to affect price: Only the floor space, lot size and number of rooms showed a significant correlation with house price.

For the regression analysis I only used the variables that had a significant correlation. Variables without correlation would not produce meaningful results anyway.

Read the rest of this entry »

Simulating an Elasticsearch Ingest Node pipeline

February 2nd, 2017 by

Indexing document into your cluster can be done in a couple of ways:

  • using Logstash to read your source and send documents to your cluster;
  • using Filebeat to read a log file, send documents to Kafka, let Logstash connect to Kafka and transform the log event and then send those documents to your cluster;
  • using curl and the Bulk API to index a pre-formatted file;
  • using the Java Transport Client from within a custom application;
  • and many more…

Before version 5 however there where only two ways to transform your source data to the document you wanted to index. Using Logstash filters, or you had to do it yourself.

In Elasticsearch 5 the concept of the Ingest Node has been introduced. Just a node in your cluster like any other but with the ability to create a pipeline of processors that can modify incoming documents. The most frequently used Logstash filters have been implemented as processors.

For me, the best part of pipelines is that you can simulate them. Especially in Console, simulating your pipelines makes creating them very fast; the feedback loop on testing your pipeline is very short. Making using pipelines a very convenient way to index data.

Read the rest of this entry »

Service Discovery using Consul & Spring Cloud

December 14th, 2016 by


In one of our customer projects we are heavily using Spring Boot in combination with other Spring projects for our microservices.

One of the more complex parts of microservices, especially when you are using them as fine-grained as meant to be, will be the fact that you need to setup and maintain the connections between all those services. In Spring you are typically doing this using some way of externalized configuration like property files. But even then, it can become quite a challenge when you need to connect with for instance 20 other microservices.

To make it even more complex you definitely want, especially in cloud based solutions, something like scalability. Actually, this should be accomplished by running just another instance of your microservice. They are self-contained, so they just need some basic configuration like setting the port-number. But then you also need something to load-balance the different microservices serving the same purpose. And to be honest: I don’t care about the location and port! I just want a service which offers me a certain contract. And at runtime, when needed, I want them to behave differently depending on configuration.

So what we are actually looking for is a solution which provides an easy way to do service discovery and even better, can act as a load-balancer and even better, can provide my services with their configuration.

This is where comes to the rescue. According to their website Consul is a solution which makes service discovery and service configuration easy and is distributed, highly available and datacenter-aware.

So let’s discover how Consul plays nicely with Spring Boot!

Read the rest of this entry »

Writing less code

November 23rd, 2016 by

Have you had that feeling that you have to write too much code to build simple functionality? Some things just feel repetitive, they feel you should be not have to write them yourself, instead a framework should make your life easier.

Recently I’ve been building a project in Java/Spring, and after some time I started wondering about alternatives and how to build the same functionality with less code.

There is lots of alternative frameworks and multiple ways of building rest endpoints in Java/Spring.

  • Building the controller/service/dao layers manually in Spring ;
  • Using spring-data-rest to export your spring-data repositories ;
  • Groovy/grails RestfulController ;
  • Python/django django-rest-framework ;
  • etc


Below some abbreviated examples of how a simple rest endpoint looks for each approach. To actually run the examples, you’ll need check out the tutorials mentioned earlier. My goal here is a quick comparison of how you do things in each framework.

Read the rest of this entry »

Measure the Adequacy of Android Unit Tests with Mutation Testing

September 7th, 2016 by

Unit tests are an essential tool in a trustworthy test suite for an Android application or any other software system for that matter. But unit tests themselves doesn’t guarantee that the right features or requirements are tested, even if you did a thorough effort to cover as much code as possible in your entire code base with them. It only proves that the system is actually tested, but says nothing about the quality of the tests. Mutation tests can help with this issue, by measuring the quality of your unit tests by manipulating your code under test. Mutation tests can be seen as the tests of your unit tests.

So, what Exactly is Mutation Testing?

In a nutshell mutation testing is a mechanism to inject different kinds of errors (mutants) into your code base while running your tests. A mutant could be changing a simple conditional in your Java code from == to !=. Generally, mutants try to simulate common programming errors like accidentally inverting an if statement, returning null instead of a real object, etc.
If your test covering the piece of code where this mutant was injected still passes, then the mutant survived. If on the other hand the test fails, then the mutant was killed. As you might already have guessed, the terminology in mutation testing is somehow opposite of normal test results, as killed mutants is a good thing, while surviving mutants is a bad thing, since they indicate, that we didn’t write our test well enough.

Read the rest of this entry »