Trifork Blog

Posts by Rik van den Ende

Refactoring from Elasticsearch version 1 with Java Transport client to version 6 with High Level REST client

February 27th, 2018 by
(https://blog.trifork.com/2018/02/27/refactoring-from-elasticsearch-version-1-with-java-transport-client-to-version-6-with-high-level-rest-client/)

Every long running project accrues technical debt. It may be that the requirements today have evolved in a different direction from what was foreseen when the project was designed, or it may be that difficult infrastructure tasks have been put off in favor of new functionality. From time to time, you need to refactor your code to clean up this technical debt. I recently finished such a refactoring task for a customer, so in the category ‘from the trenches’, I would like to share the story here.

Elasticsearch exposes both a REST interface and the internal Java API, via the binary transport client, for connecting with the search engine. Just over a year ago, Elastic announced to the world that it plans to deprecate the transport client in favor of the high level REST client, “as soon as the REST client is feature complete and is mature enough to replace the Java API entirely”. The reasons for this are clearly explained in Luca Cavanna’s blogpost, but the most important disadvantage is that using the transport client, you introduce a tight coupling between your application and the exact major and minor release of your ES cluster. As long as Elasticsearch exposes its internal API, it has to worry about breaking thousands of applications all over the world that depend on it.

The “as soon as…” timetable sounds somewhat vague and long term, but there may be good reasons to migrate your search functionality now, rather than later. In the case of our customer, their reason is wanting to use the AWS Elasticsearch service. The entire codebase is already running in AWS, and for the past few years they have been managing their own Elasticsearch cluster running in EC2 instances. This turns out to be labor intensive when updates have to be applied to these VMs. It would be easier and probably cheaper to let Amazon manage the cluster. As the AWS Elasticsearch service only exposes the REST API, the dependence on the transport protocol will have to be removed.

Action plan

The starting situation was a dependency on Elasticsearch 1.4.5, using the Java API. The goal was the most recent Elasticsearch version available in the Amazon Elasticsearch Service, which at the time was 6.0.2, using the REST API.

In order to reduce the complexity of the refactoring operation, we decided early on, to reindex the data, rather than trying to convert the indices. Every Elasticsearch release comes with a handy list of breaking changes. Looking through this list, we tried to make a list of breaking changes that would likely affect the search implementation of our customer. There are more potential breaking changes than listed here, but these are the ones that an initial investigation suggested might have an impact:

1.x – 2.x:

  • Facets replaced by aggregations
  • Field names can’t contain dots

2.x – 5.x:

5.x – 6.0:

  • Support for indices with multiple mapping types dropped

The plan was first to convert the existing code to work with ES 6, and only then migrate from the transport client to the High Level REST client.

Implementation 

The entire search functionality, originally written by our former colleague Frans Flippo, was exhaustively covered by unit- and integration tests, so the first step was to update the maven dependency to the current version, run the tests, and see what broke. First there were compilation errors that were easily fixed. Some examples:

Replace FilterBuilder with QueryBuilder, RangeFilterBuilder with RangeQueryBuilder, TermsFilterBuilder with TermsQueryBuilder, PercolateRequestBuilder with PercolateQueryBuilder etc, switch to HighlightBuilder for highlighters, replace ‘fields’ with ‘storedFields’. The count API was removed in version 5.5, and its use had to be replaced by executing a search with size 0. Facets had already been replaced by aggregations by our colleague Attila Houtkooper, so we didn’t have to worry about that.

In ES 5, the suggest API was removed, and became part of the search API. This turned out not to have an impact on our project, because the original developer of the search functionality implemented a custom suggestions service based on aggregation queries. It looks like he wanted the suggestions to be ordered by the number of occurrences in a ‘bucket’, which couldn’t be implemented using the suggest API at the time. We decided that refactoring this to use Elasticsearch suggesters would be new functionality, and outside the scope of this upgrade, so we would continue to use aggregations for now.

Some updates were required to the index mappings. The most obvious one was replacing ‘string’ with either ‘text’ or ‘keyword’. Analyzer became search_analyzer, while index_analyzer became analyzer.

Syntax ES 1:

"fields": {
    "analyzed": {
        "type": "string",
        "analyzer" : "dutch",
        "index_analyzer": "default_min_word_length_2"
    },
    "not_analyzed": {
        "type": "string",
        "index": "not_analyzed"
    }
}

Syntax ES 6:

"fields": {
  "analyzed": {
    "type": "text",
    "search_analyzer": "dutch",
    "analyzer": "default_min_word_length_2"
  },
  "not_analyzed": {
    "type": "keyword",
    "index": true
  }
}

Document id’s were associated with a path:

"_id": {
    "path": "id"
},

The _id field is no longer configurable, so in order to have document ids in Elasticsearch match ids in the database, the id has to be set explicitly, or Elasticsearch will generate a random one.

All in all, it was roughly a day of work to get the project to compile and ready to run the unit tests. All of them were red.

Read the rest of this entry »

Server side applications in Apple’s Swift

May 2nd, 2016 by
(https://blog.trifork.com/2016/05/02/server-side-applications-in-apples-swift/)

In 2014, Apple announced the release of Swift, a new programming language for all their platforms. Their programming language of choice on iOS and OSX has always been Objective-c, a language which is a bit dated (it predates C++) and as it has had new features (and syntaxes) bolted onto it every few years, it carries quite a bit of baggage. It seems I wasn’t the only one with this opinion, as the release of swift was greeted with great enthusiasm, and has been adopted very rapidly.

Swift combines all the features that are fashionable in a general purpose language today, without the feeling that they were bolted on after the fact. While building an iOS client for our customer Gerimedica in swift, I found myself wishing I could use this language on the server side as well as in the client. At WWDC 2015, Apple announced the intention to open source the language, and release a Linux version, so it looked like it could become a reality. Since december 2015, the sources have been available on github, and builds for OSX and Ubuntu are made available roughly twice per month.

PerfectLib

A number of groups and companies saw an opportunity to be among the first with something that was obviously going to be big. One of the first was PerfectSoft, a startup that aims to be the one big framework for all your server side development in swift. They started building their framework as soon as the open source release of swift was announced, and have been advertising their product everywhere. Because they started development before anyone outside Cupertino had a good idea what the release would look like, it only worked on OSX at first, and it didn’t use the Swift Package Manager, the intended default build and dependency management tool for swift. At the time, the framework compiled to one big binary, that you had to include in your build manually. They have a beautiful website and good documentation, but it just wasn’t working when I tried it. I intend to try this framework out again at a later date.

IBM

The biggest player (other than Apple) to openly jump on the swift bandwagon is IBM. As soon as the open source release of Swift was announced, IBM announced the Swift Sandbox, their Swift based version of google’s golang playground. It is a web based repl that can be shared online by sharing a URL. Cool, but not extremely useful, as unlike go, swift already comes with a repl. The real significance of this is not the swift sandbox itself, but the message that IBM is interested in this technology and intends to be involved. IBM isn’t the kind of company to back technologies just because they like them, so they either see an opportunity, or a potential strategic interest. At the moment, IBM’s swift related activities seem to be associated with their PaaS solution BlueMix, so they are likely working on the Swift / IBM version of google’s app engine for go. IBM offers its own web framework for swift: kitura. Kitura turns out to be less than trivial to install and for now somewhat bare bones, but as this is IBM, it is worth dedicating another blog post to it at a later date. Also check out their overview of the most popular, most active and most essential open source projects on github for swift.

Read the rest of this entry »

Simulating a bad network for testing

July 3rd, 2012 by
(https://blog.trifork.com/2012/07/03/simulating-a-bad-network-for-testing/)

In a development environment, and often in the test and QA environments as well, we are thankfully blessed with a network that is for all intents and purposes infinitely fast, infinitely reliable and not shared with anyone else. Sometimes this causes you to miss a bug that only becomes apparent once your application has been released into the wild, where it has to deal with latency, packet loss and protocol violations.

To reproduce such bugs, it would be nice to have a network that is bad in a precisely controlled way. On a Linux machine, you can simulate one with netem. There is a wide range of possibilities with this tool, most of which are more useful to a network engineer than to a programmer or software tester, but I’ll give some simple examples, and demonstrate their effect with mtr.

First let’s take a look at the normal state of the network:

$ mtr -c 100 --report orange11.nl
HOST: cartman                    Loss%  Snt   Last   Avg  Best  Wrst StDev
1.|-- lobby                      0.0%   100    0.2   0.2   0.1   0.2   0.0
2.|-- backup1.orange11.nl        0.0%   100    2.4   4.0   2.0   9.1   1.7
3.|-- 10.0.0.30                  0.0%   100    5.0   4.0   2.2  10.4   1.6

That’s not too bad. Now we’ll simulate an average packet delay of 100 ms with a variability of 50ms, and a packet loss of 5%:

$ sudo tc qdisc add dev eth0 root netem delay 100ms 50ms loss 5%
$ mtr -c 100 --report orange11.nl
HOST: cartman                    Loss%  Snt   Last   Avg  Best  Wrst StDev
1.|-- lobby                      8.0%   100  129.3  96.6  50.2 147.8  26.0
2.|-- backup1.orange11.nl        3.0%   100  120.1 103.9  54.4 157.5  27.8
3.|-- 10.0.0.30                  4.0%   100   90.3 103.4  53.9 154.3  29.3

Pretty much as we would expect, the best ping times are around 50ms, the worst around 150ms, with an average around 100ms. The packet loss is a bit more random that I expected, but it should average out around 5% if we left mtr running for much longer than 100 cycles.

I can recommend trying out whatever project you are working on now, with a packet delay of 500ms to see if strange things happen in a reasonable worst case. It is important to realize that this tool can only shape the traffic that we’re sending, not receiving, so if the networked application is running on a different server, only your uploads and ACK packets should be affected.

You don’t have to reboot to get your network back to normal:

$ sudo tc qdisc del dev eth0 root

A great deal more can be done to shape your network traffic for better or for worse, such as rate control, prioritizing one destination over another, introducing packet corruption, duplication, or reordering etc, but these are outside the scope of this post.

A nice tutorial with examples can be found at the linuxfoundation.org and if you are interested in reading more about the background of network traffic control in the Linux kernel, I can recommend the Linux Advanced Routing & Traffic Control HOWTO.

Let me know how you get on won’t you?