Trifork Blog

Posts Tagged ‘Elasticsearch’

Refactoring from Elasticsearch version 1 with Java Transport client to version 6 with High Level REST client

February 27th, 2018 by
(https://blog.trifork.com/2018/02/27/refactoring-from-elasticsearch-version-1-with-java-transport-client-to-version-6-with-high-level-rest-client/)

Every long running project accrues technical debt. It may be that the requirements today have evolved in a different direction from what was foreseen when the project was designed, or it may be that difficult infrastructure tasks have been put off in favor of new functionality. From time to time, you need to refactor your code to clean up this technical debt. I recently finished such a refactoring task for a customer, so in the category ‘from the trenches’, I would like to share the story here.

Elasticsearch exposes both a REST interface and the internal Java API, via the binary transport client, for connecting with the search engine. Just over a year ago, Elastic announced to the world that it plans to deprecate the transport client in favor of the high level REST client, “as soon as the REST client is feature complete and is mature enough to replace the Java API entirely”. The reasons for this are clearly explained in Luca Cavanna’s blogpost, but the most important disadvantage is that using the transport client, you introduce a tight coupling between your application and the exact major and minor release of your ES cluster. As long as Elasticsearch exposes its internal API, it has to worry about breaking thousands of applications all over the world that depend on it.

The “as soon as…” timetable sounds somewhat vague and long term, but there may be good reasons to migrate your search functionality now, rather than later. In the case of our customer, their reason is wanting to use the AWS Elasticsearch service. The entire codebase is already running in AWS, and for the past few years they have been managing their own Elasticsearch cluster running in EC2 instances. This turns out to be labor intensive when updates have to be applied to these VMs. It would be easier and probably cheaper to let Amazon manage the cluster. As the AWS Elasticsearch service only exposes the REST API, the dependence on the transport protocol will have to be removed.

Action plan

The starting situation was a dependency on Elasticsearch 1.4.5, using the Java API. The goal was the most recent Elasticsearch version available in the Amazon Elasticsearch Service, which at the time was 6.0.2, using the REST API.

In order to reduce the complexity of the refactoring operation, we decided early on, to reindex the data, rather than trying to convert the indices. Every Elasticsearch release comes with a handy list of breaking changes. Looking through this list, we tried to make a list of breaking changes that would likely affect the search implementation of our customer. There are more potential breaking changes than listed here, but these are the ones that an initial investigation suggested might have an impact:

1.x – 2.x:

  • Facets replaced by aggregations
  • Field names can’t contain dots

2.x – 5.x:

5.x – 6.0:

  • Support for indices with multiple mapping types dropped

The plan was first to convert the existing code to work with ES 6, and only then migrate from the transport client to the High Level REST client.

Implementation 

The entire search functionality, originally written by our former colleague Frans Flippo, was exhaustively covered by unit- and integration tests, so the first step was to update the maven dependency to the current version, run the tests, and see what broke. First there were compilation errors that were easily fixed. Some examples:

Replace FilterBuilder with QueryBuilder, RangeFilterBuilder with RangeQueryBuilder, TermsFilterBuilder with TermsQueryBuilder, PercolateRequestBuilder with PercolateQueryBuilder etc, switch to HighlightBuilder for highlighters, replace ‘fields’ with ‘storedFields’. The count API was removed in version 5.5, and its use had to be replaced by executing a search with size 0. Facets had already been replaced by aggregations by our colleague Attila Houtkooper, so we didn’t have to worry about that.

In ES 5, the suggest API was removed, and became part of the search API. This turned out not to have an impact on our project, because the original developer of the search functionality implemented a custom suggestions service based on aggregation queries. It looks like he wanted the suggestions to be ordered by the number of occurrences in a ‘bucket’, which couldn’t be implemented using the suggest API at the time. We decided that refactoring this to use Elasticsearch suggesters would be new functionality, and outside the scope of this upgrade, so we would continue to use aggregations for now.

Some updates were required to the index mappings. The most obvious one was replacing ‘string’ with either ‘text’ or ‘keyword’. Analyzer became search_analyzer, while index_analyzer became analyzer.

Syntax ES 1:

"fields": {
    "analyzed": {
        "type": "string",
        "analyzer" : "dutch",
        "index_analyzer": "default_min_word_length_2"
    },
    "not_analyzed": {
        "type": "string",
        "index": "not_analyzed"
    }
}

Syntax ES 6:

"fields": {
  "analyzed": {
    "type": "text",
    "search_analyzer": "dutch",
    "analyzer": "default_min_word_length_2"
  },
  "not_analyzed": {
    "type": "keyword",
    "index": true
  }
}

Document id’s were associated with a path:

"_id": {
    "path": "id"
},

The _id field is no longer configurable, so in order to have document ids in Elasticsearch match ids in the database, the id has to be set explicitly, or Elasticsearch will generate a random one.

All in all, it was roughly a day of work to get the project to compile and ready to run the unit tests. All of them were red.

Read the rest of this entry »

Smart energy consumption insights with Elasticsearch and Machine Learning

August 21st, 2017 by
(https://blog.trifork.com/2017/08/21/smart-energy-consumption-insights-with-elasticsearch-and-machine-learning/)

At home we have a Youless device which can be used to measure energy consumption. You have to mount it to your energy meter so it can monitor energy consumption. The device then provides energy consumption data via a RESTful api. We can use this api to index energy consumption data into Elasticsearch every minute and then gather energy consumption insights by using Kibana and X-Pack Machine Learning.

The goal of this blog is to give a practical guide how to set up and understand X-Pack Machine Learning, so you can use it in your own projects! After completing this guide, you will have the following up and running:

  • A Complete data pre-processing and ingestion pipeline, based on:
    • Elasticsearch 5.4.0 with ingest node;
    • Httpbeat 3.0.0.
  • An energy consumption dashboard with visualizations, based on:
    • Kibana 5.4.0.
  • Smart energy consumption insights with anomaly detection, based on:
    • Elasticsearch X-Pack Machine Learning.

The following diagram gives an architectural overview of how all components are related to each other:

Read the rest of this entry »

Handling a massive amount of product variations with Elasticsearch

December 22nd, 2016 by
(https://blog.trifork.com/2016/12/22/handling-a-massive-amount-of-product-variations-with-elasticsearch/)

In this blog we will review different techniques for modelling data structures in Elasticsearch. A project case is used to describe our approach on handling a small sized product data set with a large sized related product variations data set. Furthermore we will show how certain modelling decisions resulted in a 1000 factor query performance gain!

The flat world

Elasticsearch is a great product if you want to index and search through a large number of documents. Functionality like term and range queries, full-text search and aggregations on large data sets are very fast and powerful. But Elasticsearch prefers to treat the world as if it were flat. This means that an index is a flat collection of documents. Furthermore, when searching, a single document should contain all of the information that is required to decide whether it matches the search request.

In practice, however, domains often are not flat and contain a number of entities which are related to each other. These can be difficult to model in Elasticsearch in such a way that the following conditions are met:

  • Multiple entities can be aggregated from a single query;
  • Query performance is stable with low response times;
  • Large numbers of documents can easily be mutated or removed.

The project case

This blog is based on a project case. In the project, two data sets were used. The data sets have the following characteristics:

  • Products:
    • Number of documents: ~ 75000;
    • Document characteristics: A product contains a set of fields which contains the primary information of a product;
    • Mutation frequency: Updates on product attributes can occur fairly often (e.g. every 15 minutes).
  • Product variations:
    • Number of documents: ~ 500 million;
    • Document characteristics: A product variation consists of a set of additional attributes which contain extra information on top of the corresponding product. The number of product variations per product varies a lot, and can go up to 50000;
    • Mutation frequency: During the day, there is a continuous stream of updates and new product variations.

Read the rest of this entry »

Collecting data from a private LoRaWAN sensor network into Elastic

May 20th, 2016 by
(https://blog.trifork.com/2016/05/20/collecting-data-from-a-private-lorawan-sensor-network-into-elastic/)

Introduction to LoRaWAN and ELK

Why LoRaWAN, and what makes it different from other types of low power consumption, high range wireless protocols like ZigBee, Z-Wave, etc … ?

LoRa is a wireless modulation for long-range, low-power, low-data-rate applications developed by Semtech. The main features of this technology are the big amount of devices that can connect to one network and the relatively big range that can be covered with one LoRa router. One gateway can coordinate around 20’000 nodes in a range of 10–30km. It’s a very flexible protocol and allows the developers build various types of network architectures according to the demand of the client. The general description of the LoRaWAN protocol together with a small tutorial are available in my previous post.

What is the ELK stack, and why use it with LoRaWAN?

In the figure above, you can see a simplified model of what a typical LoRaWAN network looks like.
As you can see, the data from the LoRa endpoints, has to go through several devices before it reaches the back-end application. Nowadays there are a lot of tools that would allow us to gather and manipulate the data. A very good solution is the ELK stack which consists of Elasticsearch, Logstash and Kibana; these three tools allow to gather, store and analyze big amounts of data. More information and details can be found on the official website: https://www.elastic.co/.

Read the rest of this entry »

Dealing with NodeNotAvailableExceptions in Elasticsearch

April 8th, 2015 by
(https://blog.trifork.com/2015/04/08/dealing-with-nodenotavailableexceptions-in-elasticsearch/)

tl;dr

Elasticsearch provides distributed search with minimal setup and configuration. Now the nice thing about it is that, most of the time, you don’t need to be particularly concerned about how it does what it does. You give it some parameters – “I want 3 nodes”, “I want 3 shards”, “I want every shard to be replicated so it’s on at least two nodes”, and Elasticsearch figures out how to move stuff around so you get the situation you asked for. If a node becomes unreachable, Elasticsearch tries to keep things going, and when the lost node appears and rejoins, the administration is updated so everything is hunky-dory again.

The problem is when things don’t work the way you expect…

Computer says “no node available”

Read the rest of this entry »

Server-side clustering of geo-points on a map using Elasticsearch – continued

March 26th, 2014 by
(https://blog.trifork.com/2014/03/26/server-side-clustering-of-geo-points-on-a-map-using-elasticsearch-continued/)

In a previous post I described a problem of data visualization and a possible solution provided by a plugin of elasticsearch. I noticed that elasticsearch might one day evolve to make the plugin unnecessary. That day seems to have come: starting from version 1.0.0, elasticsearch includes Aggregations, a new API for data mining. In this post I’ll show you how to use aggregations to reproduce the functionality of the plugin.

Read the rest of this entry »

Summer time…

September 4th, 2012 by
(https://blog.trifork.com/2012/09/04/summer-time/)

For those you may have missed our newsletter last week I’d like to take this opportunity to give you a quick lowdown of what we’ve been up to. The summer months have been far from quiet and I’m pretty excited to share in this month’s edition lots of news on projects, products & upcoming events.

Hippo & Orange11

hippo logoThe countdown has begun for the launch of The University of Amsterdam online platform. Built by Orange11 with the use of Hippo CMS the website developed with multi-platform functionality in mind is a masterpiece of technology all woven together. We’ll keep you posted about the tips & tricks we implemented.

If you can’t wait until then and want more information, contact Jan-Willem van Roekel.

Mobile Apps; just part of the service!

new motion appWe mentioned in our last newsletter the launch of Learn to write with Tracy. Well since then we’ve been working on apps for many customers including for example The New Motion, a company dedicated to the use of Electric Vehicles. Orange11 has developed an iPhone app that allows users to view or search load locations (in list or map form) and even check real-time availability of these.

ysis screenAnother example is the app for GeriMedica;Ysis Mobiel, a mobile addition to their existing Electronic Health Record database used largely in Geriatric care (also an Orange11 project). The mobile app supports specific work processes allows registered users to document (in line with the strict regulations) all patient related interaction through a simple 3-step logging process. A registration overview screen also shows the latest activities registered, which prevents co-workers from accidentally registering the same activity twice.

Visit our website for more on our mobile expertise.

Orange11 & MongoDB 

mongoDB logo

We’ve got tons of exciting things going on with MongoDB as trusted implementation partner so here are a few highlights:

Brown bag sessions

Since the launch of our brown bag sessions we’re excited that so many companies are interested to find out more this innovative open source document database. What we offer is a 60 minute slot with an Orange11 & MongoDB expert, who can educate & demonstrate MongoDB best practices & cover how it can be used in practice. It’s our sneak preview to you of the host of opportunities there are with MongoDB.

Sign up now!

Tech meeting / User Group Meeting

As a partner we’re also proud to host the next user group session on Thursday 6th September, whereby Chief Technical Director, Alvin Richards will be here to cover all the product ins & outs and share some use cases.

Don’t miss out & join usas always it’s free & pizza and cold beer on the house!

Coffee Cookies, Conversation & Customers

Last week we invited some of our customers to a brainstorm session around the new Cookie Law in the Netherlands. Together with Eric Verhelst, a lawyer specialized in the IT industry, Intellectual Property, Internet and Privacy we provided our customers with legal insight and discussed what their concerns & ideas were around solutions. If you have any questions around the new cookie law and are looking for advice, answers & solutions, contact Peter Meijer.

ElasticSearch has just got bigger

es logoCongratulations to our former CEO, Steven Schuurman who announced his new venture:ElasticSearch, the company. The company’s product “elasticsearch”, is an innovative and advanced open source distributed search engine. The combination of Steven joining forces with elasticsearch founder & originator Shay Banon and his background as co-founder of SpringSource, the company behind the popular Spring Framework, (also close to our heart at Orange11) it’s bound to be a great success. The company offers users and potential users of elasticsearch a definitive source for support, education and guidance with respect to developing, deploying and running elasticsearch in production environments. As Search remains a key focus area for Orange11, with our experience in both Solr and elasticsearch, our customers are guaranteed the best search solution available. For more info contact Bram Smeets. 

Our team is getting bigger & better

beach eventsWe’re happy to welcome Michel Vermeulen to the team this month. Michel is an experienced Project Manager and will further professionalize our agile development organization. We also have new talent starting next month, BUT there is room for more.

So if you’re a developer and wanna work on great project with a fun team (left: snapshot from our company beach event) then call Bram Smeets now.

That’s all for now folks….