Trifork Blog

Using Axon with PostgreSQL without TOAST

No Comments

The client I work for at this time is leveraging Axon 3. The events are stored in a PostgreSQL database. PostgreSQL uses a thing called TOAST (The Oversized-Attribute Storage Technique) to store large values.

From the PostgreSQL documentation:

“PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows”

As it happens, in our setup using JPA (Hibernate) to store events, the DomainEventEntry entity has a @Lob annotation on the payload and the metaData fields (via extension of the AbstractEventEntry class):

For PostgreSQL this will result in events that are not easily readable:

SELECT payload FROM domainevententry;

| payload |
| 24153   |

The data type of the payload column of the domainevententry table is OID.

The PostgreSQL JDBC driver obviously knows how to deal with this. The real content is deTOASTed lazily. Using PL/pgSQL it is possible to store a value in a file. But this needs to be done value by value. But when you are debugging your application and want a quick look at the events of your application, this is not a fun route to take.

So we wanted to change the data type in our database to something more human readable. BYTEA for example. Able to store store large values in, yet still readable. As it turned out, a couple changes are needed to get it working.

It took me a while to get all the pieces I needed. Although the solution I present here works for us, perhaps this could not be the most elegant of even the best solution for everyone.
Read the rest of this entry »

Posted in: Java

Kibana Histogram on Day of Week

No Comments

I keep track of my daily commutes to and from the office. One thing I want to know is how the different days of the week are affecting my travel duration. But when indexing all my commutes into Elasticsearch, I can not (out-of-the-box) create a histogram on the day of the week. My first visualization will look like this:

Read the rest of this entry »

Posted in: Elasticsearch

Smart energy consumption insights with Elasticsearch and Machine Learning

1 Comment

At home we have a Youless device which can be used to measure energy consumption. You have to mount it to your energy meter so it can monitor energy consumption. The device then provides energy consumption data via a RESTful api. We can use this api to index energy consumption data into Elasticsearch every minute and then gather energy consumption insights by using Kibana and X-Pack Machine Learning.

The goal of this blog is to give a practical guide how to set up and understand X-Pack Machine Learning, so you can use it in your own projects! After completing this guide, you will have the following up and running:

  • A Complete data pre-processing and ingestion pipeline, based on:
    • Elasticsearch 5.4.0 with ingest node;
    • Httpbeat 3.0.0.
  • An energy consumption dashboard with visualizations, based on:
    • Kibana 5.4.0.
  • Smart energy consumption insights with anomaly detection, based on:
    • Elasticsearch X-Pack Machine Learning.

The following diagram gives an architectural overview of how all components are related to each other:

Read the rest of this entry »

Posted in: Docker | Elasticsearch | Machine Learning

Heterogeneous microservices

1 Comment

Heterogeneous microservices

Microservices architecture is increasingly popular nowadays. One of the promises is flexibility and easier working in larger organizations by reducing the amount of communication and coordination between teams. The thinking is, teams have their own service(s) and don’t depend on other teams, meaning they can work independently, therebty reducing coordination efforts.

Especially with multiple teams and multiple services per team, this can mean there are quite a few services with quite different usage. Different teams can have different technology preferences, for example because they are more familiar with the one or the other. Similar different usage can mean quite different requirements, which might be easier to fulfill with the one or the other technology.

The question i’m going to discuss in this blog post, how free or constrained should technology choices be in such an environment?

Read the rest of this entry »

Posted in: Custom Development

Interview with Sam Newman, author of Building Microservices

1 Comment

After living in Australia for the last five years, the Londoner, and author of the well-received Building Microservices, has returned home to focus on his business as an independent consultant. We caught up via Skype to discuss his upcoming visit to Amsterdam and the tech trends that he is keeping his eye on.

Reading time: Less than 5 minutes

Read the rest of this entry »

Posted in: Knowledge | Newsletter | Training

How to send your Spring Batch Job log messages to a separate file

1 Comment

In one of my current projects we’re developing a web application which also has a couple of dozen batch jobs that perform all sort of tasks at particular times. These jobs produce quite a bit of logging output when they’re run, which is important to see what has happened during a job exactly. What we noticed however, is that the batch logging would make it hard to quickly spot the other logging performed by the application while also running a batch job. In addition to that, it wasn’t always clear in the context of what job a log statement was issued.
To address these issues I came up with a simple solution based on Logback Filters, which I’ll describe in this blog.

Logback Appenders

We’re using Logback as a logging framework. Logback defines the concept of appenders: appenders are responsible for handling the actual log messages emitted by the loggers in the application by writing them to the console, to a file, to a socket, etc.
Many applications define one or more appenders and them simply list them all as part of their root logger section in the logback.xml configuration file:

<configuration scan="true">

  <appender name="LOGSTASH" class="net.logstash.logback.appender.LogstashTcpSocketAppender">
    <destination>logstash-server</destination>
    <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
  </appender>

  <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>log/server.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <fileNamePattern>log/server.%d{yyyy-MM-dd}.log</fileNamePattern>
      <maxHistory>30</maxHistory>
    </rollingPolicy>
    <encoder>
      <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %mdc %-5level %logger{36} - %msg%n</pattern>
    </encoder>
  </appender>
  <root level="info">
    <appender-ref ref="LOGSTASH"/>
    <appender-ref ref="FILE"/>
  </root>

</configuration>

This setup will send all log messages to both of the configured appenders. Read the rest of this entry »

Posted in: DevOps | From The Trenches | Java

Machine Learning: Predicting house prices

No Comments

Recently i have followed an online course on machine learning to understand the current hype better. As with any subject though, only practice makes perfect, so i was looking to apply this new knowledge.

While looking to sell my house i found that would be a nice opportunity: Check if the prices a real estate agents estimates are in line with what the data suggests.

Linear regression algorithm should be a nice algorithm here, this algorithm will try to find the best linear prediction (y = a + bx1 + cx2 ; y = prediction, x1,x2 = variables). So for example this algorithm can estimate a price per square meter floor space or price per square meter of garden. For a more detailed explanation, check out the wikipedia page.

In the Netherlands funda is the main website for selling your house, so i have started by collecting some data, i used data on the 50 houses closest to my house. I’ve excluded apartments to try and limit data to properties similar to my house. For each house i collected the advertised price, usable floor space, lot size, number of (bed)rooms, type of house (row-house, corner-house, or detached) and year of construction (..-1930, 1931-1940, 1941-1950, 1950-1960, etc). These are the (easily available) variables i expected would influence house price the most. Type of house is a categorical variable, to use that in regression I modeled them as several binary (0/1) variables.

As preparation, i checked for relations between the variables using correlation. This showed me that much of the collected data does not seem to affect price: Only the floor space, lot size and number of rooms showed a significant correlation with house price.

For the regression analysis I only used the variables that had a significant correlation. Variables without correlation would not produce meaningful results anyway.

Read the rest of this entry »

Posted in: Custom Development | General | Machine Learning

Simulating an Elasticsearch Ingest Node pipeline

No Comments

Indexing document into your cluster can be done in a couple of ways:

  • using Logstash to read your source and send documents to your cluster;
  • using Filebeat to read a log file, send documents to Kafka, let Logstash connect to Kafka and transform the log event and then send those documents to your cluster;
  • using curl and the Bulk API to index a pre-formatted file;
  • using the Java Transport Client from within a custom application;
  • and many more…

Before version 5 however there where only two ways to transform your source data to the document you wanted to index. Using Logstash filters, or you had to do it yourself.

In Elasticsearch 5 the concept of the Ingest Node has been introduced. Just a node in your cluster like any other but with the ability to create a pipeline of processors that can modify incoming documents. The most frequently used Logstash filters have been implemented as processors.

For me, the best part of pipelines is that you can simulate them. Especially in Console, simulating your pipelines makes creating them very fast; the feedback loop on testing your pipeline is very short. Making using pipelines a very convenient way to index data.

Read the rest of this entry »

Posted in: Custom Development | Elasticsearch

Public Elasticsearch clusters are being held ransom

No Comments

Last week several news sites and researchers reported that Elasticsearch clusters that are connected to the internet without proper security are being held ransom.

You can use shodan.io to search for Elasticsearch clusters: https://www.shodan.io/search?query=port%3A9200+json&language=en.

The first hit is actually a cluster that is ‘infected’:

There are some secured clusters as well:

But the default ‘root’ account with username “elastic” and password “changeme” (docs) will grant access. So not much security here… But at least your data is still there. For now.

Please do not connect your cluster to the internet without securing. Use X-Pack Security for authentication and authorization.

Elastic Cloud could also be something for you. Security in Elastic Cloud is default.

Posted in: Elasticsearch

Handling a massive amount of product variations with Elasticsearch

5 Comments

In this blog we will review different techniques for modelling data structures in Elasticsearch. A project case is used to describe our approach on handling a small sized product data set with a large sized related product variations data set. Furthermore we will show how certain modelling decisions resulted in a 1000 factor query performance gain!

The flat world

Elasticsearch is a great product if you want to index and search through a large number of documents. Functionality like term and range queries, full-text search and aggregations on large data sets are very fast and powerful. But Elasticsearch prefers to treat the world as if it were flat. This means that an index is a flat collection of documents. Furthermore, when searching, a single document should contain all of the information that is required to decide whether it matches the search request.

In practice, however, domains often are not flat and contain a number of entities which are related to each other. These can be difficult to model in Elasticsearch in such a way that the following conditions are met:

  • Multiple entities can be aggregated from a single query;
  • Query performance is stable with low response times;
  • Large numbers of documents can easily be mutated or removed.

The project case

This blog is based on a project case. In the project, two data sets were used. The data sets have the following characteristics:

  • Products:
    • Number of documents: ~ 75000;
    • Document characteristics: A product contains a set of fields which contains the primary information of a product;
    • Mutation frequency: Updates on product attributes can occur fairly often (e.g. every 15 minutes).
  • Product variations:
    • Number of documents: ~ 500 million;
    • Document characteristics: A product variation consists of a set of additional attributes which contain extra information on top of the corresponding product. The number of product variations per product varies a lot, and can go up to 50000;
    • Mutation frequency: During the day, there is a continuous stream of updates and new product variations.

Read the rest of this entry »

Posted in: Elasticsearch