Trifork Blog

Category ‘Software Development’

Spring Data Native Queries and Projections in Kotlin

August 28th, 2018 by
(https://blog.trifork.com/2018/08/28/spring-data-native-queries-and-projections-in-kotlin/)

Koltin, Spring Boot and JPA

This blog describes the solution to mapping native queries to objects. This is useful because sometimes you want to use a feature of the underlying database implementation (such as PostgreSQL) that is not part of the JPQL standard. By the end of this blog you should be able to confidently use native queries and use their outcome in a type-safe way.

In creating great applications based on Machine Learning solutions, we often come across uses for frameworks and databases that aren’t exactly standard. We sometimes need to build functionality that is either so new or so specific that it hasn’t been adopted into JPA implementations yet.

Working on a project with Spring Data is usually simple albeit somewhat opaque. Write a repository, annotate methods with @Query annotation and presto! You have mapped your database entities to Kotlin objects. Especially since Spring Framework 5 many of the interoperability issues (such as nullable values that are never null) have been alleviated.

Confucius wrote “Real knowledge is to know the extent of one’s ignorance”. So, to gauge the extent of our ignorance, let’s have a look at what happens when we cannot use the JPA abstraction layer in full and instead need to work with native queries.

Setting up the entity

When you use non-JPA features of the underlying database store, things can become complex.
Let’s say we have the following PostgreSQL table for storing people:

CREATE TABLE person (
  id BIGSERIAL NOT NULL UNIQUE PRIMARY KEY,
  first_name VARCHAR(20),
  last_name VARCHAR(20)
);

Given we represent an individual person like this:

import javax.persistence.Entity
import javax.persistence.GeneratedValue
import javax.persistence.Id
import javax.persistence.Table
@Entity
@Table(name = "person")
class PersonEntity {
  @Id
  @GeneratedValue
  var id: Long? = null
  var firstName: String? = null
  var lastName: String? = null
}

We can access that using a Repository:

import org.springframework.data.jpa.repository.JpaRepository
import org.springframework.stereotype.Repository
@Repository interface PersonRepo : JpaRepository<PersonEntity, Long>

We could now implement a custom query on the repository as follows:

@Repository interface PersonRepo : JpaRepository<PersonEntity, Long> {

  @Query("FROM PersonEntity WHERE first_name = :firstName")
  fun findAllByFirstName(@Param("firstName") firstName: String):
    List<PersonEntity>
}

So far so good. It uses JPQL syntax to form database-agnostic queries which is nice because we get some validation of these queries when starting the application, plus the added benefit of the syntax being database-type ignorant.

Adding a native query

Sometimes however, we want to use syntax that is specific to the database that we are using. We can do that by adding the boolean nativeQuery attribute to the @Query annotation and using Postgres’ SQL instead of JPQL:

  @Query("SELECT first_name, random() AS luckyNumber FROM person",
    nativeQuery = true)
  fun getPersonsLuckyNumber(): LuckyNumberProjection?

Obviously this example is simple for the sake of this context, more practical applications are in the area of using the extra data types that Postgres offers such as the cube data type for storing matrices.

You may be, as I was at first, tempted to write a class for LuckyNumberProjection.

class LuckyNumberProjection {
  var firstName: String? = null
  var luckyNumber: Float? = null
}

You will run cause into the following error:

org.springframework.core.convert.ConverterNotFoundException: No converter found
capable of converting from type
[org.springframework.data.jpa.repository.query.AbstractJpaQuery$TupleConverter$TupleBackedMap]
to type
[com.trifork.machinelearning.PersonRepo$LuckyNumberProjection]

The accompanying stack trace points in the direction of converters. This then makes you need to add a converter. However that doesn’t seem like it should be as hard. Good for us it turns out it isn’t!

Turns out that contrary to Entities, Projections, like Repositories, are expected to be interfaces. So let’s do that instead:

interface LuckyNumberProjection {
  val firstName: String?
  val luckyNumber: Float
}

This should set you straight next time you want to get custom objects mapped out of your JPA queries.

At Trifork Amsterdam, we are currently doing multiple projects using Kotlin using frameworks such as Spring Boot, Axon Framework and Project Reactor on top of Kubernetes clusters using Helm to build small and smart microservices. More and more of those microservices contain our Machine Learning based solutions. These are in a variety of areas ranging from natural language processing (NLP) to time-series analysis and clustering data for recommender systems and predictive monitoring.

Integrating the AWS Parameter Store with Spring Cloud

July 20th, 2018 by
(https://blog.trifork.com/2018/07/20/integrating-the-aws-parameter-store-with-spring-cloud/)

I’ll tell you all my secrets (but I lie about my past)
— Tom Waits – Tango till they’re sore

tl;dr

We’ve integrated the AWS Parameter Store with Spring Cloud so that it can be used as a secure configuration backend for services deployed to EC2, including ECS. This code has recently been merged in Spring Cloud AWS and is available in its 2.0 release.

Introduction

At the moment I’m working on a project where we’re developing a microservices-based system based on Spring Cloud for the Dutch Lotteries. The services are deployed on Amazon Web Services using Amazon’s current Docker support (ECS).

When we started late last year, we decided to use Consul, both as a service registry and as a key-value store for configuration. Spring Cloud has excellent built-in integration with Consul, both for service discovery as well as for using it as a shared configuration backend.

However, we quickly found out that we needed an internal load balancer to allow ECS to perform health checks on the services, so we might as well use that for server-side load balancing. This eliminated the need for client-side routing and service discovery. Furthermore, we weren’t too happy with the options to easily restrict access to secrets stored as config in Consul and were looking for a configuration service provided by AWS (rather than e.g. Vault) so that we’d no longer need to operate our own Consul cluster or other middleware.

AWS Parameter Store

When we looked for alternative solutions we soon found the AWS Parameter Store: it’s an option provided by EC2 to store all sorts of configuration parameters, including secrets that are encrypted at rest. Using IAM roles you can restrict access to parameters, which can have nested paths that can be used to define ACL-like access constraints. It also integrates with ECS quite nicely, by allowing containers to retrieve credentials to access the store, and provides versioning of parameter values.

This screenshot provides an impression of the corresponding console:

However, when looking for integration with Spring Cloud I just found some open tickets, so I decided to try to develop some integration myself. This blog post describes the result of that effort.

Read the rest of this entry »

Exposing asynchronous communication through a synchronous REST API with Spring 5

April 13th, 2018 by
(https://blog.trifork.com/2018/04/13/exposing-asynchronous-communication-through-a-synchronous-rest-api-with-spring-5/)

On my current project, we opted not to use REST for the communication between our services. Instead we make use of AxonIQ’s AxonHub, which acts as a specialized message broker. Messages can be of three types:

  • Command – You want to change something
  • Event – You want to inform others of something that happened
  • Query – You want to know something

The communication is asynchronous and we also have to deal with eventual consistency. If we would create an order by sending a CreateOrderCommand, this order would result in various events which update the state of the Order. We then need to send a query, of which we also receive the result asynchronous.

Our web and mobile frontend communicate with a microservices backend through a REST API. When the user creates an order (clicks the ‘buy’ button), they expect a result immediately. For both frontends, an API which models this user experience closely makes the most sense. This means that we want a more synchronous API, where we send a ‘create order’ request and immediately receive a response.

We implemented this using a small REST facade which translates our asynchronous communication in the backend to the synchronous communication for the frontend. This is a good use case for Spring 5’s reactive Webflux and Project Reactor. Using Project Reactor’s reactive API makes it possible to combine multiple asynchronous calls and operate on their result. Webflux handles the conversion of the reactive types (Mono, Flux) to REST responses. It optimizes the use of threads; by writing non-blocking code, we can reuse threads between asynchronous calls for handling other requests.

Diagram 1 gives us an overview of this approach.

Diagram 1

 

Implementation

Let’s have a more detailed look at the code for the create order example. Listing 1 shows the (slightly) simplified implementation of our REST controller method.

@PostMapping
public Mono&lt;ResponseEntity&lt;OrderResponse&gt;&gt; createOrder(CreateOrderRequest request) 
{
 CreateOrderCommand command = CreateOrderCommand.fromRequest(request); //1

 return this.commandGateway.send(command) //2
  .flatMap(id -&gt; queryGateway.send(new FindOrderSummaryQuery(id))) //3
  .retryWhen(errors -&gt; errors.delayElements(Duration.of(100, MILLIS)) //4
  .take(10)).concatWith(Mono.error(new RuntimeException())).next() //5
  .onErrorReturn(new OrderResponse(orderID, OrderStatus.CREATED)) //6
  .map(orderResponse -&gt; ResponseEntity.ok().body(orderResponse)); //7
}

Listing 1

We first create a command out of the REST request (line 1). A command is a message with the specific intent to change something in our domain. In this specific case, we want to create a new order.

After creating the command, the two asynchronous calls we make are:

Mono&lt;String&gt; id = this.commandGateway.send(command);

and:

Mono&lt;OrderResponse&gt; orderResponse = queryGateway.send(new FindOrderSummaryQuery(id));

Both calls return a single value by using a Mono. A Mono is a reactive type, comparable to the Java’s CompletableFuture. It has zero or one element and can represent an error. As with all reactive types, the value (or error) is delivered over time.

The second call takes the result of the first call as its input. We need the returned id of the command to query for the order. We use the flatMap operator to achieve this (line 3). The flatMap takes the asynchronous result of call 1 and passes this as a parameter to the lambda of call 2. The callback version can be seen in Listing 2: notice the nested lambda, which makes the code complex and less readable.

this.commandGateway.send(command, id -&gt; {
  queryGateway.send(new FindOrderSummaryQuery(id));
});

Listing 2

There is a delay between sending the command and being able to query the result. When the created order cannot be found (e.g. it isn’t create yet or something has gone wrong), an exception is thrown. In this chain, this is represented as a Mono.error(throwable). We use the retryWhen method to retry the query (line 4). We do this 10 times with a delay of 100 ms. When we still don’t get a result, we throw an error (line 5). We don’t expose the error to the client, but pass an OrderResponse with the id and status CREATED (line 6). The client can then query the status of the order later by using this id. This is a form of graceful degradation.

Finally we map the order response from the query to a response entity which can be returned by Spring. Spring actually subscribes to this whole chain and sends out the REST response for us.

Conclusion

Spring 5 and Project Reactor allow us to handle asynchronous communication with concise and readable code. We can do retries, error handling and the combination of multiple asynchronous calls in just a few lines.
The integration of Webflux with Project Reactor allows the use of reactive paradigms in a REST controller. Webflux uses an asynchronous approach. While we wait for a backend query to return, we don’t block the main thread. this allows it to be used for other requests.
Our specific use case is a good example of one of the applications of Spring 5 Webflux and Project Reactor.

Refactoring from Elasticsearch version 1 with Java Transport client to version 6 with High Level REST client

February 27th, 2018 by
(https://blog.trifork.com/2018/02/27/refactoring-from-elasticsearch-version-1-with-java-transport-client-to-version-6-with-high-level-rest-client/)

Every long running project accrues technical debt. It may be that the requirements today have evolved in a different direction from what was foreseen when the project was designed, or it may be that difficult infrastructure tasks have been put off in favor of new functionality. From time to time, you need to refactor your code to clean up this technical debt. I recently finished such a refactoring task for a customer, so in the category ‘from the trenches’, I would like to share the story here.

Elasticsearch exposes both a REST interface and the internal Java API, via the binary transport client, for connecting with the search engine. Just over a year ago, Elastic announced to the world that it plans to deprecate the transport client in favor of the high level REST client, “as soon as the REST client is feature complete and is mature enough to replace the Java API entirely”. The reasons for this are clearly explained in Luca Cavanna’s blogpost, but the most important disadvantage is that using the transport client, you introduce a tight coupling between your application and the exact major and minor release of your ES cluster. As long as Elasticsearch exposes its internal API, it has to worry about breaking thousands of applications all over the world that depend on it.

The “as soon as…” timetable sounds somewhat vague and long term, but there may be good reasons to migrate your search functionality now, rather than later. In the case of our customer, their reason is wanting to use the AWS Elasticsearch service. The entire codebase is already running in AWS, and for the past few years they have been managing their own Elasticsearch cluster running in EC2 instances. This turns out to be labor intensive when updates have to be applied to these VMs. It would be easier and probably cheaper to let Amazon manage the cluster. As the AWS Elasticsearch service only exposes the REST API, the dependence on the transport protocol will have to be removed.

Action plan

The starting situation was a dependency on Elasticsearch 1.4.5, using the Java API. The goal was the most recent Elasticsearch version available in the Amazon Elasticsearch Service, which at the time was 6.0.2, using the REST API.

In order to reduce the complexity of the refactoring operation, we decided early on, to reindex the data, rather than trying to convert the indices. Every Elasticsearch release comes with a handy list of breaking changes. Looking through this list, we tried to make a list of breaking changes that would likely affect the search implementation of our customer. There are more potential breaking changes than listed here, but these are the ones that an initial investigation suggested might have an impact:

1.x – 2.x:

  • Facets replaced by aggregations
  • Field names can’t contain dots

2.x – 5.x:

5.x – 6.0:

  • Support for indices with multiple mapping types dropped

The plan was first to convert the existing code to work with ES 6, and only then migrate from the transport client to the High Level REST client.

Implementation 

The entire search functionality, originally written by our former colleague Frans Flippo, was exhaustively covered by unit- and integration tests, so the first step was to update the maven dependency to the current version, run the tests, and see what broke. First there were compilation errors that were easily fixed. Some examples:

Replace FilterBuilder with QueryBuilder, RangeFilterBuilder with RangeQueryBuilder, TermsFilterBuilder with TermsQueryBuilder, PercolateRequestBuilder with PercolateQueryBuilder etc, switch to HighlightBuilder for highlighters, replace ‘fields’ with ‘storedFields’. The count API was removed in version 5.5, and its use had to be replaced by executing a search with size 0. Facets had already been replaced by aggregations by our colleague Attila Houtkooper, so we didn’t have to worry about that.

In ES 5, the suggest API was removed, and became part of the search API. This turned out not to have an impact on our project, because the original developer of the search functionality implemented a custom suggestions service based on aggregation queries. It looks like he wanted the suggestions to be ordered by the number of occurrences in a ‘bucket’, which couldn’t be implemented using the suggest API at the time. We decided that refactoring this to use Elasticsearch suggesters would be new functionality, and outside the scope of this upgrade, so we would continue to use aggregations for now.

Some updates were required to the index mappings. The most obvious one was replacing ‘string’ with either ‘text’ or ‘keyword’. Analyzer became search_analyzer, while index_analyzer became analyzer.

Syntax ES 1:

"fields": {
    "analyzed": {
        "type": "string",
        "analyzer" : "dutch",
        "index_analyzer": "default_min_word_length_2"
    },
    "not_analyzed": {
        "type": "string",
        "index": "not_analyzed"
    }
}

Syntax ES 6:

"fields": {
  "analyzed": {
    "type": "text",
    "search_analyzer": "dutch",
    "analyzer": "default_min_word_length_2"
  },
  "not_analyzed": {
    "type": "keyword",
    "index": true
  }
}

Document id’s were associated with a path:

"_id": {
    "path": "id"
},

The _id field is no longer configurable, so in order to have document ids in Elasticsearch match ids in the database, the id has to be set explicitly, or Elasticsearch will generate a random one.

All in all, it was roughly a day of work to get the project to compile and ready to run the unit tests. All of them were red.

Read the rest of this entry »

How to manage Database Migrations with Flyway?

December 5th, 2017 by
(https://blog.trifork.com/2017/12/05/how-to-do-database-migration-with-flyway/)

Joris Kuipers, CTO at Trifork, presented a webinar on some usage patterns for Flyway. You can find the recording on our Trifork YouTube channel.

Tools like Flyway address a common concern for many people, which quickly leads to questions on how to pick a tool and then apply it in the best manner for one’s particular situation.

In this blog post, Joris has summarized the Q&A session – he provides the readers with his answers and ideas for managing database migrations.

1. How does Flyway compare to Liquibase?

When I was choosing a DB schema migration tool 4 or 5 years ago, I’ve looked at both Liquibase and Flyway. In general, I’d say that Flyway is a bit more light-weight: Liquibase addresses some requirements that Flyway explicitly chooses not to support. These include support to define migrations declaratively (e.g. in XML) and then generate the correct SQL DDL statements for your particular DBMS and the support to generate migration rollbacks (countering operations).
Read the rest of this entry »

Using Axon with PostgreSQL without TOAST

October 9th, 2017 by
(https://blog.trifork.com/2017/10/09/axon-postgresql-without-toast/)

The client I work for at this time is leveraging Axon 3. The events are stored in a PostgreSQL database. PostgreSQL uses a thing called TOAST (The Oversized-Attribute Storage Technique) to store large values.

From the PostgreSQL documentation:

“PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows”

As it happens, in our setup using JPA (Hibernate) to store events, the DomainEventEntry entity has a @Lob annotation on the payload and the metaData fields (via extension of the AbstractEventEntry class):

For PostgreSQL this will result in events that are not easily readable:

SELECT payload FROM domainevententry;

| payload |
| 24153   |

The data type of the payload column of the domainevententry table is OID.

The PostgreSQL JDBC driver obviously knows how to deal with this. The real content is deTOASTed lazily. Using PL/pgSQL it is possible to store a value in a file. But this needs to be done value by value. But when you are debugging your application and want a quick look at the events of your application, this is not a fun route to take.

So we wanted to change the data type in our database to something more human readable. BYTEA for example. Able to store store large values in, yet still readable. As it turned out, a couple changes are needed to get it working.

It took me a while to get all the pieces I needed. Although the solution I present here works for us, perhaps this could not be the most elegant of even the best solution for everyone.
Read the rest of this entry »

Heterogeneous microservices

July 13th, 2017 by
(https://blog.trifork.com/2017/07/13/heterogeneous-microservices/)

Heterogeneous microservices

Microservices architecture is increasingly popular nowadays. One of the promises is flexibility and easier working in larger organizations by reducing the amount of communication and coordination between teams. The thinking is, teams have their own service(s) and don’t depend on other teams, meaning they can work independently, thereby reducing coordination efforts.

Especially with multiple teams and multiple services per team, this can mean there are quite a few services with quite different usage. Different teams can have different technology preferences, for example because they are more familiar with the one or the other. Similar different usage can mean quite different requirements, which might be easier to fulfill with the one or the other technology.

The question I’m going to discuss in this blog post, how free or constrained should technology choices be in such an environment?

Spaghetti freedom

If you’ve ever worked in a hectic startup environment where quickly building features is more important than clean sustainable code, you probably know how that ends. People just build things and it all becomes a big spaghetti mess. There is lots of technical debt..

The combination of freedom and high pressure means choices are short-term focussed and there is little attention to keeping things tidy.

Over-standardization

Large organizations can have the opposite policy, any project or change need to be approved by multiple boards, architects and committees. There are strict rules about exactly what technologies should be used.

This can mean there might be some shoe-horning to implement things with a sub-optimal technology, or over engineering because complex technology is used for simple problems.

This can also be de-motivating to engineers, as they’re not allowed to pick their favorite technology.

Inspire!

I think the right approach is a middle ground. Some standardization and preferred languages, frameworks and solutions are very helpfull to promote software being built the same way. This helps engineers working on new services, for example when switching teams or when onboarding new people.

I think this preferred way of working should evolve also and be open for discussion. New insights or new ways of building can be integrated if they prove beneficial.

In my opinion such a preferred approach should serve as a starting point for teams building new services and be open for discussion. Teams should try and build software in a common way, but when needed it should be fine to diverge and do things differently.

Concluding

I’ve shared my opinion – how can technology be standardized in larger microservice environments. How does this work in your company? What’s your insight on this topic?

How to send your Spring Batch Job log messages to a separate file

April 14th, 2017 by
(https://blog.trifork.com/2017/04/14/how-to-send-your-spring-batch-job-log-messages-to-a-separate-file/)

In one of my current projects we’re developing a web application which also has a couple of dozen batch jobs that perform all sort of tasks at particular times. These jobs produce quite a bit of logging output when they’re run, which is important to see what has happened during a job exactly. What we noticed however, is that the batch logging would make it hard to quickly spot the other logging performed by the application while also running a batch job. In addition to that, it wasn’t always clear in the context of what job a log statement was issued.
To address these issues I came up with a simple solution based on Logback Filters, which I’ll describe in this blog.

Logback Appenders

We’re using Logback as a logging framework. Logback defines the concept of appenders: appenders are responsible for handling the actual log messages emitted by the loggers in the application by writing them to the console, to a file, to a socket, etc.
Many applications define one or more appenders and them simply list them all as part of their root logger section in the logback.xml configuration file:

<configuration scan="true">

  <appender name="LOGSTASH" class="net.logstash.logback.appender.LogstashTcpSocketAppender">
    <destination>logstash-server</destination>
    <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
  </appender>

  <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>log/server.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <fileNamePattern>log/server.%d{yyyy-MM-dd}.log</fileNamePattern>
      <maxHistory>30</maxHistory>
    </rollingPolicy>
    <encoder>
      <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %mdc %-5level %logger{36} - %msg%n</pattern>
    </encoder>
  </appender>
  <root level="info">
    <appender-ref ref="LOGSTASH"/>
    <appender-ref ref="FILE"/>
  </root>

</configuration>

This setup will send all log messages to both of the configured appenders. Read the rest of this entry »

Machine Learning: Predicting house prices

February 16th, 2017 by
(https://blog.trifork.com/2017/02/16/machine-learning-predicting-house-prices/)

Recently I have followed an online course on machine learning to understand the current hype better. As with any subject though, only practice makes perfect, so i was looking to apply this new knowledge.

While looking to sell my house I found that would be a nice opportunity: Check if the prices a real estate agents estimates are in line with what the data suggests.

Linear regression algorithm should be a nice algorithm here, this algorithm will try to find the best linear prediction (y = a + bx1 + cx2 ; y = prediction, x1,x2 = variables). So, for example, this algorithm can estimate a price per square meter floor space or price per square meter of garden. For a more detailed explanation, check out the wikipedia page.

In the Netherlands funda is the main website for selling your house, so I have started by collecting some data, I used data on the 50 houses closest to my house. I’ve excluded apartments to try and limit data to properties similar to my house. For each house I collected the advertised price, usable floor space, lot size, number of (bed)rooms, type of house (row-house, corner-house, or detached) and year of construction (..-1930, 1931-1940, 1941-1950, 1950-1960, etc). These are the (easily available) variables I expected would influence house price the most. Type of house is a categorical variable, to use that in regression I modelled them as several binary (0/1) variables.

As preparation, I checked for relations between the variables using correlation. This showed me that much of the collected data does not seem to affect price: Only the floor space, lot size and number of rooms showed a significant correlation with house price.

For the regression analysis, I only used the variables that had a significant correlation. Variables without correlation would not produce meaningful results anyway.

Read the rest of this entry »

Simulating an Elasticsearch Ingest Node pipeline

February 2nd, 2017 by
(https://blog.trifork.com/2017/02/02/elasticsearch-ingest-node/)

Indexing document into your cluster can be done in a couple of ways:

  • using Logstash to read your source and send documents to your cluster;
  • using Filebeat to read a log file, send documents to Kafka, let Logstash connect to Kafka and transform the log event and then send those documents to your cluster;
  • using curl and the Bulk API to index a pre-formatted file;
  • using the Java Transport Client from within a custom application;
  • and many more…

Before version 5 however there where only two ways to transform your source data to the document you wanted to index. Using Logstash filters, or you had to do it yourself.

In Elasticsearch 5 the concept of the Ingest Node has been introduced. Just a node in your cluster like any other but with the ability to create a pipeline of processors that can modify incoming documents. The most frequently used Logstash filters have been implemented as processors.

For me, the best part of pipelines is that you can simulate them. Especially in Console, simulating your pipelines makes creating them very fast; the feedback loop on testing your pipeline is very short. Making using pipelines a very convenient way to index data.

Read the rest of this entry »