Trifork Blog

AngularJS training

Elasticsearch beyond “Big Data” - running elasticsearch embedded

September 13th, 2012 by
| Reply

elasticsearchTrifork has a long track record in doing project, training and consulting around open source search technologies. Currently we are working on several interesting search projects using elasticsearch. Elasticsearch is an open source, distributed, RESTful, search engine built on top of Apache Lucene. In contrast to for instance Apache Solr, elasticsearch is built as a highly scalable distributed system from the ground up, allowing you to shard and replicate multiple indices over a large number of nodes. This architecture makes scaling from one server to several hundreds a breeze. But, it turns out elasticsearch is not only good for what everyone calls “Big Data”, but it is also very well suited for indexing only small amounts of documents and even running elasticsearch embedded within an application, while still providing the flexibility to scale up later when needed.

As most developers know, most databases offer full-text search capabilities on the data that is stored. However, from our experience often more is needed and that is where Lucene-based solutions come in. And elasticsearch is currently our technology of choice when it comes to greenfield projects, as it provides all the features you typically need and combines it with scalability.

For this case, we decided to use elasticsearch as part of a bigger project for the University of Amsterdam (UvA). We use elasticsearch to “cache” course information that is retrieved from a Peoplesoft SiS system and make it searchable. And in this case we decided to fire up a local elasticsearch node within an existing Spring web application, using it as an embedded search engine.

We already had a Spring web application in place that was able to get data from Peoplesoft SiS through web services. One of the goals of the project was to allow users to search through the course data. The application was meant to be run on a single Tomcat instance and we wanted to avoid having to run a different application server just to provide the search features. Since the application was single instance, it was doable to store the index on the same server, within a local file system. For these reasons we started to look into using an embedded solution, which would also avoid any http traffic in order to index and search. We first thought about using plain Lucene: the search requirements looked pretty easy at first, but using Lucene does require quite some code, while the search servers built on top of it allow you to concentrate more on your data flow. Even though you don’t need to write Lucene code, you do need to know quite a lot about it in order to use it properly. However, elasticsearch makes your life easier and provides additional features as well, e.g. caching. That's why we looked into embedding elasticsearch in our application, which is pretty easy to do thanks to its Java API. All the REST APIs provided by elasticsearch are in fact exposed through Java APIs, since effectively that is the way elasticsearch itself processes every request internally.

First thing you need to do is add the elasticsearch dependency to your project in your POM (assuming you are using Maven). Note that the elasticsearch artifacts are hosted on the sonatype repository:

<dependency>
   <groupId>org.elasticsearch</groupId>
   <artifactId>elasticsearch</artifactId>
   <version>0.19.9</version>
</dependency>

After that you can either create a Client object in your application code in order to send requests to an existing elasticsearch instance or a Node object in order to start a new node and eventually join an existing cluster.

In order to create and start our embedded elasticsearch node within the existing spring application we created a FactoryBean and combined it with the use of the InitializingBean and DisposableBean interfaces. The following is the afterPropertiesSet method, which effectively fires up the node:

@Override
public void afterPropertiesSet() throws Exception {
    ImmutableSettings.Builder settings =
        ImmutableSettings.settingsBuilder();
    settings.put("node.name", "orange11-node");
    settings.put("path.data", "/data/index");
    settings.put("http.enabled", false);
    node = NodeBuilder.nodeBuilder()
        .settings(settings)
        .clusterName("orange11-cluster")
        .data(true).local(true).node();
}

Our cluster will be called orange11-cluster rather than the default elasticsearch. Our node will contain data and will be a local node, which means that other nodes can join the same cluster only if they belong to the same java process. Due to this parameter the transport port used for the inter-node communications (9300 by default) will not be used. Through the settings we are providing the name of the node, the directory where we are going to store the index and a boolean parameter that allows to choose whether we want to enable the http connector or not. In fact, we don't need http in production, but it's useful while developing in order to have access to our elasticsearch instance. Let's not forget to stop our node together with the application:

@Override
public void destroy() throws Exception {
    node.close();
}

Once our node is started we need to somehow connect to it in order to index and search data. We need to create a Client object out of the existing Node. We can do it again using a FactoryBean. The elasticsearch Client object is thread safe and its lifecycle is meant to be similar to the application lifecycle itself, that's why you don't need to create an instance for each request. A singleton client for the whole application is fine. We'll inject the Node to the ClientFactoryBean and create the Client object like this:

@Override
public void afterPropertiesSet() throws Exception {
    client = node.client();
}

Again, let's not forget the destroy method to close the client when the application is stopped:

@Override
public void destroy() throws Exception {
    client.close();
}

We are now ready to use the Client object to submit requests to the elasticsearch node using the Java API. For example, we can create the orange11 index like this, providing our own settings and mapping:

CreateIndexRequest request =
    Requests.createIndexRequest("orange11")
        .settings(yourSettings)
        .mapping(yourMapping);
CreateIndexResponse response =
    client.admin().indices().create(request).actionGet();

We can then index a document like this:

IndexRequest indexRequest =
    Requests.indexRequest("orange11")
        .type("blog")
        .id("1")
        .source(jsonDocument)
IndexResponse indexResponse =
    client.index(indexRequest).actionGet();

And here is how we can search using a query string, one of the many queries provided by the powerful elasticsearch query DSL:

QueryBuilder queryStringBuilder =
    QueryBuilders.queryString(query)
        .field("title", 2)
        .field("content");
SearchRequestBuilder requestBuilder =
    client.prepareSearch("orange11")
        .setTypes("blog")
        .setQuery(queryStringBuilder);
SearchResponse response = requestBuilder.execute().actionGet();

We are really happy with this solution since it is exactly what we were looking for and it performs really well too. We are using a single shard with no replica and the index is stored only locally, which can be dangerous. In our case that's not a problem since it takes only a couple of minutes to do a complete re-index retrieving all the data from the external web service. That's why we are not that worried about losing our data. And if one day we'll need to scale up, it'll be pretty simple: it's just a matter of installing an external elasticsearch cluster, slightly modify the code that creates the Client object, remove the embedded Node from our codebase and do a complete re-index increasing the number of shards. Our data will automatically distribute over the nodes belonging to our cluster.

Even though we are only using a subset of the elasticsearch features for this project, it still adds a lot of value. Conclusion: running elasticsearch embedded as part of your application can be an easy way to add powerful search capabilities to your application, without having to install a separate process.

7 Responses

  1. September 13, 2012 at 13:00 by David

    Hi Luca,
    May I suggest that you give a try to the elasticsearch spring factories project? It helps to build nodes and clients and to perform some initialization jobs (create index, mappings, ...).
    I use it in production for some months.

  2. September 13, 2012 at 13:02 by David

    Sorry, I forgot the link: https://github.com/dadoonet/spring-elasticsearch

  3. September 13, 2012 at 15:16 by Luca Cavanna

    Hi David,
    I am familiar with your project and had a look at it - nice work.
    On the other hand, in this blogpost I wanted to show how simple it is to fire a node and use the elasticsearch Java API without additional dependencies. Thanks for your interest.

  4. January 9, 2013 at 23:37 by kops

    Hi Luca,
    Just wanted to thank you for this idea. 10 minutes and it worked like a charm. I created the node and the client from withing the same factory because I only need the client at the moment and have no need to expose node as a separate bean.
    cheers,
    kops

  5. February 12, 2013 at 14:31 by Mohsin

    Hi David and Luca,
    we just developed Spring Data Elasticsearch, Spring Data implementation of elasticsearch.
    https://github.com/BioMedCentralLtd/spring-data-elasticsearch

    also sample code is at

    https://github.com/BioMedCentralLtd/spring-data-elasticsearch-sample-application

  6. April 12, 2013 at 17:02 by glenn

    Ha, ha, I think that you are getting Lucene and Solr confused. That is easy to do since both projects are under Apache control and they are closely tied together. Solr is similar to elasticsearch in that it is a highly scalable web service built around Lucene. Lucene is still the core technology in both projects. There is even a way to embed Solr in your projects similar to what you have documented here for elasticsearch. http://www.dynamicalsoftware.com/embedded/lucene is where I cover an example of embedding lucene directly into your projects.

  7. October 4, 2013 at 06:23 by Paul

    Hi Luca
    Helpful article for beginners.

    Mohsin i checked out the project from github but it fails to build

Leave a Reply