If you are reading some technical blogs, maybe about search or data analysis, chances are big you have read about Kibana. You have seen stories about how easy it is to use. Most of the blogging effort deals with getting data into kibana using logstash for instance. Maybe some of you have installed Kibana and are using it in combination with logstash. But what if you want to analyze other data. With the most recent release M4, Kibana is better than ever in analyzing other sort of data. In this blog I am going to show you how to create your own dashboard in Kibana. In order to do something useful with Kibana we have to have data. Peter Meijer had a very nice idea to index metadata from all of your images to learn about the type of photo's that you take. I decided to put this in practice. I used Node.js and the exiftool to obtain metadata from images and store it in elasticsearch.
Category ‘Enterprise Search’
In my last blog post on the subject, I tried to find the maximum shard size in elasticsearch. But in the end all I could say is that elasticsearch can index the whole English Wikipedia dump in one shard without any problem but that queries are painfully slow. I couldn't find any hard limit because I didn't know exactly what will be the problem. I was expecting indexing to slow down before the querying, thus I couldn't do a relevant querying test with a smaller index. Armed with my knowledge from my previous experiment, in this post I will try to show what the maximum shard size is for a given set of conditions.
In my previous blog post I explained what the split-brain problem is for elasticsearch and how to avoid it, but only briefly spoken about how it manifests. In this post I'm going to expand on what actually happens to your indexing and query requests after the split-brain has occurred. As I'm sure you're already aware, it depends! It depends on the type of client you use. Because Java is my specialty, I'm going to write about the two types of clients elasticsearch supports through the Java API: the transport client and the node client.
We've all been there - we started to plan for an elasticsearch cluster and one of the first questions that comes up is "How many nodes should the cluster have?". As I'm sure you already know, the answer to that question depends on a lot of factors, like expected load, data size, hardware etc. In this blog post I'm not going to go into the detail of how to size your cluster, but instead will talk about something equally important - how to avoid the split-brain problem.
Whenever people start working with elasticsearch they have to make important configuration decisions. Most of the decisions can be altered along the line (refresh interval, number of replicas), but one stands out as permanent - number of shards. When you create an index in elasticsearch, you specify how many shards that index will have and you cannot change this setting without reindexing all the data from scratch. In some cases reindexing is not a time consuming task, but there are situations where it can take days to rebuild an elasticsearch index.
Many developers feel the pressure of making the right choice in regards to the number of shards they will use when creating an index. But with a base line of what the maximum shard size is and knowing how much data needs to be stored in elasticsearch, the choice of number of shards becomes much easier.
When I started working with elasticsearch a while ago, I was fortunate enough to work alongside a very talented engineer, a true search expert. I would often ask him questions like "So how many shards can one elasticsearch node support?" or "What should the refresh interval be?". He would pause, think for a while, but in the end his answer would always be "Well, it depends". This answer irked me in the beginning, especially because we're in IT, where everything is 0s and 1s, right? In this blog post I will show what the answer to the question "How much data can a single-shard index hold?" depends on and how to find the best setting for your environment.
Plotting markers on a map is easy using the tooling that is readily available. However, what if you want to add a large number of markers to a map when building a search interface? The problem is that things start to clutter and it's hard to view the results. The solution is to group results together into one marker. You can do that on the client using client-side scripting, but as the number of results grows, this might not be the best option from a performance perspective.
This blog post describes how to do server-side clustering of those markers, combining them into one marker (preferably with a counter indicating the number of grouped results). It provides a solution to the “too many markers” problem with an Elasticsearch facet.
We have done multiple big Hippo projects. A regular Hippo project consists of multiple components like the website, the content management system and a repository for the documents. In most of the projects we also introduce the integration component. This component is used to pull other data sources into Hippo, but we also use it to expose data to third parties.
By default, the Hippo Site Toolkit delegates searches to the Hippo Repository that in turn delegates the search to Jackrabbit. This is the repository that is used by Hippo to store the documents. JackRabbit has integrated Lucene which can be used for search. This is a domain specific Lucene implementation targeting to be compatible with the Java Content Repository specification. This however comes at a price, for example more expensive and less customizable searches than search engines like Solr or Elasticsearch provides. for typical Solr/Elasticsearch features like highlighting, suggestions, boosting, full control over indexing or searching external content, Hippo Repository search is limited.
This search problem can be overcome using a specialized search solution like Elasticsearch. For multiple customers we have realized this solution, some based on Solr and others on Elasticsearch.
In this blogpost I am describing the solution we have created. I'll discuss the requirements for creating (near) real time search using Hippo workflow events as well as the integration component that reads the documents from Hippo and pushes them to Elasticsearch.
Nederlands Instituut voor Beeld & Geluid: Beeld & Geluid is not only the very interesting museum of media and television located in the colorful building next to the Hilversum Noord train station, but is also responsible for the archiving of all the audio-visual content of all the Dutch radio and television broadcasters. Around 800.000 hours of material is available in the Beeld & Geluid archives – and this grows every day as new programs are being broadcasted.
This blog entry describes the project Trifork Amsterdam is currently doing at Beeld & Geluid, replacing the current Verity search solution with one that is based on Elasticsearch.
I recently read the ElasticSearch server book published by Packt Publishing. It was a pleasant reading, really interesting even though I was already familiar with the product. So here is a quick synopsis of the book & it's content. Not one of my usual blogs but nonetheless something I wanted to share.
Writing a book about Elasticsearch turns out not to be easy. There are in fact lots of features and gems that would need to be discussed, something that's really hard to do in a book with a reasonable number of pages. Also, the product is rapidly evolving, which makes it extremely hard to keep up with it and come up with up-to-date content.
I think this book brings something that was missing until now in the Elasticsearch ecosystem, since it goes from installing the product and setting it up to using it in real life, describing also potential issues and their solutions. Also, it doesn't neglect the needed technical details about the underlying Lucene library and search in general.
Read the rest of this entry »