This blog shows you how to run Mahout in the cloud, using Apache Whirr. Apache Whirr is a promosing Apache incubator project for quickly launching cloud instances, from Hadoop to Cassandra, Hbase, Zookeeper and so on. I will show you how to setup a Hadoop cluster and run Mahout jobs both via the command line […]
Search Result Grouping / Field Collapsing in Lucene / Solr
Grouping of search results or also known as field collapsing is often a requirement for search projects. As described earlier this functionality was added to Solr and happens to be one of the most wanted features in Solr. Recently result grouping was added to Lucene as contrib in Lucene 3.1 and a module in 4.0. […]
The State and Future of Spatial Search
The release of Solr 3.1, containing Solr’s official spatial search support, has coincided with a new debate about the future of spatial search in Solr and Lucene. JTeam has been involved in the development of spatial search support for a number of years and we maintain our own spatial search plugin for Solr. Consequently this […]
Indexing your Samba/Windows network shares using Solr
Many of JTeam’s clients want to search the content of their existing network shares as part of their Enterprise Search infrastructure. Over the last couple of years, more and more people are switching to Apache Lucene / Solr as their preferred, open source search solution. However, many still have the misconception that it is not […]
Lucene indexing gains concurrency
Imagine you are a Kindergarten teacher and a whole bunch of kids are playing with lego. Suddenly it’s almost 4pm and the big mess needs to be cleaned up, so you ask each kid to pick up one lego brick and put it in your hands. They all run around, bringing bricks to you one […]
SSP 1.0 Video Tutorial
Although SSP v1.0 has been replaced by the simpler 2.0 version, some of you out there are probably still using 1.0 version. Because we like to provide as much assistance as we can to our users, we’ve decided to publish a video tutorial I created on how to configure and use SSP v1.0. It walks […]
Solr and Lucene 3.1 Release
The new release of Solr and Lucene 3.1, available here and here, is the first major release for Solr in almost two years and the first joint release of both projects. With each project having resolved several hundred issues leading to the release, lets take a look at the major improvements and new features including […]
How to cluster Seinfeld episodes with Mahout
This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.
Gimme all resources you have – I can use them!
Exploiting full IO and CPU concurrency when indexing with Apache Lucene During the last year Apache Lucene has been improved an extreme amount with outstanding improvements such as 100 times faster FuzzyQueries, new Term-Dictionary implementation, enhanced Segment-Merging and the famous Flexible-Indexing API. Recently I started working on another fundamental change referred to as DocumentsWriterPerThread, an […]
Mahout at FOSDEM 2011 DataDevRoom
Last saturday, february 5th, FOSDEM 2011 hosted the DataDevRoom where talks were given on topics surrounding data analysis with free and open source software. I was there and gave an introductory talk on clustering with Apache Mahout. In case you missed the conference, read on to learn about some of the talks or checkout the […]