Trifork Blog

Mahout - Taste :: Part 1 - Introduction

December 9th, 2009 by
| Reply

This post is the first in a series on Taste, a Java framework for providing personalized recommendations. Taste is part of the larger Mahout framework, which features various scalable machine-learning algorithms. In this post I introduce you to the concepts of personalized recommendations, also known as collaborative filtering. After this introduction, Taste's architecture and extension points are explained. I finish this post by demonstrating and explaining the TanimotoCoefficientSimilarity, one of Taste's implementations used for computing recommendations.

Personalized Recommendations

Today the web is full of services for recommending books, websites, music, applications, movies and so on. Amazon, Last.fm and StumbleUpon, all provide these personalized recommendations for internet users. These features can be quite useful for customers and profitable for many e-commerce sites these days.

Collaborative Filtering

Let's first review some basic concepts. The theory that powers all these useful websites mentioned above is the process of Collaborative Filtering, or pattern recognition in large datasets of multiple users. These datasets can contain preferences of users for certain items. For example, Youtube members can rate a video by assigning a number of stars. The number of stars is a user's preference value, a value from 1 to 5. Based on this collection of personal preferences and a 'similarity function' you can recommend videos to users or determine similar users, users with similar taste in videos. In this case, recommending videos is an example of an 'item-based recommendation' and determining users with similar tastes is an example of an 'user-based recommendation'.

Introducing Mahout - Taste

So how can we use this theory to build recommenders? This is where Mahout - Taste comes in. Mahout is a Java framework for running scalable machine learning algorithms on top of Hadoop. Taste is a sub-framework of Mahout for building recommendation engines. Since April 4th 2008, Taste has become part of Mahout. Below is a figure with Taste's architecture and the building blocks you need to configure a recommender.

taste-architecture

Taste architecture diagram

The main building block in Taste is the Recommender. The Recommender recommends items based on a given item or it determines users with similar tastes. It works as follows: the recommender applies a similarity function on a subset of pairs of items (or users) in the dataset. A similarity function usually returns a value between 0 and 1, with 1 representing two completely similar items and 0 completely dissimilar items. When the similarity function processes pairs in the dataset the resulting similarity values are collected and are either kept in memory or stored on the filesystem or a database. When the Java application requests a few recommendations for a given item, the Recommender returns the items with the highest similarity.

The Recommender retrieves items and users through the DataModel abstraction. Taste contains DataModel implementations for retrieving and storing your dataset through the filesystem or a database. In addition, the DataModel provides methods that count the total number of users, total number of items, number of users that prefer a certain item, and many more functions. Similarity functions use these numbers to compute a similarity value for pairs of items or users. This will be shown later in the example of the TanimotoCoefficientSimilarity, which determines a ratio based on some of these figures. You can build a recommender with Taste by adding a DataModel and a similarity function to a Recommender. You can also define your own similarity function by extending UserSimilarity or ItemSimilarity to recommend users or items, respectively.

Tanimoto coefficent similarity

Taste contains around a dozen similarity algorithms you can choose from to build a recommender. For this introductory post I will explain Taste's TanimotoCoefficientSimilarity, a relatively straightforward similarity algorithm that is widely used in chemo-informatics for discovering similarities between molecules. Let's illustrate the algorithm in the context of a webshop. Suppose there are 3 customers, A, B and C and 5 products, numbered 1 up to 5. Say each customer has bought a few products. For this algorithm it does not matter how many products are purchased, only which products are purchased by which customer. See the table below.

Customer purchases

Customer purchases

Intuitively you may see that the similarity between two products can be expressed by some ratio of purchases of customers. To be more precise, the tanimoto coefficient is computed by the following formula:

\[\frac{c}{a + b - c}\]

\(c = \) Number of customers that purchased p1 and p2
\(a = \) Number of customers that purchased p1
\(b = \) Number of customers that purchased p2

This means that if many customers have bought products, the numerator will be higher and so will be the similarity value. Alternatively, if many people have bought p1 and many have bought p2, but very few people bought both, p1 and p2 are probably dissimilar. Below is a table with calculated tanimoto coefficients for each product pair:

tanimoto_calculations

Tanimoto coefficients for all product pairs

Demonstrating Taste

In this section I demonstrate how to express the previous example with Taste.  If you like to try this example at home, download the brand new 0.2 mahout jar here and add it to your classpath. If you are using maven, add the following snippet to your pom.xml

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-core</artifactId>
    <version>0.2</version>
</dependency>

Below is a code snippet that shows how to build the DataModel and how to compute similarities with the TanimotoCoefficientSimilarity.  In the setup() method I create a BooleanPreferenceArray for each user. I then fill these arrays, put all of them in a FastByIdMap and put that in the DataModel. Next I create the TanimotoCoefficientSimilarity and pass in the DataModel. I then write a few tests that check whether the similarities computed by Taste return the values I expect, given the formula above.

/**
 * Demonstrates TanimotoCoefficientSimilarity + recommender.
 *
 * @author Frank Scholten
 */
public class TanimotoDemo {

 private DataModel dataModel;
 private ItemSimilarity tanimoto;

 private long CUSTOMER_A = 0;
 private long CUSTOMER_B = 1;
 private long CUSTOMER_C = 2;

 private long productOne = 0;
 private long productTwo = 1;
 private long productThree = 2;
 private long productFour = 3;
 private long productFive = 4;

 @Before
 public void setup() {
   FastByIDMap<PreferenceArray> userIdMap = new FastByIDMap<PreferenceArray>();

   BooleanUserPreferenceArray customerAPrefs = new BooleanUserPreferenceArray(4);
   customerAPrefs.set(0, new BooleanPreference(CUSTOMER_A, productOne));
   customerAPrefs.set(1, new BooleanPreference(CUSTOMER_A, productTwo));
   customerAPrefs.set(2, new BooleanPreference(CUSTOMER_A, productFour));
   customerAPrefs.set(3, new BooleanPreference(CUSTOMER_A, productFive));

   BooleanUserPreferenceArray customerBPrefs = new BooleanUserPreferenceArray(3);
   customerBPrefs.set(0, new BooleanPreference(CUSTOMER_B, productTwo));
   customerBPrefs.set(1, new BooleanPreference(CUSTOMER_B, productThree));
   customerBPrefs.set(2, new BooleanPreference(CUSTOMER_B, productFive));

   BooleanUserPreferenceArray customerCPrefs = new BooleanUserPreferenceArray(2);
   customerCPrefs.set(0, new BooleanPreference(CUSTOMER_C, productOne));
   customerCPrefs.set(1, new BooleanPreference(CUSTOMER_C, productFive));

   userIdMap.put(CUSTOMER_A, customerAPrefs);
   userIdMap.put(CUSTOMER_B, customerBPrefs);
   userIdMap.put(CUSTOMER_C, customerCPrefs);

   dataModel = new GenericDataModel(userIdMap);
   tanimoto = new TanimotoCoefficientSimilarity(dataModel);
 }

 @Test
 public void testSimilarities() throws TasteException {
   assertEquals((double) 1, tanimoto.itemSimilarity(productOne, productOne), 0.01);
   assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productOne, productTwo), 0.01);
   assertEquals((double) 0, tanimoto.itemSimilarity(productOne, productThree), 0.01);
   assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productOne, productFour), 0.01);
   assertEquals((double) 2 / 3, tanimoto.itemSimilarity(productOne, productFive), 0.01);

   assertEquals((double) 1 / 1, tanimoto.itemSimilarity(productTwo, productTwo), 0.01);
   assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productTwo, productThree), 0.01);
   assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productTwo, productFour), 0.01);
   assertEquals((double) 2 / 3, tanimoto.itemSimilarity(productTwo, productFive), 0.01);

   assertEquals((double) 1, tanimoto.itemSimilarity(productThree, productThree), 0.01);
   assertEquals((double) 0, tanimoto.itemSimilarity(productThree, productFour), 0.01);
   assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productThree, productFive), 0.01);

   assertEquals((double) 1, tanimoto.itemSimilarity(productFour, productFour), 0.01);
   assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productFour, productFive), 0.01);

   assertEquals((double) 1, tanimoto.itemSimilarity(productFive, productFive), 0.01);
 }

 @Test
 public void testRecommendProducts() throws TasteException {
   ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, tanimoto);

   List<RecommendedItem> similarToProductThree = recommender.mostSimilarItems(productThree, 2);

   assertEquals(productTwo, similarToProductThree.get(0).getItemID());
   assertEquals(productFive, similarToProductThree.get(1).getItemID());
 }
}

I also added tests for the recommendations themselves. The testRecommendProducts() method uses mostSimilarItems to determine items similar to product 3, in terms of customer preferences. The second parameter of this method is the number of similar items to compute. The result of this method is a list of items ordered by their similarity value, descendingly. We can now predict and see that product two is most similar to product three and product five is second most similar. Note that the code above is only suitable for demonstrative purposes to demonstrate and understand the algorithm. For instance, in an actual a production ready implementation a database backed DataModel can be used, or the algorithm can be ran on Hadoop.

This concludes this introduction to Taste. There are a lot of interesting parts that I haven't explored myself, I am learning this one piece at a time, including the algorithms and mathematics behind it. Some things to check out: the similarity Javadocs for similarity algorithms. And of course there are classes to run your recommender on Hadoop and to evaluate the quality of your recommender with training sets. In the next few posts I will go into some of these subjects and use examples with a larger dataset, look into Hadoop and do a comparison of algorithms.

References

14 Responses

  1. December 9, 2009 at 12:11 by Andrew Phillips

    For the lazy amongst us...it would be convenient if you added the calculated tanimoto coefficients to your "customer purchases" example ;-) Thanks!

  2. [...] Mahout – Taste :: Part 1 – Introduction « JTeam Blog / JTeam: Enterprise Java, Open Source, sof.... [...]

  3. December 9, 2009 at 15:55 by Frank Scholten

    @Andrew, yes I can do that when I get around to it.

  4. December 9, 2009 at 20:00 by Frank Scholten

    Just added a table with tanimoto coefficients for all product pairs

  5. December 9, 2009 at 21:06 by Wouter Heijke

    Thanks for this informative Mahout introduction! I've been wishing to get started with Mahout for quite a while now and your article made me do it today!

  6. December 10, 2009 at 09:58 by Frank Scholten

    @Wouter: Great, let us know what you learned.

  7. December 10, 2009 at 13:14 by Wouter Heijke

    What i've noticed that it is easy to get an:

    org.apache.mahout.cf.taste.common.NoSuchUserException

    To clarify I think you should change the following:

    userIdMap.put(0, customerAPrefs);
    userIdMap.put(1, customerBPrefs);
    userIdMap.put(2, customerCPrefs);

    into:

    userIdMap.put(CUSTOMER_A, customerAPrefs);
    userIdMap.put(CUSTOMER_B, customerBPrefs);
    userIdMap.put(CUSTOMER_C, customerCPrefs);

  8. December 11, 2009 at 12:57 by Frank Scholten

    @Wouter: Yes, good point. I updated the code.

  9. December 17, 2009 at 12:00 by John

    Thank you for the detailed write-up.

    My company is developing a recommendation engine. Do you recommend using Mahout as our core or are there better open source collaborative filtering systems out there? Also, do you know of any websites that use Mahout to provide recommendations? I would like to see it in action.

  10. December 18, 2009 at 15:09 by Frank Scholten

    @John: Mahout is the only open source collaborative filtering system I am researching at the moment.

    There is an active and growing community behind it, so I'd say give it a try.

    For a real-life example of Mahout, look into Mippin, a search engine with recommendation features. Check out http://blog.mippin.com/2009/09/may-we-recommend.html

  11. January 12, 2010 at 16:41 by Nishant

    Nice article. Is there any real life dataset representing the boolean preferences?

  12. July 9, 2010 at 18:31 by neil kodner

    Any chance of you updating the document to aid people who are unfamiliar with java? I've followed the instructions on downloading and building mahout from svn but am unsure how to get your example code to run. I believe I'm down to classpath issues but I can never be sure.

    My mahout trunk is in /Users/nkodner/development/mahout/trunk/

    I'm not sure how to set up my classpath. I'm also not sure where to save TanimotoDemo.java.

    Any help is greatly appreciated! I'm looking to try Mahout, specifically TanimotoSimilarity on top of my Hadoop cluster to see how it performs against my Python implementation.

  13. July 12, 2010 at 17:11 by Frank Scholten

    @Neil:

    It's best to run this JUnit test from either IntelliJ, http://www.jetbrains.com/idea/download/, or Eclipse, http://www.eclipse.org/downloads/. Create a Java project, create a package taste and a class TanimotoDemo. Now add the mahout jars as dependencies to your project and you can run TanimotoDemo as a JUnit test.

    It's possible to run it from the command line with

    java -cp all:your:dependencies:here org.junit.runner.JunitCore package.to.TanimotoDemo

    but you need to setup your classpath and an IDE makes that a lot easier.

    Note btw that this a non-distributed recommendation, it does not run on Hadoop. I will dive into this subject in the future and blog about it.

  14. [...] Mahout taste part 1 [...]

Leave a Reply