Mahout – Taste :: Part 1 – Introduction

by frankDecember 9, 2009

This post is the first in a series on Taste, a Java framework for providing personalized recommendations. Taste is part of the larger Mahout framework, which features various scalable machine-learning algorithms. In this post I introduce you to the concepts of personalized recommendations, also known as collaborative filtering. After this introduction, Taste’s architecture and extension points are explained. I finish this post by demonstrating and explaining the TanimotoCoefficientSimilarity, one of Taste’s implementations used for computing recommendations.

Personalized Recommendations

Today the web is full of services for recommending books, websites, music, applications, movies and so on. Amazon, Last.fm and StumbleUpon, all provide these personalized recommendations for internet users. These features can be quite useful for customers and profitable for many e-commerce sites these days.

Collaborative Filtering

Let’s first review some basic concepts. The theory that powers all these useful websites mentioned above is the process of Collaborative Filtering, or pattern recognition in large datasets of multiple users. These datasets can contain preferences of users for certain items. For example, Youtube members can rate a video by assigning a number of stars. The number of stars is a user’s preference value, a value from 1 to 5. Based on this collection of personal preferences and a ‘similarity function’ you can recommend videos to users or determine similar users, users with similar taste in videos. In this case, recommending videos is an example of an ‘item-based recommendation’ and determining users with similar tastes is an example of an ‘user-based recommendation’.

Introducing Mahout – Taste

So how can we use this theory to build recommenders? This is where Mahout – Taste comes in. Mahout is a Java framework for running scalable machine learning algorithms on top of Hadoop. Taste is a sub-framework of Mahout for building recommendation engines. Since April 4th 2008, Taste has become part of Mahout. Below is a figure with Taste’s architecture and the building blocks you need to configure a recommender.

taste-architecture
Taste architecture diagram

The main building block in Taste is the Recommender. The Recommender recommends items based on a given item or it determines users with similar tastes. It works as follows: the recommender applies a similarity function on a subset of pairs of items (or users) in the dataset. A similarity function usually returns a value between 0 and 1, with 1 representing two completely similar items and 0 completely dissimilar items. When the similarity function processes pairs in the dataset the resulting similarity values are collected and are either kept in memory or stored on the filesystem or a database. When the Java application requests a few recommendations for a given item, the Recommender returns the items with the highest similarity.

The Recommender retrieves items and users through the DataModel abstraction. Taste contains DataModel implementations for retrieving and storing your dataset through the filesystem or a database. In addition, the DataModel provides methods that count the total number of users, total number of items, number of users that prefer a certain item, and many more functions. Similarity functions use these numbers to compute a similarity value for pairs of items or users. This will be shown later in the example of the TanimotoCoefficientSimilarity, which determines a ratio based on some of these figures. You can build a recommender with Taste by adding a DataModel and a similarity function to a Recommender. You can also define your own similarity function by extending UserSimilarity or ItemSimilarity to recommend users or items, respectively.

Tanimoto coefficent similarity

Taste contains around a dozen similarity algorithms you can choose from to build a recommender. For this introductory post I will explain Taste’s TanimotoCoefficientSimilarity, a relatively straightforward similarity algorithm that is widely used in chemo-informatics for discovering similarities between molecules. Let’s illustrate the algorithm in the context of a webshop. Suppose there are 3 customers, A, B and C and 5 products, numbered 1 up to 5. Say each customer has bought a few products. For this algorithm it does not matter how many products are purchased, only which products are purchased by which customer. See the table below.

Customer purchases
Customer purchases

Intuitively you may see that the similarity between two products can be expressed by some ratio of purchases of customers. To be more precise, the tanimoto coefficient is computed by the following formula:

$$!\frac{c}{a + b – c}$$

$$c = $$ Number of customers that purchased p1 and p2
$$a = $$ Number of customers that purchased p1
$$b = $$ Number of customers that purchased p2

This means that if many customers have bought products, the numerator will be higher and so will be the similarity value. Alternatively, if many people have bought p1 and many have bought p2, but very few people bought both, p1 and p2 are probably dissimilar. Below is a table with calculated tanimoto coefficients for each product pair:

tanimoto_calculations
Tanimoto coefficients for all product pairs

Demonstrating Taste

In this section I demonstrate how to express the previous example with Taste.  If you like to try this example at home, download the brand new 0.2 mahout jar here and add it to your classpath. If you are using maven, add the following snippet to your pom.xml

<dependency>
    <groupId>org.apache.mahout</groupId>
    <artifactId>mahout-core</artifactId>
    <version>0.2</version>
</dependency>

Below is a code snippet that shows how to build the DataModel and how to compute similarities with the TanimotoCoefficientSimilarity.  In the setup() method I create a BooleanPreferenceArray for each user. I then fill these arrays, put all of them in a FastByIdMap and put that in the DataModel. Next I create the TanimotoCoefficientSimilarity and pass in the DataModel. I then write a few tests that check whether the similarities computed by Taste return the values I expect, given the formula above.

/**
 * Demonstrates TanimotoCoefficientSimilarity + recommender.
 *
 * @author Frank Scholten
 */
public class TanimotoDemo {

 private DataModel dataModel;
 private ItemSimilarity tanimoto;

 private long CUSTOMER_A = 0;
 private long CUSTOMER_B = 1;
 private long CUSTOMER_C = 2;

 private long productOne = 0;
 private long productTwo = 1;
 private long productThree = 2;
 private long productFour = 3;
 private long productFive = 4;

 @Before
 public void setup() {
   FastByIDMap<PreferenceArray> userIdMap = new FastByIDMap<PreferenceArray>();

   BooleanUserPreferenceArray customerAPrefs = new BooleanUserPreferenceArray(4);
   customerAPrefs.set(0, new BooleanPreference(CUSTOMER_A, productOne));
   customerAPrefs.set(1, new BooleanPreference(CUSTOMER_A, productTwo));
   customerAPrefs.set(2, new BooleanPreference(CUSTOMER_A, productFour));
   customerAPrefs.set(3, new BooleanPreference(CUSTOMER_A, productFive));

   BooleanUserPreferenceArray customerBPrefs = new BooleanUserPreferenceArray(3);
   customerBPrefs.set(0, new BooleanPreference(CUSTOMER_B, productTwo));
   customerBPrefs.set(1, new BooleanPreference(CUSTOMER_B, productThree));
   customerBPrefs.set(2, new BooleanPreference(CUSTOMER_B, productFive));

   BooleanUserPreferenceArray customerCPrefs = new BooleanUserPreferenceArray(2);
   customerCPrefs.set(0, new BooleanPreference(CUSTOMER_C, productOne));
   customerCPrefs.set(1, new BooleanPreference(CUSTOMER_C, productFive));

   userIdMap.put(CUSTOMER_A, customerAPrefs);
   userIdMap.put(CUSTOMER_B, customerBPrefs);
   userIdMap.put(CUSTOMER_C, customerCPrefs);

   dataModel = new GenericDataModel(userIdMap);
   tanimoto = new TanimotoCoefficientSimilarity(dataModel);
 }

 @Test
 public void testSimilarities() throws TasteException {
   assertEquals((double) 1, tanimoto.itemSimilarity(productOne, productOne), 0.01);
   assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productOne, productTwo), 0.01);
   assertEquals((double) 0, tanimoto.itemSimilarity(productOne, productThree), 0.01);
   assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productOne, productFour), 0.01);
   assertEquals((double) 2 / 3, tanimoto.itemSimilarity(productOne, productFive), 0.01);

   assertEquals((double) 1 / 1, tanimoto.itemSimilarity(productTwo, productTwo), 0.01);
   assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productTwo, productThree), 0.01);
   assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productTwo, productFour), 0.01);
   assertEquals((double) 2 / 3, tanimoto.itemSimilarity(productTwo, productFive), 0.01);

   assertEquals((double) 1, tanimoto.itemSimilarity(productThree, productThree), 0.01);
   assertEquals((double) 0, tanimoto.itemSimilarity(productThree, productFour), 0.01);
   assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productThree, productFive), 0.01);

   assertEquals((double) 1, tanimoto.itemSimilarity(productFour, productFour), 0.01);
   assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productFour, productFive), 0.01);

   assertEquals((double) 1, tanimoto.itemSimilarity(productFive, productFive), 0.01);
 }

 @Test
 public void testRecommendProducts() throws TasteException {
   ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, tanimoto);

   List<RecommendedItem> similarToProductThree = recommender.mostSimilarItems(productThree, 2);

   assertEquals(productTwo, similarToProductThree.get(0).getItemID());
   assertEquals(productFive, similarToProductThree.get(1).getItemID());
 }
}

I also added tests for the recommendations themselves. The testRecommendProducts() method uses mostSimilarItems to determine items similar to product 3, in terms of customer preferences. The second parameter of this method is the number of similar items to compute. The result of this method is a list of items ordered by their similarity value, descendingly. We can now predict and see that product two is most similar to product three and product five is second most similar. Note that the code above is only suitable for demonstrative purposes to demonstrate and understand the algorithm. For instance, in an actual a production ready implementation a database backed DataModel can be used, or the algorithm can be ran on Hadoop.

This concludes this introduction to Taste. There are a lot of interesting parts that I haven’t explored myself, I am learning this one piece at a time, including the algorithms and mathematics behind it. Some things to check out: the similarity Javadocs for similarity algorithms. And of course there are classes to run your recommender on Hadoop and to evaluate the quality of your recommender with training sets. In the next few posts I will go into some of these subjects and use examples with a larger dataset, look into Hadoop and do a comparison of algorithms.

References