Mahout – Taste :: Part 1 – Introduction
This post is the first in a series on Taste, a Java framework for providing personalized recommendations. Taste is part of the larger Mahout framework, which features various scalable machine-learning algorithms. In this post I introduce you to the concepts of personalized recommendations, also known as collaborative filtering. After this introduction, Taste’s architecture and extension points are explained. I finish this post by demonstrating and explaining the TanimotoCoefficientSimilarity
, one of Taste’s implementations used for computing recommendations.
Personalized Recommendations
Today the web is full of services for recommending books, websites, music, applications, movies and so on. Amazon, Last.fm and StumbleUpon, all provide these personalized recommendations for internet users. These features can be quite useful for customers and profitable for many e-commerce sites these days.
Collaborative Filtering
Let’s first review some basic concepts. The theory that powers all these useful websites mentioned above is the process of Collaborative Filtering, or pattern recognition in large datasets of multiple users. These datasets can contain preferences of users for certain items. For example, Youtube members can rate a video by assigning a number of stars. The number of stars is a user’s preference value, a value from 1 to 5. Based on this collection of personal preferences and a ‘similarity function’ you can recommend videos to users or determine similar users, users with similar taste in videos. In this case, recommending videos is an example of an ‘item-based recommendation’ and determining users with similar tastes is an example of an ‘user-based recommendation’.
Introducing Mahout – Taste
So how can we use this theory to build recommenders? This is where Mahout – Taste comes in. Mahout is a Java framework for running scalable machine learning algorithms on top of Hadoop. Taste is a sub-framework of Mahout for building recommendation engines. Since April 4th 2008, Taste has become part of Mahout. Below is a figure with Taste’s architecture and the building blocks you need to configure a recommender.
The main building block in Taste is the Recommender
. The Recommender
recommends items based on a given item or it determines users with similar tastes. It works as follows: the recommender applies a similarity function on a subset of pairs of items (or users) in the dataset. A similarity function usually returns a value between 0 and 1, with 1 representing two completely similar items and 0 completely dissimilar items. When the similarity function processes pairs in the dataset the resulting similarity values are collected and are either kept in memory or stored on the filesystem or a database. When the Java application requests a few recommendations for a given item, the Recommender
returns the items with the highest similarity.
The Recommender retrieves items and users through the DataModel
abstraction. Taste contains DataModel
implementations for retrieving and storing your dataset through the filesystem or a database. In addition, the DataModel
provides methods that count the total number of users, total number of items, number of users that prefer a certain item, and many more functions. Similarity functions use these numbers to compute a similarity value for pairs of items or users. This will be shown later in the example of the TanimotoCoefficientSimilarity
, which determines a ratio based on some of these figures. You can build a recommender with Taste by adding a DataModel
and a similarity function to a Recommender
. You can also define your own similarity function by extending UserSimilarity
or ItemSimilarity
to recommend users or items, respectively.
Tanimoto coefficent similarity
Taste contains around a dozen similarity algorithms you can choose from to build a recommender. For this introductory post I will explain Taste’s TanimotoCoefficientSimilarity
, a relatively straightforward similarity algorithm that is widely used in chemo-informatics for discovering similarities between molecules. Let’s illustrate the algorithm in the context of a webshop. Suppose there are 3 customers, A, B and C and 5 products, numbered 1 up to 5. Say each customer has bought a few products. For this algorithm it does not matter how many products are purchased, only which products are purchased by which customer. See the table below.
Intuitively you may see that the similarity between two products can be expressed by some ratio of purchases of customers. To be more precise, the tanimoto coefficient is computed by the following formula:
$$!\frac{c}{a + b – c}$$
$$c = $$ Number of customers that purchased p1 and p2
$$a = $$ Number of customers that purchased p1
$$b = $$ Number of customers that purchased p2
This means that if many customers have bought products, the numerator will be higher and so will be the similarity value. Alternatively, if many people have bought p1 and many have bought p2, but very few people bought both, p1 and p2 are probably dissimilar. Below is a table with calculated tanimoto coefficients for each product pair:
Demonstrating Taste
In this section I demonstrate how to express the previous example with Taste. If you like to try this example at home, download the brand new 0.2 mahout jar here and add it to your classpath. If you are using maven, add the following snippet to your pom.xml
<dependency> <groupId>org.apache.mahout</groupId> <artifactId>mahout-core</artifactId> <version>0.2</version> </dependency>
Below is a code snippet that shows how to build the DataModel
and how to compute similarities with the TanimotoCoefficientSimilarity
. In the setup()
method I create a BooleanPreferenceArray
for each user. I then fill these arrays, put all of them in a FastByIdMap
and put that in the DataModel
. Next I create the TanimotoCoefficientSimilarity
and pass in the DataModel
. I then write a few tests that check whether the similarities computed by Taste return the values I expect, given the formula above.
/** * Demonstrates TanimotoCoefficientSimilarity + recommender. * * @author Frank Scholten */ public class TanimotoDemo { private DataModel dataModel; private ItemSimilarity tanimoto; private long CUSTOMER_A = 0; private long CUSTOMER_B = 1; private long CUSTOMER_C = 2; private long productOne = 0; private long productTwo = 1; private long productThree = 2; private long productFour = 3; private long productFive = 4; @Before public void setup() { FastByIDMap<PreferenceArray> userIdMap = new FastByIDMap<PreferenceArray>(); BooleanUserPreferenceArray customerAPrefs = new BooleanUserPreferenceArray(4); customerAPrefs.set(0, new BooleanPreference(CUSTOMER_A, productOne)); customerAPrefs.set(1, new BooleanPreference(CUSTOMER_A, productTwo)); customerAPrefs.set(2, new BooleanPreference(CUSTOMER_A, productFour)); customerAPrefs.set(3, new BooleanPreference(CUSTOMER_A, productFive)); BooleanUserPreferenceArray customerBPrefs = new BooleanUserPreferenceArray(3); customerBPrefs.set(0, new BooleanPreference(CUSTOMER_B, productTwo)); customerBPrefs.set(1, new BooleanPreference(CUSTOMER_B, productThree)); customerBPrefs.set(2, new BooleanPreference(CUSTOMER_B, productFive)); BooleanUserPreferenceArray customerCPrefs = new BooleanUserPreferenceArray(2); customerCPrefs.set(0, new BooleanPreference(CUSTOMER_C, productOne)); customerCPrefs.set(1, new BooleanPreference(CUSTOMER_C, productFive)); userIdMap.put(CUSTOMER_A, customerAPrefs); userIdMap.put(CUSTOMER_B, customerBPrefs); userIdMap.put(CUSTOMER_C, customerCPrefs); dataModel = new GenericDataModel(userIdMap); tanimoto = new TanimotoCoefficientSimilarity(dataModel); } @Test public void testSimilarities() throws TasteException { assertEquals((double) 1, tanimoto.itemSimilarity(productOne, productOne), 0.01); assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productOne, productTwo), 0.01); assertEquals((double) 0, tanimoto.itemSimilarity(productOne, productThree), 0.01); assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productOne, productFour), 0.01); assertEquals((double) 2 / 3, tanimoto.itemSimilarity(productOne, productFive), 0.01); assertEquals((double) 1 / 1, tanimoto.itemSimilarity(productTwo, productTwo), 0.01); assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productTwo, productThree), 0.01); assertEquals((double) 1 / 2, tanimoto.itemSimilarity(productTwo, productFour), 0.01); assertEquals((double) 2 / 3, tanimoto.itemSimilarity(productTwo, productFive), 0.01); assertEquals((double) 1, tanimoto.itemSimilarity(productThree, productThree), 0.01); assertEquals((double) 0, tanimoto.itemSimilarity(productThree, productFour), 0.01); assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productThree, productFive), 0.01); assertEquals((double) 1, tanimoto.itemSimilarity(productFour, productFour), 0.01); assertEquals((double) 1 / 3, tanimoto.itemSimilarity(productFour, productFive), 0.01); assertEquals((double) 1, tanimoto.itemSimilarity(productFive, productFive), 0.01); } @Test public void testRecommendProducts() throws TasteException { ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel, tanimoto); List<RecommendedItem> similarToProductThree = recommender.mostSimilarItems(productThree, 2); assertEquals(productTwo, similarToProductThree.get(0).getItemID()); assertEquals(productFive, similarToProductThree.get(1).getItemID()); } }
I also added tests for the recommendations themselves. The testRecommendProducts()
method uses mostSimilarItems
to determine items similar to product 3, in terms of customer preferences. The second parameter of this method is the number of similar items to compute. The result of this method is a list of items ordered by their similarity value, descendingly. We can now predict and see that product two is most similar to product three and product five is second most similar. Note that the code above is only suitable for demonstrative purposes to demonstrate and understand the algorithm. For instance, in an actual a production ready implementation a database backed DataModel
can be used, or the algorithm can be ran on Hadoop.
This concludes this introduction to Taste. There are a lot of interesting parts that I haven’t explored myself, I am learning this one piece at a time, including the algorithms and mathematics behind it. Some things to check out: the similarity Javadocs for similarity algorithms. And of course there are classes to run your recommender on Hadoop and to evaluate the quality of your recommender with training sets. In the next few posts I will go into some of these subjects and use examples with a larger dataset, look into Hadoop and do a comparison of algorithms.
References
- http://lucene.apache.org/mahout/taste.html – Taste homepage
- http://lucene.apache.org/mahout/mailinglists.html – Mahout mailing list
- http://www.ibm.com/developerworks/java/library/j-mahout/ – Mahout Article by Grant Ingersoll