Mahout – Taste :: Part Two – Getting started

by frankApril 15, 2010
This blog is a ‘getting started’ article and shows you how to build a simple web-based movie recommender with Mahout / Taste, Wicket and the Movielens dataset from Grouplens research group at the University of Minnesota. I will discuss which components you need, how to wire them up in Spring, and how to create a Wicket frontend for displaying movies and their recommendations. Along the way I give some tips and pointers about developing a recommender. Additionally I show the ResourceDataModel, a Mahout DataModel implementation which reads preferences from a Spring Resource.

Online movie store

Our running example for this post is an online DVD shop in which you can view and rent movies. Visitors go to a movie’s page to check out the plot description and will be shown a list of similar movies. These are other movies that were watched by people who also rented that specific movie. Because of space- and time-constraints, this application only provides a view on the dataset and the recommended movies. Other features expected of an online movie rental service, such as registration and payment are left out.

Movielens Dataset

The movies and their ratings originate from the Movielens dataset of the Grouplens research group from the University of Minnesota. There are datasets contains 100.000, 1 million and 10 million ratings. Note that these ratings are explicit, ranging from 1 to 5. This is different from the example in my earlier blog, which featured implicit ratings. Implicit ratings only indicate if a user purchased or liked an item but not how much someone liked it. In this example I used the 100.000 ratings file.

Item-based or user-based algorithms

For this application we use an item-based recommender, which is a good choice performance-wise if the number of items is less than the number of users. This is likely to be the case for an online movie rental store. Item-based recommendation is different from user-based recommendation, which works by identifying a user neighbourhood of similar users and recommending items from the user neighbourhood to other users. User-based recommendations are personalized while with item-based recommendations each user will get the same recommendations for a given item. In our case, all users that visit a specific movie page will see the same recommended movies for that movie. An example of user-based recommendation is Stumbleupon, which provides personalized website recommendations. Stumbleupon requires that you to login first so that it can perform its user-based recommendation based on your profile of likes and dislikes of certain sites.

EuclidianDistanceSimilarity

Now we need to select an item-based algorithm that fits our dataset. The EuclidianDistanceSimilarity is one of Taste’s algorithms that is suitable for explicit ratings and will be used for our demonstration. Before we construct our recommender we introduce the euclidian distance similarity in more detail. The algorithm computes the euclidian distance between each item’s preference vector. The shorter the distance between these vectors, the greater the similarity. For instance, suppose we have users u1, u2 and u3 and item i1. Let’s say these preferences for these users are 2, 4 and 5 respectively. We now have a preference vector [2,4,5] for item i1. The euclidian distance between two of such vectors can now be computed and used as a measure of their similarity. The formula for computing the euclidian distance between two vectors i and j equals the root of the sum of squared differences between coordinates of a pair of vectors. See the formula below:
$$!d_{ij}=\sqrt{\sum\limits_{k=1}^n \left(x_{ik} – x_{jk}\right)^2}$$

The EuclidianDistanceSimilarity calculates this similarity for each pair of items and then returns

$$!\frac{1}{1 + d_{ij}}$$

which results in a value between 0 and 1. The EuclidianDistanceSimilarity can also be weighted. If you pass in the Weighting.WEIGHTED enum to the constructor of EuclidianDistanceSimilarity then the algorithm will weight the values based on the number of users and the number of co-occurring preferences.

Creating a web-based recommender

Below is a list of components we need to create a small web-based recommendation engine, using Taste, Wicket and JPA/Hibernate. I won’t cover all the details of building this webapp, just the main building blocks. You can download the code for this example here and look at the specific details. We need the following components:

  • Datamodel
    A FileDataModel that reads movie ids, user ids and ratings from the Movielens dataset directly into memory.
  • EuclidianDistanceSimilarity
    Computes item similarities for each pair of items in the datset.
  • GenericItemBasedRecommender
    Uses both the datamodel and the similarity algorithm to compute similar items in memory.
  • MovieRepository
    JPA repository for retrieving Movie objects
  • MovieService
    Uses the recommender and the MovieRepository to retrieve most similar movies for a given movie id.
  • Wicket MoviePage, HTML + CSS
    This includes a page for viewing a movie along with similar movies, a few model classes, some HTML and CSS, and a few code tweaks to the original wicket quickstart project.

Note that Taste ships with a preconfigured MovielensRecommender. For the purpose of this article however, I wanted to show you how to build a recommender from the ground up.

Configuring the ResourceDataModel, EuclidianDistanceSimilarity and GenericItemBasedRecommender

Because of license restrictions the movielens data cannot be shipped with this demo, so you need to download it here. Place the u.data file under src/main/resources/grouplens/100K/ratings/ and place the u.item file under src/main/resources/grouplens/100K/data/. Now that you have the ratings data setup you need to feed it into a DataModel class. You can use a FileDataModel for this but for this you need to use an absolute path. Instead I implement a ResourceDataModel which reads ratings files from the classpath. Below is the implementation of the ResourceDataModel, which is a wrapper around a FileDataModel. More on how to wire this below.

package nl.jteam.mahout.gettingstarted.datamodel;

// Imports omitted.

/**
 * DataModel implementation which reads a Spring {@link org.springframework.core.io.Resource} into a
 * {@link org.apache.mahout.cf.taste.impl.model.file.FileDataModel} delegate.
 *
 * @author Frank Scholten
 */
public class ResourceDataModel implements DataModel {
    FileDataModel delegate;

    ResourceDataModel() { // For testing
    }

    /**
     * Reads the preferences from the given {@link org.springframework.core.io.Resource}
     *
     * @param resource with user IDs, items IDs and preferences
     */
    public ResourceDataModel(Resource resource) {
        try {
            this.delegate = new FileDataModel(resource.getFile());
        } catch (IOException e) {
            throw new RuntimeException("Could not read resource " + resource.getDescription(), e);
        }
    }

    @Override
    public LongPrimitiveIterator getUserIDs() throws TasteException {
        return delegate.getUserIDs();
    }

    @Override
    public PreferenceArray getPreferencesFromUser(long userID) throws TasteException {
        return delegate.getPreferencesFromUser(userID);
    }

    @Override
    public FastIDSet getItemIDsFromUser(long userID) throws TasteException {
        return delegate.getItemIDsFromUser(userID);
    }

    @Override
    public LongPrimitiveIterator getItemIDs() throws TasteException {
        return delegate.getItemIDs();
    }

    @Override
    public PreferenceArray getPreferencesForItem(long itemID) throws TasteException {
        return delegate.getPreferencesForItem(itemID);
    }

    @Override
    public Float getPreferenceValue(long userID, long itemID) throws TasteException {
        return delegate.getPreferenceValue(userID, itemID);
    }

    @Override
    public int getNumItems() throws TasteException {
        return delegate.getNumItems();
    }

    @Override
    public int getNumUsers() throws TasteException {
        return delegate.getNumUsers();
    }

    @Override
    public int getNumUsersWithPreferenceFor(long... itemIDs) throws TasteException {
        return delegate.getNumUsersWithPreferenceFor(itemIDs);
    }

    @Override
    public void setPreference(long userID, long itemID, float value) throws TasteException {
        delegate.setPreference(userID, itemID, value);
    }

    @Override
    public void removePreference(long userID, long itemID) throws TasteException {
        delegate.removePreference(userID, itemID);
    }

    @Override
    public boolean hasPreferenceValues() {
        return delegate.hasPreferenceValues();
    }

    @Override
    public void refresh(Collection<Refreshable> alreadyRefreshed) {
        delegate.refresh(alreadyRefreshed);
    }
}

Next you need to initialize the movie database. Run the initialize_movielens_db.sql from the src/main/resources/sql folder. This will create the movielens database and a user with username and password movielens. Additionally, it creates the movie database and loads it with the movie titles from the u.item file.

Now we wire up the GenericItemBasedRecommender and its dependencies, the EuclidianDistanceSimilarity and the ResourceDataModel in the following spring context:

 <?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">

    <!-- Recommender -->

    <bean id="euclidianDistanceRecommender" class="org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender">
        <constructor-arg index="0" ref="movielensDataModel100K"/>
        <constructor-arg index="1" ref="euclidianDistanceSimilarity"/>
    </bean>

    <!-- DataModel -->

    <bean id="movielensDataModel100K" class="nl.jteam.mahout.gettingstarted.datamodel.ResourceDataModel">
        <constructor-arg value="classpath:/grouplens/100K/ratings/u.data"/>
    </bean>

    <!-- Similarity -->

    <bean id="euclidianDistanceSimilarity" class="org.apache.mahout.cf.taste.impl.similarity.EuclideanDistanceSimilarity">
        <constructor-arg ref="movielensDataModel100K"/>
        <constructor-arg value="WEIGHTED"/>
    </bean>

</beans>

MovieRepository, MovieService & MoviePage

Our recommender is now ready to determine similar movies for a given movie ID. However, the GenericItemBasedRecommender interface only returns movie IDs of the type long. In order to display the actual movie information to users we need to create a MovieRepository which fetches recommended Movie objects. Additionally, we need a MovieService which coordinates the MovieRepository and the GenericItemBasedRecommender so that we can retrieve recommended Movie objects for a given movie ID. Below is a snippet of a JPA implementation of a MovieRepository.

package nl.jteam.mahout.gettingstarted.repository;

// Imports omitted.

/**
 * Repository for retrieving {@link Movie}s
 *
 * @author Frank Scholten
 */
@Repository
public class JpaMovieRepository implements MovieRepository {

    @PersistenceContext
    private EntityManager entityManager;

    /** {@inheritDoc} */
    @Override
    public Movie getMovieById(long id) {
        return entityManager.find(Movie.class, id);
    }

    /** {@inheritDoc} */
    @Override
    @SuppressWarnings("unchecked")
    public List<Movie> getMoviesById(List<Long> movieIds) {
        return (List<Movie>) entityManager.createQuery("SELECT m FROM movie m WHERE m.id IN (:movieIds)")
                .setParameter("movieIds", movieIds)
                .getResultList();
    }
}

Below is a code snippet of a default implementation of the MovieService.

package nl.jteam.mahout.gettingstarted.service;

// Imports omitted.

/**
 * Service for retrieving and recommending {@link Movie}s.
 *
 * @author Frank Scholten
 */
@Transactional
@Service
public class DefaultMovieService implements MovieService {

    @Autowired
    private MovieRepository movieRepository;

    @Autowired
    private ItemBasedRecommender movieRecommender;

    public Movie getMovieById(long id) {
        return movieRepository.getMovieById(id);
    }

    @SuppressWarnings("unchecked")
    public List<Movie> moreLikeThis(long movieId) {
        try {
            List<RecommendedItem> recommendedItems = movieRecommender.mostSimilarItems(movieId, 5);

            List<Long> ids = new ArrayList();
            for (RecommendedItem r : recommendedItems) {
                ids.add(r.getItemID());
            }

            return movieRepository.getMoviesById(ids);

        } catch (TasteException e) {
            return (List<Movie>) Collections.EMPTY_LIST;
        }
    }
}

Finally, below is the snippet of the Wicket MoviePage which displays the current movie and similar movies fetched through specially created Wickets models.

package nl.jteam.mahout.gettingstarted.web.page;

// Imports omitted.

/**
 * Page for showing a single {@link Movie} from the Movielens dataset along
 * with recommended movies i.e. 'more like this'.
 *
 * @author Frank Scholten
 */
public class MoviePage extends WebPage {

    private static final String MOVIE_ID = "0";

    public MoviePage(PageParameters pageParameters) {
        final long movieId = pageParameters.getLong(MOVIE_ID, 1);

        MovieModel model = new MovieModel(movieId);
        add(new Label("title", model.getObject().getTitle()));

        PropertyListView<Movie> recommendedMovies = new PropertyListView<Movie>("moreLikeThis", new RecommendedMoviesModel(movieId)) {
            @Override
            protected void populateItem(ListItem listItem) {
                Movie movie = (Movie) listItem.getModelObject();
                PageParameters pageParameters = new PageParameters();
                pageParameters.put(MOVIE_ID, movie.getId());

                BookmarkablePageLink<MoviePage> movieLink = new BookmarkablePageLink<MoviePage>("link", MoviePage.class, pageParameters);
                listItem.add(movieLink);
                Label movieTitle = new Label("title");
                movieTitle.setRenderBodyOnly(true);
                movieLink.add(movieTitle);
            }
        };
        add(recommendedMovies);
    }
}

Running the web application

First you need to download the Movielens dataset and add the ratings file on the classpath under grouplens/100K/ratings. See the spring context above. Since this example is based on the Wicket quickstart project you can start the application via Jetty through the Start class and run it in your favourite IDE. Go to http://localhost:9090/ and you can browse through movies via recommendations. Alternatively you can build the WAR and drop it into tomcat.

Performance tweaks

  • If you like to experiment with the larger datasets you need to add -Xmx512m as a VM parameter if you want to run this application with the 10 million ratings dataset. Also, Tastes FileDataModel uses commas and tabs as delimiter. You may need to run the Movielens files through sed</ before feeding them into a FileDataModel
  • The FileDataModel reads everything in memory before computation. This is way faster than using a MySQLJDBCDataModel, since this requires around O(n2) database queries to compute similarities between all pairs. Reading all data in memory is not always feasible however. An alternative is to precompute the similarities for the item pairs and store the results in the database and read them via the MySQLJDBCItemSimilarity.
  • You can also sample the dataset and/or remove noise elements to speed things up a little more

These performance aspects are an interesting topic for a later blogpost. If any of you reading this has experience with these type of issues please post a comment and we’ll discuss them.

Conclusions

This concludes the getting started post on Mahout / Taste. What we didn’t cover was how to update the recommender and how to customize your recommender with boosting of items. These are all subjects for future blog posts.

References

  • Getting started demo – source code
  • You can download the source code of this example here.

  • Grouplens datasets
  • The Grouplens research group of the University of Minnesota have made a few datasets publicly available for research purposes.

  • Mahout in Action EAP
  • This is a great resource on Mahout and explains a lot about performance issues and details how the algorithms stack up against eachother. Also it provides a lot of examples and case studies on how to use Mahout in practice.