Searching your Java CMS using Apache Solr: Introduction

by Ralph Benjamin Ruijs, March 31, 2010

All Content Management Systems (CMS) provide the capability for users to search the content and browse the result. However, commonly this functionality turns out to be insufficient. This can be either because you want to allow users to search over multiple sources (the content repository, but also some external system) and combine the result. Or because you want to offer your users more advanced search functionality like “Did you mean…” functionality or facetted navigation. Therefore, you might want to consider using an advanced, open source search solution like Apache Solr. This blog post is the first in a serie that will introduce searching different CMS solutions using Apache Solr.

Introduction

To finish the last part for my Bachelor of ICT, I recently started my internship at JTeam. JTeam has been doing a lot of projects using either a CMS (e.g. Magnolia and Hippo) or requiring a search solution, typically using Apache Lucene and Apache Solr. My assignment is to investigate the problem of making the information from different CMS search-able using Solr and hopefully come up with a good solution.

In order for the content in a CMS to be available for searching, it needs to be indexed by Solr. Problem is that many of the Java-based CMS solutions do not provide an easy way to get their content to Solr. We need to get the information out of the CMS and feed it to Solr. In this blog post I will discuss the synchronization problem at a more generic level and look at some possible solutions.

Synchronization problem

Before diving into CMS solutions specifically, let’s look at a more classic problem: the data replication problem. When we get information from a specific source and replicate it into another system, the data may get out of date when changes in the original source are not immediately propagated. This is a common problem, that has been studied extensively. Junghoo Cho and Hector Garcia-Molina studied this problem with the focus on Web Crawlers. They started their study by looking how to measure the problem and defined the freshness and age of a database.

Intuitively, we consider a database ”fresher” when the database has more
up-to-date elements. For instance, when database A has 10 up-to-date elements out of 20 elements, and when database B has 15 up-to-date elements, we consider B to be fresher than A. Also, we have a notion of ”age”: Even if all elements are obsolete, we consider database A ”more current” than B, if A was synchronized 1 day ago, and B was synchronized 1 year ago.

With this notion of Freshness and Age we can look at the potential quality of the solutions to this problem.

Possible ways

Basically, there are three ways of replicating the content in a CMS to a search solution like Apache Solr:

Look at the CMS’s data to see what has changed
Listen to the CMS and hear what has changed
Let the CMS update Solr

The first option, looking at the CMS’s data to see what has changed, can be done by creating a crawler. The crawler will iterate over all the content in the CMS and compare the last known version with the current version of the content in the CMS. Such a crawler is typically implemented by repeating the following steps:

Put all the known elements in a Queue
While the Queue is not Empty
- Take an element from the Queue
- Check if the element has changed

This is fairly straightforward to implement. Luckily, the same study shows there is little to gain in trying to optimize the order in which we look at the elements, or the amount of resources we use for specific elements (e.g. the rate at which specific elements are checked). However, in order to keep your Solr data fresh and young, a crawler will need more and more resources when the CMS grows.

The second option, listening for changes in the CMS, would be better. In order to listen for changes the CMS should push the changes to external systems. A push, or observation, mechanism can be implemented in one of two ways: asynchronous or journaled. The simplest form, asynchronous observation, will send a notification on every change that occurs in the CMS. Getting notified of the change and being able to incorporate that change will give us a fresh and young Solr index. Although asynchronous observation looks perfect at first sight, it has two flaws: it doesn’t guarantee that all changes are actually sent and does not guarantee the right ordering of the changes. Missing a change, or getting two updates in the wrong order, leaves an element out-of-date, and the Solr index not fresh.

Journal based observation solves the problems of asynchronous observation, as it can guarantee complete information and ordering. Instead of being informed on every change, we will get a list, a journal, of changes since the last time we checked. Using journal based observation, the age is only influenced by the time it takes to process the changes and the freshness by the amount of changes in each journal.

The third option, let the CMS update Solr, is obviously the most ideal. Instead of crawling or observing the CMS and update Solr when something changes, the CMS could directly propagate all changes to Solr. The age and freshness of Solr would be perfect, as changes to data can be made available at the same time to both the CMS and Solr. Having a CMS that is built using an event-driven, highly decoupled architecture, for instance using a CQRS framework, like Axon Framework, would make extending the CMS with the capability for updating Solr very easy. However, currently no Java based CMS provides this capability out of the box.

Conclusion

Trying to solving the problem of making the information from different CMS solutions search-able using Solr gave some interesting insights in the classic data replication problem. Using Age and Freshness as a guideline, letting the CMS update Solr looks like the best solution, but is currently not usable when looking at different CMS solutions. Both asynchronous observation and crawling have up and downsides, but seem to be the most generic solution. Journaled observation, when looking at the Age and Freshness and different CMS systems, looks the most promising.

In the next blog posts as part of this series, I will go into more detail of actually implementing this integration for a number of widely used CMS solutions like Magnolia and Hippo. Stay tuned…

References

Effective Page Refresh Policies For Web Crawlers
Axon framework