Language analysis comparable to Fast / Endeca for Solr

by Martijn van GroningenMarch 30, 2010

Good, solid language analysis is a very important asset for the quality of your search results. It is one of the features that for instance Microsoft Fast and Endeca are using as one of their unique selling points. However, you can get the same powerful analysis when using Apache Solr to implement your search.

The thing is that both Ms Fast and Endeca did not implement their language analysis themselves. They use an existing, commercial solution called the Rosetta Linguistics Platform (RLP) provided by Basis Technologies under the hood to provide their sophisticated language analysis capabilities. This is a good thing, as RLP also provides integration components for Apache Solr. This allows anyone using Solr to easily plug in RLP advanced language capabilities into their solution.

What is RLP?
The Rosette Linguistics Platform (RLP) is a commercial solution that allows you to perform linguistic analysis of text in many languages (English as well as dozens of major European, Asian, and Middle Eastern languages). Besides that RLP also supports advanced entity extraction capabilities, base noun detection, sentence boundary detection and even part of speech tagging. However, as RLP is a commercial product it comes with a price tag.
However, we feel that RLP has a lot of potential when you are in need of sophisticated language capabilities. That’s why we want to show you how to integrate RLP into your solution based on Apache Solr.

Installing the RLP platform for Solr
RLP and Solr are two separate systems. Integration between Solr and RLP is quite straight forward. Basis provides extensive documentation on how to setup the RLP on your machine. With the RLP configuration you can customize the actual language analysis. This is not done directly via Solr. Installing and configuring RLP is explained in RLP documentation that is included with RLP bundle.

Configuring Solr to use RLP for analyzing your documents is also relatively easy to setup. In order to use RLP’s language analysis in Solr you have to configure the RLPTokenizerFactory as tokenizer for the specific fields you want to use it on. As RLP is not written in Java, JNI is used to integrate a Java application with RLP. In order for Solr to integrate with RLP you must set certain environment variables that point to the RLP installation. Basis has quite extensive documentation for the Solr integration with RLP as well, which is included in the RLP Solr bundle.

RLP’s stemming capabilities
Terms like verbs, nouns and adjectives appear in different forms. When searching the different forms of a term, it can result in a miss. For example if you search for the term customize, but the token in the index is customization, your are unlikely to find results with this term. The solution to this is stemming. Stemming reduces a term to a base form. There are many ways to do this but the most common way is to remove the terms’ suffix according to some basic rules. In our case customize and customization would be stemmed to customiz. Stemming is usually applied during indexing and during searching (stemming the search query). So whether we search for customization or customize does not matter, for both terms we’ll get the same result. Another example is go and going, both forms are stemmed to go.

In all languages there are terms that are irregular. For example the verb to buy. The present tense is buy and the the past tense is bought. Using a suffix based stemmer, will not create a common base form for the tokens. The create a proper base form the stemmer needs to have some basic knowledge of a language. Stemmers who do this usually use lemmatisation. Using this algorithm buy and bought would both be stemmed to buy. Lemmatisation based stemmers will stem more terms to a common base form and therefore increase findability of documents in your index.

RLP’s stemming capabilities fall into the last category of stemming algorithms and as you can see in the table below the stemming is quite powerful.

Unstemmed token Stemmed token
mice mouse
mouse mouse
been be
is be

Advantages of using RLP
Why would you use RLP in conjunction with Solr? RLP provides you with really powerful language analysis. Solr is a extensible open-source search engine. By integrating RLP with Solr, RLP will compliment the language analysis provided by Solr and increase the quality of your search results. As seen in the previous table, the better stemming capabilities increases the likelihood for relevant documents to be found.

If you consider using RLP for your project or solution, please feel free to contact us, so we can help you both to make a decision as well as help you implement it.