Analysing European Languages With Lucene

by Chris MaleDecember 7, 2011

It seems more and more often these days that search applications must support a large array of European languages.  Part of supporting a language is analysing words to find their stem or root form.  An example of stemming is the reduction of the words  “run”, “running”, “runs” and “ran”  to their stem “run”.  In the past all this required in Lucene was use of the Analyzers for the desired languages.  Yet nowadays developers are presented with a plethora of TokenFilter alternatives, from the traditional Snowball based filters, through the recently added Hunspell filter, to the vaguely named Light and Minimal filters.

In this blog I want to compare and contrast each of these options focusing on their algorithms and how they achieve their goals and hopefully giving you enough information so you can make an informed decision about which option fits your usecase.

Snowball – Defacto no more?

For most of Lucene’s history, the defacto analysis framework for European languages has been driven by Snowball.  Snowball is a domain specific language for defining stemming algorithms for European languages, from which ANSI C and Java implementations can be generated.  Started by Martin Porter, the creator of the Porter algorithm for stemming English, Snowball provides Stemmers for the commonly used European languages such as French, Dutch, German, Spanish and Italian.

Snowball Stemmers use an algorithm which allows them to stem any word in a language, whether the word would appear in an official dictionary or not.  Although the algorithms for each language vary, Snowball Stemmers are considered very aggressive, often removing large suffixes to reduce words to a stem.  For usage with Lucene, this can have the positive effect of reducing the number of unique terms in the index.  Yet it can also heavily impact the quality of your search results.  Consider the query string “international”.  The Snowball Stemmer for English will remove the common suffix “ational” from the word creating the stem “intern”.  As you can, this could lead to search results being found that have no relation to the original query.

Despite their potentially negative impact, Snowball Stemmers continue to be popular due to reliability, long lifespan and their development by academic linguistic communities.

SnowballFilter is the primary entry point for Snowball Stemmers in Lucene.

Hunspell for Lucene – Beyond just Hungarian

Hunspell for Lucene, a pure Java port of the Hunspell analytical system found in major applications like OpenOffice and Firefox, is a new stemming system for Lucene which supports every European language you’ve ever heard of and many you haven’t.

Unlike Snowball, Hunspell is not a purely algorithmic system.  Instead Hunspell uses a file containing the grammatical rules for a language encoded in a standardised format, and a dictionary file containing the language’s valid stems.  During the analysis of a word, Hunspell loops over the grammatical rules, applying those applicable, and then checking if a valid stem is found.  The advantage of this process is that very complicated grammatical rules can be applied such as the removal of multiple suffixes and prefixes, and the addition of new characters.  Furthermore, Hunspell will continue to stem a word until all valid stems have been found, rather than just one.  However, unlike Snowball which can analyse any word, Hunspell will only find stems for those words which match its rules or have stems found in its dictionary.

Because the linguistic logic of Hunspell is separated from the programmatic algorithm, adding support for new languages is remarkably easy.  As such, hundreds of Hunspell dictionaries have been created.  Unfortunately the quality of the dictionaries varies considerably.  Therefore although Hunspell can be used to analyse Maltese for example, the quality of the analysis will most likely be much less than the analysis of English.  As such, the impact on your Lucene index and search results are hard to predict.

Hunspell for Lucene has only recently been developed and added to the Lucene codebase.  As such I recommend thorough testing before putting it into production.  Equally, there are some dictionary formats which are not yet supported.

HunspellStemFilter is the primary entry point for Hunspell for Lucene.

Light & Minimal – Analysis made easy

While both Snowball and Hunspell for Lucene aggressively try to unravel complex grammatical rules to find ‘true’ stems of words, the rather vague named Light and Minimal Stemmers found in Lucene, take a very different direction.  Developed from the work done by Jacques Savoy and discussed in detail here, both Stemmer forms apply few grammatical rules using incredibly simple and efficient algorithms.  Most of the rules applied are the removal of plurals and common suffixes from nouns and the normalisation of special characters.  The effect on a Lucene index, when compared against both Snowball and Hunspell for Lucene, is many more unique terms but a considerably reduced likelihood of unrelated words being reduced to the same stem.

The difference between the Light and Minimal form of a Stemmer for say French, comes down to how many rules are applied.  The FrenchLightStemmer, a mere 203 lines of code vs the Snowball derived FrenchStemmer’s 608 lines, removes both plurals and some common suffixes, as well as doing character normalisation.  In comparison, the FrenchMinimalStemmer, coming in at a tiny 18 lines of code, just removes plurals.

Due to their less aggressive analysis and incredible efficiency, the Light and Minimal Stemmers provide a very compelling alternative to the comparatively slow and monolithic Snowball and Hunspell stemmers.  Unfortunately, Light and Minimal Stemmers only exist for a few European languages.

Light/MinimalStemFilter implementations for Lucene can be found under the appropriate language packages in org.apache.lucene.analysis.*.

Conclusion

With multiple options for analysing European languages, it can seem quite daunting deciding which to use.  However, with each approaching the same problem in very different ways and the range of languages supported growing everyday, its well worth experimenting before settling on which fits your usecase.