The Lucene Sandbox

by Chris Male, September 15, 2011

Few people are aware that Apache Lucene has been part of the ASF since 2001, becoming a Top Level Project in 2005. 10 years is an eternity in IT where ideas tend to evolve in leaps and bounds. Over that 10 year period many users, contributors and committers have come and gone from Lucene, each shaping in their own way what has become the defacto Open Source Search library. But of course good ideas from 10 years ago are not necessarily so good today.

Contribs and Module Consolidation

Before I talk about the Sandboxing process in more detail, it’s perhaps best to understand what the Lucene codebase has been like until recently and moreover, where we want to go.

For some time now, Lucene has had a section in its codebase known as contrib which is home to code that is related to Lucene but for whatever reason, has been deemed as not belonging in the core code. Examples of commonly used code that has been or remains part of the contrib are Lucene’s Highlighters, its large array of analysis tools and the flexible QueryParser framework.

One of the unfortunate features of the contrib is that code quality and algorithmic efficiency vary greatly and can go unaddressed for many years. I recently saw a class where the last SVN commit was 2005. I can assure you that many parts of Lucene have changed since then.

With Lucene 4 and the merger of Lucene and Solr’s development, the idea was put forward that the analysis tools in the contrib and from around the rest of the codebase should be consolidated together as a module along side Lucene and Solr. Once consolidated, the code would be held to a high standard comparable to that of Lucene’s core. With the success of this consolidation, it was decided that other concepts in Lucene should also be pulled together into modules. Two concepts that were immediately thrown around where Lucene’s wide variety of QueryParsers and its many exotic Query implementations.

It works but it’s kind of sandy

Consolidating Lucene’s QueryParsers was not such a problem. While they do suffer from their fair share of flaws, because they are so commonly used they are well tested and operate efficiently. The same can not be said however, about all the exotic Query implementations.

The problem confronting the consolidation of the Query impls was not that they didn’t work. If that had been the problem then it would have been fine to remove them. The problem was that they did work, but were poorly written, documented or tested, or operated in inefficient ways. To put it bluntly, the code was not necessarily up to the standard that we expected for the new queries module.

Deleting code which exists in such a grey area would be a bold decision and could prove a mistake. It was obviously added for a reason and no doubt has some users. Some code with some effort could probably make its way to being module worthy. Therefore the idea of a sandbox for Lucene was floated. A place of lesser standards where ‘sandy’ code which is not deemed module worthy, could go so that it can continue to be used and maybe even improved, before the decision about its ultimate fate is made.

Sandy Code and The Future

So what code in Lucene is sandy? It’s very hard to say right now and is definitely subjective. Long depreciated or poorly written code are obviously signs, but what about the lack of testing? As such, only that code which is in the sandbox can be called sandy for sure. However as we continue to consolidate the various parts of Lucene into modules and slowly shutdown the contrib, it will no doubt grow further.

Do you have any thoughts on what perhaps belongs in the sandbox? Do share them with us, it would be good to hear from you.