Simon says: optimize is bad for you….

by Simon WillnauerNovember 21, 2011

In the upcoming Apache Lucene 3.5 release we deprecated an old and long standing method on the IndexWriter. Almost everyone who has ever used Lucene knows, IndexWriter#optimize() – I expect a lot of users to ask why we did this, well this is one of the reasons I wrote this blog. Let me go back a couple of steps and try to first explain what optimized did and even more importantly why previous versions of Lucene actually had this option.

Lucene writes segments?

One of the principles in Lucene since day one is the write-once policy. We never write a file twice. When you add a document via IndexWriter it gets indexed into the memory and once we have reached a certain threshold (max buffered documents or RAM buffer size) we write all the documents from the main memory to disk; you can find out more about this here and here. Writing documents to disk produces an entire new index called a segment. Now, when you index a bunch of documents or you run incremental indexing in production here you can see the number of segments changing frequently.  However, once you call commit Lucene flushes its entire RAM buffer into segments, syncs them and writes pointers to all segments belonging to this commit into the SEGMENTS file.

So far so good. Since Lucene never changes files how is it updating documents? The truth is it doesn’t. In Lucene an update is just an atomic add & delete, meaning that Lucene adds the updated document to the index and marks all previous versions as deleted. Alright, but how do we get rid of deleted documents then?

Housekeeping & Per-Segment Search?

Obviously, Lucene needs to do some housekeeping here. What happens under the hood is that from time to time segments are merged into (usually) bigger segments to:

  • reduce the number of segments to be searched
  • expunge deleted documents (influences scoring due to their contribution to Document Frequency)
  • keep the number of file handles small (Lucene tends to have a fair few files open)
  • reduce disk space

All this happens in the background controlled by a configurable MergePolicy.  The MergePolicy takes care of keeping the number of segments balanced and merges them together once needed. I don’t want to go into details on merging here, which is clearly way out of scope for this post – maybe I or someone else will come back to this another time.  Yet, there is another way of forcing merges to happen, you can call  IndexWriter#optimize() which merges all existing segments into one large segment.

Optimize sounds like a very powerful operation, right? It certainly is powerful but  “if all you have is a hammer, everything looks like a nail.  Back in earlier versions of Lucene (before 2.9) Lucene treated the underlying index as one big index and reopening the IndexReader invalidated all datastructures & caches. This has changed quiet a lot towards a per-segment orientation. Almost all structures in Lucene now work on a per-segment basis which means that we only load changes or reopen instead of the entire index. As a user it still might look like one big index but once you look a little under the hood you see everything works per- segment like this IndexSearcher snippet:

 public void search(Weight weight, Filter filter, Collector collector)
      throws IOException {
    // iterate through all segment readers & execute the search
    for (int i = 0; i < subReaders.length; i++) {
      // pass the reader to the collector
      collector.setNextReader(subReaders[i], docBase + docStarts[i]);
      final Scorer scorer = ...;
      if (scorer != null) { // score documents on this segment
        scorer.score(collector);
      }
    }
  }
Figure 1. Executing searches across segments in IndexSearcher
Each search you execute in Lucene runs on each segment in the index sequentially, unless you have an optimized index. Well, this sounds like you should optimize all the time? Wrong! Think about it again, optimizing your index will build one large segment out of your maybe 5 or 10 or N segments; this has several side-effects:
  • enormous amount of IO when merging into one big segment.
  • can take up to hours  when your index is large
  • reopen can cause unexpected memory peaks

You say this doesn’t sound that bad? Well, if you run this in production with a large index optimizing can have large impact on your system performance. Lucene’s bread and butter is the filesystem cache it uses for searching. During a merge you invalidate lots of disk-cache which is in turn not available for currently searched segments.  Once you open the index all data needs to be loaded into disk-cache again, field-caches need do be created, term-dictionaries loaded etc. and last but not least you are likely doubling the disk-space required to hold your index as old segments are still referenced while optimize is running.

Death to optimize, here comes forceMerge(1)

As I mentioned above, there is no IndexWriter#optimize() anymore in Lucene 3.5. If you do still want to explicitly invoke an optimize like merging you can use IndexWriter#forceMerge(int) where you can specify the maximum number of segments left after the merge finishes. The functionality is still there but we hope that fewer people feel like calling this together with each commit. If you use optimize extensively, think about it again and give your disks a break.