On Schemas and Lucene

by Chris Male, April 17, 2012

One of the very first thing users encounter when using Apache Solr is its schema. Here they configure the fields that their Documents will contain and the field types which define amongst other things, how field data will be analyzed. Solr’s schema is often touted as one of its major features and you will find it used in almost every Solr component. Yet at the same time, users of Apache Lucene won’t encounter a schema. Lucene is schemaless, letting users index Documents with any fields they like.

To me this schemaless flexibility comes at a cost. For example, Lucene’s QueryParsers cannot validate that a field being queried even exists or use NumericRangeQuerys when a field is numeric. When indexing, there is no way to automate creating Documents with their appropriate fields and types from a series of values. In Solr, the most optimal strategies for faceting and grouping different fields can be chosen based on field metadate retrieved from its schema.

Consequently as part of the modularisation of Solr and Lucene, I’ve always wondered whether it would be worth creating a schema module so that Lucene users can benefit from a schema, if they so choose. I’ve talked about this with many people over the last 12 months and have had a wide variety of reactions, but inevitably I’ve always come away more unsure. So in this blog I’m going ask you a lot of questions and I hope you can clarify this issue for me.

So what is a schema anyway?

Before examining the role of a schema, it’s worthwhile first defining what a schema is. So to you, what is a schema? and what makes something schemaless?

According to Wikipedia, a schema in regards to a database is “a set of formulas called integrity constraints imposed on a database”. This of course can be seen in Solr. A Solr schema defines constraints on what fields a Document can contain and how the data for those fields must be analyzed. Lucene, being schemaless, doesn’t have those constraints. Nothing in Lucene constrains what fields can be indexed and a field could be analyzed in different ways in different Documents.

Yet there is something in this definition that troubles me. Must a schema constrain? or can it simply be informative? Or put another way, if I index a field that doesn’t exist in my schema, must I get an error? If a schema doesn’t constrain, is it even a schema at all?

Field Name Driven vs. Data Type Driven

Assuming we have a schema, whether it constrains or not, how should it be oriented? Should it follow the style of databases where you state per field name the definition of that field, or should it use datatypes instead where you configure for, say, numeric fields, their definition?

The advantage of being field name driven is that it gives you fine grained control over each field. Maybe field X is text but should be handled differently to another text field Y. If you only have a single text datatype then you wouldn’t be able to handle the fields differently. It also simplifies the interaction with the schema. Anything needing access to how a field should be handled can look up the information directly using the field’s name.

The disadvantage of the field name driven approach is that it is the biggest step away from the schemaless world. A definition must be provided for every field and that can be cumbersome for indexes containing hundreds of fields, when the schema must be defined upfront (see below) or when new fields need to be constantly defined.

The datatype driven approach is more of a middle ground. Yes the definition for each datatype must be defined, but it wouldn’t matter how many actual fields were indexed as long as they mapped to a datatype in the schema. At the same time this could increase the difficulty of using the schema. There wouldn’t be any list of field names stored in the schema. Instead users of the schema would need to infer the datatype of a field before they could access how the field should be handled. Note, work on adding something along these lines to Solr has begun in SOLR-3250.

What do you think is best? Do you have other ideas how a schema could be structured?

Upfront vs. Incremental

Again assuming we have a schema, whether it be field name or datatype driven, should we expect the schema to be defined upfront before it’s used, or should it be able to be built incrementally over time?

The advantage of the schema being upfront is that is considerably reduces the complexity of the schema implementation. There is no need to support multi-threaded updates or incompatible changes. However it is also very inflexible, requiring all the fields ever to be used in the index be known before any indexing begins.

An incrementally created schema is the opposite of course since you can start from a blank slate and add definitions when you know them. This means a schema can evolve along with an index. Yet as mentioned above, it can be more complex to implement. Issues of how to handle multiple threads updating the schema and incompatible changes to the schema arise. Furthermore, where with an upfront schema you could ensure that when a field is used it will have a definition in the schema, with an incremental schema it may be that a field is accidentally used before its definition is added. Should this result in an error?

It may seem as though Solr requires its schemas be defined upfront. However in reality, Solr only requires a schema be defined when it brings a core online and prevents any changes while its online. When the core is taken offline, its schema can be edited. In SOLR-3251 ideas on how to add full incremental schema support to Solr are being discussed.

Storage: External vs. Index

No matter whether a schema is defined upfront or incrementally built, it will need to be stored somewhere. Solr stores its schema externally in its schema.xml file. This decouples the schema from the index itself since the same schema could, in theory, be used for multiple indexes. Changes to an external schema do not necessarily have to be impact an index (and vice versa), and an external schema doesn’t impact the loading of an index.

At the same time, the disconnect between an externally stored schema and an index means that they could fall out of sync. An index could be opened and new fields added without the schema being notified. Removing the definition of a field in the schema wouldn’t necessarily mean that field would be removed from the index.

One way to address this is to store the schema inside the index itself. Lucene already partially does this, having a very simple notion of FieldInfo. There has been considerable reluctance to increasing what’s stored in FieldInfo since it will slow down the incredibly efficient loading of indexes. How slow it would become would depend on how much data was stored in the schema. Yet this would ensure that the schema and the index were synchronized. Any changes to one would be represented in the other.

Given how controversial storing a schema in an index would be, do you think its worthwhile? Have you encountered synchronisation issues between your indexes and your schemas? Would you prefer control over where your schema were stored, allowing you to choose Cloud storage or maybe another database?

Does any of this even matter?

You might be thinking that I’ve totally wasted my time here and that actually there is no need for a schema module in Lucene. It could be argued that while having a schema is one of Solr’s major features, being schemaless is one of Lucene’s and that it should stay that way. Maybe that it’s best left up to Lucene users to create their own custom schema solutions if they need one. What do you think? Do you have some sort of Schema notion in your Lucene applications? If so, how does it work? If you use Solr, do you like how its schema works? If you could change anything, what would you change? I’d love to hear your thoughts.