Enterprise Search using Solr and Lucene

by Bram Smeets, April 1, 2010

The Enterprise Search market has long been dominated by commercial vendors and their products (e.g. Autonomy and Fast). We at JTeam feel that this era is finally over. At least for certain customers and requirements, there is finally a good Open Source alternative: Apache Solr, which is the Enterprise Search server based on Apache Lucene. In this blog post we’ll give our view on enterprise search and explain how Lucene and Solr can help you realize your projects.

Enterprise Search

Enterprise Search is the application of Information Retrieval (IR) technology to the information within an organization (as opposed to Web Search and Desktop Search). This can be all sorts of data (and usually a combination of multiple sorts), for example products in a database, documents on a file system and emails from a company mail server. This typically includes several distinct steps: gather content, index the retrieved content and make the content searchable.

Gathering content

The first thing to do is make sure the data that needs to be searchable is retrieved somehow from its respective location. This usually means using and/or building connectors that can access the content and retrieve it. There are currently several open source crawling tools out there, but Lucene Connector Framework (LCF) is by far the most promising one. Luckily for us, LCF integrates with Solr out-of-the-box. You can use LCF to crawl a website or intranet site, but also other sources of information (e.g. a database, file system and Sharepoint and many more) and pass all the found documents (before or after extracting the useful content) to Solr.

Extracting data

Once we get to the data we want to make searchable, we usually have to transform it to a format that is understandable by the indexing / search engine. In order to achieve this there are several alternatives. Which one is more suitable for you depends on the specific requirements you have.

The first option is using Solr’s built in support for Apache Tika, a toolkit for extracting metadata and structured text content from various document formats (e.g. Word and PDF documents). This allows you to send the raw document content to Solr and then have Tika extract the right content that is subsequently indexed.

If you need broader support for document formats, there are also several commercial file readers available with the most important being ISYS file readers.

Indexing / Searching

The last part is making sure users can actually search the indexed data sources. The search algorithms and relevance rankings are handled by Solr (with Lucene under the hood) and can be configured and extended to the max. The main advantage of using an open source search solution is that it allows you to easily extend the default functionality with custom logic.
The good thing about using Apache Solr is that it provides a whole lot of functionality out of the box, including facetted navigation, “Did you mean…” and “More like this” functionality. This allows you to easily impress your users with a minimum effort.

Conclusion

As mentioned in the introduction, we at JTeam feel that organisations that are considering using an Enterprise Search solution, should at least consider using Apache Solr and its related technology stack. The main advantage of these is that they are open source, which entails no license fees, but also built on a highly performant, proven technology: Apache Lucene.
If you have are considering Solr, have any questions about using Solr for your project or need help convincing others about using Solr, contact us!