Enterprise Search: Introduction to Solr
From day one, we at JTeam were very much occupied with pushing new revolutionary open source technologies that can bring real value to us and to our customers. We were there when Spring just started and we helped making it what it is today. We were one of the first companies to use Hibernate in real world projects (I reckon the first version we used was 0.4), and contributed to (back then) innovative new front end technologies like Ajax and DWR. With time, these technologies became mainstream and for a while it seemed that they just fulfilled every bit of our needs where JEE development is concerned. Yet something was still missing. About 3 years ago, we started noticing a new and growing trend in the market – a new demand – demand for search. Customers started paying more attention to the “findability” aspect in their offerings, be it an e-commerce website offering faceted navigation to its users, or proprietary search solutions on top large service management systems. The trend was obvious, the demand was there, and we had to deliver. We started by implementing our own custom solutions based on the brilliant Lucene library, but then came Solr and once again revolutionized our JEE development.
My goal in this post is to introduce you to Solr. Not too fancy, but to give you just a taste and enough information to at least get started with it. In future posts, I hope to expand on this and show you how you can leverage some of Solr’s features to implement some really cool stuff.
So what is Solr?
Solr is an apache project which can be described in many ways. Some like to see it as an open source enterprise search platform rivaling well established commercial offerings like Autonomy, Fast, and Dieselpoint. But it’s very hard to get a really good grasp of what Solr is and what it does with this definition, especially as the phrase “Enterprise Search Platform” has different meanings for different people. I like to view Solr as an attempt to gather all the experience people have had over the years developing search solutions with Lucene and build a search platform based on these experiences and all the known (and unknown) best practices.
NOTE: For those who are not familiar with Lucene, it is a low level IR (Information Retrieval) library implemented in Java. With Lucene you can index textual data and execute free text search on it in a highly performant manner.
So what’s the purpose of it all? Well, I guess it depends on who you’re asking and on the context in which search is applied. For example, if you’re a web developer looking to integrate search in a web site, Solr can be a perfect fit for you. It can be installed as a standalone server and exposes its search functionality via a REST-like API (which also makes it language independent so it doesn’t matter whether you develop in Java, PHP, .NET or any other preferred language/platform). Also note that the core search functionality that is supported out of the box by Solr is probably more than enough to what most websites require.
That’s all well, but Solr is by far not limited to website search. It is not for nothing that it is often compared with Fast, Dieselpoint or other enterprise search platforms. You can build quite large scale and complex search solutions with it and the list of high profile companies (CNet, AOL, Digg, and more) already using it to power their search requirements is a testimony for that.
Getting Solr up and running
You can get Solr up and running in practically no time. You’ll first need to download it from the following site: http://www.apache.org/dyn/closer.cgi/lucene/solr.
Once downloaded, you can extract the compressed archive (zip or gzip – depending one your platform) anywhere you want on your file system. I would now like you to pay attention to two folders in the extracted directory – dist and example.
As I already mentioned, Solr is essentially a search server implemented in Java. It is actually, a standard web application that can be deployed in a normal servlet container such as Tomcat. In the dist folder, you can find the war file for this server (along with several other jar files which serve for different purposes… well… I’ll cover them in a later post). But as you probably know, it’s always a bit of a hustle to work with war files – you first need to set up a servlet container, then you need to set up a few environment variables, then deploy the war, etc, etc, etc… Luckily, the developers of Solr acknowledged that and decided to make it even easier for you to get started with it, hence the example folder. This folder contains a Solr distribution bundled with a Jetty server. Here is the layout of the example folder:
Now, all you need to do to start Solr server, is run the following command from this folder:
java -jar start.jar
Congrats! You are now running a Solr server.
Interacting with Solr
Now that you have Solr up and running, it’s time to do something with it. As a search service, the two main operations that Solr supports are indexing and searching. First, you need to send Solr data to index after which you’ll be able to perform free text search on this data. But what is this data and how does it look like?
A world of Documents
In Java, we’re used to model the world in terms of objects and properties. In the IR world, the world is modeled as Documents and fields. A Document represents a unit of data and is made up of one or more fields. A field is a simple text based name-value pair which holds the actual data. For example, a web page can be represented as a document with 3 fields – URL, title, and body:
And here’s how you can model a person as a Document:
Indexing Documents
Now that you understand how data is represented in Solr, it is time to send Solr a few documents to be indexed. For that, we’ll used yet another useful tool that Solr ships with. In the example/exampledocs directory you will find several XML files and a post.jar file. The latter, is a tool which can post document files to Solr to be indexed. If you open one of the XML files, you’ll see that each file actually holds an XML structure that represents an “add” command. The post.jar accepts a list of files as an argument and send these files to Solr using HTTP POST request to a dedicated “update” URL. Make sure that Solr is running, and execute the following command:
java -jar post.jar *.xml
When executed, this command will send all documents in all XML files to Solr. After they’ve all been sent, a “commit” command is sent which makes these documents available for search.
NOTE: without committing the documents will still be indexed, but they will not be available for search until either a “commit” is executed, or the Solr server is restarted. Luckily, the post.jar tool sends a “commit” request automatically after sending the documents.
Searching for documents
We managed to index some documents in Solr, all that is left now is to search for them. Just like Solr has a dedicated URL for indexing documents, it also has a dedicated URL for searching for documents (actually, there can be more than one such URL but I won’t get into that right now). By default, the search URL is:
http://localhost:8983/solr/select
But (as you may already have tried and realized) opening this URL in the browser results in an exception. The reason for this is quite simply due to the fact that no search query was specified. To do that, just add a “q” request parameter which holds the free text you want to search, for example:
http://localhost:8983/solr/select?q=*:*
NOTE: the “*.*” represents a “match all” query. When executing this query, it’s as if you’re asking Solr to return all the documents it has indexed.
The result returned from Solr for this query is an XML document containing some meta data over the request and query execution (e.g. how long did the search take), and also a list of the matched documents (also referred to as “search hits”). Notice, that it’s only a partial list of the documents – while the “numFound” attribute of the <result> element shows that 26 documents were found, only 10 are actually returned. There’s a very good reason for that – as Solr is designed to index millions of documents, it make very little sense to return such large search results at once. Therefore, Solr returns the results one “page” at a time. By default the page size is 10 and if not specified otherwise the first page is returned (that is, the first 10 documents). You can control this behavior by providing 2 extra parameters – rows (determine the page size) and start (determines the zero-based index of the first document in the page):
http://localhost:8983/solr/select?q=*:*&rows=5&start=1
Understanding and controlling the returned result
As you can see in the returned XML, each document in the result is returned with its fields. When configuring Solr, one can specify the schema of the index. I will not go into it right now, but in general, the schema determines what fields (name and type) a document is expected to have and also how should Solr handle them. Fields can be handled in 3 ways:
- Indexed – a field value is being broken into tokens which are filtered and indexed so it will be possible to search on it
- Stored – a field value is stored as a whole so when the document is read from the index the original field value can be restored.
- Index and Stored – the two combined
The fields that are returned in the result set document are only those fields that are stored. If a field is not stored, Solr has no way of reconstructing its original value from the index. It is also possible to explicitly specify which fields you’d like to be returned. This feature is quite handy when you have very high performance requirements. Reading the stored field values from the index takes time and narrowing this operation only for those fields which you really need may help you to boost performance a bit. This fine control over returned fields is done using the fl parameter. This parameter accepts a comma (or space) separated list of fields that you’d like to be returned. There is one special field called score which is not part of the original document, but instead holds the search score of the document for the performed search. Here’s an example of the same request we executed above, only this time, we only return the the id and the score of the document:
http://localhost:8983/solr/select?q=*:*&rows=5&start=1&fl=id,score
NOTE: For the “match all” query, the score for all document is identical and equals to 1.0.
Sorting
By default, the search hits are sorted by their score in a descending manner (highest score first). It is however possible to request to sort based on other field values. To do this, you can add the [sort] parameter to the request. This parameter can hold a comma separated list of sort “specification” where each specification defines the fields to sort on and the direction of the sort (ascending/descending). When multiple sort specifications are set, the search result will be sorted on each specification in turn. For example, let’s say the following request is sent:
http://localhost:8983/solr/select?q=montior&sort=score+desc,name+asc
The search hits will first be sorted on their score, then all hits with the same score will be sorted on their name in an ascending manner.
Query Syntax
So far, we’ve seen two types of queries – free text queries (like the “monitor” in the example above) and “match all” query (*:*). The truth is, that Solr supports a much richer query syntax than this. Here are two examples of more advanced search queries:
- q=”name:john” – perform free text search on specific fields
- q=”age:[0 TO 20]” – perform a range query on the [age] field.
- q=”name:john AND age:[0 TO 20]” – It is possible to compose queries with boolean constructs (AND, OR, NOT)
For the full specification of the search syntax, please visit: http://wiki.apache.org/solr/SolrQuerySyntax
SolrJ
Up until now, we used the browser to interact with Solr. But it is more likely that you’ll be using Solr services from another application. To simplify this type of communication, client libraries where developed for the most common development languages. SolrJ is such a library developed in Java (You can find the SolrJ jar in the dist directory as described above). Here’s a small snippet of code that demonstrates how you can use SolrJ to connect to a Solr server, index documents and search on them:
SolrServer server = new CommonsHttpSolrServer("http://localhost:8983"); SolrInputDocument doc = new SolrInputDocument(); doc.addField("firstName", "John"); doc.addField("lastName", "Doe"); doc.addField("age", "25"); server.add(doc); server.commit(); SolrQuery query = new SolrQuery("name:John") .addField("score") .addField("id") .addSortField("age", SolrQuery.ORDER.asc); QueryResponse response = server.query(query); for (SolrDocument hit : response.getResults()) { for (String fieldName : hit.getFieldNames()) { System.out.println(fieldName + ": " + hit.get(fieldName)); } System.out.println(); }
Admin Site
Before I end this post and let you have a go at it on your own, there’s one more thing you might like to play around with. Solr comes with an out of the box administration site from which you can monitor its working at runtime. It also enables you to perform simple searches (without having to build the request URL manually yourself) and provide other cool functionality which we haven’t covered yet (For example, you can view the different stages of the analysis process for field values that are being index – this can help you fine tune and debug the schema configuration). Assuming Solr is still running, the following URL will take you to the admin site:
http://localhost:8983/solr/admin
Final words
Well.. this is it for this post. I tried to give you a short introduction to Solr and some of its very basic features. Nowadays, I find it hard to point to a project we’re implementing in JTeam that doesn’t have something to do with search. Simply put, Solr is practically integrated in every project that we’re doing, and the interesting things is, that we always start small (only using it’s basic functionality) but as the customer realizes the possibilities, Solr ends up one of the main driving forces behind the systems we develop. If you haven’t tried it until now, I truly recommend you to do so in your next project… you’ll be presently surprised.
As for me and this post, this is just the beginning. We’ve been doing with Solr so much and it has given us so much in return, we’re very eager to share this with the rest of the world. So if you’re interested, keep a close watch on JTeam’s blog and on more specifically our Enterprise Search category.