Indexing your Samba/Windows network shares using Solr

by Martijn van Groningen, May 5, 2011

Many of JTeam’s clients want to search the content of their existing network shares as part of their Enterprise Search infrastructure. Over the last couple of years, more and more people are switching to Apache Lucene / Solr as their preferred, open source search solution. However, many still have the misconception that it is not possible to index the content of other enterprise content systems, like Microsoft Sharepoint and Samba / Windows shares using Solr. This blog entry will show you how to easily index the content of your network shares reusing some of the components JTeam built to make this really easy to do.

In some current day product selections, Apache Solr is dismissed as a serious option, partly because it does not provide out-of-the-box support for indexing content from non standard repository such as Microsoft Sharepoint and Samba / Windows shares. Also, support for scheduling of tasks (needed to periodically index the content from these repositories) is not built into Solr. However, there are several open source solutions available that do exaclty this and are easily combined with a typical Solr installation. The main two open source solutions that can be used to index content from Sharepoint and/or network shares are Manifold Connector Framework and Google Enterprise Connector Manager. In this blog I will show you how you can index your files from network shares using Solr and Google enterprise connector manager.

Connector manager

The connector manager is the central part of the connector framework for the Google Search Appliance (GSA) and enables searching documents that are stored in non-Web repositories, such as enterprise content management systems, like Microsoft Sharepoint, Samba / Windows network shares, but it also supports JDBC databases.

The connector manager is a web application that runs inside a Java Servlet container. The connector manager is the entry point for the creation, instantiation, scheduling and monitoring of connectors that supply content and authentication and authorization services over that content. Interacting with the connector manager is done over HTTP using an XML interface. There is no graphical user interface to do this.

By default, the connector manager is used in conjunction with the Google Search Appliance (GSA). However, we want to use it to index documents to Apache Solr. This is easy to do, by means of a connector manager concept called Pusher. A Pusher is responsible for sending a crawled document to an external system. By default, the only implementation sends crawled documents to GSA, but implementing your own Pusher is not difficult. At JTeam we already created a Pusher that sends documents to Solr using the SolrJ client library. This SolrDocPusher integrates the connector manager with Solr, by sending every crawled document to Solr. The default implementation of the SolrDocPusher sends the documents to the ExtractingRequestHandler (a.k.a. Solr Cell) which is part of Solr. The ExtractingRequestHandler uses Apache Tika to extract text content from documents in various file formats.

Connector

A connector is the bridge between the repository and the connector manager. A connector knows the details for retrieving content (incl. authentication and authorization) from a specific repository. It implements a connector SPI interface, which is the only thing the connector manager know about.

Currently, there are connectors available for all repositories that are supported by GSA, which includes Microsoft sharepoint, file systems, Samba / Windows shares, RDBS, but also experimental support for Salesforce. A full list of available connectors can be found here.

Traversals

Traversal is an important concept when working with Google Enterprise connector framework. A traversal is the process that makes sure that all new and changed documents are added to and deleted documents are removed from the indexer (in our case Solr). Each connector uses its own heuristics to determine when a document is new, changed or deleted. A traversal is executed in batches and starts depending on the configured time intervals. After each batch a checkpoint is saved that records the progress of a specific traversal.

In order for a traversal to start and run correctly, the following actions should have been executed:

Connector has to be installed. The jar containing the connector and any dependencies should be on the classpath of the connector manager.
Connector instance has to be configured. Each connector has its own options that have to be specified before crawling can happen. For example, the DataBaseConnector needs a query and JDBC details and the FileSystemConnector needs a path and optionally account details when a Samba share has to be crawled. You can query the connector manager for the options of a specific connector. In the last section of this blog entry I will show how to do this.
Connector instance has to be scheduled. Scheduling a connector is a separate step. In this step you’ll need to specify the time intervals, load and retry delay time.

The above image shows an overview of the Google connector manager and different traversals. In our case the retrieved documents are pushed to Solr and the configuration / scheduling happens over HTTP via the command line using curl.

When scheduling a connector instance there are two option that need some more explanation: load and retry delay time. The connector manager has a notion of load per connector instance. This determines the number of documents that can be processed per minute. The retry delay time option is simply the time a connector instance should wait during a traversal when an error has occurred.

Authenticating and Authorization

Besides indexing data the connector manager can also assist you with authenticating users and authorizing documents. The connector manager provides web interfaces to perform authentication and authorization. As shown in the image below the connector manager delegates the authentication and authorization requests to one or more connectors.

Please note that not all connectors support authentication and authorization features. In this blog I will focus on indexing, but this shows that connector manager is designed with authentication and authorization in mind.

Using the connector manager with Solr

The following paragraph shows how to use the connector manager with the filesystem connector. We’ll try to crawl a Samba share. Most of the steps are similar when using different connectors. In this small tutorial I will use Tomcat as Servlet container for running the connector manager, but any other Servlet container should work.

The connector manager’s binary distribution can be downloaded from the Google’s download page. The binary distribution contains the connector manager as a war file. However this distribution doesn’t contain any connectors and it relies on the GSA instead of Solr. Adding connectors isn’t a big deal (just put the connector jars on the connector manager classpath). There is even an installer for the connector manager with many connectors.

We want the connector manager to work with Apache Solr. This requires some changes to the connector manager’s configuration which is packaged inside the war file and the connector manager needs the SolrDocPusher. Obviously the file system connector is also required. To make it easy we have provided an archive with a Tomcat 6 distribution containing Solr version 3.1.0 and the connector manager version 2.6.6 with the required changes.

In this setup Solr in configured under the solr context and the connector manager under the connector-manager context. The connector manager is configured to send documents to this local Solr instance. In this setup logging is also properly configured. Solr will log to the solr-info file and solr-error file in the Tomcat log directory. The connector manager will log to the gcf-info file and the gcf-error file. The error log files will only contain error messages and the info log files contains the info messages and error messages.

Follow the following steps to index data from your network shares or local disk:

Unpack the archive into a convenient location on your local system.
Start Tomcat by running the start script in the bin directory and check via the log files if Solr and the connector manager are running without any errors. Browse to the connector manager context and see if the xml element statusId contains the value 0. If that is the case then everything is fine. Any other value means that there is an error (This shouldn’t happen with the provided distribution). Browse to the Solr context and execute the example query in the Solr admin menu. You should see an empty result.

Check if the file system connector is available in the connector manager by going to the following url: http://localhost:8080/connector-manager/getConnectorList
You should see a response like this in your browser:

<p></p>
<p>  Google Search Appliance Connector Manager 2.6.6 (build 2658  December 7 2010); Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM 1.6.0_22; Linux 2.6.35-28-generic (amd64)<br>
  0</p>
<p>    FileConnectorType</p>
<p>

To configure a connector instance you’ll need to send a xml snippet with the connector configuration to the connector manager. Take a look at the following connector instance configuration:
```

 en 
 test 
 FileConnectorType 
 false

```
The above configuration tells the connector manager to crawl certain paths on a Samba / Windows share. The start options tell the file system connector what paths to crawl recursively. Replace [YOUR-PATH] by some path in your network shares and replace [YOUR-HOST] by the hostname or IP of your share server. Also replace [YOUR-USER] and [YOUR-PASSWORD] with a username / password combination of a user that is allowed to index files in the specified path(s). The username should be specified without a domain (not like this: domain@username) when crawling a Windows share. Finally if you’re crawling a Windows share replace [YOUR-DOMAIN] with the domain the share is located in.

The include options and exclude options specify what files should be included for traversal or excluded from traversal. Url patterns are used as syntax to define includes and excludes. The first include option in the above XML snippet specifies to include all files it encounters in the start path(s).

You can also crawl a path on your local file-system. Omit the smb:// prefix from the paths in the start configuration option and replace it with a path from your local file system. When crawling files from your local file-system you can omit the credentials and domain configuration options.

Save the above xml snippet as a file on your disk. For example: setConnectorConfig.xml. To create the connector instance you’ll need to send a HTTP post to the following url: http://localhost:8080/connector-manager/setConnectorConfig

A command line utility like curl will do the job:
```
curl http://localhost:8080/connector-manager/setConnectorConfig -X POST -d @setConnectorConfig.xml
```
Check if the connector instance is properly configured on the following url: http://localhost:8080/connector-manager/getConnectorInstanceList
You should see the following xml snippet in your browser:
```

Google Search Appliance Connector Manager 2.6.6 (build 2658 December 7 2010); Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM 1.6.0_22; Linux 2.6.35-28-generic (amd64) 
0
test 
FileConnectorType 
2.6.0 September 28 2010 
0

```
The next step is to schedule the connector instance. This is also done by sending a HTTP post. Take a look at the following XML snippet:
```

 test 
 1000 
 30000 
 0-23

```
The above XML snippet tells the connector manager that the connector instance with name test should traverse the specified paths in the previous step at all hours of the day. It also specifies the load and timeout. Save the xml snippet to file (for example: setSchedule.xml) and post it with curl to the following url: http://localhost:8080/connector-manager/setSchedule
With curl it would look like this:
```
curl http://localhost:8080/connector-manager/setSchedule -X POST -d @setSchedule.xml
```
If the call was successful you should see the following response in your prompt:
```

 0

```
The connector manager will now begin to crawl the specified path(s) in step 4. You should see documents being added to your Solr instance.

If you want to remove the connector instance you can do so by executing the following url in for example your browser:

http://localhost:8080/connector-manager/removeConnector?ConnectorName=test

The result is that the crawling stops and loses it state it has maintained. Meaning that if you reconfigure a connector instance with the same name it would start crawling from scratch.

The XML configuration snippet in step 4 seems verbose. The reason you need to send empty properties is that all connector properties must be specified when configuring a connector, otherwise an exception occurs. Sending many empty configuration options seems overkill, but that is just how it works. To find out what options you need to specify for configuring a connector instance. You’ll need to send a http request to the following url:
http://localhost:8080/connector-manager/getConfigForm?ConnectorType=[FileConnectorType]
Replace [FileConnectorType] by the connector you want to configure. This returns a xml response with a html form inside. From this form you can determine the fields to send or you can render this in a front-end application end let an end-user supply the configuration options.

The connector manager can easily be changed to index documents to a different Solr instance. If you want that, you can do so by changing the following file in the connector manager exploded directory: /WEB-INF/applicationContext.properties
You must then change the solr.baseUrl and solr.core properties to a different Solr instance.

In Solr you’ll see that the documents have a number of fields with google: prefix. The google:aclgroups field defines which usergroups are allowed the read a specific document. An important note is that the file system connector does not retrieve nested groups. The google:aclusers field defines which users are allowed to read a document. Usually this are users who don’t belong to one of the groups in the previous field and have direct read privileges. The google:ispublic field defines if the document can be viewed by public users. The content field contains the parsed content of a file. The binary content of a file is parsed by Tika and put in the content field. The ExtractingRequestHandler on the Solr side puts all metadate Tika can collect during parsing in fields with the prefix metadata_. Not all files can be parsed by Tika for several reasons. For example Tika doesn’t have a parser for that specific content type or an error occurred during parsing. The latter does happen often and the behavior of the SolrDocPusher in that case is to send the documents to Solr without parsing of the content with Tika. This means that all the metadata fields and the content fields are empty.

Conclusion

Hopefully, this blog entry has shown you that it’s fairly easy to index your data sources using Apache Solr, with little custom development and more help from Google’s connector framework. The connector manager and its connectors are a good solution to crawl many data storages. Especially the connectors are typically good pieces of software that are easy to reuse. This blog also shows that open source products from different communities (Apache & Google) can be integrated together with little effort. A GUI for managing and configuring the connectors and the manager would be nice though, but that is one of the features the GSA provides. If you want to have more background information about the connector manager you can checkout my presentation. I gave this presentation at JTeam last February. Or you can always contact us to help you index your data!