Introduction to Lucene Connectors Framework – Part 1

by Ralph Benjamin Ruijs, April 16, 2010

In my previous blog, Searching your Java CMS using Apache Solr: Introduction, I looked at how to synchronize the information in a Java CMS with a Solr index. This blog is an introduction to the Lucene Connectors Framework, a crawler framework I will use to solve the problem of making the information from a Java CMS search-able using Solr. I will show you how to build, deploy and get it running as a web crawler. In part 2 of this introduction I will extend LCF with a new Connector.

The Lucene Connectors Framework, an incubator project at Apache, provides a framework for connecting a source content repository to target repositories or indexes, such as Apache Solr. Last month the Lucene Connector Framework published their first build-able sources.

Basic Introduction to LCF

The Lucene Connector Framework has been developed and successfully deployed for the last five years by MetaCarta. Giving us a proven and usable system, but with a rather steep learning curve. There is a lot of documentation already available on the project’s wiki and for this blog post we will take a look at how to build and deploy LCF.

The Lucene Connectors Framework consists of three main components:

Configuration UI (Web Application)
Authority Service (Web Application)
Agent (Java Process)

Within LCF, the Agent process does the actual work, it crawls documents and ingests them. The Authority Service webapp is used to get authorization tokens for a given username. To configure and interact with the system, the Configuration UI is used. The Configuration UI (crawler-ui) is a webapp in which you can configure the system, start/pause/abort the crawler and get statistics about the documents.
The agent, authority service and configuration ui, are build on top of the PostgreSql database. The database is used to store state, eg. crawled documents, and configuration, eg. scheduling and connection information. The synchronization folder is used to keep the processes in sync, by providing a locking mechanism across jvm instances.

Build

There are currently no releases of LCF available, but the project sources are in svn. Get the project sources using subversion from:
https://svn.apache.org/repos/asf/incubator/lcf/trunk

After a checkout, you will find the following directories:

documentation
- Part of the end-user documentation, written in LaTeX.
modules
- This directory contains the sources for the framework and the different supported connectors.
tests
- Part of the test files used by MetaCarta

Building LCF is done using ant and requires ant 1.7 or greater. Run ant from the modules directory to build the system. The output is produced in the modules/dist folder. The build creates two war files in the tomcat folder and a set of jar files in the processes folder.

Setup

The LCF is implemented on top of a postgreSQL database and uses a folder on the file system for synchronization. Both need to be configured in the LCF configuration file. By default LCF looks for a configuration file at /lcf/properties.ini. The following code block shows the minimal configuration needed.

org.apache.lcf.synchdirectory=/home/user/lcf_work/
org.apache.lcf.database.username=postgres
org.apache.lcf.database.password=postgres
org.apache.lcf.database.name=lcf

When not specified in the configuration file, LCF expects a standard commons-logging property file at /lcf/logging.ini.

org.apache.commons.logging.Log=org.apache.commons.logging.impl.Log4JLogger

# Configuration file of the log
log4j.configuration=log4j.xml
log4j.rootCategory=info
# Set root logger level to DEBUG and its only appender to A1.
log4j.rootLogger=DEBUG, A1

# A1 is set to be a ConsoleAppender.
log4j.appender.A1=org.apache.log4j.FileAppender
log4j.appender.A1.Threshold=DEBUG
log4j.appender.A1.File=ApplicationLog.log
log4j.appender.A1.Append=true

# A1 uses PatternLayout.
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-4r [%t] %-5p %c %x – %m%n

Create the database and register the different components using the following commands from the modules/dist directory.

java -Djava.ext.dirs=processes org.apache.lcf.core.DBCreate postgres postgres
java -Djava.ext.dirs=processes org.apache.lcf.agents.Install
java -Djava.ext.dirs=processes org.apache.lcf.agents.Register org.apache.lcf.crawler.system.CrawlerAgent
java -Djava.ext.dirs=processes org.apache.lcf.agents.RegisterOutput org.apache.lcf.agents.output.solr.SolrConnector “Solr Connector”
java -Djava.ext.dirs=processes org.apache.lcf.crawler.Register org.apache.lcf.crawler.connectors.webcrawler.WebcrawlerConnector “Web Crawler”

Run

With the configuration in place you need to deploy the lcf-crawler-ui web application and start the Agent. The web application can be deployed in any servlet container like Tomcat or Jetty, but needs to be run as the same user as the Agent. You can find the war file needed in the modules/dist/web directory.

The Agent is started by using the following command from the modules/dist directory.

java -Djava.ext.dirs=dist/processes org.apache.lcf.agents.AgentRun&

Stopping the agent is done using the AgentStop class. Don’t stop the Agent in another way, because that may result in dangling locks in the synchronization directory.

The lcf-crawler-ui should now be working at http://hostname/lcf-crawler-ui

Configure

Using the lcf-crawler-ui webapplication you can configure the web-crawler. The web-crawler needs an output connection, a repository connection and a job. The authority connection will not be used by the webcrawler.

Output Connection	is an output connector with its configuration parameters. LCF supports different output connectors, such as the Solr connector.
Authority Connection	is an authority connector with its configuration parameters. Its only function is to convert a user name (which is often a Kerberos principal name) into a set of access tokens.
Repository Connection	is an repository connector with its configuration parameters. Currently LCF supports 11 connectors.
Job	ties the different connections together and holds additional, job specific, configuration. Jobs are scheduled and run by the Agent.

Create an output connection of the type “Solr Connector” and specify the details of your solr installation. The update handler defaults to /update/extract, as the Solr connector uses the extraction handler to send the data. Make sure the Solr extraction handler is correctly configured in your solr installation.
Screenshot-Lucene Connector Framework: Edit Output Connection

Create a repository connection of the type “Web Crawler” and at least specify an Email address. This email address is send in the head of every request and can be used by system administrators to contact you about the
crawling.
Screenshot-LCF_WebCrawler_Email

Create a Job for the crawling the web by selecting the two created connections. The web-crawler uses the Seeds you specify as its starting point, so you need to specify at least one url.
Screenshot-Lucene Connector Framework: Edit Job

Crawl

You can then look at the status of your job by clicking “Status and Job Management” on the sidebar. You can start any crawl you like immediately from this interface by clicking “Start” next to the name of the job. This interface also allows you to see how many documents have been crawled.
Screenshot-Lucene Connector Framework: Status of all jobs

Conclusion

The Lucene Connectors Framework provides a solid and generic infrastructure for connecting source repositories, to target repositories or indexes. Build on top of postgresql makes it robust and rigid against system crashes or restarts.

Of course there is still a lot of work to be done. The Lucene/LCF authorization model has yet to be worked out and more connectors are needed to support different source and output repositories. Maybe a new ui?