Using your Lucene index as input to your Mahout job – Part I

by frank, March 5, 2012

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Introduction

When working with Mahout text clustering or classification you preprocess your data so it can be understood by Mahout. Mahout contains input tools such as seqdirectory and seqemailarchives for fetching data from different input sources and transforming them into text sequence files. The resulting sequence files are then fed into seq2sparse to create Mahout vectors. Finally you can run one of Mahout’s algorithms on these vectors to do text clustering.

The lucene2seq program

Recently a new input tool has been added, lucene2seq, which allows you read from stored fields of a Lucene index to create text sequence files. This is different from the existing lucene.vector program which reads term vectors from a Lucene index and transforms them into Mahout vectors straight away. When using the original text content you can take full advantage of Mahout’s collocation identification algorithm which improves clustering results.

Let’s look at the lucene2seq program in more detail by running

$ bin/mahout lucene2seq --help

This will print out all the program’s options.

Job-Specific Options:
  --output (-o) output       The directory pathname for output.
  --dir (-d) dir             The Lucene directory
  --idField (-i) idField     The field in the index containing the id
  --fields (-f) fields       The stored field(s) in the index containing text
  --query (-q) query         (Optional) Lucene query. Defaults to
                             MatchAllDocsQuery
  --maxHits (-n) maxHits     (Optional) Max hits. Defaults to 2147483647
  --method (-xm) method      The execution method to use: sequential or
                             mapreduce. Default is mapreduce
  --help (-h)                Print out help
  --tempDir tempDir          Intermediate output directory
  --startPhase startPhase    First phase to run
  --endPhase endPhase        Last phase to run

The required parameters are lucene directory path(s), output path, id field and list of stored fields. The tool will fetch all documents and create a key value pair where the key equals the value of the id field and the value equals the concatenated values of the stored fields. The optional parameters are a Lucene query, a maximum number of hits and the execution method, sequential or MapReduce. The tool can be run like any other Mahout tool.

Converting an index of Wikipedia articles to sequence files

To demonstrate lucene2seq we will convert an index of Wikipedia articles to sequence files. Checkout the Lucene 3x branch, download a part of the Wikpedia articles dump and run a benchmark algorithm to create an index of the articles in the dump.

$ svn checkout http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x lucene_3x $ cd lucene_3x/lucene/contrib/benchmark $ mkdir temp work $ cd temp $ wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2 $ bunzip enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2

The next step is to run a benchmark ‘algorithm’ to index the Wikipedia dump. Contrib benchmark contains several of these algorithms in the conf directory. For this demo we only index a small part of the Wikipedia index so edit the conf/wikipediaOneRound.alg file so it points to enwiki-latest-pages-articles1.xml-p000000010p000010000. For an overview of the syntax of these benchmarking algorithms check out the benchmark.byTask package-summary Javadocs

Now it’s time to create the index

$ cd .. $ ant run-task -Dtask.alg=conf/wikipediaOneRound.alg -Dtask.mem=2048M

The next step is to run lucene2seq on the generated index under work/index. Checkout the lucene2seq branch from Github

$ git clone https://github.com/frankscholten/mahout $ git checkout lucene2seq $ mvn clean install -DskipTests=true

Change back to the lucene 3x contrib/benchmark work dir and run

$ <path/to>/bin/mahout lucene2seq -d index -o wikipedia-seq -i docid -f title,body -q 'body:java' -xm sequential

To create sequence files of all documents that contain the term ‘java’. From here you can run seq2sparse followed by a clustering algorithm to cluster the text contents of the articles.

Running the sequential version in Java

The lucene2seq program can also be run from Java. First create a LuceneStorageConfiguration bean and pass in the list of index paths, the sequence files output path, the id field and the list of stored fields in the constructor.

LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(configuration, asList(index), seqFilesOutputPath, "id", asList("title", "description"));

You can then optionally set a Lucene query and max hits via setters

luceneStorageConf.setQuery(new TermQuery(new Term("body", "Java"))); luceneStorageConf.setMaxHits(10000);

Now you can run the tool by calling the run method with the configuration as a parameter

LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles(); lucene2seq.run(luceneStorageConf);

Conclusions

In this post I showed you how to use lucene2seq on an index of Wikipedia articles. I hope this tool will make it easier for you to start using Mahout text clustering. In a future blog post I discuss how to run the MapReduce version on very large indexes. Feel free to post comments or feedback below.