Gather content for Lucene from WordPress using Groovy

by Jettro CoenradieAugust 16, 2011

I am learning about the capabilities of Lucene. Here at JTeam we have a few people that are specialized in Search using technology like Lucene and Solr. Therefore I want to have a higher level of knowledge of Lucene than I have now. So I started reading the Lucene in Action book. As I read a book I want to create some samples. When learning about Lucene you need to have content. I decided to gather content from my own website and use it for my Lucene learning.

First challenge, how to get the content from my website and give the content meaning? That is what this blog post is about. I take you on my journey from one end of the groovy spectrum (using the XMLSlurper) to the other end using the XMLRPCServerProxy. During this journey I will also explain some of the basics of the XMLRPC api of wordpress.

Using the XMLSlurper

In the past I gained some experience with the XMLSlurper of groovy. You can check that post or just go to the documentation of XMLSlurper. It is not very hard to read an xml document from a url. Using the xmlslurper it is very easy to get to the content. Just walking the tree of xml nodes, but a lot of other options are available. Problem with websites is that they are often not very nice xml. Therefore the guys from the TagSoup project have created a parser that makes your website is parsed better by the xmlslurper. I found a reference to the TagSoup parser in this blog. Using this technology I was able to read the gridshore homepage and obtain the title, link and summary of a blogpost. Of course I had to analyze the structure of the blog. The following code sample shows you the class that you can use to obtain a set of objects that contain the mentioned information.

class GridshoreHomePage {
    private Parser tagsoupParser = new Parser()
    private XmlSlurper slurper = new XmlSlurper(tagsoupParser)
    private GPathResult page

    GridshoreHomePage() {
        page = slurper.parse("http://www.gridshore.nl")
    }

    def obtainBlogItems() {
        def blogItems = []
        page.body."**".findAll({it.@class == "post-headline"}).each {
            def blogItem = new Expando()
            blogItem.title = it.h2.a.text()
            blogItem.link = it.h2.a.@href.text()
            blogItem.summary = it.parent().div[2].p.text()
            blogItem.byline = it.parent().div[1].text()
            blogItems.add blogItem
        }
        return blogItems
    }
}

As you can see from the sample, I analyzed the html content and therefore know exactly where to take the required content from. In line 11 I take the found page, browse to the body tag and find all elements that have a class attribute post-headline. Those are the blog posts we are looking for. Taking the text of the anker element we have the title and reading the href attribute gives us the link.

As you can see, this kind of works. The approach is brittle, it will fail when using a different theme. It will likely break when we get an update of the site. Since gridshore is using wordpress I wanted to have a look at another mechanism to see if easier ways to obtain the content can be found. I knew wordpress has an xmlrpc option to obtain content. Therefore I wanted to have a look at that.

WordPress xmlrpc api

WordPress uses the xmlrpc specification to expose content. Be sure to enable the api in your wordpress installation if you want to use this. The api is specified by a few different specifications. Check this page to get an overview of what is possible. I think it is a bit strange that three different specifications are implemented and in addition to that a few other methods. The wordpress specific methods are limited, therefore I had to use the metaweblog methods. The wordpress methods are well documented on the wordpress site. I found out that using the source code of wordpress is the best way to see what methods are available. If you have the sources avaialbe check the following file.

wp-includes > class-wp-xmlrpc-server.php

?

In the beginning of this file you can see a mapping between the named method of the api and the actual implementation. You can also see a clear separation between the methods as exposed through the different api’s. All wordpress methods start with wp., all MetaWeblog methods start with metaWeblog. I do not want to show everything of the php file. Still I want to show you the specification of the method as well as the gathering of input and output parameters. That way you know what to look for when using the methods.

The first part is within the constructor of the class.

class wp_xmlrpc_server extends IXR_Server {
	function __construct() {
		$this->methods = array(
			'metaWeblog.getPost' => 'this:mw_getPost',
			'metaWeblog.getRecentPosts' => 'this:mw_getRecentPosts',
		);
	}
}

As you can see we map the external method names to the internal method names. The following code block shows the beginning and the end of an internal function.

	function mw_getRecentPosts($args) {

		$this->escape($args);

		$blog_ID     = (int) $args[0];
		$username  = $args[1];
		$password   = $args[2];
		if ( isset( $args[3] ) )
			$query = array( 'numberposts' => absint( $args[3] ) );
		else
			$query = array();

		// other part of the method

		foreach ($posts_list as $entry) {
			$struct[] = array(
				'dateCreated' => new IXR_Date($post_date),
				'userid' => $entry['post_author'],
				'postid' => (string) $entry['ID'],
				'description' => $post['main'],
				'title' => $entry['post_title'],
				'link' => $link,
				'permaLink' => $link,
				// commented out because no other tool seems to use this
				// 'content' => $entry['post_content'],
				'categories' => $categories,
				'mt_excerpt' => $entry['post_excerpt'],
				'mt_text_more' => $post['extended'],
				'mt_allow_comments' => $allow_comments,
				'mt_allow_pings' => $allow_pings,
				'mt_keywords' => $tagnames,
				'wp_slug' => $entry['post_name'],
				'wp_password' => $entry['post_password'],
				'wp_author_id' => $author->ID,
				'wp_author_display_name' => $author->display_name,
				'date_created_gmt' => new IXR_Date($post_date_gmt),
				'post_status' => $entry['post_status'],
				'custom_fields' => $this->get_custom_fields($entry['ID']),
				'wp_post_format' => $post_format
			);
		}
	}

From the previous code block you can see input parameters: blogId, username, password, number of posts. The output is an array, each item in the array has the fields and mentioned like dateCreated, userid, etc. These are the fields we read in the groovy code that I will show later on.

Enough php, let us move back to the groovy side of life.

Using XMLRPCServerProxy

Groovy comes with a module that enables xmlrpc integration. Be aware of the version of groovy you use. When using 1.8.x you have a problem. The latest version of the xmlrpc plugin does not support 1.8. You’ll get nice exceptions. Somebody filed a bug already, so can update it and hope it gets fixed. The source code below shows the complete code to print information about the latest 10 blog posts.

def serverProxy = new XMLRPCServerProxy("http://www.gridshore.nl/xmlrpc.php")
serverProxy.setBasicAuth("user", "password")
def posts = serverProxy.metaWeblog.getRecentPosts(1,"user", "password",10)
posts.each {post ->
    println "___________________________________"
    println "ID : " + post['postid']
    println "Link : " + post['permaLink']
    println "Post status : " + post['post_status']
    println "Keywords : " + post['mt_keywords']
    println "Title : " + post['title']
    println "Created on : " + post['dateCreated']
    println "Content : "  + post['description']
    println "Categories : " + post['categories']
    println "Author : " + post['wp_author_display_name']
    println "Slug : " + post['wp_slug']
}

You can see that the name of the api method that we call is the same as the one specified in the php file. You can also find back the properties as returned by the php method. The next code block shows the first blogpost

ID : 1165
Link : http://www.gridshore.nl/2011/08/01/cleaning-up-your-maven-repository-with-groovy/
Post status : publish
Keywords : maven
Title : Cleaning up your maven repository with groovy
Created on : Mon Aug 01 10:25:21 CEST 2011
Content : 

Ever looked at the space used by your maven repository? Think that this is to much? Start reading the blog post I wrote on my employers blog about a groovy script that you can use to clean your local maven repository. It removes old snapshots stored in your repo as well as old versions of artifacts.

http://blog.jteam.nl/2011/08/01/cleaning-up-your-maven-repository Categories : Author : jettro Slug : cleaning-up-your-maven-repository-with-groovy ___________________________________