Parsing HTML with Jericho

by Roberto van der Linden, July 14, 2010

In one of our projects I had to parse and manipulate HTML. After searching for a nice HTML parser, I ended up using the open source library Jericho HTML Parser. Jericho provides you a lot of features including text extraction from HTML markup, rendering, formatting or compacting HTML. In this post I will show you a few of the features I have used.

Maven dependency

If you use Maven, you can simply add the following dependency to use the library.

<br>
&lt;dependency&gt;<br>
    &lt;groupid&gt;net.htmlparser.jericho&lt;/groupid&gt;<br>
    &lt;artifactid&gt;jericho-html&lt;/artifactid&gt;<br>
    &lt;version&gt;3.1&lt;/version&gt;<br>
&lt;/dependency&gt;<br>

API

I don’t want to explain all classes, but the following classes are basically the starting point of all your parsing.

Source – Represents a source HTML document. This is always the first step in parsing an HTML document.
OutputDocument – Represents a modified version of an original Source document or Segment.
Element – Represents an element in a specific source document, which encompasses a start tag, an optional end tag and all content in between.

For a complete overview of all classes you can view the javadoc.

Extract all text

To extract all the text from the HTML markup, all you have to do is the following:

<br>
    public String extractAllText(String htmlText){<br>
        Source source = new Source(htmlText);<br>
        return source.getTextExtractor().toString();<br>
    }<br>

You define a new Source object that takes in our case a String as input. But it also accepts for example a InputStream or URL. The Source object contains a method getTextExtractor that allows you to, how surprising, extract the text. The TextExtractor class gives you a few options to configure the extraction. One of the options is that you can exclude text from a specified Element. You can also include an attribute. The value of that attribute will be included in the output.

Manipulating HTML

Manipulating HTML is very easy with Jericho. In the code example below I want to add an id attribute to all H2 elements to create anchor navigation. One again I create a Source document. From this Source document I create an OutputDocument.

The OutputDocument represents a modified version of the original Source document. With the list of all H2 elements retrieved from the Source, we now can ask for all the attributes of a single H2 element. If the attribute id already exists we do nothing, but if it does not we recreate the starttag with a new id attribute and all the other existing attributes from that H2 element.

As you can see in the example, it is relatively easy to manipulate attributes of an element. With the Attributes object you can get a List of Attribute objects that are found in the source document or in a starttag. These attributes are not modifiable. The outputDocument has a convenience method that allows us to replace the specific startTag with our newly created H2 start tag in order to add our id attribute.

<br>
    public String addIdAttributeToH2Elements(String html) {<br>
        Source source = new Source(html);<br>
        OutputDocument outputDocument = new OutputDocument(source);<br>
        List&lt;element&gt; h2Elements = source.getAllElements("h2");</p>
<p>        for (Element element : h2Elements) {<br>
            StartTag startTag = element.getStartTag();<br>
            Attributes attributes = startTag.getAttributes();<br>
            Attribute idAttribute = attributes.get("id");</p>
<p>            if (idAttribute == null) {<br>
                String elementValue = element.getTextExtractor().toString();<br>
                String validAnchorId = AnchorUtils.getLowerCasedValidAnchorTitle(elementValue);</p>
<p>                StringBuilder builder = new StringBuilder();<br>
                builder.append("&lt;h2").append(" ").append("id=\"").append(validAnchorId).append("\"");<br>
                for (Attribute attribute : attributes) {<br>
                    builder.append(" ");<br>
                    builder.append(attribute);<br>
                }<br>
                builder.append("&gt;");</p>
<p>                outputDocument.replace(startTag, builder);<br>
            }<br>
        }</p>
<p>        return outputDocument.toString();<br>
    }<br>

Remove Elements

Just like me, you may want to remove a few tags from your HTML. Here is an example that shows you how you can achieve that.

<br>
    private static final Set&lt;string&gt; ALLOWED_HTML_TAGS = new HashSet&lt;string&gt;(Arrays.asList(<br>
            HTMLElementName.ABBR,<br>
            HTMLElementName.ACRONYM,<br>
            HTMLElementName.SPAN,<br>
            HTMLElementName.SUB,<br>
            HTMLElementName.SUP)<br>
    );</p>
<p>    private static String removeNotAllowedTags(String htmlFragment) {<br>
        Source source = new Source(htmlFragment);<br>
        OutputDocument outputDocument = new OutputDocument(source);<br>
        List&lt;element&gt; elements = source.getAllElements();</p>
<p>        for (Element element : elements) {<br>
            if (!ALLOWED_HTML_TAGS.contains(element.getName())) {<br>
                outputDocument.remove(element.getStartTag());<br>
                if (!element.getStartTag().isSyntacticalEmptyElementTag()) {<br>
                    outputDocument.remove(element.getEndTag());<br>
                }<br>
            }<br>
        }</p>
<p>        return outputDocument.toString();<br>
    }<br>

In the example above you see that after checking if the tag is allowed, we need to remove the start and endtag. If you would remove the complete element, then you would also remove the text within these tags. The API allows you to check for elements that are empty. This can be handy to remove redundant empty elements or in my case to check if the starttag a self closing tag.

Conclusion

In this post I showed you how I have used Jericho, but Jericho has a lot more interesting features. On their webpage they provide more examples on how to use those features. Jericho provides a nice and clean API and makes the parsing of HTML really easy!