{"id":2368,"date":"2010-07-14T13:00:59","date_gmt":"2010-07-14T11:00:59","guid":{"rendered":"http:\/\/blog.jteam.nl\/?p=2368"},"modified":"2010-07-14T13:00:59","modified_gmt":"2010-07-14T11:00:59","slug":"parsing-html-with-jericho","status":"publish","type":"post","link":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/","title":{"rendered":"Parsing HTML with Jericho"},"content":{"rendered":"<p>In one of our projects I had to parse and manipulate HTML. After searching for a nice HTML parser, I ended up using the open source library <a href=\"http:\/\/jericho.htmlparser.net\/docs\/index.html\" target=\"_blank\" rel=\"noopener\">Jericho HTML Parser<\/a>. Jericho provides you a lot of features including text extraction from HTML markup, rendering, formatting or compacting HTML. In this post I will show you a few of the features I have used.<br \/>\n<!--more--><\/p>\n<h2>Maven dependency<\/h2>\n<p>If you use Maven, you can simply add the following dependency to use the library.<\/p>\n<p><pre class=\"brush: xml; title: ; notranslate\" title=\"\">&lt;br&gt;\n&amp;lt;dependency&amp;gt;&lt;br&gt;\n    &amp;lt;groupid&amp;gt;net.htmlparser.jericho&amp;lt;\/groupid&amp;gt;&lt;br&gt;\n    &amp;lt;artifactid&amp;gt;jericho-html&amp;lt;\/artifactid&amp;gt;&lt;br&gt;\n    &amp;lt;version&amp;gt;3.1&amp;lt;\/version&amp;gt;&lt;br&gt;\n&amp;lt;\/dependency&amp;gt;&lt;br&gt;\n<\/pre><\/p>\n<h2>API<\/h2>\n<p>I don\u2019t want to explain all classes, but the following classes are basically the starting point of all your parsing.<\/p>\n<ul>\n<li><strong>Source<\/strong> \u2013 Represents a source HTML document. This is always the first step in parsing an HTML document.<\/li>\n<li><strong>OutputDocument<\/strong> &#8211; Represents a modified version of an original Source document or Segment.<\/li>\n<li><strong>Element<\/strong> &#8211; Represents an element&nbsp; in a specific source document, which encompasses a start tag, an optional end tag and all content&nbsp; in between.<\/li>\n<\/ul>\n<p>For a complete overview of all classes you can view the <a href=\"http:\/\/jericho.htmlparser.net\/docs\/javadoc\/index.html\" target=\"_blank\" rel=\"noopener\">javadoc<\/a>.<\/p>\n<h2>Extract all text<\/h2>\n<p>To extract all the text from the HTML markup, all you have to do is the following:<\/p>\n<p><pre class=\"brush: java; title: ; notranslate\" title=\"\">&lt;br&gt;\n    public String extractAllText(String htmlText){&lt;br&gt;\n        Source source = new Source(htmlText);&lt;br&gt;\n        return source.getTextExtractor().toString();&lt;br&gt;\n    }&lt;br&gt;\n<\/pre><\/p>\n<p>You define a new Source object that takes in our case a String as input. But it also accepts for example a InputStream or URL. The Source object contains a method getTextExtractor that allows you to, how surprising, extract the text. The TextExtractor class gives you a few options to configure the extraction. One of the options is that you can exclude text from a specified Element. You can also include an attribute. The value of that attribute will be included in the output.<\/p>\n<h2>Manipulating HTML<\/h2>\n<p>Manipulating HTML is very easy with Jericho. In the code example below I want to add an id attribute to all H2 elements to create anchor navigation. One again I create a Source document. From this Source document I create an OutputDocument.<\/p>\n<p>The OutputDocument represents a modified version of the original Source document. With the list of all H2 elements retrieved from the Source, we now can ask for all the attributes of a single H2 element. If the attribute <em>id<\/em> already exists we do nothing, but if it does not we recreate the starttag with a new id attribute and all the other existing attributes from that H2 element.<\/p>\n<p>As you can see in the example, it is relatively easy to manipulate attributes of an element. With the <a href=\"http:\/\/jericho.htmlparser.net\/docs\/javadoc\/net\/htmlparser\/jericho\/Attributes.html\">Attributes<\/a> object you can get a List of Attribute objects that are found in the source document or in a starttag. These attributes are not modifiable. The outputDocument has a convenience method that allows us to replace the specific startTag with our newly created H2 start tag in order to add our <em>id<\/em> attribute.<\/p>\n<p><pre class=\"brush: java; title: ; notranslate\" title=\"\">&lt;br&gt;\n    public String addIdAttributeToH2Elements(String html) {&lt;br&gt;\n        Source source = new Source(html);&lt;br&gt;\n        OutputDocument outputDocument = new OutputDocument(source);&lt;br&gt;\n        List&amp;lt;element&amp;gt; h2Elements = source.getAllElements(&quot;h2&quot;);&lt;\/p&gt;\n&lt;p&gt;        for (Element element : h2Elements) {&lt;br&gt;\n            StartTag startTag = element.getStartTag();&lt;br&gt;\n            Attributes attributes = startTag.getAttributes();&lt;br&gt;\n            Attribute idAttribute = attributes.get(&quot;id&quot;);&lt;\/p&gt;\n&lt;p&gt;            if (idAttribute == null) {&lt;br&gt;\n                String elementValue = element.getTextExtractor().toString();&lt;br&gt;\n                String validAnchorId = AnchorUtils.getLowerCasedValidAnchorTitle(elementValue);&lt;\/p&gt;\n&lt;p&gt;                StringBuilder builder = new StringBuilder();&lt;br&gt;\n                builder.append(&quot;&amp;lt;h2&quot;).append(&quot; &quot;).append(&quot;id=\\&quot;&quot;).append(validAnchorId).append(&quot;\\&quot;&quot;);&lt;br&gt;\n                for (Attribute attribute : attributes) {&lt;br&gt;\n                    builder.append(&quot; &quot;);&lt;br&gt;\n                    builder.append(attribute);&lt;br&gt;\n                }&lt;br&gt;\n                builder.append(&quot;&amp;gt;&quot;);&lt;\/p&gt;\n&lt;p&gt;                outputDocument.replace(startTag, builder);&lt;br&gt;\n            }&lt;br&gt;\n        }&lt;\/p&gt;\n&lt;p&gt;        return outputDocument.toString();&lt;br&gt;\n    }&lt;br&gt;\n<\/pre><\/p>\n<h2>Remove Elements<\/h2>\n<p>Just like me, you may want to remove a few tags from your HTML. Here is an example that shows you how you can achieve that.<\/p>\n<p><pre class=\"brush: java; title: ; notranslate\" title=\"\">&lt;br&gt;\n    private static final Set&amp;lt;string&amp;gt; ALLOWED_HTML_TAGS = new HashSet&amp;lt;string&amp;gt;(Arrays.asList(&lt;br&gt;\n            HTMLElementName.ABBR,&lt;br&gt;\n            HTMLElementName.ACRONYM,&lt;br&gt;\n            HTMLElementName.SPAN,&lt;br&gt;\n            HTMLElementName.SUB,&lt;br&gt;\n            HTMLElementName.SUP)&lt;br&gt;\n    );&lt;\/p&gt;\n&lt;p&gt;    private static String removeNotAllowedTags(String htmlFragment) {&lt;br&gt;\n        Source source = new Source(htmlFragment);&lt;br&gt;\n        OutputDocument outputDocument = new OutputDocument(source);&lt;br&gt;\n        List&amp;lt;element&amp;gt; elements = source.getAllElements();&lt;\/p&gt;\n&lt;p&gt;        for (Element element : elements) {&lt;br&gt;\n            if (!ALLOWED_HTML_TAGS.contains(element.getName())) {&lt;br&gt;\n                outputDocument.remove(element.getStartTag());&lt;br&gt;\n                if (!element.getStartTag().isSyntacticalEmptyElementTag()) {&lt;br&gt;\n                    outputDocument.remove(element.getEndTag());&lt;br&gt;\n                }&lt;br&gt;\n            }&lt;br&gt;\n        }&lt;\/p&gt;\n&lt;p&gt;        return outputDocument.toString();&lt;br&gt;\n    }&lt;br&gt;\n<\/pre><\/p>\n<p>In the example above you see that after checking if the tag is allowed, we need to remove the start and endtag. If you would remove the complete element, then you would also remove the text within these tags. The API allows you to check for elements that are empty. This can be handy to remove redundant empty elements or in my case to check if the starttag a self closing tag.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this post I showed you how I have used Jericho, but Jericho has a lot more interesting features. On their webpage they provide more examples on how to use those features. Jericho provides a nice and clean API and makes the parsing of HTML really easy!<\/p>\n\n\n<figure class=\"wp-block-image\"><a href=\"https:\/\/bit.ly\/3BAo305\" target=\"_blank\" rel=\"noreferrer noopener\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"256\" src=\"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png\" alt=\"\" class=\"wp-image-20303\" srcset=\"https:\/\/trifork.nl\/blog\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png 1024w, https:\/\/trifork.nl\/blog\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-300x75.png 300w, https:\/\/trifork.nl\/blog\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-768x192.png 768w, https:\/\/trifork.nl\/blog\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1536x384.png 1536w, https:\/\/trifork.nl\/blog\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-2048x512.png 2048w, https:\/\/trifork.nl\/blog\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1920x480.png 1920w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>In one of our projects I had to parse and manipulate HTML. After searching for a nice HTML parser, I ended up using the open source library Jericho HTML Parser. Jericho provides you a lot of features including text extraction from HTML markup, rendering, formatting or compacting HTML. In this post I will show you [&hellip;]<\/p>\n","protected":false},"author":102,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"content-type":"","footnotes":""},"categories":[10],"tags":[79,203,204],"class_list":["post-2368","post","type-post","status-publish","format-standard","hentry","category-development","tag-html","tag-jericho","tag-parsing"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Parsing HTML with Jericho - Trifork Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Parsing HTML with Jericho - Trifork Blog\" \/>\n<meta property=\"og:description\" content=\"In one of our projects I had to parse and manipulate HTML. After searching for a nice HTML parser, I ended up using the open source library Jericho HTML Parser. Jericho provides you a lot of features including text extraction from HTML markup, rendering, formatting or compacting HTML. In this post I will show you [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/\" \/>\n<meta property=\"og:site_name\" content=\"Trifork Blog\" \/>\n<meta property=\"article:published_time\" content=\"2010-07-14T11:00:59+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png\" \/>\n<meta name=\"author\" content=\"Roberto van der Linden\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Roberto van der Linden\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/\",\"url\":\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/\",\"name\":\"Parsing HTML with Jericho - Trifork Blog\",\"isPartOf\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png\",\"datePublished\":\"2010-07-14T11:00:59+00:00\",\"author\":{\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/037974cf3e24a7b09a93770b190d6e35\"},\"breadcrumb\":{\"@id\":\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#primaryimage\",\"url\":\"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png\",\"contentUrl\":\"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/trifork.nl\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Parsing HTML with Jericho\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/trifork.nl\/blog\/#website\",\"url\":\"https:\/\/trifork.nl\/blog\/\",\"name\":\"Trifork Blog\",\"description\":\"Keep updated on the technical solutions Trifork is working on!\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/trifork.nl\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/037974cf3e24a7b09a93770b190d6e35\",\"name\":\"Roberto van der Linden\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/afe49faf7ef8dd3753baefb334568b10?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/afe49faf7ef8dd3753baefb334568b10?s=96&d=mm&r=g\",\"caption\":\"Roberto van der Linden\"},\"url\":\"https:\/\/trifork.nl\/blog\/author\/roberto\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Parsing HTML with Jericho - Trifork Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/","og_locale":"en_US","og_type":"article","og_title":"Parsing HTML with Jericho - Trifork Blog","og_description":"In one of our projects I had to parse and manipulate HTML. After searching for a nice HTML parser, I ended up using the open source library Jericho HTML Parser. Jericho provides you a lot of features including text extraction from HTML markup, rendering, formatting or compacting HTML. In this post I will show you [&hellip;]","og_url":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/","og_site_name":"Trifork Blog","article_published_time":"2010-07-14T11:00:59+00:00","og_image":[{"url":"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png","type":"","width":"","height":""}],"author":"Roberto van der Linden","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Roberto van der Linden","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/","url":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/","name":"Parsing HTML with Jericho - Trifork Blog","isPartOf":{"@id":"https:\/\/trifork.nl\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#primaryimage"},"image":{"@id":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#primaryimage"},"thumbnailUrl":"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png","datePublished":"2010-07-14T11:00:59+00:00","author":{"@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/037974cf3e24a7b09a93770b190d6e35"},"breadcrumb":{"@id":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#primaryimage","url":"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png","contentUrl":"https:\/\/trifork.nl\/articles\/wp-content\/uploads\/sites\/3\/2022\/02\/Blog-Banner-1-1024x256.png"},{"@type":"BreadcrumbList","@id":"https:\/\/trifork.nl\/blog\/parsing-html-with-jericho\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/trifork.nl\/blog\/"},{"@type":"ListItem","position":2,"name":"Parsing HTML with Jericho"}]},{"@type":"WebSite","@id":"https:\/\/trifork.nl\/blog\/#website","url":"https:\/\/trifork.nl\/blog\/","name":"Trifork Blog","description":"Keep updated on the technical solutions Trifork is working on!","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/trifork.nl\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/037974cf3e24a7b09a93770b190d6e35","name":"Roberto van der Linden","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/trifork.nl\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/afe49faf7ef8dd3753baefb334568b10?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/afe49faf7ef8dd3753baefb334568b10?s=96&d=mm&r=g","caption":"Roberto van der Linden"},"url":"https:\/\/trifork.nl\/blog\/author\/roberto\/"}]}},"_links":{"self":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/2368","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/users\/102"}],"replies":[{"embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/comments?post=2368"}],"version-history":[{"count":0,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/posts\/2368\/revisions"}],"wp:attachment":[{"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/media?parent=2368"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/categories?post=2368"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/trifork.nl\/blog\/wp-json\/wp\/v2\/tags?post=2368"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}