Trifork Blog

Axon Framework, DDD, Microservices

Parsing HTML with Jericho

July 14th, 2010 by
|

In one of our projects I had to parse and manipulate HTML. After searching for a nice HTML parser, I ended up using the open source library Jericho HTML Parser. Jericho provides you a lot of features including text extraction from HTML markup, rendering, formatting or compacting HTML. In this post I will show you a few of the features I have used.

Maven dependency

If you use Maven, you can simply add the following dependency to use the library.

<dependency>
    <groupid>net.htmlparser.jericho</groupid>
    <artifactid>jericho-html</artifactid>
    <version>3.1</version>
</dependency>

API

I don’t want to explain all classes, but the following classes are basically the starting point of all your parsing.

  • Source – Represents a source HTML document. This is always the first step in parsing an HTML document.
  • OutputDocument – Represents a modified version of an original Source document or Segment.
  • Element – Represents an element  in a specific source document, which encompasses a start tag, an optional end tag and all content  in between.

For a complete overview of all classes you can view the javadoc.

Extract all text

To extract all the text from the HTML markup, all you have to do is the following:

    public String extractAllText(String htmlText){
        Source source = new Source(htmlText);
        return source.getTextExtractor().toString();
    }

You define a new Source object that takes in our case a String as input. But it also accepts for example a InputStream or URL. The Source object contains a method getTextExtractor that allows you to, how surprising, extract the text. The TextExtractor class gives you a few options to configure the extraction. One of the options is that you can exclude text from a specified Element. You can also include an attribute. The value of that attribute will be included in the output.

Manipulating HTML

Manipulating HTML is very easy with Jericho. In the code example below I want to add an id attribute to all H2 elements to create anchor navigation. One again I create a Source document. From this Source document I create an OutputDocument.

The OutputDocument represents a modified version of the original Source document. With the list of all H2 elements retrieved from the Source, we now can ask for all the attributes of a single H2 element. If the attribute id already exists we do nothing, but if it does not we recreate the starttag with a new id attribute and all the other existing attributes from that H2 element.

As you can see in the example, it is relatively easy to manipulate attributes of an element. With the Attributes object you can get a List of Attribute objects that are found in the source document or in a starttag. These attributes are not modifiable. The outputDocument has a convenience method that allows us to replace the specific startTag with our newly created H2 start tag in order to add our id attribute.

    public String addIdAttributeToH2Elements(String html) {
        Source source = new Source(html);
        OutputDocument outputDocument = new OutputDocument(source);
        List<element> h2Elements = source.getAllElements("h2");

        for (Element element : h2Elements) {
            StartTag startTag = element.getStartTag();
            Attributes attributes = startTag.getAttributes();
            Attribute idAttribute = attributes.get("id");

            if (idAttribute == null) {
                String elementValue = element.getTextExtractor().toString();
                String validAnchorId = AnchorUtils.getLowerCasedValidAnchorTitle(elementValue);

                StringBuilder builder = new StringBuilder();
                builder.append("<h2").append(" ").append("id=\"").append(validAnchorId).append("\"");
                for (Attribute attribute : attributes) {
                    builder.append(" ");
                    builder.append(attribute);
                }
                builder.append(">");

                outputDocument.replace(startTag, builder);
            }
        }

        return outputDocument.toString();
    }

Remove Elements

Just like me, you may want to remove a few tags from your HTML. Here is an example that shows you how you can achieve that.

    private static final Set<string> ALLOWED_HTML_TAGS = new HashSet<string>(Arrays.asList(
            HTMLElementName.ABBR,
            HTMLElementName.ACRONYM,
            HTMLElementName.SPAN,
            HTMLElementName.SUB,
            HTMLElementName.SUP)
    );

    private static String removeNotAllowedTags(String htmlFragment) {
        Source source = new Source(htmlFragment);
        OutputDocument outputDocument = new OutputDocument(source);
        List<element> elements = source.getAllElements();

        for (Element element : elements) {
            if (!ALLOWED_HTML_TAGS.contains(element.getName())) {
                outputDocument.remove(element.getStartTag());
                if (!element.getStartTag().isSyntacticalEmptyElementTag()) {
                    outputDocument.remove(element.getEndTag());
                }
            }
        }

        return outputDocument.toString();
    }

In the example above you see that after checking if the tag is allowed, we need to remove the start and endtag. If you would remove the complete element, then you would also remove the text within these tags. The API allows you to check for elements that are empty. This can be handy to remove redundant empty elements or in my case to check if the starttag a self closing tag.

Conclusion

In this post I showed you how I have used Jericho, but Jericho has a lot more interesting features. On their webpage they provide more examples on how to use those features. Jericho provides a nice and clean API and makes the parsing of HTML really easy!

12 Responses

  1. August 10, 2010 at 18:52 by k

    You can also do the following to update an attribute:

    Attributes attrs = element.getAttributes();
    Map attrsUpdated = outputDocument.replace(attrs, true);
    // attrsUpdated.put(attributeName, attrValueUpdated);
    attrsUpdated.put(“id”, “idForSomeElement”);
    String modifiedHtml = outputDocument.toString();

  2. August 10, 2010 at 19:01 by k

    Is there a way to make changes to a particular Element cumulative? I have 2 methods in my program that do different things, but they wind up modifying the same Element. However, when outputDocument is written out to String, only the first change to the Element (made in the first method) has taken effect in the outputDocument.

    The only solution that I can think of right now is after each method performs its logic is to get the String HTML from the outputDocument and re-create the Source and OutputDocument objects again (before the next method is called), which seems to be very inefficient. 🙁

    Thanks.

  3. August 11, 2010 at 09:24 by Roberto van der Linden

    Can’t you use the output of the two methods to rebuild the Element? Or is this not possible?

  4. August 11, 2010 at 21:54 by k

    Thanks for your reply.

    The methods do not return any result, but rather share instance-level Source and OutputDocument objects. I was able to resolve the issue by re-initializing the Source object to the current OutputDocument object’s HTML:

    // run method 1
    // at the beginning of method 2, do the following:
    this.source = new Source(this.outputDocument.toString());

    This solved my problem.

    Thanks again.

  5. September 20, 2010 at 00:49 by Kike

    Is there a way to show the tag text content avoiding to show nested tag text???

    For example: ” this is a piece of text with a link in it”

    And I want to get the content of p (only of p): “this is a piece of text with in it”

    Thanks!!!

  6. September 20, 2010 at 15:13 by Roberto van der Linden

    Hi Kike,

    I have created a test that will retrieve the content you want:

    @Test
    public void removeSpecificSegments() throws Exception {
    Source bodySource = new Source(“this is a piece of text with a link in it”);
    List segmentsToRemove = new ArrayList();
    segmentsToRemove.addAll(bodySource.getAllElements(“a”));

    OutputDocument outputDocument = new OutputDocument(bodySource);
    outputDocument.remove(segmentsToRemove);

    String output = outputDocument.toString();
    String removedDoubleWhitespaces = output.replace(” “, ” “);
    assertEquals(“this is a piece of text with in it”, removedDoubleWhitespaces);
    }

    As you can see, I have to remove double whitespaces at the end, because Jericho does not remove the whitespace before and/or after the segment.

    Hope this helps,

    Roberto

  7. September 24, 2010 at 20:53 by k

    A client of mine, who is considering using Jericho in a project, would like to know if any companies (i.e. SpringSource or other) provide any type of support model for Jericho (for legal purposes)?

    Thanks.

  8. September 27, 2010 at 16:10 by Roberto van der Linden

    Hi K,

    Maybe you can contact our COO Peter Meijer (peter.meijer@jteam.nl) and explain to him what kind of support you are looking for. Perhaps he can help you any further.

    Roberto

  9. April 11, 2012 at 20:05 by sadiruddin

    I want include only links(anchor tags) in my parsed output data. How can I achieve this?

  10. July 4, 2014 at 07:18 by LahiruR

    really helpful stuff! thank you.

  11. January 5, 2015 at 13:01 by Parth

    Hi
    I just want to store the text present between the tag for example in an excel sheet. How can I achieve this ?

  12. January 5, 2015 at 13:13 by Parth

    Hi
    I just want to save the text in between some tag say for example in an Excel sheet. How can I achieve this?