Wednesday, 2 May 2012

HtmlAgilityPack incorrectly handles form tags

It's a horribly old bug, something that was reported on their page since 2007 and it in the issue list for HtmlAgilityPack since 2011. You want to parse a string as an HTML document and then get it back as a string from the DOM that the pack is generating. And it closes the form tag, like it has no children.
Example: <form></form> gets transformed into <form/></form>

The problem lies in the HtmlNode class of the HtmlAgilityPack project. It defines the form tag as empty in this line:
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
One can download the sources and remove the Empty value in order to fix the problem or, if they do not want to change the sources of the pack, they have the option of using a workaround:
HtmlNode.ElementsFlags["form"]=HtmlElementFlag.CanOverlap;
Be careful, though, the ElementsFlags dictionary is a static property. This change will be applied on the entire application.

6 comments:

  1. It's not a "bug". It was desiged this way, and it's configurable via the ElementFlags, like you said (it's not a "workaround", again, it was designed this way as well)

    ReplyDelete
  2. Even if static flags that affect the behaviour of every class that was or ever will be instantiated would be good design (and it isn't!) one can hardly call defining the form element as childless, akin to br or meta, any form of design. I am referencing here the form definition page at the w3 site, just in case you feel like arguing about it:http://www.w3.org/TR/html4/interact/forms.html

    ReplyDelete
  3. Html Agility Pack was designed well before HTML4. Read here for an explanation of FORM handling: http://stackoverflow.com/questions/4218847/htmlagilitypack-does-form-close-itself-for-some-reason

    ReplyDelete
  4. http://ftp.ics.uci.edu/pub/ietf/html/rfc1866.txt - the RFC from nov 1995. I quote: The <FORM> element contains a sequence of input elements, along with document structuring elements.

    Even if Simon Mourier considered some elements as possibly overlapping, that doesn't excuse assuming they always overlap or saving them with the closed format when transformed back to string.

    Even Simon writes in that StackOverflow issue: you can save them back without breaking the original HTML which is exactly what this bug is about.

    Errare humanum est, perseverare diabolicum

    ReplyDelete
  5. There is anybody maintaining the HTMLAgilityPack now?

    ReplyDelete
  6. I don't know. The Codeplex site as well as the Twitter feed seem to have been inactive since aug 2012. But HAP is widely used in a lot of projects.

    ReplyDelete