I've been trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. But I quickly hit an edge case where it fails, because parsing HTML is surprisingly difficult.

I was parsing some HTML that looked like this:

Some text.<br /><i><b>Title</b></i><br />...

When I ran html2text it produced an output like this:

Some text.
 _ **Title**_
...

This isn't correct markdown; there shouldn't be a space after the underscore.

I was pretty unlucky to hit this so quickly. Having only one type of emphasis or whitespace between the . and <i> tag would produce correct output. So once I found a minimum reproducible example I raised an issue.

Understanding the issue

The html2text library was originally written by Aaron Swartz, the inventor of Markdown. The bulk of it is implemented as a very large Python html.parser object.

The issue comes in how whitespace is handled in the following code:

def no_preceding_space(self: HTML2Text) -> bool:
    return bool(
        self.preceding_data and re.match(r"[^\s]", self.preceding_data[-1])
    )

if tag in ["em", "i", "u"] and not self.ignore_emphasis:
    if start and no_preceding_space(self):
        emphasis = " " + self.emphasis_mark
    else:
        emphasis = self.emphasis_mark

    self.o(emphasis)
    if start:
        self.stressed = True

if tag in ["strong", "b"] and not self.ignore_emphasis:
    if start and no_preceding_space(self):
        strong = " " + self.strong_mark
    else:
        strong = self.strong_mark

    self.o(strong)
    if start:
        self.stressed = True

When there's not a preceeding space one is added. However this inserted space in the output should count as preceeding, but it doesn't, and so another space is added.

Since this is the only usage of preceding_data a safe workaround is to append a space to self.preceding_data. There are many other ways we could possibly preserve the state if altering this may break existing code. I've tried submitting this as a pull request.

Parsing HTML in a meaningful way is really hard! The html2text library is very mature and does very well, but even it has weird edge cases.