I've been thinking about how to convert HTML to Text for NLP. We want to at least extract the text, but if we can preserve some of the formatting it can make it easier to extract information down the line. Unfortunately it's a little tricky to get the segmentation right.

The standard answers on Stack Overflow are to use Beautiful Soup's getText method. Unfortunately this just turns every tag into the argument, whether it is block level or inline. This means for a lot of compact HTML it changes the meaning.

The pragmatic answer I've ended up with is to convert the text to Markdown with html2text, parse the Markdown back into HTML, and then converting that HTML to text. This is ridiculously inefficient, but lets me offload the processing logic to external tools and does a good enough job.

The final solution looks like this:

def html2plain(html):
md = html2md(html)
md = normalise_markdown_lists(md)
html_simple = mistletoe.markdown(md)
text = BeautifulSoup(html_simple).getText()
text = fixup_markdown_formatting(text)
return text

# The problem

For example the following HTML document:

<b>Section</b><br />A list<ul><li>Item <b>1</b></li>

Would be converted to something where we lose all sentence and section structure:

Section A list Item 1

We can convert the tags into newlines with BeautifulSoup but that will break across inline tags:

Section
A list
Item
1

The best option is to write your own HTML parser, but that's hard because you have to decide what to do with every case and deal with the complexities of real HTML. Another way is to first convert it to Markdown with html2text. Then we would get something we may be able to parse:

Section

A list

* Item *1*

We can then convert that back into simple HTML or plain text.

from bs4 import BeautifulSoup
from mistletoe import markdown
from html2text import HTML2Text

md = HTML2Text().handle(html)
html2 = markdown(md)
text = BeautifulSoup(html2).getText()

# Converting the HTML to Markdown

The html2text library does a good job of converting HTML to markdown. We need to give it a little configuration to get the output we want. In particular to turn off line wrapping we need to set the body_width to 0. I also ignore anchors and images since they are rare and I have no way of dealing with them.

def html2md(html):
parser = HTML2Text()
parser.ignore_images = True
parser.ignore_anchors = True
parser.body_width = 0
md = parser.handle(html)
return md

# Normalising Lists

HTML has a standard way of creating lists; <ul> and <li> tags. However surprisingly often I find custom lists with formats like List<br />- Item 1. We can convert these kinds of lists to look the same as a Markdown list with a little bit of regex:

def normalise_markdown_lists(md):
return re.sub(r'(^|\n) ? ? ?\\?[·--*]( \w)', r'\1  *\2', md)

# Converting the Markdown back to Text

There are a bunch of Markdown parsers, but mistletoe seems to be a good one. The main benefit of going through Markdown is irrelevant tags are stripped off, and the mistletoe HTML output is consistently formatted. In particular there are line breaks around block level formats, which may not be true for the source HTML.

html_simple = mistletoe.markdown(md)
text = BeautifulSoup(html_simple).getText()

# Cleaning up processing errors

As nice as html2text is, it has issues with multiple kinds of emphasis and repeated emphasis. For the repeated emphasis I remove any left over double stars. Sometimes tables seem to leave an extra vertical strut, |. I also clean up final whitespace.

def fixup_markdown_formatting(text):
# Strip off table formatting
text = re.sub(r'(^|\n)\|\s*', r'\1', text)
# Strip off extra emphasis
text = re.sub(r'\*\*', '', text)
# Remove trailing whitespace and leading newlines
text = re.sub(r' *\$', '', text)
text = re.sub(r'\n\n+', r'\n\n', text)
text = re.sub(r'^\n+', '', text)
return text

# Testing it out

This pipeline was actually developed by trialing it on some example job ads. The next step would be to create some formal tests based on these examples, but I'm happy to start with this until there are enough issues to improve it.