Python HTML Parser
A lot of information is embedded in HTML pages, which contain both human text and markup. If you ever want to extract this information, don’t use regex use a parser. Python has an inbuilt library html.parser
library to do just that.
The excellent html2text library uses it to parse HTML into markdown, which you can use for removing formatting. However for your own purposes you can use a similar approach to build a custom parser by subclassing HTMLParser
.
Here’s a simple example of a parser that tries to convert HTML to plain text. You would use it like this:
= HTMLTextConverter()
converter = converter('<html><h1>Example</h1><p>Hello world!</p></html>')
plain_text == 'Example\nHello world!' plain_text
When you feed HTML to a HTMLParser it executed handle_starttag
whenever it encounters a new open tag, handle_endtag
whenever it encounters a new close tag, and handle_data
whenever it encounters data between tags.
To insert newlines whenever we hit a block level tag we can implement a custom handle_starttag
, that adds a newline to an output
method.
def handle_starttag(self, tag, attrs):
if tag in BLOCK_TAGS:
self.output('\n')
elif tag in INLINE_TAGS:
pass
else:
raise ValueError('Unexpected tag %s', tag)
In this case we don’t need to do anything special with endtags, but we do need to output all data. We will strip off newlines, because they won’t be shown in HTML output.
def handle_data(self, data):
self.output(data.strip('\n'))
The output method is one we need to add ourselves; we can append the output to internal state in a list called outdata
. We add to a list rather than append to a string because Python strings are immutable which means we’d need to create a whole new string object when we append a single character which is very inefficient if the string gets large.
def output(self, data):
self.outdata.append(data)
Of course we need to initialise self.outdata
to an empty list.
def __init__(self) -> None:
self.outdata = []
super().__init__()
Finally we can provide a nice interface that does all the work when we call converter(html)
by implementing the __call__
magic method.
def __call__(self, html):
self.feed(html)
= ''.join(self.outdata).strip()
output self.reset()
return output
That’s all there is to implementing a simple HTML transformation in Python. If you wanted more complex transformations you would need to track more pieces of state; the html2text code is a good example of how this can work.
Full example listing
Here’s an example listing of the HTML Parser. The functionality is very basic; it’s likely to produce way too much whitespace in certain cases, and fail on many HTML documents. However it’s a reasonable starting point for building a customer HTML transformation function.
= (
BLOCK_TAGS 'html', 'p', 'br',
'li', 'ul', 'ol',
'blockquote',
'table', 'tbody', 'tr',
)= (
INLINE_TAGS 'strong', 'ul', 'em', 'i', 'b',
'a', 'figure', 'img',
'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
'td',
)
from html.parser import HTMLParser
class HTMLTextConverter(HTMLParser):
def __init__(self) -> None:
self.outdata = []
super().__init__()
def __call__(self, html):
self.feed(html)
= ''.join(self.outdata).strip()
output self.reset()
return output
def reset(self):
super().reset()
self.outdata = []
def output(self, data):
self.outdata.append(data)
def handle_starttag(self, tag, attrs):
if tag in BLOCK_TAGS:
self.output('\n')
elif tag in INLINE_TAGS:
pass
else:
raise ValueError('Unexpected tag %s', tag)
def handle_endtag(self, tag):
pass
def handle_data(self, data):
self.output(data.strip('\n'))