Extracting Links From HTML


April 13, 2021

Sometimes you have a HTML webpage or email that you want to extract all the links from. There’s lots of ways to do this, but there’s a simple solution in Python with BeautifulSoup:

from bs4 import BeautifulSoup
def extract_links(html):
    soup = BeautifulSoup(html, 'html.parser')
    return [a.get('href') for a in soup.find_all('a') if a.get('href')]

Some other methods would be to use regular expressions (which would be faster than parsing, but a little harder to get right), directly going through a parse tree or using lxml. These other solutions would likely be a bit faster, but I like the flexibility of BeautifulSoup (especially with it’s select method for CSS selectors).