Extracting Links From HTML
python
Sometimes you have a HTML webpage or email that you want to extract all the links from. There’s lots of ways to do this, but there’s a simple solution in Python with BeautifulSoup:
from bs4 import BeautifulSoup
def extract_links(html):
= BeautifulSoup(html, 'html.parser')
soup return [a.get('href') for a in soup.find_all('a') if a.get('href')]
Some other methods would be to use regular expressions (which would be faster than parsing, but a little harder to get right), directly going through a parse tree or using lxml. These other solutions would likely be a bit faster, but I like the flexibility of BeautifulSoup (especially with it’s select
method for CSS selectors).