python

## Automated Refactoring in Python

I am a very recent convert on automatic refactoring tools. I thought it was something for languages like Java that have a lot of boilerplate, and overkill for something like Python. I still liked the concept of refactoring, but I just moved the code around with Vim keymotions or sed. But then I came up against a giant Data Science codebase that was a wall of instructions like this: import pandas as pd import datetime df = pd.

• Edward Ross
pandas

## Writing Pandas Dataframes to S3

Writing a Pandas (or Dask) dataframe to Amazon S3, or Google Cloud Storage, all you need to do is pass an S3 or GCS path to a serialisation function, e.g. # df is a pandas dataframe df.to_csv(f's3://{bucket}/{key}') Under the hood Pandas uses fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS).

• Edward Ross
pandas

## Fast Pandas DataFrame to Dictionary

Tabular data in Pandas is very flexible, but sometimes you just want a key value store for fast lookups. Because Python is slow, but Pandas and Numpy often have fast C implementations under the hood, the way you do something can have a large impact on its speed. The fastest way I've found to convert a dataframe to a dictionary from the columns keys to the column value is: df.set_index(keys)[value].to_dict() The rest of this article will discuss how I used this to speed up a function by a factor of 20.

• Edward Ross
python

## Chompjs for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Sometimes Python's json.loads won't cut it for dynamic JSON; one option is demjson but another much faster option is chompjs. Chompjs converts a javascript string into something that json.loads. It's a little less strict than demjson; for example {"key": undefined} will be converted by chompjs.

• Edward Ross
python

## Aggregating Quantiles with Pandas

One of my favourite tools in Pandas is agg for aggregation (it's a worse version of dplyrs summarise). Unfortunately it can be difficult to work with for custom aggregates, like nth largest value. If your aggregate is parameterised, like quantile, you potentially have to define a function for every parameter you use. A neat trick is to use a class to capture the parameters, making it much easier to try out variations.

• Edward Ross
python

## A Command Line Interface for HTML With parsel-cli

There are many great command line tools for searching and manipulating text (like grep), columnar data (like awk), JSON data (like jq). With HTML there's parsel-cli built on top of the wonderful parsel Python library. Parsel is a fantastic library that gives a simple and powerful interface for extracting data from HTML documents using CSS selectors, Xpath and regular expressions. Parsel-cli is a very small utility that lets you use parsel from the command line (and can be installed with pip install parsel-cli).

• Edward Ross
python

## Persistent Dictionaries in Python

Dictionaries in Python (in other languages called maps or hashmaps) are a useful and flexible data structure that can be used to solve lots of problems. Part of their charm is the affordances in the lanugage for them; setting and accessing with square brackets [], deleting with del. But sometimes you want a dictionary that persists across sessions, or can handle more data than you can fit into memory - and there's a solution persistent dictionaries.

• Edward Ross
python

## Not Using Scrapy for Web Scraping

Scrapy is a fast high-level web crawling and web scraping framework. But as much as I want to like it I find it very constraining and there's a lot of added complexity and magic. If you don't fit the typical use case it feels like a lot more work and learning doing things with scrapy than without. I really like Zyte (formerly ScrapingHub) the team behind Scrapy. They really know what they're talking about with great blogs about QA of Data Crawls, guide to browser tools, how bots are tracked, and Scrapy's documentation has a very useful page on selecting dynamically-loaded content.

• Edward Ross
python

Sometimes you have a HTML webpage or email that you want to extract all the links from. There's lots of ways to do this, but there's a simple solution in Python with BeautifulSoup: from bs4 import BeautifulSoup def extract_links(html): soup = BeautifulSoup(html, 'html.parser') return [a.get('href') for a in soup.find_all('a') if a.get('href')] Some other methods would be to use regular expressions (which would be faster than parsing, but a little harder to get right), directly going through a parse tree or using lxml.

• Edward Ross
python

## Reading Email in Python with imap-tools

You can use Python to read, process and manage your emails. While most email providers provide autoreplies and filter rules, you can do so much more with Python. You could download all your PDF bills from your electricity provider, you could parse structured data from emails (using e.g. BeautifulSoup), sort or filter by sentiment, or even do your own personal analytics like Steven Wolfram. The easiest tool I've found for reading emails in Python is imap_tools.

• Edward Ross
python

• Edward Ross
python

## Double emphasis error in html2text

I'm trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. I've already resolved an issue with multiple types of emphasis. However HTML in the wild has all sort of weird edge cases that the library has trouble with. In this case I found a term that was emphasised twice: <strong><strong>word</strong></strong>. I'm pretty sure for a browser this is just the same as doing it once; <strong>word</strong>.

• Edward Ross
python

## An edge bug in html2text

I've been trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. But I quickly hit an edge case where it fails, because parsing HTML is surprisingly difficult. I was parsing some HTML that looked like this: Some text.<br /><i><b>Title</b></i><br />... When I ran html2text it produced an output like this:

• Edward Ross
writing

## Writing Blog Posts with Jupyter and Hugo

It can be convenient to directly publish a mixture of prose, source code and graphs. It ensures the published code actually runs and makes it much easier to rerun at a later point. I’ve done this before in Hugo with R Blogdown, and now I’m experimenting with Jupyter notebooks. The best available option seems to be nb2hugo which converts the notebook to markdown, keeping the front matter exporting the images.

• Edward Ross
python

## Raising Exceptions in Python Futures

Python concurrent.futures are a handy way of dealing with asynchronous execution. However if you're not careful it will swallow your exceptions leading to difficult to debug errors. While you can perform concurrent downloads with multiprocessing it means starting up multiple processes and sending data between them as pickles. One problem with this is that you can't pickle some kinds of objects and often have to refactor your code to use multiprocessing.

• Edward Ross
python

## Stanza for NLP

Working with unstructured text is much easier if we add structure to it. Stanza is a state of the art library for doing this in over 60 languages. Given some text it will tokenize, sentencize, tag parts of speach and morphological features, parse syntactic dependencies and in a few languages perform NER. It's easy to use and gets extremely good results on benchmarks for each of these tasks on a large number of languages.

• Edward Ross
python

## pyBART: Better Dependencies for Information Extraction

Dependency trees are a remarkably powerful tool for information extraction. Neural based taggers are very good and Universal Dependencies means the approach can be used for almost any language (although the rules are language specific). However syntax can get really strange requiring increasingly complex rules to extract information. The pyBART system solves this by rewriting the rules to be half a step closer to semantics than syntax. I've seen that dependency based rules are useful for extracting skills from noun phrases and adpositions.

• Edward Ross
python

## Demjson for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Python's inbuilt json.loads is effective, but won't handle very dynamic Javascript, but demjson will (another, much faster alternative is Chompjs. The problem shows up when using json.loads as the following obscure error:

• Edward Ross
python

## Tips for Extracting Data with Beautiful Soup

Beautiful soup can be a useful library for extracting infomation from HTML. Unfortunately there's a lot of little issues I hit working with it to extract data from a careers webpage using Common Crawl. The library is still useful enough to work with; but the issues make me want to look at alternatives like lxml (via html5-parser). The source data can be obtained at the end of the article. Use a good HTML parser Python has an inbuild html.

• Edward Ross
python

## Saving Requests and Responses in WARC

When fetching large amounts of data from the internet a best practice is caching all the data. While it might seem easy to extract just the information you need, it's easy to hit edge cases or changing structure, and you can never use the data you throw away. This is easy to do in the Web ARChive (WARC) format with warcio used by the Internet Archive and Common Crawl. from warcio.

• Edward Ross
python

## Only write file on success

When writing data pipelines it can be useful to cache intermediate results to recover more quickly from failures. However if a corrupt or incomplete file was written then you could end up caching that broken file. The solution is simple; only write the file on success. A strategy for this is to write to some temporary file, and then move the temporary file on completion. I've wrapped this in a Python context manager called AtomicFileWriter which can be used in a with statement in place of open:

• Edward Ross
python

## Analytics Web Data Commons with SPARQL

I am trying to understand how the JobPosting schema is used in Web Data Commons structured data extracts from Common Crawl. I wrote a lot of ad hoc Python to get usage statistics on JobPosting. However SPARQL is a tool that makes it much easier to answer these kinds of questions. After reading in the graphs individually they can be combined into a rdflib.Dataset so we can query them all together.

• Edward Ross
python

I've been using RDFLib to parse Job posts extracted from Common Crawl. RDF Literals It automatically parses XML Schema Datatypes into Python datastructures, but doesn't handle the <http://schema.org/Date> datatype that commonly occurs in JSON-LD. It's easy to add with the rdflib.term.bind command, but this kind of global binding could lead to problems. When RDFLib parses a literal it will create a rdflib.term.Literal object and the value field will contain the Python type if it can be successfully converted, otherwise it will be None.

• Edward Ross
python

## Converting RDF to Dictionary

The Web Data Commons has a vast repository of structured RDF Data about local businesses, hostels, job postings, products and many other things from the internet. Unfortunately it's not in a format that's easy to do analysis on. We can stream the nquad format to get RDFlib Graphs, but we still need to convert the data into a form we can do analysis on. We'll do this by turning the relations into dictionaries of properties to the list of objects they contain.

• Edward Ross
data

The Web Data Commons extracts structured RDF Data from about one monthly Common Crawl per year. These contain a vast amount of structured infomation about local businesses, hostels, job postings, products and many other things from the internet. Python's RDFLib can read the n-quad format the data is stored in, but by default requires reading all of the millions to billions of relations into memory. However it's possible to process this data in a streaming fashion allowing it to be processed much faster.

• Edward Ross
python

## Parsing Escaped Strings

Sometimes you may have to parse a string with backslash escapes; for example "this is a \"string\"". This is quite straightforward to parse with a state machine. The idea of a state machine is that the action we need to take will change depending on what we have already consumed. This can be used for proper regular expressions (without special things like lookahead), and the ANTLR4 parser generator can maintain a stack of "modes" that can be used similarly.

• Edward Ross
python

## Minibatching in Python

Sometimes you have a long sequence you want to break into smaller sized chunks. This is generally because you want to use some downstream process that can only handle so much data at a time. This is common in stochastic gradient descent in deep learning where you are constrained by the memory on the GPU. But this is also useful for API calls that can take a list, but can't handle all the data at once.

• Edward Ross
web

## Serving Static Assets with Python Simple Server

I was trying to load a local file in a HTML page and got a Cross-Origin Request Blocked error in my browser. The solution was to start a Python web server with python3 -m http.server. I had a JSON file I wanted to load into Javascript in a HTML page. Looking at StackOverflow I found I found fetch could do this fetch("test.json") .then(response => response.json()) .then(json => process(json)) Where process is some function that acts on the data; console.

• Edward Ross
python

## Cartesian Product in R and Python

You've got a couple of groups and you want to get every possible combination of them. This is called the Cartesian Product of the groups. There are standard ways of doing this in R and Python. Python: List Comprehensions Concretely we've got (in Python notation) the vectors x = [1, 2, 3] and y = [4, 5] and we want to get all possible pairs: [(1, 4), (2, 4), (3, 4), (1, 5), (2, 5), (3, 5)]`.

• Edward Ross
jobs

## Minhash Sets

We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.

• Edward Ross
nlp

## Searching for Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.

• Edward Ross
nlp

## Detecting Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.

• Edward Ross
nlp

## Near Duplicates with TF-IDF and Jaccard

I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.

• Edward Ross
nlp

## Near Duplicates with Jaccard

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calcuate.

• Edward Ross
nlp

## Edit Distance

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.

• Edward Ross
python

## Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.

• Edward Ross
python

## Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

• Edward Ross
nlp

## Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

• Edward Ross
python

## Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.

• Edward Ross
nlp

## Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

• Edward Ross
nlp

## Active NER with Prodigy Teach

Active learning reduces the number of annotations you have to make by selecting for annotation the items that will have the biggest impact on model retraining. Active learning for NER is built into Prodigy, but I failed to use to it to improve my job title recogniser. Having built a reasonable NER model for recognising job titles I wanted to see if I could easily improve it with Protidy's active learning.

• Edward Ross
python

## Python Inequality Chaining

In Python the comparison a <= b == c < d does the mathematically correct thing. This is a handy notational trick. This wasn't obvious to me because a lot of programming languages treat these associatively, so that a <= b < c may resolve to (a <= b) < c. This is very dangerous if boolean (True or False) are coerced to integers (1 or 0) because it may look like it works but give the wrong results.

• Edward Ross
python

## Getting Started Debugging with pdb

When there's something unexpected happening in your Python code the first thing you want to do is to get more information about what's going wrong. While you can use print statements or logging it may take a lot of iterations of rerunning and editing your statements to capture the right information. You could use a REPL but sometimes it's challenging to capture all the state at the point of execution. The most powerful tool for this kind of problem is a debugger, and it's really easy to get started with Python's pdb.

• Edward Ross
python

## Second most common value with Pandas

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group. This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas. Problem For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).

• Edward Ross