Automated Refactoring in Python
python

Automated Refactoring in Python

I am a very recent convert on automatic refactoring tools. I thought it was something for languages like Java that have a lot of boilerplate, and overkill for something like Python. I still liked the concept of refactoring, but I just moved the code around with Vim keymotions or sed. But then I came up against a giant Data Science codebase that was a wall of instructions like this: import pandas as pd import datetime df = pd.

  • Edward Ross
Writing Pandas Dataframes to S3
pandas

Writing Pandas Dataframes to S3

Writing a Pandas (or Dask) dataframe to Amazon S3, or Google Cloud Storage, all you need to do is pass an S3 or GCS path to a serialisation function, e.g. # df is a pandas dataframe df.to_csv(f's3://{bucket}/{key}') Under the hood Pandas uses fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS).

  • Edward Ross
Fast Pandas  DataFrame to Dictionary
pandas

Fast Pandas DataFrame to Dictionary

Tabular data in Pandas is very flexible, but sometimes you just want a key value store for fast lookups. Because Python is slow, but Pandas and Numpy often have fast C implementations under the hood, the way you do something can have a large impact on its speed. The fastest way I've found to convert a dataframe to a dictionary from the columns keys to the column value is: df.set_index(keys)[value].to_dict() The rest of this article will discuss how I used this to speed up a function by a factor of 20.

  • Edward Ross
Chompjs for parsing tricky Javascript Objects
python

Chompjs for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Sometimes Python's json.loads won't cut it for dynamic JSON; one option is demjson but another much faster option is chompjs. Chompjs converts a javascript string into something that json.loads. It's a little less strict than demjson; for example {"key": undefined} will be converted by chompjs.

  • Edward Ross
Aggregating Quantiles with Pandas
python

Aggregating Quantiles with Pandas

One of my favourite tools in Pandas is agg for aggregation (it's a worse version of dplyrs summarise). Unfortunately it can be difficult to work with for custom aggregates, like nth largest value. If your aggregate is parameterised, like quantile, you potentially have to define a function for every parameter you use. A neat trick is to use a class to capture the parameters, making it much easier to try out variations.

  • Edward Ross
A Command Line Interface for HTML With parsel-cli
python

A Command Line Interface for HTML With parsel-cli

There are many great command line tools for searching and manipulating text (like grep), columnar data (like awk), JSON data (like jq). With HTML there's parsel-cli built on top of the wonderful parsel Python library. Parsel is a fantastic library that gives a simple and powerful interface for extracting data from HTML documents using CSS selectors, Xpath and regular expressions. Parsel-cli is a very small utility that lets you use parsel from the command line (and can be installed with pip install parsel-cli).

  • Edward Ross
Persistent Dictionaries in Python
python

Persistent Dictionaries in Python

Dictionaries in Python (in other languages called maps or hashmaps) are a useful and flexible data structure that can be used to solve lots of problems. Part of their charm is the affordances in the lanugage for them; setting and accessing with square brackets [], deleting with del. But sometimes you want a dictionary that persists across sessions, or can handle more data than you can fit into memory - and there's a solution persistent dictionaries.

  • Edward Ross
Not Using Scrapy for Web Scraping
python

Not Using Scrapy for Web Scraping

Scrapy is a fast high-level web crawling and web scraping framework. But as much as I want to like it I find it very constraining and there's a lot of added complexity and magic. If you don't fit the typical use case it feels like a lot more work and learning doing things with scrapy than without. I really like Zyte (formerly ScrapingHub) the team behind Scrapy. They really know what they're talking about with great blogs about QA of Data Crawls, guide to browser tools, how bots are tracked, and Scrapy's documentation has a very useful page on selecting dynamically-loaded content.

  • Edward Ross
Extracting Links From HTML
python

Extracting Links From HTML

Sometimes you have a HTML webpage or email that you want to extract all the links from. There's lots of ways to do this, but there's a simple solution in Python with BeautifulSoup: from bs4 import BeautifulSoup def extract_links(html): soup = BeautifulSoup(html, 'html.parser') return [a.get('href') for a in soup.find_all('a') if a.get('href')] Some other methods would be to use regular expressions (which would be faster than parsing, but a little harder to get right), directly going through a parse tree or using lxml.

  • Edward Ross
Reading Email in Python with imap-tools
python

Reading Email in Python with imap-tools

You can use Python to read, process and manage your emails. While most email providers provide autoreplies and filter rules, you can do so much more with Python. You could download all your PDF bills from your electricity provider, you could parse structured data from emails (using e.g. BeautifulSoup), sort or filter by sentiment, or even do your own personal analytics like Steven Wolfram. The easiest tool I've found for reading emails in Python is imap_tools.

  • Edward Ross
Machine Learning Serving on Google CloudRun
python

Machine Learning Serving on Google CloudRun

I sometimes build hobby machine learning APIs that I want to show off, like whatcar.xyz. Ideally I want these to be cheap and low maintenance; I want them to be available most of the time but I don't want to spend much time or money maintaining them and I can have many of them running at the same time. My current solution is Cloud Compute (e.g. Digital Ocean or Linode) which has low fixed costs (around $5 per month).

  • Edward Ross
Jupyter Notebooks as Logs for Batch Processes
jupyter

Jupyter Notebooks as Logs for Batch Processes

When creating a batch process you typically add logging statements so that when something goes wrong you can more quickly debug the issue. Then when something goes wrong you either try to fix and rerun it, or otherwise run the process in a debugger to get more information. For many tasks Jupyter Notebooks are better for these kinds of batch processes. Jupyter Notebooks allow you to write your code sequentially as you usually would in a batch script; importing libraries, running functions and having assertions.

  • Edward Ross
Jupyter Notebook Preamble
jupyter

Jupyter Notebook Preamble

Whenever I use Jupyter Notebooks for analysis I tend to set a bunch of options at the top of every file to make them more pleasant to use. Here they are for Python and R with IRKernel Python # Automatically reload code from dependencies when running cells # This is indispensible when importing code you are actively modifying. %load_ext autoreload %autoreload 2 # I almost always use pandas and numpy import pandas as pd import numpy as np # Set the maximum rows to display in a dataframe pd.

  • Edward Ross
Offline SQL Formatting with sqlformat
sql

Offline SQL Formatting with sqlformat

It's polite to format your SQL before you share it around. You want to be able to do it in context, and not upload your private SQL to some random website. The sqlformat command of the Python package sqlparse is a great tool for the job. You can install sqlformat in Debian derivatives such as Ubuntu with sudo apt install sqlformat. Alternatively with any system with Python you can install it via pip install sqlparse, just make sure you have the binary in your path (e.

  • Edward Ross
Flattening Nested Objects in Python
python

Flattening Nested Objects in Python

Sometimes I have nested object of dictionaries and lists, frequently from a JSON object, that I need to deal with in Python. Often I want to load this into a Pandas dataframe, but accessing and mutating dictionary column is a pain, with a whole bunch of expressions like .apply(lambda x: x[0]['a']['b']). A simple way to handle this is to flatten the objects before I put them into the dataframe, and then I can access them directly.

  • Edward Ross
Extracting Fields from JSON with a Python DSL
python

Extracting Fields from JSON with a Python DSL

Indexing into nested objects of dictionaries and lists in Python is painful. I commonly come up against this when reading JSON objects, and often fields can be omitted. I haven't found a solution to this and so I've invented a tiny DSL to do this. It works like this: d = [{'a': [{'b': 'c'}, {'d': ['e']}]}] assert extract(d, '0.a.1.d.0') == d[0]['a'][1]['d'][0] assert extract(d, '1.a.1.d.0') == None You can specify a path into an object, separated by periods, and it will extract it returning None if that path doesn't exist.

  • Edward Ross
Getting Started with nbdev
nbdev

Getting Started with nbdev

Nbdev is a tool to make it possible to develop Python libraries in Jupyter notebooks. At first I found this idea scary, but after watching the talk I like Notebooks and seeing how it works I think it's got the best of all worlds. It lets you put code, documentation, examples and tests all together in context and provides tooling to extract the code into an installable library, run the tests and produce great hyperlinked documentation.

  • Edward Ross
Fixing repr errors in Jupyter Notebooks
python

Fixing repr errors in Jupyter Notebooks

When running the Kaggle API method dataset_list_files in a Jupyter notebook I got an error about __repr__ returning a non-string. At first I thought the function was broken, but then I realised it was just how it was displaying in Jupyter that was breaking because the issues were all in IPython: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) IPython/core/formatters.py in __call__(self, obj) 700 type_pprinters=self.type_printers, 701 deferred_pprinters=self.deferred_printers) --> 702 printer.pretty(obj) 703 printer.

  • Edward Ross
Pip Can Now Resolve Dependencies
python

Pip Can Now Resolve Dependencies

Something that has always bothered me about pip in Python is that you would get errors about inconsistent packages. Things still seemed to work surprisingly often, but it meant that the order you installed packages could lead to very different results (and one ordering may cause your test to fail, even if that doesn't succeed). Now there is a new resolver in Pip 20.3 for pip that checks the dependencies and tries to find versions that meet all constraints.

  • Edward Ross
Why use Tox for Python Libraries
python

Why use Tox for Python Libraries

I have been surprised how hard it is to maintain an internal library in Python. There are constantly issues for end users where something doesn't work. It turns out one feature used was introduced in Python 3.8, but someone was stuck on Python 3.6. Changes to Pandas and PyArrow meant some combinations of those libraries broke. It's really hard to build confidence in your system when lots of people end up with breakages.

  • Edward Ross
Managing Python Versions with asdf
programming

Managing Python Versions with asdf

I was recently trying to run a pipenv script, but it gave an error that it required Python 3.7 which wasn't installed. Unfortunately I was on Ubuntu 20.04 which has Python 3.8 as default, and no access to earlier versions in the repositories. However pipenv gave a useful hint; pyenv and asdf not found. The asdf tool allows you to configure multiple versions of applications in common interactive shells (Bash, Zsh, and Fish).

  • Edward Ross
Composing Functions
progamming

Composing Functions

R core looks like it's getting a new pipe operator |> for composing functions. It's just like the existing magrittr pipe %>%, but has been implemented as a syntax transformation so that it is more computationally efficient and produces better stack traces. The pipe means instead of writing f(g(h(x))) you can write x |> h |> g |> f, which can be really handy when changing dataframes. Python's Pandas library doesn't have this kind of convenience and it opens up a class of error that won't happen in that R code.

  • Edward Ross
Chaining with Pandas Pipe function
python

Chaining with Pandas Pipe function

I often use method chaining in pandas, although certain problems like calculating the second most common value are hard. A really good solution to adding custom functionality in a chain is Pandas pipe function. For example to raise a function to the 3rd power with numpy you could use np.power(df['x'], 3) But another way with pipe is: df['x'].pipe(np.power, 3) Note that you can pass any positional or keyword arguments and they'll get passed along.

  • Edward Ross
Type Checking Beautiful Soup
python

Type Checking Beautiful Soup

Static type checking in Python can quickly verify whether your code is open to certain bugs. But it only works if it knows the types of external libraries. I've already introduced how to add type stubs for libraries without type annotations. But what if we have a complex library like BeautifulSoup that uses a lot of recursion, magic methods and operated on unknown data? With some small changes to your code you can make it typecheck with BeautifulSoup.

  • Edward Ross
Typechecking with a Python Library That Has No Type Hints
python

Typechecking with a Python Library That Has No Type Hints

Type hints in Python allow statically verifying the code is correct, with tools like mypy, efficiently eliminating a whole class of bugs. However sometimes you get the message found module but no type hints or library stubs, because that library doesn't have any type information. It's easy to work around this by adding type stubs. When you see this error it's worth first checking that there aren't any types already available.

  • Edward Ross
Setting the Icon in Jupyter Notebooks
jupyter

Setting the Icon in Jupyter Notebooks

I often have way too many Jupyter notebook tabs open and I have to distinguish them from the first couple letters of the notebook in front of the Jupyter organge book icon. What if we could change the icons to visually distinguish different notebooks? I thought I found a really easy way to set the icon in Jupyter notebooks... but it works in Firefox and not Chrome. I'll go through the easy solution works in more browsers and the hard solution.

  • Edward Ross
Retrying Python Requests
python

Retrying Python Requests

The computer networks that make up the internet are complex and handling an immense amount of traffic. So sometimes when you make a request it will fail intermittently, and you want to try until it succeeds. This is easy in requests using urllib3 Retry. I was trying to download data from Common Crawl's S3 exports, but ocassionally the process would fail due to a network or server error. My process would keep the successful downloads using an AtomicFileWriter, but I'd have to restart the process.

  • Edward Ross
Decorating Pandas Tables
python

Decorating Pandas Tables

When looking at Pandas dataframes in a Jupyter notebook it can be hard to find what you're looking for in a big mess of numbers. Something that can help is formatting the numbers, making them shorter and using graphics to highlight points of interest. Using Pandas style you can make the story of your dataframe standout in a Jupyter notebook, and even export the styling to Excel. The Pandas style documentation gives pretty clear examples of how to use it.

  • Edward Ross
Building a Job Extraction Pipeline
jobs

Building a Job Extraction Pipeline

I've been trying to extract job ads from Common Crawl. However I was stuck for some time on how to actually write transforms for all the different data sources. I've finally come up with an architecture that works; download, extract and normalise. I need a way to extract the job ads from hetrogeneous sources that allows me to extract different kinds of data, such as the title, location and salary. I got stuck in code for a long time trying to do all this together and getting a bit confused about how to make changes.

  • Edward Ross
Updating a Python Project: Whatcar
whatcar

Updating a Python Project: Whatcar

The hardest part of programming isn't learning the language itself, it's getting familiar with the gotchas of the ecosystem. I recently updated my whatcar car classifier in Python after leaving it for a year and hit a few roadblocks along the way. Because I'm familiar with Python I knew enough heuristics to work through them quickly, but it takes experience with running into problems to get there. I thought I had done a good job of making it reproducible by creating a Dockerfile for it.

  • Edward Ross
R: Keeping Up With Python
r

R: Keeping Up With Python

About 5 years ago a colleague told me that the days were numbered for R and Python had won. From his perspective he is probably right; in software engineering companies Python has got increasing adoption in programmatic analytics. However R has its own set of unique strengths which make it more appealing for the stats people and has kept up surprisingly well with Python. Python has a wider audience than R, and keeps to its reputation as "not the best language for anything but the second best language for everything".

  • Edward Ross
From Multiprocesing to Concurrent Futures in Python
python

From Multiprocesing to Concurrent Futures in Python

Waiting for independent I/O can be a performance bottleneck. This can be things like downloading files, making API calls or running SQL queries. I've already talked about how to speed this up with multiprocessing. However it's easy to move to the more recent concurrent.futures library which allows running on threads as well as processes, and allows handling more complicated asynchronous flows. From the previous post suppose we have this multiprocessing code:

  • Edward Ross
Python HTML Parser
python

Python HTML Parser

A lot of information is embedded in HTML pages, which contain both human text and markup. If you ever want to extract this information, don't use regex use a parser. Python has an inbuilt library html.parser library to do just that. The excellent html2text library uses it to parse HTML into markdown, which you can use for removing formatting. However for your own purposes you can use a similar approach to build a custom parser by subclassing HTMLParser.

  • Edward Ross
Maybe Monad in Python
python

Maybe Monad in Python

A monad in languages like Haskell is used as a particular way to raise the domain of a function beyond where it was domain. You can think of them as a generalised form of function composition; they are a way of taking one type of function and getting another function. A very useful case is the maybe monad used for dealing with missing data. Suppose you've got some useful function that parses a date: parse_date('2020-08-22') == datetime(2020,8,22).

  • Edward Ross
Python is not a Functional Programming Language
python

Python is not a Functional Programming Language

Python is a very versitile multiparadigm language with a great ecosystem of libraries. However it is not a functional programming lanugage, as I know some people have described it. While you can write it in a functional style it goes against common practice, and has some practical issues. There is no fundamental definition of a functional programming language but two core concepts are that data are immutable and the existence of higher order functions.

  • Edward Ross
Test Driven Salary Extraction
python

Test Driven Salary Extraction

Even when there's a specific field for a price there's a surprising number of ways people write it. This is what the tool price-parser solves. Unfortunately it doesn't work too well on salaries, which tend to be ranges and much higher, but the approach works. Price parser has a very large set of tests covering different ways people write prices. The solution is a simple process involving a basic regular expression, but it solves all these different cases.

  • Edward Ross
Finding Australian Locations with Placeholder
python

Finding Australian Locations with Placeholder

People write locations in many different ways. This makes them really hard to analyse, so we need a way to normalise them. I've already discussed how Placeholder is useful for coarse geocoding. Now I'm trying to apply it to normalising locations from Australian Job Ads in Common Crawl. The best practices when using Placeholder are: Go from the most specific location information (e.g. street address) to the most general (e.

  • Edward Ross
Converting HTML to Text
python

Converting HTML to Text

I've been thinking about how to convert HTML to Text for NLP. We want to at least extract the text, but if we can preserve some of the formatting it can make it easier to extract information down the line. Unfortunately it's a little tricky to get the segmentation right. The standard answers on Stack Overflow are to use Beautiful Soup's getText method. Unfortunately this just turns every tag into the argument, whether it is block level or inline.

  • Edward Ross
How to turn off LaTeX in Jupyter
jupyter

How to turn off LaTeX in Jupyter

When showing money in Jupyter notebooks the dollar signs can disappear and turn into LaTeX through Mathjax. This is annoying if you really want to print monetary amounts and not typeset mathematical equations. However this is easy to fix in Pandas dataframes, Markdown or HTML output. For Pandas dataframes this is especially annoying because it's much more likely you would want to be showing $ signs than displays math. Thankfully it's easy to fix by setting the display option pd.

  • Edward Ross
Double emphasis error in html2text
python

Double emphasis error in html2text

I'm trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. I've already resolved an issue with multiple types of emphasis. However HTML in the wild has all sort of weird edge cases that the library has trouble with. In this case I found a term that was emphasised twice: <strong><strong>word</strong></strong>. I'm pretty sure for a browser this is just the same as doing it once; <strong>word</strong>.

  • Edward Ross
An edge bug in html2text
python

An edge bug in html2text

I've been trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. But I quickly hit an edge case where it fails, because parsing HTML is surprisingly difficult. I was parsing some HTML that looked like this: Some text.<br /><i><b>Title</b></i><br />... When I ran html2text it produced an output like this:

  • Edward Ross
Writing Blog Posts with Jupyter and Hugo
writing

Writing Blog Posts with Jupyter and Hugo

It can be convenient to directly publish a mixture of prose, source code and graphs. It ensures the published code actually runs and makes it much easier to rerun at a later point. I’ve done this before in Hugo with R Blogdown, and now I’m experimenting with Jupyter notebooks. The best available option seems to be nb2hugo which converts the notebook to markdown, keeping the front matter exporting the images.

  • Edward Ross
Raising Exceptions in Python Futures
python

Raising Exceptions in Python Futures

Python concurrent.futures are a handy way of dealing with asynchronous execution. However if you're not careful it will swallow your exceptions leading to difficult to debug errors. While you can perform concurrent downloads with multiprocessing it means starting up multiple processes and sending data between them as pickles. One problem with this is that you can't pickle some kinds of objects and often have to refactor your code to use multiprocessing.

  • Edward Ross
pyBART: Better Dependencies for Information Extraction
python

pyBART: Better Dependencies for Information Extraction

Dependency trees are a remarkably powerful tool for information extraction. Neural based taggers are very good and Universal Dependencies means the approach can be used for almost any language (although the rules are language specific). However syntax can get really strange requiring increasingly complex rules to extract information. The pyBART system solves this by rewriting the rules to be half a step closer to semantics than syntax. I've seen that dependency based rules are useful for extracting skills from noun phrases and adpositions.

  • Edward Ross
Demjson for parsing tricky Javascript Objects
python

Demjson for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Python's inbuilt json.loads is effective, but won't handle very dynamic Javascript, but demjson will (another, much faster alternative is Chompjs. The problem shows up when using json.loads as the following obscure error:

  • Edward Ross
Tips for Extracting Data with Beautiful Soup
python

Tips for Extracting Data with Beautiful Soup

Beautiful soup can be a useful library for extracting infomation from HTML. Unfortunately there's a lot of little issues I hit working with it to extract data from a careers webpage using Common Crawl. The library is still useful enough to work with; but the issues make me want to look at alternatives like lxml (via html5-parser). The source data can be obtained at the end of the article. Use a good HTML parser Python has an inbuild html.

  • Edward Ross
Only write file on success
python

Only write file on success

When writing data pipelines it can be useful to cache intermediate results to recover more quickly from failures. However if a corrupt or incomplete file was written then you could end up caching that broken file. The solution is simple; only write the file on success. A strategy for this is to write to some temporary file, and then move the temporary file on completion. I've wrapped this in a Python context manager called AtomicFileWriter which can be used in a with statement in place of open:

  • Edward Ross
Adding Types to Rdflib
python

Adding Types to Rdflib

I've been using RDFLib to parse Job posts extracted from Common Crawl. RDF Literals It automatically parses XML Schema Datatypes into Python datastructures, but doesn't handle the <http://schema.org/Date> datatype that commonly occurs in JSON-LD. It's easy to add with the rdflib.term.bind command, but this kind of global binding could lead to problems. When RDFLib parses a literal it will create a rdflib.term.Literal object and the value field will contain the Python type if it can be successfully converted, otherwise it will be None.

  • Edward Ross
Converting RDF to Dictionary
python

Converting RDF to Dictionary

The Web Data Commons has a vast repository of structured RDF Data about local businesses, hostels, job postings, products and many other things from the internet. Unfortunately it's not in a format that's easy to do analysis on. We can stream the nquad format to get RDFlib Graphs, but we still need to convert the data into a form we can do analysis on. We'll do this by turning the relations into dictionaries of properties to the list of objects they contain.

  • Edward Ross
Streaming n-quads as RDF
data

Streaming n-quads as RDF

The Web Data Commons extracts structured RDF Data from about one monthly Common Crawl per year. These contain a vast amount of structured infomation about local businesses, hostels, job postings, products and many other things from the internet. Python's RDFLib can read the n-quad format the data is stored in, but by default requires reading all of the millions to billions of relations into memory. However it's possible to process this data in a streaming fashion allowing it to be processed much faster.

  • Edward Ross
Parsing Escaped Strings
python

Parsing Escaped Strings

Sometimes you may have to parse a string with backslash escapes; for example "this is a \"string\"". This is quite straightforward to parse with a state machine. The idea of a state machine is that the action we need to take will change depending on what we have already consumed. This can be used for proper regular expressions (without special things like lookahead), and the ANTLR4 parser generator can maintain a stack of "modes" that can be used similarly.

  • Edward Ross
Serving Static Assets with Python Simple Server
web

Serving Static Assets with Python Simple Server

I was trying to load a local file in a HTML page and got a Cross-Origin Request Blocked error in my browser. The solution was to start a Python web server with python3 -m http.server. I had a JSON file I wanted to load into Javascript in a HTML page. Looking at StackOverflow I found I found fetch could do this fetch("test.json") .then(response => response.json()) .then(json => process(json)) Where process is some function that acts on the data; console.

  • Edward Ross
Minhash Sets
jobs

Minhash Sets

We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.

  • Edward Ross
Searching for Near Duplicates with Minhash
nlp

Searching for Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.

  • Edward Ross
Detecting Near Duplicates with Minhash
nlp

Detecting Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.

  • Edward Ross
Near Duplicates with TF-IDF and Jaccard
nlp

Near Duplicates with TF-IDF and Jaccard

I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.

  • Edward Ross
Near Duplicates with Jaccard
nlp

Near Duplicates with Jaccard

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calcuate.

  • Edward Ross
Edit Distance
nlp

Edit Distance

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.

  • Edward Ross
Finding Exact Duplicate Text
python

Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.

  • Edward Ross
Showing Side-by-Side Diffs in Jupyter
python

Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

  • Edward Ross
Creating a Diff Recipe in Prodigy
nlp

Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

  • Edward Ross
Counting n-grams with Python and with Pandas
python

Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.

  • Edward Ross
Not using NER for extracting Job Titles
nlp

Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

  • Edward Ross
Getting Started Debugging with pdb
python

Getting Started Debugging with pdb

When there's something unexpected happening in your Python code the first thing you want to do is to get more information about what's going wrong. While you can use print statements or logging it may take a lot of iterations of rerunning and editing your statements to capture the right information. You could use a REPL but sometimes it's challenging to capture all the state at the point of execution. The most powerful tool for this kind of problem is a debugger, and it's really easy to get started with Python's pdb.

  • Edward Ross
Second most common value with Pandas
python

Second most common value with Pandas

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group. This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas. Problem For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).

  • Edward Ross