skeptric - Skeptric

Exporting Nomic’s Mixture of Experts model to ONNX

onnx

Sparse mixture of experts models can contain the parameters and…

Dataset of 2024 Victorian Council Election Candidates

data

In local elections what the candidates campaign on tells us a lot about what matters to them, and by extension to their communities. To understand what is…

Measuring a Language Model

nlp

makemore

Implementing Lempel-Ziv 77 Paper in Python

entropy

Lossless compression is the process of taking some data and squishing it into a smaller space without losing any information. It is…

Lossless Compression with Huffman Codes

entropy

Lossless compression algorithms are almost magical; you can take some data source squish it down into a smaller space, and then restore it back to its full size later. They…

Stanford AI Professional Program Review

In 2023 I completed the Stanford AI Professional Program to deepen my understanding of Artificial Intelligence, especially with natural language. The courses I took were great and…

Mamba Lanugage Model Inference

nlp

llm

Mamba is an alternative kind of neural network model to Transformers inspired by State Space Models. It can train as efficiently as…

A dataset of P. G. Wodehouse books

gutenberg

We are going to create a text dataset from the books of P. G. Wodehouse for language modelling. I enjoy his style of writing…

Roman Numerals with Python and Regular Expressions

python

We’re going convert integers to roman numerals in standard form and back again in Python, as well as detect them with regular expressions.

Downloading books from Project Gutenberg with Python

nlp

gutenberg

Project Gutenberg is a…

Makemore Subreddits - Part 4 Backprop

nlp

makemore

Let’s manually backpropagate through in a Multi-Layer Perceptron language model for subreddit names.

Makemore Subreddits - Part 3 Activations and Gradients

nlp

makemore

Let’s dive deep into the Activations and Gradients in a Multi-Layer Perceptron language model for subreddit names.

Building a Deep Learning Computer for AU$2000

computing

nlp

Building a reliable Deep Learning workstation for around AU$2,000 is achievable even for beginners. In my case, I constructed a PC with a used RTX 3090, a 6-core 3.9GHz CPU…

Starting AWS EC2 Compute Instances from the Command Line

aws

The ability to rent compute resources on demand is incredibly useful as a data scientist. There are often batch processing jobs, such as training a machine learning model…

Makemore Subreddits - Part 2 Multilayer Perceptron

nlp

makemore

Let’s create a Multi-Layer Perceptron language model for subreddit names.

Makemore Subreddits - Part 1 Bigram Model

nlp

makemore

Let’s create a bigram character level language model for subreddit names. This comes at a challenging time when Reddit is changing the terms of its API making most third party apps infeasible (and data extracts like this analysis relies on prohibitively expensive). However the Reddit community has been an interesting place…

Centroid Spherical Polygon

maths

You’re organising a conference of operations research analysts from all over the world, but their…

Thumbs Up? Sentiment Classification Like it’s 2002

nlp

sentiment

In July 2002 Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan published Thumbs up? Sentiment Classification using Machine Learning Techniques. at EMNLP, one of the earliest works of using machine learning for Sentiment…

Learning Natural Language Processing through Sentiment Classification

nlp

sentiment

There is currently a lot of hype in Natural Language Processing, much of it driven by Open AI’s effective marketing of its GPT systems. As someone who has only been working…

Linear Stacking Cosine Embeddings

Why pretrain your own language model?

nlp

I think now is a great time to pretrain your own language model from scratch. This may be a strange statement when all the best performing models today are hundreds of…

How not to Evaluate NER Systems

nlp

hnbooks

ner

A typical way to evaluate NER (Named Entity Recognition) Systems is to look at the F1 score, however this is a bad idea as stated in Chris Manning’s 2006 blog post Doing Named Entity Recognition? Don’t…

Conda Environment YAML for running TensorFlow on GPU

python

Getting TensorFlow to work on a GPU can be tricky, but conda can make it relatively easy. Here’s a configuration that I find works on TensorFlow 2.11 with CUDA 11.7:

Convert Hugo mmark LaTeX into Pandoc

blog

I’ve recently migrated from Hugo to Quarto and one of the hardest steps was converting the equations in Hugo’s legacy mmark format to Quarto. This notebook shows how I converted the…

Sinusoidal Functions as Harmonic Oscillations

maths

The sinusoidal functions (like sin and cosine) normally first come up when students are learning about angles of right angled triangles. However there are many…

Migrating from Hugo to Quarto

blog

I’ve just moved this website from Hugo to Quarto, and I am very happy. Quarto is much better for a mathematics, data, and…

Automatically updating SSH Config for Jarvislabs

python

Jarvislabs is a very cost efficient cloud GPU provider for deep learning. One slight issue I ran into is that every time you resume an instance it gets a different SSH…

Preventing SpaCy from re-downloading large models

python

annotation

SpaCy and Prodigy a…

How Not to Do Book Named Entity Recognition

hnbooks

I’m working on a project to extract books from Hacker News. I’ve found some heuristics to find books and want to train an NER model to recognise the names of books.…

Duplicate Record Detection in Tabular Data

data

How do you deal with near duplicate data, or join two datasets with some errors? For example when a book is added to Open Library it’s easy for accidental duplicates to occur, and there are many in practice. There are often small differences between duplicates, such as abbreviating author’s names…

Bootstrapping a book classifier

hnbooks

nlp

I’m working on a project to extract books from Hacker News. Most HackerNews posts aren’t about books, and it would be extremely tedious to manually annotate examples when most of them are negative.…

Converting SentenceTransformers to Tensorflow

python

nlp

SentenceTransformers provides a convenient interface for creating embeddings of text (and images) in PyTorch, which can be used for neural retrieval and ranking. But what if you want to…

Displaying Hacker News Book Comments in HTML

hnbooks

I’m currently working on a project to extract books from Hacker News to help find interesting books. Having extracted extracted ASINs from Hacker News posts…

Human-in-the Loop: Finding Hazards in Food

books

annotation

The book Human-in-the Loop Machine Learning by Randall Munro has a code example of finding hazards in food safety reports. Here’s the description of the problem:

Human-in-the Loop: Finding Bicycles in Images

books

annotation

The book Human-in-the Loop Machine Learning by Randall Munro has a code example of annotating bicycles. Here’s the description of the problem:

Open Library ISBN Lookup

hnbooks

I’ve been looking at Open Library as a knowledge base for books. After extracting ASINs from Hacker News posts, I want to link them to Open Library records.

Human-in-the Loop: Finding Topics in Headlines

nlp

books

annotation

The book Human-in-the Loop Machine Learning by Randall Munro has a code example of annotating headlines for a data analyst.

Human-in-the-Loop Machine Learning: Book review

books

Most machine learning models are guided by human examples, but most machine learning texts and courses focus only on the…

What’s in Open Library Data

hnbooks

nlp

ner

python

I’m working on a project to extract books from Hacker News, and want to link the books to records from Open Library. I’ve already looked at the process of adding…

Adding a Book to Open Library

hnbooks

I’ve been looking at Open Library as a knowledge base for books. A lot of the data here is manually uploaded by people, and the user…

Training SentenceTransformers Using Memory Mapping with PyArrow

python

nlp

SentenceTransformers provides a convenient interface for training linguistic embeddings using Transformers, which can be used for example with approximate nearest neighbours for search. However…

Importing Open Library into SQLite

nlp

ner

hnbooks

python

I’m working on a project to extract books from Hacker News, and want to link the books to records from Open Library. The Open Library data dumps are several gigabytes of compressed…

Open Library: A Book Knowledge Base

nlp

ner

hnbooks

I’m working on a project to extract books from Hacker News. Once I have extracted book titles (e.g. with NER or Question Answering) I need a way to disambiguate them to an entity, and potentially link…

Human in the Loop for Disaster Annotation

annotation

I’ve been reading Robert Monarch’s Human-in-the-Loop Machine Learning and the second chapter has a great practical example of human-in-the-loop machine learning; identifying whether a news headline is about a disaster. A lot of the Data…

Evaluating Book Retrieval from Hacker News

nlp

ner

hnbooks

I’m working on a project to extract books from Hacker News. I’ve been thinking about ways to bootstrap this…

Question Answeeing as Zero Shot NER for Books

nlp

ner

hnbooks

I’m working on a project to extract books from Hacker News. I’ve previously found book recommendations for Ask HN Books, and have used the Work of Art named entity from Ontonotes to detect the titles. Another approach is to use extractive question answering as a sort of zero-shot NER. This works amazingly well, at least…

Book NER as a Work of Art

nlp

ner

hnbooks

I’m working on a project to extract books from Hacker News. I’ve previously found book recommendations for Ask HN Books. Now I want a way to extract the book titles and authors. The Ontonotes…

Ask HN Book Recommendations

nlp

ner

hnbooks

I’m working on a project to extract books from Hacker News. Most…

Finding ASINs in HackerNews

nlp

ner

hnbooks

I’m currently working on a project to extract books from Hacker News. After exporting all 2021 posts from the Google Bigquery dataset in a Kaggle Notebook and doing an exploratory data…

Hacker News Dataset EDA

python

hnbooks

A mystery! A riddle! A puzzle! A quest! This was the moment that Ada loved best.

Side Project Outline: Book Title NER

nlp

ner

hnbooks

I’m starting a month long project to extract book titles from Hacker News using Named Entity Recognition. I’ve been thinking lately about how…

Source Map HTML Tags in Python

python

html

data

nlp

When using NLP in HTML it’s useful to extract the HTML tags which change the meaning of the text. I’ve already shown how to source map the text of HTML using Python’s inbuild HTMLParser. We’ll now adapt this to the task of getting the tags, which could be…

Finding Meaning in HTML

html

nlp

HTML is one of the most common forms of communication today. Emails, wikis, blogs, and many forums ultimately use some sort of HTML to communicate things such as emphasis…

Regular Expression for HTML Comments

python

html

testing

I started trying to write a grammar for generating HTML and quickly got stumped by how to represent HTML comments. From the whatwg specification:

Source Mapping Text HTML in Python

python

html

data

nlp

Sometimes I want to extract text from HTML for processing, but I don’t want to lose the context. This is useful in NLP with HTML because sometimes the context (is it emphasised, or in a header or list item) may be relevant for…

Low abstraction software

programming

I spend most of my time with software barefly aware of the towers of abstraction below. When I click a link in my web browser it sends a HTTP request down through to the…

Building a layered API with Fashion MNIST

python

data

fastai

We’re going to build up a simple Deep Learning API inspired by fastai on Fashion MNIST from scratch. Humans can only fit so many things in their head at once (somewhere between 3 and 7); trying to grasp all the details of the training loop at…

Peeling back the fastai layered AI with Fashion MNIST

python

data

fastai

Chapter 4 of the fastai book covers how to build a Neural Network for distinguishing 3s and…

Fashion MNIST with Prototype Methods

python

Prototype methods classify objects by finding their proximity to a prototype in the…

Point-in-time joins and real time feature stores

sql

data

Going from batch processing to near-real time applications is a big conceptual leap for data scientists. Data scientists are often familiar with big SQL analytics databases…

Building an AI Driven Game at no cost

python

fastai

I built Emotion Escape an adventure game with the twist that you navigate using facial expressions. Specifically you upload a photo of a face that is one of happy, sad…

Read Common Crawl Parquet Metadata with Python

python

commoncrawl

Common Crawl releases columnar indexes of their web crawls in the Apache Parquet file format. This can be efficiently queried in a distributed manner using Amazon Athena or Spark, and the Common Crawl team have…

Training Recipe Ingredient NER with Transformers

nlp

python

ner

I trained a Transformer model to predict the components of an ingredient, such as the name of the ingredient, the quantity and the unit. It performed better than the…

Training a Stanford NER Model in Python

nlp

python

ner

Stanford NER is a good implementation of a Named Entity Recognizer (NER) using Conditional Random Fields (CRFs). CRFs are no longer near state of the art for NER, having been overtaken…

Dictionary to Dataclass

python

Dataclasses are a really lightweight way to make classes. When I’m programming I’ll often start out with a dictionary of data and specific functions to manipulate…

TextRank

nlp

TextRank (Mihalecea and Tarau, 2004) is the idea of using graph ranking algorithms, like PageRank, as an unsupurvised way of extracting key units of text from a document. The interesting part is how they define graphs on the units of text; filtering text units…

Pooling Proportions with Empirical Bayes

statistics

Calculating binomial proportions by group is inherently noisy for small groups. The standard deviation for a group is \(\sqrt{\frac{p(1-p)}{N}}\), where \(p\) is the true…

Test Driven Development in Machine Learning

testing

Last weekend I read the first part of Kent Beck’s Test Driven Development: By Example. He works through a simple example of programming with Test Driven Development in excruciating detail. It shows how you can move…

Restoring Wayback Machine HTML

The Internet Archive’s Wayback Machine is a digital archive of a large portion of the internet (hundres of billions of web pages). However they don’t store the webpage in its original form, but make some changes…

Automatically changing display settings with Autorandr

linux

I’ve been using Linux on laptops for over a decade and have got used to some of the rough edges. When I boot my current laptop connected to an external monitor xrandr will…

Common Crawl Time Ranges

commoncrawl

Common Crawl provides a huge open web dataset of going back to around 2009. Unfortunately it’s not easy to find out the…

Pagination in Internet Archive’s Wayback Machine with CDX

data

I’ve been trying to use pagination with the Internet Archive’s CDX for the Wayback Machine but have been getting lots of empty results. The reason is that filters are applied afte…

Hugo Readdir Error with Emacs

emacs

blog

Every now and then when previewing Hugo (via hugo serve) as I’m editing it in Emacs I’ll get a strange error like:

Fast Web Dataset Extraction Worfklow

data

I’m currently streamlining the process of building a dataset from web data. I want to make it easy for anyone to build their own dataset in a few hours, which requires…

Unique Key for Web Captures

data

I’m currently developing a workflow for extracting data from captures of web pages, leveraging large archives like Common Crawl and the Wayback Machine. A pain point in…

On Applying Optimisation

maths

I recently watched the ACEMS public lecture Optimal Decision Making: A tribute to female ingenuity, where Alison Harcourt (née Doig)…

Estimating Group Means with Empirical Bayes

maths

data

When calculating the averages of lots of different groups it doesn’t make sense to treat the groups as independent, but to pool information across groups, especially on groups with little data. One way to do this is to build a…

A Reading Guide to Stein’s Paradox

maths

data

Stein’s Paradox states that when trying to estimate the 3 or more means of normally distributed data together, it’s always better (on average) to shrink the estimates. Specifically if…

Bernoulli Trials and the Beta Distribution

data

statistics

Suppose we want to know the probability of an event occurring; it could be a…

Learning about Multilevel Models

statistics

The concept of a multilevel model, also called a mixed effects model or a hierarchical model, is reasonably new to me. It’s not the kind of thing typically taught…

Offline Translation in Python

nlp

python

Suppose you want to translate text from one language to another. Most people’s first point of call is an online translation service from one of the big cloud providers, and…

Building Categorical Embeddings

data

High cardinality categorical data are tricky for machine learning models to deal with. A linear model tries to estimate a different coefficient for every category, treating…

Tobit Regression in Stan and R

stan

r

data

This article shows coding Tobit Regression in Stan, integrating it into R, showing it…

Making Bayesian Predictions with Stan and R

stan

r

data

statistics

This is the third on a series of articles showing the basics of building models in Stan and accessing them in R. Now that we can specify a linear model and fit it…

Stan Linear Priors

stan

r

data

statistics

This is the second on a series of articles showing the basics of building models in Stan and accessing them in R. In the previous article I showed how to specify a simple linear model with flat priors in Stan, and fit it in R with a formula syntax. In this article we extend this to specify priors; defaulting…

Getting Started with RStan

r

stan

data

I wanted to fit a Bayesian Tobit model, but I couldn’t find one (probably because I didn’t know how to look). So I decided to build one in Stan, which I had never…

Fixing sampler errors in probit regression with rstanarm

stan

r

I was working through problem 15.5 of Regression and Other Stories, which asks…

Changing Python Analytics Code

python

programming

legacy code

This is the essence of the refactoring process: small changes and testing after each change. If I try to do too…

Calculus of the Inverse Logit Function

maths

I was recently doing some logistic regression, and calculated the derivative of the Inverse Logit function (sometimes known as expit), to understand…

Priors as Regularisation

data

In Bayesian statistics you have to choose a prior distribution for the parameters to combine with the data to get a posterior…

Binning Binary Predictions

data

r

When understanding how a binary prediction depends on a continuous input I find a very useful way is to bin it into quantiles and plot the average probability.

Finding Open Datasets

data

The best ways to build a skill is to practice it, and data analytics is no different. That means you…

Burnout from Creeping Commitments

general

A couple of months ago I transitioned…

Plotting Bayesian Parameter Distributions with R Tidyverse

I’m currently reading Regression and Other Stories which contains lovely plots…

Spellchecking Articles with Aspell

I write enough text in Emacs that it’s worth using a spellchecker. As I started comparing options I ended up spellchecking all my…

Interpretable Parameterisations

data

maths

Interpretable models are incredibly useful for getting users to trust your model (and understand when not to trust it). In projects helping subject matter experts make…

Getting Sentencing Data

python

In Noise by Kahneman, Sibony and Sunstein, they discuss how much variation there is in…

Composition Over Inheritence

programming

jobs

I have been trying to extract job ads from Common Crawl, and have designed a pipeline with 3 phases; fetching the data, extracting the content and normalising the data. However the way I implemented this is using inheritance to remove some of the duplication…

Structuing Python Analytics Codebases

python

data

legacy code

Many analytics codebases consist of a pipeline of steps, doing things like getting data, extracting features, training models and evaluating results and diagnostics. The…

Setting the Order of Commands in Typer

python

Typer is a nice application for succinctly building Python CLIs built on top of Click. However when you’ve got subcommands they’re listed in alphabetical order. It would be nice to have the commands ordered in the same…

Testing Pandas transformations with Hypothesis

pandas

python

testing

Pandas and numpy let you perform fast transformations on large datasets…

Property Based Testing with Regular Expressions

python

testing

Property based testing is a really useful technique where you state a property about your code and then verify it with random data. The difficulty is…

Constrained Gradient Descent

maths

python

data

Gradient descent is an effective algorithm for finding the local extrema of functions, and the global extrema of convex functions. It’s very useful in machine learning for…

Making Changes Faster with Tests

legacy code

I used to think the whole point of software verifications like types and tests was to ensure a piece of software worked as specified. Consequently if a piece of software…

Installing Tidyverse in WSL without Timedatectl Status 1 Issue

r

wsl

When I tried to install tidyverse in WSL2 I ran into issues with timedatectl and xml2. The simple solution is:

Automated Refactoring in Python

python

legacy code

I am a very recent convert on automatic refactoring tools. I thought it was something for languages like…

Writing Pandas Dataframes to S3

pandas

python

Writing a Pandas (or Dask) dataframe to Amazon S3, or Google Cloud Storage, all you need to do is pass an S3 or GCS path to a serialisation function, e.g.

Code Optimisation as a Trade Off

programming

I’ve been writing some ARM Assembly as part of a Raspberry Pi Operating System Tutorial, and writing in Assembly really forces me to think about performance in terms of registers and instructions. When I’m writing Python trying to write concise code leads to…

Hardware Is Hard

programming

I’ve been revisiting baremetal Raspberry Pi programming (with Alex Chadwick’s Baking Pi tutorial, although there are plenty of others). It really highlights how much I take for granted. I spend a lot of time in languages like Python and R processing data, without much understanding of the interpreters…

Fast Pandas DataFrame to Dictionary

pandas

python

Tabular data in Pandas is very flexible, but sometimes you just want a key value store for fast lookups. Because Python is slow, but Pandas and Numpy often have fast C…

Chompjs for parsing tricky Javascript Objects

python

data

Aggregating Quantiles with Pandas

python

pandas

One of my favourite tools in Pandas is agg for aggregation (it’s a worse version of dplyrs summarise).…

A Command Line Interface for HTML With parsel-cli

python

linux

There are many great command line tools for searching and manipulating text (like grep), columnar data (like awk), JSON data (like jq). With HTML there’s parsel-cli built on top of…

Pasting text from long ago in Emacs and Vim

emacs

I use Vim keybindings in Emacs through Evil Mode and Evil Collection. Often I’ll copy something, make some edits, and then want to…

Persistent Dictionaries in Python

python

Dictionaries in Python (in other languages called maps or hashmaps) are a useful and…

Not Using Scrapy for Web Scraping

python

Scrapy is a fast high-level web crawling and web scraping framework. But as much as I want to like it I find it very…

Select, Fetch, Extract, Watch: Web Scraping Architecture

data

The internet is full of useful information just waiting to be collected, processes and analysed. While getting started scraping data from the web is straightforward, it’s…

Taking Screenshots in Firefox

emacs

linux

I find taking screenshots in Linux a bit painful. My current way is to use GIMP to create an image from a…

Extracting Links From HTML

python

Sometimes you have a HTML webpage or email tha…

Reading Email in Python with imap-tools

python

You can use Python to read, process and manage your emails. While most email providers provide autoreplies and filter rules, you can do so much more with Python. You could…

Machine Learning Serving on Google CloudRun

python

I sometimes build hobby machine learning APIs that I want to show off, like whatcar.xyz. Ideally I want these to be cheap and low maintenance; I want them to be available most of the time but I don’t want to spend much time or money maintaining them and I can…

How big a sample to measure conversion?

data

statistics

A common question with conversions and other rates, is how big a sample do you need to measure the conversion accurately? To get an estimate with standard error \(\sigma\) you need at…

How to Sum Random Variables

data

statistics

Suppose you’ve got two dice; what’s the probability the sum of their rolls will add up to 4? You simply look at all the ways…

Statistical Testing: 2.8 Standard Deviations

data

statistics

What sample size do you need to capture an effect with 95% confidence 80% of the time? For a normal/binomial distribution the answer is roughly \(\left(2.8…

Price Hysteresis

data

The demand curve in economics represents the relationship between price and quantity sold. It’s generally not possible to know the demand curve…

More Profitable A/B with Test and Roll

data

When running an A/B test the sample sizes can seem insane. For example to observe a 2 percentage point uplift on a 60% conversion rate requires over 9,000 people in each group to get the standard 95% confidence level with 80% power. If you’ve only got less than 18,000 customers you can reach, which is very common in businesss to business…

Docker Dependency Managment

I have a personal CV written in TeX. I wouldn’t use TeX again today, but I’ve kept it maintained and haven’t had reason to migrate it. Unfortunately the dependencies can be painful; whenever I…

Previewing changes to LaTeX documents with inotify

linux

Sometimes it’s useful to rerun a task whenever…

Reference Sets as Pervasive Models

data

Suppose you have a long standing heart condition, and are considering undergoing a surgical procedure that could alleviate the procedure, but has its own set of risks. You…

The Way of the Physicist

A large number of the physicists I trained with are now data scientists, and it’s not uncommon to meet a data scientist who trained in Physics. Part of this is…

Probability Distributions Between the Mean and the Median

maths

data

The normal distribution is used throughout statistics, because of the Central Limit Theorem it occurs in many applications, but also because it’s computationally convenient. The expectation value of the normal distribution is the mean, which has many nice…

Integrating Powers of Exponentials

maths

When working with distributions that were powers of exponentials, of which the normal and exponential distributions are special…

Metrics for Binary Classification

data

When evaluating binary classifier (e.g. will this user convert?) the most obvious metric is accuracy; what’s the probability a random prediction is correct. One issue with this metric is if 90% of the cases are one…

Building a Reputation in Data Science

data

As a professional your reputation is very important to your career success. To get people to offer you work, to pay for your advice or to buy products from you they need to…

Jupyter Notebooks as Logs for Batch Processes

jupyter

python

When creating a batch process you typically add logging statements so that when something goes wrong you can more quickly debug the issue. Then when something goes wrong you…

Jupyter Notebook Preamble

jupyter

python

r

Offline SQL Formatting with sqlformat

sql

python

emacs

It’s polite to format your SQL before you share it around. You want to be able to do it in context, and not upload your private SQL to some random website. The sqlformat command of the Python…

Flattening Nested Objects in Python

python

pandas

Sometimes I have nested object of dictionaries and lists, frequently from a JSON object, that…

Extracting Fields from JSON with a Python DSL

python

Indexing into nested objects of dictionaries and lists in Python is…

Getting Started with nbdev

jupyter

python

Nbdev is a tool to make it possible to develop Python libraries in Jupyter notebooks. At first I found this idea scary, but after watching the talk I like Notebooks and seeing how it works I think it’s got the best…

Language Models as Classifiers

nlp

Probabalistic language…

Sentence Boundaries in N-gram Language Models

nlp

An N-gram language model guesses the next possible word by looking at how frequently is has previously occurred after the previous N-1 words. I think this is how my mobile…

Hurdles in Contributing to Open Source

programming

The Gulag Archipelago: Audiobook Review

books

The Gulag Archipelago is a singular piece of literature about the horrors of arbitrary arrest, inhumane interrogations, prolonged imprisonments, deadly work camps and exile…

Rod Crewther: 23/09/1945 - 17/12/2020

general

Today I learned that Dr Rodney James Crewther, known…

Moral Justification

general

Once a problem becomes moral, the acceptable solution space collapses. I’m reading Sylvia Nasar’s book Grand Pursuit, and she talks about how Malthus’ An Essay on…

Normalising Salary

jobs

Salary ranges come in many forms; how can we convert them to a common form? A first approximation is to annualise them; it ignores…

The Righteous Mind: Book Review

books

Johnathan Haidt’s The Righeous Mind: Why Good People are Divided by Politics and Religion is about the moral norms of groups. As someone not familiar with moral psychology I found the book discussed many interesting ideas I wasn’t aware of, but didn’t provide much evidence for…

Language Through Prism

nlp

Neural language models, which have advanced the state of the art for Natural Language Processing by a huge leap over previous methods, represent the individual tokens as a sequence of vectors. This sequence of vectors can be thought of explicitly as a discrete time…

Open Source Licenses for Data Processing Code

programming

When a program primarily sources and transforms data then copyleft licenses add very little protection over other open source…

Fixing repr errors in Jupyter Notebooks

python

jupyter

When running the Kaggle API method dataset_list_files in a Jupyter notebook I got an error about __repr__ returning a non-string. At first I thought the function was broken, but then I realised it was just how it was…

Life Optimisation

general

When all you’ve got is the hammer of mathematics, everything looks like an optimisation problem, you just need to choose the right objective function. So what should the…

What Is a Better Programming Approach?

programming

When you solve a problem in code you will use some programming approach, and the approach you choose can make a big impact on your efficiency. I talk about approach rather than language because it’s more than just the language. A project will typically only use a subset of the language (especially for massive languages like C++), some…

Pip Can Now Resolve Dependencies

python

Something that has always bothered me about pip in Python is that you would get errors about inconsistent packages. Things still seemed to work…

Why use Tox for Python Libraries

python

I have been surprised how hard it is to maintain an internal library in Python. There are constantly issues for end users where something doesn’t work. It turns out one…

Mere Exposure

The Mere Exposure Effect says that people will prefer something they’ve seen or heard multiple times before than something less familiar. The Robert Zajonc paper Attitudinal Effects…

Managing Python Versions with asdf

programming

python

I was recently trying to run a pipenv script, but it gave an error that it required Python 3.7 which wasn’t installed.…

Cosine Similarity is Euclidean Distance

maths

data

In mathematics it’s surprising how often something that’s obvious (or trivial) to someone else can be revolutionary (or weeks of work) to someone else. I was looking at the annoy (Approximate Nearest Neighbours, Oh…

Git: One VCS to Rule Them All

programming

When I started as a professional developer there were a number of competing version control systems. However Git seems to have almost entirely won this battle.

Using find and xargs

programming

Sometimes you want to feed a bunch of files to a program, and this is often easily done with find and xargs.

Templates for Excel Charts

excel

Sometimes I work with Excel and need to make attractive looking charts. The default charts look awful and this is often a time consuming…

Centroid of Points on the Surface of a Sphere

maths

python

There is a fundamental mistake here where we minimise the average distance, but the centroid should minimise the average squared distance.…

Automation through Documentation

programming

You join a new team and your first task is to run the monthly batch process. It transforms some data, trains a business critical model, and outputs some reporting. Your…

Glassbox Machine Learning

data

Can we have an interpretable model that has as good performance as blackbox models like gradient boosted trees and neural networks? In a 2020 Empirical Methods…

Finding Hugo Blogs with BigQuery

sql

blog

I want to find examples of other Hugo blogs, but they’re not really easy to search for. Unless someone put “Hugo” in…

Composing Functions

programming

python

r

R core looks like it’s getting a new pipe operator |> for composing functions. It’s just like the existing magrittr pipe %>%, but has been implemented as a syntax transformation so that it is more computationally efficient and produces…

Centroid for Cosine Similarity

data

maths

Cosine similarity is often used as a similarity measure in machine learning. Suppose you have a group of points (like a cluster); you want to represent the group by a single point - the…

Checking With Calculation

general

maths

Using Behaviour to Understand Items

data

When people access products online their behaviour gives lots of information about both the people and the products. This information deeply enriches understanding of how to…

AlphaFold: Predicting protein shape from its composition

general

The Critical Assessment of protein Structure Prediction (CASP) runs every two years to predict the shape of a protein, the building blocks of life, from its sequence of amino acids. We know the shape of a bunch (around 170,000) of…

Chaining with Pandas Pipe function

python

I often use method chaining in pandas, although certain problems like calculating the second most common value are hard. A really…

Type Checking Beautiful Soup

python

programming

Static type checking in Python can quickly verify whether your code is open to certain bugs. But it only works if it knows the types of external libraries. I’ve already introduced how to add…

Truly Independent Thinkers

general

I was reading [Paul Graham’s How to Think For Yourself] where he talks about independent-mindedness. His examples are scientists, investors, startup founders and essayists as professions where you can’t do well without thinking differently from your peers. While there’s…

Structuring a Project Like a Kaggle Competition

data

programming

Analytics projects are messy. It’s rarely clear at the start how to frame the business problem, whether a given approach will actually work…

Typechecking with a Python Library That Has No Type Hints

python

Type hints in Python allow statically verifying the code is correct, with tools like mypy…

Code Structure Reflecting Function

programming

I’ve been trying to extract job ads from Common Crawl. However I’ve been stuck on how to structure the code. Thinking through the relationships really helped me do this.

Setting the Icon in Jupyter Notebooks

jupyter

python

r

I often have way too many Jupyter notebook tabs open and I have to…

Retrying Python Requests

python

The computer networks that make up the internet…

Decorating Pandas Tables

python

data

pandas

jupyter

When looking at Pandas dataframes in a Jupyter notebook it can be hard to find what…

A First Cut of Job Extraction

jobs

I’ve finally built a first iteration of a job extraction pipeline in my job-advert-analysis repository. There’s nothing in there that I haven’t written about, but it’s simply doing the work to bring it all together. I’m really happy to have a full pipeline…

Which /bin/sh

programming

I tried to run a shell script and got this error:

Operating a Tower of Hacks

programming

Remember after you run the update process to run the fix script on the production database. But run it twice because it only fixes some of the rows the first time. Oh, and…

Packaging your Expertise in a Tiny Product

I was listening to the $100 MBA Podcast about How to Easily Create a Small Information Product. I really like the idea of building a tiny…

Energy to Orbit vs Launch into Deep Space

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.11

Energy Desnsity to Launch into Space

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.11

How Much Energy is there in a 9V Battery

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.11

Success in Small Steps

general

A lot of times I’ve…

Why is Vmemm Using All My Memory?

wsl

My Windows laptop was halting to a crawl; I was waiting seconds to switch windows and even typing took a couple of seconds…

Run Webserver Without Root

programming

You’ve written your web application or API and you now want to deploy it to a server. You don’t want to run it as root, because…

Myth of the Hawthorne Effect

general

The Hawthorne effect is where when measuring the effect of lighting changes on worker output in an electrical factory any change increased output, even back to the original lighting conditions. I’ve heard this explained as running the experiment caused the employees to be observed more closely…

Running out of Resources on AWS Athena

athena

presto

AWS Athena is a managed version of Presto, a distributed database. It’s very convenient to be able to run SQL queries on large datasets, such as Common Crawl’s Index, without having to deal with managing the infrastructure of big data. However the downside of a managed service is when you hit its limits there’s no way of…

Building a Job Extraction Pipeline

jobs

python

commoncrawl

I’ve been trying to extract job ads from Common Crawl. However I was stuck for some time on how to actually write…

Insights From Google Analytics for a Small Blog

I started regularly writing this website to get better at writing, to build a portfolio and share my learnings. Because of this I haven’t been focused on…

Importance of Collecting You Own Training Data

data

whatcar

A couple years ago I built whatcar.xyz which predicts the make and model of Australian cars. It was built mainly with…

Unhappy Path Programming

programming

When programming it’s easy to think about the happy path. The path along which you get well-formed valid data, all your requests return successfully and everything works on…

Updating a Python Project: Whatcar

whatcar

python

programming

The hardest part of programming isn’t learning the language itself, it’s getting familiar with the gotchas of the ecosystem. I recently updated my whatcar car classifier in Python after leaving it for a year and hit a few roadblocks along the way. Because I’m familiar with Python I knew enough heuristics to work through them quickly, but it…

Activating Mobile Phone Camera from HTML

programming

whatcar

Building a web application is great because, if it is well built, it can be accessed across many operating…

Building NLP Datasets from Scratch

nlp

data

There’s a common misconception…

Orderly Life for Original Work

general

Be settled in your life and as ordinary as the bourgeois, in order to be fierce and original in your works.

Experimental Generalisability

statistics

data

Experiments reveal the relationship between inputs and outcomes. With statistical methods you can often, with enough observations…

Choosing a Static Site Generator

programming

blog

Static website generators fill a useful niche between handcoding all your HTML and running a server. However there’s a plethora of site generators and it’s hard to choose between them. However I’ve got a simple recommendation: if you’re writing a blog use Jekyll…

Social Flashcards

general

I’m terrible at remembering names. When someone introduces themselves I’m…

Overperforming

general

There’s a common misconception that the best way to get promoted…

Sleep

Sleep is really important. When people don’t sleep the get less effective…

Can I? Must I? Should I?

general

Whenever someone gets an idea in their head they start filtering out evidence that contradicts that idea. This idea is called confirmation bias, people start looking for evidence that confirms their current idea and neglecting evidence that challenges it. There’s no way to completely beat a bias, but something that…

Learning Hugo by Editing Themes

programming

blog

One of the hardest parts of learning something new is motivation. This is why one of the best ways to learn programming is editing code; it’s goal driven so motivation is built in. I’ve successfully used this…

Manually Triggering Github Actions

programming

I have been publishing this website using Github Actions with Hugo on push and on a daily schedule. I recently received an error notification via email from Github, and wanted to check whether it was an intermittent error. Unfortunately I couldn’t find anyway to rerun…

R: Keeping Up With Python

r

python

About 5 years ago a colleague told me that the days were numbered for R and Python had won. From his perspective he is probably right; in software engineering companies…

Population Density Australia

insight

How dense is the population in Australia? I’ve looked at the Gridded Population of the World and you can see that the population is concentrated around the few capital cities on the…

Gridded Population of the World

data

I’ve spent the last few hours looking at the Gridded Population of the World which consistenly estimates the population density consistent with national censuses and…

Bluetooth Headphones in 2020

I’ve been looking for some bluetooth headphones that I can use both on a mobile phone and a computer at the same time. I want something portable enough to take with me, but…

Implicit Bias

general

I like to think of myself as an egalitarian, but I know I have implicit bias. I’ve done some tests on Project Implicit and have roughly the implicit biases you would expect for my demographic. This makes me feel a bit sad, but you can’t really control your implicit biases…

Finding Files Installed in Ubuntu and Debian

programming

The Fifth Risk

books

Michael Lewis’ The Fifth Risk promotes parts of the US public service and some people who work in it. The public service is culturally opposed, if not legally prevented, from promoting itself which…

Fuel Efficiency of a 747

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.6

Endurance Counting

general

Counting is a strangely…

Moving Away From Keepass

A password manager is one of the best ways for the majority of people to keep their logins secure. After using KeePass and its derivatives for years, the Kee Firefox Addon dropped support for Keepass and it’s now less convenient to use. After looking at the alternatives I’m going to switch to an online…

Finding Analytics in Melbourne

My first job in analytics was in large part luck. I had an academic background in Physics and Mathematics, some professional programming experience building applications and…

Devil Take The Hindmost: Book Summary

books

Edward Chancellor’s Devil Take the Hindmost: A History of Financial Speculation is a history of several market bubbles and crashes. It covers bubbles such as the So…

Don’t Not Avoid Being Indirect

general

Australian Deathographics

insight

I’ve recently tried to estimate Australian Deaths using life expectancy. This failed badly and I think the reason is demographics; this article looks more into this.

Checking Australian Oil Imports

insight

I’ve estimated Australian oil imports; here I check the data to see how reasonable my estimates are.

Australian Oil Imports

insight

This is a variation of Sanjoy Mahajan’s The Art of Insight Section 1.4 (and problem 1.6)

Redundancy on Phone Power Button

general

My 5 year old OnePlus One’s power button has finally…

How many People in Australia Die?

insight

How many people die in Australia each year?

Checking Australian Births Estimates

insight

I estimated the number of Australian births as 250,000. The actual number of births, according to the Australian Institute of Family Studies, it’s…

Australian Births

insight

How many babies are born in Australia?

SICP Exercise 1.5

sicp

Exercise from SICP:

SICP Exercise 1.4

sicp

Exercise from SICP:

Sicp Exercise 1.3

sicp

Exercise from SICP:

SICP Exercise 1.2

sicp

Exercise from SICP:

SICP Exercise 1.1

sicp

Exercise from SICP:

Tree Diagram Bills

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.5

Mixing Warm Water

insight

I used to have a fancy kettle that came with settings for…

Gold or Bills

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.4

Estimating Weight with Body Mass Index

insight

When estimating…

Value of Gold

insight

How much is one cubic centimetre of gold worth?

Diagrams in Hugo with Mermaid

blog

Being able to write simple diagrams with text is very convenient. We can do this in Hugo by rendering with mermaid.js.

How much Money is in a Suitcase?

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.3

Mass of Air in Bedroom

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.2

Programming Languages to Learn in 2020

programming

A language that doesn’t affect the way you think about programming, is not worth knowing.

How Much Does a Box of Books Weigh?

insight

This is from Sanjoy Mahajan’s The Art of Insight Problem 1.1

Some Ideas for Recurring Articles

Radio shows, comedy sketch shows and talk shows have the…

Diffing in SQL

sql

legacy code

r

One way of refactoring legacy code is to use diff tests; checking what changes when you change the code. While it can be easy to diff files, it’s a little less obvious how to do this with SQL pipelines. Fortunately…

Diff Tests

legacy code

When making changes to code tests are a great way to make sure you haven’t inadvertently introduced regressions. This means that you can make changes much faster with more…

Dataflow Chasing

data

legacy code

When making changes to a new model training pipeline I find it really useful to understand the dataflow. Analytics workflows are done as a series of transformations, taking…

Comment to Function

programming

legacy code

A lot of analytics code I’ve read is a very long procedural chain. These can be hard to follow because the only way to really know what’s going on in any point is to…

Tidy Time

general

I love having a clean desk and empty inbox. But I hate…

From Multiprocesing to Concurrent Futures in Python

python

Approximate Percentiles in Presto and Athena

presto

athena

Calculating percentiles and quantiles is a common operation in analytics. While they can be done in vanilla SQL with window functions and…

Downloading files from Jupyter Notebook

python

jupyter

You’ve done an analysis and generated an output file in a Jupyter notebook. How…

Git Stash Changesets

linux

Pretty frequently I start writing some code, when I realise there’s another change I need to make before I can continue. I like to make lots of small atomic changes to a…

Calculating Logs

maths

Logarithms are a handy…

Solving Solved Problems

general

maths

A good technique for deeply understanding something is to try to solve it yourself first. Sometimes this can even lead to better methods or new discoveries.

Contact Tracing in Fighting Epidemics

data

maths

The state government of Victoria, Australia has recently announced a plan on how to respond to the current Covid-19 pandemic. Based on epidemiological modelling they have set to reduce restrictions based on 14 day averages of new case…

Modelling the Spread of Infectious Disease

maths

data

Understanding the spread of infectious disease is very important for policies around public health. Whether it’s the seasonal flu, HIV or a novel pandemic the health…

Time Budgeting

general

It’s worthwhile spending some time thinking about how you spend your…

Fixing suddenly unable to connect to X server in WSL2

wsl

Today when I tried to connect to VcXsrv after running it with XLaunch it didn’t work. I’d…

Exceed Expectations

general

Today I saw a picture in someone’s windows “Always Exceed Everyone’s Expectations”. My initial reaction was that was a quick way to burnout - trying to always exceed…

South Sea Bubble

I’ve been surprised to learn that financial bubbles and collapses are actually hundreds of years old. I learned this reading the book Devil Takes the…

Embeddings for categories

data

Categorical objects with a large number of categories are quite problematic for modelling. While many models can work with them it’s really hard to learn…

From Descriptive to Predictive Analytics

data

The starting point for an analysis is often summary statistics, such as the mean or the median. For some of these you’re going to want it more…

Teaching Programming by Editing Code

programming

I’ve had a few discussions with people, especially analysts, about how to learn programming. Generally I encourage them to find a project they want to accomplish and try to…

Interpretable models with Cynthia Rudin

data

A while ago I came across Cynthia Rudin through their work on the FICO Explainable Machine Learning Challenge. Her team got an honourable mention and she…

Topic Modelling to Bootstrap a Classifier

data

nlp

Sometimes you want to classify documents, but you don’t have an existing classification.…

Filtering a left join in SQL

sql

When doing a left join in SQL any filtering of the table after the join will turn it into an inner join. However there are some easy ways to do the filtering first.

Rough Coarse Geocoding

data

A coarse geocoder takes…

Legality of Publishing Web Crawls

general

As a data analyst I rely on open code and open data to inform decisions. There’s a lot of data available…

Python HTML Parser

python

html

A lot of information is embedded in HTML pages, which contain both human text and markup. If you ever want to extract this information, don’t use regex use a parser.…

Refining Location with Placeholder

data

Placeholder is a great library for Coarse Geocoding, and I’m using it for finding locations in Australia. In my application I want to get the location to a similar level of granularity; however the input may be for a higher level of granularity. Placeholder doesn’t directly…

Maybe Monad in Python

python

programming

A monad in languages like Haskell is used as a particular way to raise the domain of a function beyond where it was domain. You can think of them as a generalised form of function…

Dip Statistic for Multimodality

maths

data

If you’ve got a distribution you may want a way to tell if it has multiple components. For example a sample of heights may have a couple of…

Priorities mean saying No

general

There are always more things you can be doing than the…

My next monitor in 2020

My primary monitor has recently died, and my backup is showing it’s age. I need to buy a new one and am trying to decide what to buy.

Create User Sessions with SQL

sql

data

presto

athena

Sometimes you may want to experiment with sessions and need to hand-roll your own in SQL. There’s a good mode blog on how to do this. If you’re using…

Removing Timezone in Athena

presto

athena

When creating a table in Athena I got the error: Invalid column type for column: Unsupported Hive type: timestamp with time zone. Unfortunately it can’t support timestamps with timezone. In my case all the data was in UTC so I just needed to remove the timezone to create the table. The easiest way to…

Python is not a Functional Programming Language

python

Python is a very versatile multiparadigm language with a great ecosystem of libraries. However it is not a functional programming lanugage, as I know some people have described it. While you can write it in a…

Differentiation is Linear Approximations

maths

data

Differentiation is the process of creating a local linear approximation of a function. This is…

Classifying Finite Groups

maths

Groups can be thought of a mathematical realisation of symmetry. For example the symmetric…

Complex Analysis

maths

Data Tests with SQL

data

sql

A challenge of data analytics is that the data can change as well as the code. The systems producing and collecting data are often changed and can lead to missing or corrupt data. These can easily corrupt reports and…

Sessionisation Experiments

data

You don’t need a lot of data to prove a point.…

Test Driven Salary Extraction

python

Even when there’s a specific field for a price there’s a surprising number of ways people write it. This is what the tool price-parser solves. Unfortunately it doesn’t work too well on salaries, which tend to be…

Finding Australian Locations with Placeholder

python

nlp

jobs

People write locations in many different ways. This makes them really hard to analyse, so we need a way to normalise them. I’ve already discussed how Placeholder is useful for…

Converting HTML to Text

python

nlp

I’ve been thinking about how to convert HTML to Text for NLP. We want to at least extract the text, but if we can preserve some of the formatting it can…

How to turn off LaTeX in Jupyter

jupyter

python

When showing money in Jupyter notebooks the dollar signs can disappear and turn into LaTeX through Mathjax.…

Double emphasis error in html2text

python

I’m trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown…

An edge bug in html2text

python

I’ve been trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. But I quickly…

Symmetry in probability

maths

The simplest way to model probability of a system is through symmetry. For…

Sunk Cost of Pure Mathematics

maths

Today I went through the painful exercise of culling my notebooks. My honours notebooks, independent research and…

Writing Blog Posts with Jupyter and Hugo

python

blog

It can be convenient to directly…

Searching within a Website

general

Some websites, like this one, have a lot of content but have no search function. Others…

NLP Learning Resources in 2020

nlp

There’s a lot of…

Speaking Quota

I often find listening more productive than talking, but still find it easy to spend a lot of meetings talking. When I get curious I ask lots of questions in a meeting that can take it off on a tangent…

Being Patient with People

general

I’m sitting in a meeting listening to an update.…

Don’t Stop Pretraining

nlp

In the past two years the best performing NLP models have been based on transformer models trained on an enormous corpus of text. By understanding how language in general works they are much more effective at detecting…

Tangled up in BLEU

nlp

How can we evaluate how good a machine generated translation is? We could get bilingual readers to score the translation, and average their scores. However this is expensive…

Hugo Casper 2 to 3

blog

I’ve been wanting to upgrade my version of Hugo, but the Casper 2 theme I was using didn’t support it. As a first step to this transition is to use Casper 3. It looks similar to my old theme, is easy to set up, but seems to be…

Running an X server with WSL2

wsl

linux

emacs

I’ve recently started working with WSL2 on my Windows machine, but have had trouble getting an X server to run. This is an issue for me because running Emacs with Evil keybindings under Windows Terminal I often…

Metrics you can Drive

data

Tracking a metric can help to drive dramatic…

Customising Portable Dotfiles

emacs

linux

I keep my personal configuration files in a public dotfiles repository. This means that whenever I’m on a new machine it’s very easy to get comfortable in a new environment.…

Git Folder Identities

linux

programming

Sometimes you want a different git configuration in different contexts. For example you might want different author information, or to exclude files for only some kinds of projects, or to have a specific template for certain…

Raising Exceptions in Python Futures

python

Python concurrent.futures are…

Getting Started with WSL2

wsl

I’ve finally started trying out Windows System for Linux version 2. When comparing with WSL1 it’s much faster because it works on a Virtual Machine rather than translating syscalls, but is slower when working on Windows filesystems. The speed up is significant when…

Targeting my brand

general

My friend has four different magnets for plumbers on his fridge. Three of them are generic rectangular magnets that have generic information and contact details. One of them…

Embrace, Extend and Extinguish

general

In the 90s Microsoft famously used a strategy of embracing other protocols, then adding extensions to their implementation until it’s no longer compatible and utilising…

Mace-Bearer

general

The University of Adelaide, being a sandstone Group of Eight University, has the archaic ceremony of a mace-bearer leading the processing carrying a…

Finding a direction

general

When I don’t have clear goals, I’m more likely to spread myself too thin. It’s easy to get…

Data Models

data

Information is useful in that it helps make better decisions. This is much easier if the data is represented in a way that closely match the conceptual model of the business. Building a useful view of the data can dramatically…

Filling Gaps in SQL

sql

It’s common for there to be gaps or missing values in an SQL table. For example you may have…

Directions of Delegation

general

For any actionable item there are four ways to handle it: do it, defer it, delegate it or delete it. Delegation is an often overlooked powerful option to handle things. It’s…

A Checklist for NLP models

data

nlp

When training machine learning models typically you get a training dataset for fitting the model and a test dataset for evaluating the model (on small datasets techniques…

Deep Neural Networks as a Building Block

data

Deep Neural Networks have transformed dealing with unstructured data like images and text, making totally new things possible. However they are difficult to train, require a…

Mean Value Theorem

maths

I remember three things from lectures in my first…

Sequential Weak Labelling for NER

data

nlp

The traditional way to train an NER model on a new domain is to annotate a whole bunch of data. Techniques like active learning can speed this up, but especially…

Stanza for NLP

python

nlp

Working with unstructured text is much easier if we add structure to it. Stanza is a state of the art library for doing this in over 60 languages. Given some text it will tokenize, sentencize, tag parts of speech and morphological features, parse…

pyBART: Better Dependencies for Information Extraction

python

nlp

Dependency trees are a remarkably powerful tool for information extraction. Neural based taggers are very good and Universal Dependencies means the approach can be used for…

Using HTML in NLP

data

nlp

Many documents available on the web have meaningful markup. Headers, paragraph breaks, links, emphasis and lists…

Demjson for parsing tricky Javascript Objects

python

data

Tips for Extracting Data with Beautiful Soup

python

data

jobs

Beautiful soup can be a useful library for extracting information from HTML. Unfortunately there’s a lot of little issues I hit working with it to…

Saving Requests and Responses in WARC

python

When fetching large amounts of data from the internet a best practice is caching all the data. While it might seem easy to extract just the…

Only write file on success

python

When writing data pipelines it can be useful to cache intermediate results to recover more quickly from failures. However if a corrupt or incomplete file was written…

Diverge then Converge

general

It’s very useful to diverge on ideas before converging on a solution. Trying to do both at the same time tends to stifle creativity and lead to less innovative solutions.

Accelerating downloads with Multiprocessing

data

Downloading files can often be a bottleneck in a data pipeline because network I/O is slow. A really simple way to handle this is to run multiple…

Caching Data Pipelines

data

Data pipelines can often be thought of as a chain of pure functions passing data between them, even if they are not implemented that way. As long as you can access the…

Processing RDF nquads with grep

data

commoncrawl

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I previously came up with a SPARQL query to extract the Australian jobs from the domain, country and currency. Unfortunately it’s quite slow, but we can speed it up dramatically by replacing it with a…

Coarse Geocoding

data

Sometimes you have some description of a location and want to…

Extracting Australian Job Postings with SPARQL

jobs

commoncrawl

rdf

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I have previously written scripts to read in the graphs…

Analytics Web Data Commons with SPARQL

python

rdf

commoncrawl

data

I am trying to understand how the JobPosting schema is used in Web Data Commons structured data extracts from Common Crawl. I wrote a lot of ad hoc Python to get usage…

Adding Types to Rdflib

python

rdf

I’ve been using RDFLib to parse Job posts extracted from Common Crawl.…

Schemas for JobPostings in Practice

jobs

A job posting has a description, a company, sometimes a salary, … and what else? Schema.org have a detailed JobPosting schema, but it’s not immediately…

Converting RDF to Dictionary

python

data

The Web Data Commons has a vast repository of structured RDF Data about local businesses, hostels, job postings, products…

Streaming n-quads as RDF

data

python

The Web Data Commons extracts structured RDF Data from about one monthly Common Crawl per year. These contain a vast amount of structured information about local…

Scheduling Github Actions

programming

blog

I use Github actions to publish daily articles via Hugo. I had set it up to publish on push, but sometimes I future date articles…

Using Local Github Actions

programming

blog

I’ve been using Github Actions to publish this website for almost a month. The…

Checking for Uniques in SQL

sql

When checking my work in SQL one of the first things I do is confirm a column I expect to be unique is. Many tables have a unique key at the level they are at; for session level data it’s a…

Checking your Work

general

One of the most important abilities of an analyst is to be able…

Parsing Escaped Strings

python

Sometimes you may have to parse a string with backslash escapes; for example "this is a \"string\"". This is quite straightforward to parse with a state machine.

Extracting Job Ads from Common Crawl

commoncrawl

jobs

I’ve been using data from the Adzuna Job Salary Predictions Kaggle Competition to extract skills, find near duplicate job ads and understand seniority of job titles. But the dataset has heavily processed ad text which makes it harder to do natural language processing on. Instead I’m going to…

Excel Completion Count

excel

I was recently running some simple, but tedious, annotation in Excel. While it’s not a good tool for complex annotation for a problem with a simple textual annotation where…

Common Crawl Index Athena

commoncrawl

data

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are petabytes of data archived so directly searching through them is very expensive and slow. To search…

Extracing Text, Metadata and Data from Common Crawl

commoncrawl

data

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. You can search the index to find where pages from a particular website are archived, but you still need a way to access the…

Searching 100 Billion Webpages Pages With Capture Index

commoncrawl

data

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. Every month they use Apache Nutch to follow links accross the web and download over a billion unique items to Amazon S3, and have data back to 2008. This is like what Google and Bing do to build their…

Understaning Job Ad Titles with Salary

jobs

data

Different industries have different ways of distinguishing seniority in a job title.…

Discovering Job Titles

jobs

nlp

A job ad title can contain a lot of things like location, skills or benefits. I want a list of just the job titles, without the rest of those things. This is a key piece of information…

Normalise Job Title Words

jobs

nlp

I’m trying to find job titles in job ads, but…

Heuristics for Active Open Source Project

programming

When evaluating whether to use an open source project I generally want to know how active the project is. A project doesn’t need to be active to be usable; mature and stable…

Making Words Singular

nlp

jobs

Trying to normalise text in job titles I need a way to convert plural words into their singular form. For example a job for “nurses” is about a “nurse”, a job for “salespeople” is about a “salesperson”, a job…

Rewriting A of B

nlp

When examining words in job titles I noticed that if was common to see titles written as “head of …” or “director of …”. This is unusual because most role titles go from specific to general (e.g. finance…

Mail merge to PDF Files

excel

A friend needed to generate a hundred contracts and their HR information system wasn’t working properly. I helped them implement a workaround solution by using mail merge to…

Minibatching in Python

python

Sometimes you have a long sequence you want to break into smaller sized chunks. This is generally because you want to use some downstream process that can only handle so…

Job Title Words

jobs

nlp

I found NER wasn’t the right tool for extracting job titles, and a frequency based approach is going to work better. The first step for this is to identify words that signify a job title, like “manager”…

Simple Metrics

data

I have a tendency to create really complex metrics. Sometimes when I’m analysing data I’ll need to transform the data to understand it. I often calculate the ratio of common…

Summary of Finding Near Duplicates in Job Ads

nlp

data

jobs

I’ve been trying to find near duplicate job ads in the Adzuna Job Salary Predictions Kaggle Competition. Job ads can be duplicated because a hirer posts the…

Finding Duplicate Companies with Cliques

jobs

nlp

data

We’ve found pairs of near duplicate texts in 400,000 job ads from the Adzuna Job Salary Predictions Kaggle Competition. When we tried to extracted groups of similar ads by finding connected components in the graph of similar ads. Unfortunately with a low threshold of similarity we ended up with a chain of ads that were each similar, but…

Market for Highschool Maths Textbooks

general

My first professional job was for Haese mathematics which is a small family-owned South Australian business that writes and publishes mathematics textbooks. Working for a small company was a really interesting experience, I…

Pain gain matrix for discussing approaches

general

Placing options on a scatterplot of costs versus benefits is a common practice for prioritising opportunities and solutions. The primary benefit of this approach is it can…

Spreadsheets as a Rough Annotation Tool

excel

data

I needed to design some heuristic thresholds for grouping together items. In my first step attempt I iteratively tried to guess the thresholds by trying them on different…

Bridging Bipartite Graph

data

presto

athena

When you have behavioural data between actors and events you naturally get a bipartite graph. For example you can have the actors as customers and events as products that…

Clustering for Exploration

data

Suppose you’re running a website with tens of thousands of different products, and no satisfactory way to group them up. Even a mediocre clustering can…

Less Is Better

general

Today I was picking grapes from their vine for my partner’s grandmother. They had been left…

Using Github Actions with Hugo

programming

blog

I really like the idea of having a process triggered automatically when I push code. Github actions gives a way to do this with Github repositories, and this…

Project Estimation

general

Estimating projects is notoriously difficult, and the larger the project the harder to estimate. But even small pieces of work for a single person are easy to underestimate.…

Probability Jaccard

maths

I don’t like Jaccard index for clustering because it doesn’t work well on sets of different sizes. Instead I find the concepts from Association Rule Learning (a.k.a market basket analysis) very useful. It turns out Jaccard Similarity can be written in terms of these concepts so they…

Community detection in Graphs

data

People using a website or app will have different patterns of behaviours. It can be useful to cluster the customers or products to help understand the…

Serving Static Assets with Python Simple Server

python

I was trying to load a local file in a HTML page and got a Cross-Origin Request Blocked error in my browser. The solution was to start a Python web server with python…

Listening

When I’m in a comfortable environment I love to talk. This can…

Finding Common Substrings

data

nlp

jobs

I’ve found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a…

Simple Models

data

My first instinct when dealing with a new problem is to try to find a complex technique to solve it. However I’ve almost always found it more useful to start with a simple…

Power of Easy

Something being easy makes a huge difference in how often it is used. Even small frictions can add up and make a task less desirable.

Cartesian Product in R and Python

python

r

You’ve got a couple of groups and you want to get every possible combination of them. This is called the Cartesian Product of the groups.…

Beta Function

maths

The Beta Function comes up in the likelihood of the binomial distribution. Understanding its properties is useful for understanding the binomial distribution.

From Bernoulli to Binomial Distributions

data

maths

Suppose that you flip a fair coin 10 times, how many heads will you get? You’d think…

Minhash Sets

jobs

nlp

python

We’ve found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract…

Searching for Near Duplicates with Minhash

nlp

jobs

python

I’m trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search…

Considering VS Code from Emacs

emacs

I’ve been using Emacs as my primary editor for around 5 years now (after 4 years of Vim). I’m very comfortable in it, having spent a long time configuring my init.el. But once in a while I’m slowed down by some strange issue, so I’m going to put aside my sunk configuration costs and have a look at…

Estimating Bias in a Coin with Bayes Rule

I wanted to work through an example of applying Bayes rule to update model paremeters based on toy data This example comes from Kruschke’s Doing Bayesian Data Analysis…

Detecting Near Duplicates with Minhash

nlp

jobs

python

I’m trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I’ve found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to…

Lessons from a mathematician on building a community

maths

Mathematicians and software developers have a lot in common. They both build structures of ideas, typically working in small groups or alone, but leveraging structures built…

Clustering for Segmentation

data

Dealing with thousands of different items is difficult. When you’ve got a couple of dozen you can view them together, but as you get into the hundreds, thousands and beyond…

Representing Decision Trees on a grid

data

A decision tree is a series of conditional rules leading to an outcome. When stated as a chain of if-then-else rules it can be really…

Writing 50 Daily Articles

blog

I’ve been writing an…

Four Competencies of an Effective Analyst

data

Analysts tend to be natural problem solvers, good at reasoning and adept with numbers. But to know how to frame the problem and what to look for they need to understand the…

4am Rule for timeseries

data

When you’ve got a timeseries that doesn’t have a timezone attched to it the natural question is “what timezone is this data from?” Sometimes it’s UTC, sometimes it’s the…

Locating Addresses with G-NAF

data

A very useful open dataset the Australian Government provides is the Geocoded National Address File (G-NAF). This is a database mapping addresses to locations. This is really useful for applications that want to provide information or services based on someone’s location. For…

Pipetable to CSV

emacs

data

Sometimes I get out pipe tables in Emacs that I want to convert into a CSVto put somewhere else. This is really easy with regular expressions.

Binning data in SQL

sql

data

excel

Generally when combining datasets you want to join them on some key. But sometimes you really want a range lookup like Excel’s VLOOKUP. A common example is binning values; you want to group values into custom ranges. While you could do this with a…

A Mixture of Bernoullis is Bernoulli

maths

Suppose you are analysing email conversion through rates. People either follow the call to action or they don’t, so it’s a Bernoulli Distribution with probability the actual probability a random person will the email. But in actuality your email list will be made up of different groups; for example people who have…

Probability Squares

maths

A geometric way to represent combining two independent discrete random variables is as a probability square. On each side of the square we have the distributions of the…

Representing Interaction Networks

data

Behavioural data can illuminate the…

Excel Binning

excel

Putting numeric data into bins is a useful technique for summarising, especially for continuous…

Powershell Debugging with Write-Warning

programming

I had to debug some Powershell, without knowing anything about it. I found Write-Warning was the right tool for printline debugging. This was enough to resolve my issue.

Analysis Needs to Change A Decision

data

Any analysis where the results won’t change a decision is worthless. Before even thinking of getting any data it’s worth being clear on how it impacts the decision.

SQL Views for hiding business logic

sql

data

Near Duplicates with TF-IDF and Jaccard

nlp

jobs

python

I’ve looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren’t much better…

Near Duplicates with Jaccard

nlp

jobs

python

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that’s efficient on small sets. I’ve tried it on the Adzuna Job Salary Predictions…

From Hugo to R Blogdown

r

blog

R blogdown gives a really easy way to post blogs containing evaluated R (and Python!) code chunks and plots. This…

Edit Distance

nlp

python

jobs

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one…

Using Emacs under WSL

emacs

wsl

Getting Emacs to work nicely on a Windows system can be a challenge. You can install it natively (although getting all the dependencies is a challenge), but many packages require libraries or utilities that are hard to install or don’t exist on Windows. The best…

The Problem with Jaccard for Clustering

data

maths

The Jaccard Index is a useful measure of similarity between two sets. It makes sense for any two sets, is efficient to compute at scale and it’s arithmetic complement is a…

Jaccard Shingle Inequality

maths

data

nlp

Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if…

Finding Exact Duplicate Text

python

jobs

nlp

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions…

Showing Side-by-Side Diffs in Jupyter

python

jupyter

data

nlp

jobs

Creating a Diff Recipe in Prodigy

nlp

jobs

python

data

All of Statistics

data

For anyone who wants to learn Statistics and has a maths or physics I highly recommend Larry Wasserman’s All of Statistics . It covers a wide range of statistics with enough mathematical detail to really understand what’s going on, but not so much that the machinery is overwhelming. What I…

Remote social catchups are less intimate

Counting n-grams with Python and with Pandas

python

data

nlp

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it’s very likely they have a…

Waiting for System clock to synchronise

linux

When trying to install packages with apt on a new Ubuntu AWS EC2 instance I had issues where the signature would fail to verify. The reason was the system clock was far in the past and so it looked like the…

Not using NER for extracting Job Titles

nlp

data

jobs

python

I’ve been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it’s not the…

Rules, Pipelines and Models

nlp

data

Over the past decade deep neural networks have revolutionised dealing with unstructured data. Problems like identifying what objects are in a video through gener…

Active NER with Prodigy Teach

nlp

python

jobs

Active learning…

Python Inequality Chaining

python

In Python the comparison a <= b == c < d does the mathematically correct thing. This is a handy notational trick.

Training a job title NER with Prodigy

nlp

annotation

data

In a couple of hours I trained a reasonable job title Named Entity Recogniser for job ad titles using Prodigy, with over 70% accuracy. While 70% doesn’t sound great it’s a bit ambiguous what a job title is, and getting exactly the bounds of the job title can be a hard…

Annotating Job Titles

nlp

data

jobs

When doing Named Entity Recognition it’s important to think about how to set up the problem. There’s a balance between what you’re trying to achieve and what the algorithm…

What’s in a Job Ad Title?

nlp

data

jobs

The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from…

Disk Usage in Linux with du

linux

When your harddrive is filling up the du utility is a great way of…

Getting Started Debugging with pdb

python

When there’s something unexpected happening in your Python code the first thing you want to do is to get more information about what’s going wrong. While you can use print…

Calculating percentages in Presto

sql

presto

athena

One trick I use all the time is calculating percentages in SQL by dividing with the count. Percentages quickly tell me how much coverage I’ve got…

Moving Averages in SQL

sql

Moving averages can help smooth out the noise to reveal the underlying signal in a dataset. As they lag behind the actual signal they tradeoff timeliness for increased…

Getting most recent value in Presto with max_by

presto

athena

sql

Presto and the AWS managed alternative Amazon Athena have some powerful aggregation functions that can make writing SQL much…

Syncing Calendars and Contacts to Android with DAVx5

I find it really handy to have my calendar and contacts from my email client on my mobile phone. DAVx5 is a fantastic free (GPLv3) app to do this on Android. This lets me organise my life…

Don’t manage work email with Emacs

emacs

linux

I do a lot of work in Emacs and at the command line, and I…

Data Transformations in the Shell

data

linux

There are many great tools for filtering, transforming and aggregating data like SQL, R dplyr and Python Pandas (not to mention Excel). But sometimes when I’m working on a remote server I want to quickly extract some information from a file without switching to one of these…

Second most common value with Pandas

python

pandas

r

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of…

Property Based Testing - A thousand test cases in a single line

Property based testing lets you specify…

Using emacs dumb-jump with evil

emacs

programming

Dumb-jump is a…

Presto and Athena CLI in Emacs

I find having Emacs as a unified programming environment really useful. When writing an SQL pipeline I can iteratively develop my SQL in emacs, running it against the…

Fastai Callbacks as Lisp Advice

Creating state of the art deep learning algorithms often requires changing the details of the training process. Whether it’s scheduling hyperparameters, running on multiple…

94% confidence with 5 measurements

There are many things that are valuable to know in business but are hard to measure. For example the time from when a customer has a need to purchase, the number of related…

How to Display All Columns in R Jupyter

I like to do one-off analyses in R because tidyverse makes it really easy and beautiful. I also like to do them in Jupyter Notebooks because they…

Exporting data to Python with Amazon Athena

One necessary hurdle in doing data analysis or machine learning is loading the data. In many businesses larger datasets live in databases, in an object store (like Amazon…

Extracting Skills from Job Ads: Part 3 Conjugations

nlp

jobs

I’m trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition.

Extracting Skills from Job Ads: Part 2 - Adpositions

nlp

jobs

I’m trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition.

Extracting Skills from Job Ads: Part 1 - Noun Phrases

nlp

jobs

I’m trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition.…

Leading the Product 2019

I attended the excellent 2019 Leading the Product conference in Melbourne with aronud 500 other Product Managers and Enthusiasts. The conference had a broad range of great talks, a stimulating networking event where we…

Data Blockless: A better way to create data

Before you can do any machine learning you need to be able to read the data, create test and training splits and convert it into the right format. Fastai has a generic data…

Constant Models

When predicting outcomes using machine learning it’s always useful to have a baseline to compare results against. A simple baseline is the best constant model; that is a model that gives the same prediction for any input. This is a really simple check to perform against any dataset, and can be informative to check across…

A programmer using Excel

excel

When I was 15 I did a week of work experience with my neighbour, who was an agricultural economist running his own one person business. I’m still not really sure what an…

Spectra of atoms

Why is a sodium lamp yellow? How can we determine the elemental composition of the sun? How does a Helium-neon laser can work?

Regular expressions, automata and monoids

In formal language theory the…

DVI by example

The Device Independent File Format (DVI) is the output format of Knuth’s TeX82; modern TeX engines (pdfTeX, luaTeX) output…

Algorithms for finding the real roots of polynomials

maths

Given an degree n polynomial over the…

Non-desarguesian projective planes

There are two main constructions of a projective space.

Geometry and topology of division rings

Following from my last post (and Veblen and Young’s Projective Geometry) consider a projective plane satisfying the axioms:

Geometry of division rings

maths

It is fairly easy to construct a geometry from algebra: given a division ring K we form an n-dimensional vector space, the points being the elements of the…

Some history of integration

This post is based mainly on a chapter in A Radical Approach to Lebesgue’s Theory of Integration by David Bressoud in which he explores the history of the Lebesgue integral. The story I will…

Linear representation of additive groups and the Fourier Transform: Part 1

In this article I will show that the cyclic group of order n, that is the set \(\{0,1,2,\ldots,n-1\}\) under addition modulo n motivates the discrete Fourier transform on a particular finite dimensional complex inner product space, and gives many of its properties. In a…

From polynomials to transcendental numbers

In a previous post I discussed finding the zeros of low degree polynomials; I…

Symmetry, Lie Algebras and Differential Equations Part 3

There is a deep relationship between the technique of separation of variables for solving partial differential equations and the symmetries of the…

Symmetry, Lie Algebras and Quantum Differential Equations Part 2

In this article I will apply the ideas from part 1 to the theory of…

Symmetry, Lie Algebras and Differential Equations Part 1

There is a deep relationship between being able to solve a differential equation and its symmetries. Much of the theory of second order linear differential equations is…

Do you really mean S¹?

This is a follow up post to my previous post on \(\mathbb{R}^n\) . Mathematicians will often write \(S^1\) without being clear of the context and structure associated with it.

Do you really mean ℝⁿ?

In mathematics and physics it is common to talk about \(\mathbb{R}^n\) when really we mean something else that can be represented by \(\mathbb{R}^n\).

Tensor notation

Language affects the way you think, often subconsciously. The easier…

LaTeXing Multiple Equations

In mathematics and the (hard) sciences it’s important to be able to write documents with lots of…

Closure Operators

maths

Often in mathematics there is the idea of taking the closure of some elements under a particular operation. For instance if you have several vectors you may take the span of them, if you have a field and want to add the zeros of a…

Solving polynomials of degree 2,3 and 4

\[\newcommand\nth{n^{\mathrm{th}}}\]

The point of computer algebra systems

I wanted to do the contour integral of \(\frac{1}{z-a}\) around the unit circle on a computer for kicks. So I parameterised it with \(e^{it}\) and…

Local Lie Groups and Hilbert’s Fifth Problem

Lie Groups are mathematically “very nice” structures – they are analytic manifolds (real or complex) with a group structure such that multiplication and inversion are continuous. They are deeply related to infinitesimal symmetries; a group…