## Probability Distributions Between the Mean and the Median

The normal distribution is used throughout statistics, because of the Central Limit Theorem it occurs in many applications, but also because it's computationally convenient. The expectation value of the normal distribution is the mean, which has many nice arithmetic properties, but the drawback of being sensitive to outliers. When discussing constant models I noted that the minimiser of the Lᵖ error is a generalisation of the mean; for $$p = 2$$ it's the mean, for $$p = 1$$ it's the median, and for $$p = \infty$$ it's the midrange (half way betwen the maximum and minimum points).

## Integrating Powers of Exponentials

When working with distributions that were powers of exponentials, of which the normal and exponential distributions are special cases, I had to calcuate the integrals of exponentials. It's possible to transform these into expressions involving the Gamma function. Specifically I found that for all positive p: $\int_{0}^{\infty} x^m e^{-x^p}\, \rm{d}x = \frac{\Gamma\left(\frac{m+1}{p}\right)}{p}$ This is useful for calculating moments of powers of exponentials, namely for positive p and k:

data

## Metrics for Binary Classification

When evaluating binary classifier (e.g. will this user convert?) the most obvious metric is accuracy; what's the probability a random prediction is correct. One issue with this metric is if 90% of the cases are one class a high accuracy isn't really impressive; you need to contrast it with a constant model predicting the most frequent class. More subtly it's not a very sensitive measure, by measuring cross-entropy of predicted probabilities you get a much better idea of how well your model is working.

data

## Building a Reputation in Data Science

As a professional your reputation is very important to your career success. To get people to offer you work, to pay for your advice or to buy products from you they need to trust that you will deliver them value. The most common heuristic for this is your reputation; what other people say about you, what you have done and what certifications you have. It's important that your reputation is very specific to the kind of work you want others to buy.

jupyter

## Jupyter Notebooks as Logs for Batch Processes

When creating a batch process you typically add logging statements so that when something goes wrong you can more quickly debug the issue. Then when something goes wrong you either try to fix and rerun it, or otherwise run the process in a debugger to get more information. For many tasks Jupyter Notebooks are better for these kinds of batch processes. Jupyter Notebooks allow you to write your code sequentially as you usually would in a batch script; importing libraries, running functions and having assertions.

jupyter

## Jupyter Notebook Preamble

Whenever I use Jupyter Notebooks for analysis I tend to set a bunch of options at the top of every file to make them more pleasant to use. Here they are for Python and R with IRKernel Python # Automatically reload code from dependencies when running cells # This is indispensible when importing code you are actively modifying. %load_ext autoreload %autoreload 2 # I almost always use pandas and numpy import pandas as pd import numpy as np # Set the maximum rows to display in a dataframe pd.

sql

## Offline SQL Formatting with sqlformat

It's polite to format your SQL before you share it around. You want to be able to do it in context, and not upload your private SQL to some random website. The sqlformat command of the Python package sqlparse is a great tool for the job. You can install sqlformat in Debian derivatives such as Ubuntu with sudo apt install sqlformat. Alternatively with any system with Python you can install it via pip install sqlparse, just make sure you have the binary in your path (e.

python

## Flattening Nested Objects in Python

Sometimes I have nested object of dictionaries and lists, frequently from a JSON object, that I need to deal with in Python. Often I want to load this into a Pandas dataframe, but accessing and mutating dictionary column is a pain, with a whole bunch of expressions like .apply(lambda x: x[0]['a']['b']). A simple way to handle this is to flatten the objects before I put them into the dataframe, and then I can access them directly.

python

## Extracting Fields from JSON with a Python DSL

Indexing into nested objects of dictionaries and lists in Python is painful. I commonly come up against this when reading JSON objects, and often fields can be omitted. I haven't found a solution to this and so I've invented a tiny DSL to do this. It works like this: d = [{'a': [{'b': 'c'}, {'d': ['e']}]}] assert extract(d, '0.a.1.d.0') == d[0]['a'][1]['d'][0] assert extract(d, '1.a.1.d.0') == None You can specify a path into an object, separated by periods, and it will extract it returning None if that path doesn't exist.

nbdev

## Getting Started with nbdev

Nbdev is a tool to make it possible to develop Python libraries in Jupyter notebooks. At first I found this idea scary, but after watching the talk I like Notebooks and seeing how it works I think it's got the best of all worlds. It lets you put code, documentation, examples and tests all together in context and provides tooling to extract the code into an installable library, run the tests and produce great hyperlinked documentation.

nlp

## Language Models as Classifiers

Probabalistic language models can be used directly as a classifier. I'm not sure if this is a good idea; in particular it seems less efficient than building a classifier, but it's an interesting idea. A language model can give the probability of a given text under the model. Suppose we have multiple language models each trained on a distinct corpus representing a class (e.g. genre or author, or even sentiment). Then we can calculate the probability conditional on that model and compare them to calculate the class.

nlp

## Sentence Boundaries in N-gram Language Models

An N-gram language model guesses the next possible word by looking at how frequently is has previously occured after the previous N-1 words. I think this is how my mobile phone suggests completions of text; if I type "I am" it suggests "glad", "not" or "very" which are likely occurances. To make everything add up you have to have special markers for the start and end of the sentence, and the I think the best way is to make them the same marker.

programming

## Hurdles in Contributing to Open Source

Often in programming it's not the code itself that is hard, it's all the environment and systems around it. I found that today when trying to contribute to an open source repository. Today I was working on some code and using the excellent data-science-types to type check some Pandas code with mypy. But for some reason I was getting a weird error when reading with read_feather some data I just wrote with to_feather, and so I switched my to_feather to be to_pickle which doesn't do as much conversion.

books

## The Gulag Archipelago: Audiobook Review

The Gulag Archipelago is a singular piece of literature about the horrors of arbitrary arrest, inhumane interrogations, prolonged imprisonments, deadly work camps and exile that impacted tens of millions of people in the first half of the twentieth century. Aleksandr Solzhenitsyn has a way of conveying these horrors in a truly compelling way; somehow the descriptions of torture are close and detailed enough to be vivid, varied enough to be somewhat comprehensive, yet not repetitive nor gratuitous.

general

## Rod Crewther: 23/09/1945 - 17/12/2020

Today I learned that Dr Rodney James Crewther, known affectionately by colleagues and students as Rod, died last week. I only knew Rod through my education at the University of Adelaide in the late noughties, where he was a senior lecturer in the Physics department specialising in particle physics, but he had a huge impact on me. My thoughts are with those close to him in this difficult time. Rod truly cared about educating students.

general

## Moral Justification

Once a problem becomes moral, the acceptable solution space collapses. I'm reading Sylvia Nasar's book Grand Pursuit, and she talks about how Malthus' An Essay on the Principal of Population made the argument that as the poor classes gained more money they would always reproduce in greater numbers until they reached their previous state of poverty. This and other contemporary works of political economy posed these kinds of arguments as natural laws.

jobs

## Normalising Salary

Salary ranges come in many forms; how can we convert them to a common form? A first approximation is to annualise them; it ignores the difference between full-time, part-time, and temporary work. The other question is how to pick the range, for jobs with a bery large range. I started with the minimum because the maximum is often an inspirational nubmer (especially in commission sales roles). The way I approached this was:

books

## The Righteous Mind: Book Review

Johnathan Haidt's The Righeous Mind: Why Good People are Divided by Politics and Religion is about the moral norms of groups. As someone not familiar with moral psychology I found the book discussed many interesting ideas I wasn't aware of, but didn't provide much evidence for Haidt's own theories and claims. The book is reasonably well written with good structure and some excellent metaphors, but sometimes goes on unnecessarily long detours into the author's personal life and the repetition can be wearing.

nlp

## Language Through Prism

Neural language models, which have advanced the state of the art for Natural Language Processing by a huge leap over previous methods, represent the individual tokens as a sequence of vectors. This sequence of vectors can be thought of explicitly as a discrete time varying signal in each dimension, and you could decompose this signal into low frequency components, representing the information at the document level, and high frequency components, representing information at the token level and discarding higher level information.

programming

## Open Source Licenses for Data Processing Code

When a program primarily sources and transforms data then copyleft licenses add very little protection over other open source licenses. Because of this I've licensed my open data processing code as MIT because more complex licenses would prevent other people from using it, without adding much sharing. There are three main license types that are used in Open Source; MIT, Apache and GPL (with BSD family somewhere between MIT and Apache).

python

## Fixing repr errors in Jupyter Notebooks

When running the Kaggle API method dataset_list_files in a Jupyter notebook I got an error about __repr__ returning a non-string. At first I thought the function was broken, but then I realised it was just how it was displaying in Jupyter that was breaking because the issues were all in IPython: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) IPython/core/formatters.py in __call__(self, obj) 700 type_pprinters=self.type_printers, 701 deferred_pprinters=self.deferred_printers) --> 702 printer.pretty(obj) 703 printer.

general

## Life Optimisation

When all you've got is the hammer of mathematics, everything looks like an optimisation problem, you just need to choose the right objective function. So what should the objective function be for life? People today in general have much more means and much more freedom (i.e. fewer constraints in the solution space) than many of their ancestors, so what should we optimise? A study from Daniel Kahneman and the economist Angus Deaton says High income improves evaluation of life but not emotional well being.

programming

## What Is a Better Programming Approach?

When you solve a problem in code you will use some programming approach, and the approach you choose can make a big impact on your efficiency. I talk about approach rather than language because it's more than just the language. A project will typically only use a subset of the language (especially for massive languages like C++), some set of libraries, and develop patterns in the lanugage for working with those libraries.

python

## Pip Can Now Resolve Dependencies

Something that has always bothered me about pip in Python is that you would get errors about inconsistent packages. Things still seemed to work surprisingly often, but it meant that the order you installed packages could lead to very different results (and one ordering may cause your test to fail, even if that doesn't succeed). Now there is a new resolver in Pip 20.3 for pip that checks the dependencies and tries to find versions that meet all constraints.

python

## Why use Tox for Python Libraries

I have been surprised how hard it is to maintain an internal library in Python. There are constantly issues for end users where something doesn't work. It turns out one feature used was introduced in Python 3.8, but someone was stuck on Python 3.6. Changes to Pandas and PyArrow meant some combinations of those libraries broke. It's really hard to build confidence in your system when lots of people end up with breakages.

psychology

## Mere Exposure

The Mere Exposure Effect says that people will prefer something they've seen or heard multiple times before than something less familiar. The Robert Zajonc paper Attitudinal Effects of Mere Exposure covers this topic comprehensively. It includes experiments showing people like foreign words, chinese characters, photographs and nonsense words they have been exposed to more than if they had been exposed less. There are other aspects that impact affect more than mere exposure; like people prefer pronouncable words or photographs of smiling people, but exposure almost always helps.

programming

## Managing Python Versions with asdf

I was recently trying to run a pipenv script, but it gave an error that it required Python 3.7 which wasn't installed. Unfortunately I was on Ubuntu 20.04 which has Python 3.8 as default, and no access to earlier versions in the repositories. However pipenv gave a useful hint; pyenv and asdf not found. The asdf tool allows you to configure multiple versions of applications in common interactive shells (Bash, Zsh, and Fish).

maths

## Cosine Similarity is Euclidean Distance

In mathematics it's surprising how often something that's obvious (or trivial) to someone else can be revolutionary (or weeks of work) to someone else. I was looking at the annoy (Approximate Nearest Neighbours, Oh Yeah) library and saw this comment: Cosine distance is equivalent to Euclidean distance of normalized vectors I hadn't realised it at all, but once the claim was made I could immediately verify it. Given two vectors u and v their distance is given by the length of the vector between them: $$d = \| u - v \| = \sqrt{(u - v) \cdot (u - v)}$$.

programming

## Git: One VCS to Rule Them All

When I started as a professional developer there were a number of competing version control systems. However Git seems to have almost entirely won this battle. One of the most popular centralised version control systems is Subversion (SVN), which was largely an improvement of Concurrent Versioning System (CVS). But Distributed Version Control Systems, starting with Git became really popular. With a centralised system you have to lock files on the central server when editing and unlock them when you're finished, to make sure no one else interferes with your work.

programming

## Using find and xargs

Sometimes you want to feed a bunch of files to a program, and this is often easily done with find and xargs. Suppose you have an executable doit that you want to execute on all Python files in src/; you can do this directly with find: find src/ -name '*.py' -exec doit {} \; You can use xargs for this as well; but if there's a chance that a path could contain a space somewhere it's best to use -print0 with find and -0 with xargs to separate all arguments with nulls (rather than spaces):

excel

## Templates for Excel Charts

Sometimes I work with Excel and need to make attractive looking charts. The default charts look awful and this is often a time consuming exercise. I can spend hours recreating the same formatting in different charts by pointing an clicking. But there's a better way to do it; using templates. Once you've created a chart that you're happy with, right click and select "Save as Template..." and give it a clear name, so when you go to select it you immediately know what it's for.

maths

## Centroid of Points on the Surface of a Sphere

I’ve written a derivation of how to find the centroid of a polygon on a sphere. This post shows it explicitly in numerical computations, and also looks at the solution in Spherical Averages and Applications to Spherical Splines and Interpolation, by Buss and Fillmore, ACM Transactions on Graphics 20, 95–126 (2001). Explicitly coding mathematics is a great exercise; having to concretely represent everything unearthed gaps in my understanding and found errors in both drafts of my derivation and the paper.

productivity

## Automation through Documentation

You join a new team and your first task is to run the monthly batch process. It transforms some data, trains a business critical model, and outputs some reporting. Your coworker who is leaving the team talks you through the process and what you need to do. The problem is that it's a bunch of scripts and adhoc SQL that breaks all the time and has to be manually patched over.

data

## Glassbox Machine Learning

Can we have an interpretable model that has as good performance as blackbox models like gradient boosted trees and neural networks? In a 2020 Empirical Methods for Natural Language Processing Keynote, Rich Caruana says yes. He calls interpretable models glassbox machine learning, in contrast to blackbox machine learning. It is models in which a person can explicitly see how they work, and follow the steps from inputs to outputs. This interpretability is subtly different from explainable (explainable to who?

sql

## Finding Hugo Blogs with BigQuery

I want to find examples of other Hugo blogs, but they're not really easy to search for. Unless someone put "Hugo" in the descrption (which is actually common) there's no real defining files. However there's a set of files that are in a lot of Hugo Blogs and we can search them in Github with the GHArcive BigQuery Export. The strategy is that most Hugo blogs will contain a /themes folder, a /content folder a /static folder and a config.

progamming

## Composing Functions

R core looks like it's getting a new pipe operator |> for composing functions. It's just like the existing magrittr pipe %>%, but has been implemented as a syntax transformation so that it is more computationally efficient and produces better stack traces. The pipe means instead of writing f(g(h(x))) you can write x |> h |> g |> f, which can be really handy when changing dataframes. Python's Pandas library doesn't have this kind of convenience and it opens up a class of error that won't happen in that R code.

maths

## Centroid Spherical Polygon

You're organising a conference of operations research analysts from all over the world, but their time is very valuable and they only agree to meet if you minimise the average distance they need to travel (even if they have to have it on a boat in the middle of the ocean). Where do you put the conference? Let's model the world as a unit sphere in 3 dimensional space, and have the N people at cartesian coordinates $$\{ p_i\}_{i=1}^{N}$$.

data

## Centroid for Cosine Similarity

Cosine similarity is often used as a similarity measure in machine learning. Suppose you have a group of points (like a cluster); you want to represent the group by a single point - the centroid. Then you can talk about how well formed the group is by the average distance of points from the centroid, or compare it to other centroids. Surprisingly it's not much more complex than finding the geometric centre in euclidean space, if you pick the right coordinate system.

general

## Checking With Calculation

I recently convinced myself that I was right and a published paper was wrong. I did all the calculus by hand and decided they must have missed something. I resolved it by calculation. I played some very basic examples in my head and thought I was right, but I couldn't be sure. So I quickly implemented a concrete example case in Python using numpy. This helped me resolve pretty quickly that I was wrong; some searching produced a counterexample.

data

## Using Behaviour to Understand Items

When people access products online their behaviour gives lots of information about both the people and the products. This information deeply enriches understanding of how to better serve your customers, how your products are related to each other and can help answer deeper questions about them. However you need to find a way to unlock the information. Using behavioural information can greatly improve modelling on the tabular data in your database.

general

## AlphaFold: Predicting protien shape from its composition

The Critical Assessment of protein Structure Prediction (CASP) runs every two years to predict the shape of a protein, the building blocks of life, from its sequence of amino acids. We know the shape of a bunch (around 170,000) of protiens from techniques like X-ray crystallography and Magnetic Resonance Imaging, but it's a big experimental job to actually measure this. However we know the sequence of millions of proteins due to cheap DNA sequencing and the DNA to protein translation.

python

## Chaining with Pandas Pipe function

I often use method chaining in pandas, although certain problems like calculating the second most common value are hard. A really good solution to adding custom functionality in a chain is Pandas pipe function. For example to raise a function to the 3rd power with numpy you could use np.power(df['x'], 3) But another way with pipe is: df['x'].pipe(np.power, 3) Note that you can pass any positional or keyword arguments and they'll get passed along.

python

## Type Checking Beautiful Soup

Static type checking in Python can quickly verify whether your code is open to certain bugs. But it only works if it knows the types of external libraries. I've already introduced how to add type stubs for libraries without type annotations. But what if we have a complex library like BeautifulSoup that uses a lot of recursion, magic methods and operated on unknown data? With some small changes to your code you can make it typecheck with BeautifulSoup.

general

## Truly Independent Thinkers

I was reading [Paul Graham's How to Think For Yourself] where he talks about independent-mindedness. His examples are scientists, investors, startup founders and essayists as professions where you can't do well without thinking differently from your peers. While there's some useful commentary in the article about how to broaden your worldview, I don't think a really independent minded person would do well in those professions. To succeed in science you need to study something people are interested in.

data

## Structuring a Project Like a Kaggle Competition

Analytics projects are messy. It's rarely clear at the start how to frame the business problem, whether a given approach will actually work, and if you can get it adopted by your partners. However once you have a framing the modelling part can be iterated on quickly by structuring the project like a Kaggle Competition. The modelling part of analytics projects will go smoothly only if you have clear evaluation criteria.

python

## Typechecking with a Python Library That Has No Type Hints

Type hints in Python allow statically verifying the code is correct, with tools like mypy, efficiently eliminating a whole class of bugs. However sometimes you get the message found module but no type hints or library stubs, because that library doesn't have any type information. It's easy to work around this by adding type stubs. When you see this error it's worth first checking that there aren't any types already available.

programming

## Code Structure Reflecting Function

I've been trying to extract job ads from Common Crawl. However I've been stuck on how to structure the code. Thinking through the relationships really helped me do this. The architecture of the pipeline is a set of methods that fetch source data, extract the structured data and normalise it into a common form to be combined. I previously had these methods all written in one large file, adding each extractor to a dictionary, which was a headache to look at.

jupyter

## Setting the Icon in Jupyter Notebooks

I often have way too many Jupyter notebook tabs open and I have to distinguish them from the first couple letters of the notebook in front of the Jupyter organge book icon. What if we could change the icons to visually distinguish different notebooks? I thought I found a really easy way to set the icon in Jupyter notebooks... but it works in Firefox and not Chrome. I'll go through the easy solution works in more browsers and the hard solution.

python

## Retrying Python Requests

The computer networks that make up the internet are complex and handling an immense amount of traffic. So sometimes when you make a request it will fail intermittently, and you want to try until it succeeds. This is easy in requests using urllib3 Retry. I was trying to download data from Common Crawl's S3 exports, but ocassionally the process would fail due to a network or server error. My process would keep the successful downloads using an AtomicFileWriter, but I'd have to restart the process.

python

## Decorating Pandas Tables

When looking at Pandas dataframes in a Jupyter notebook it can be hard to find what you're looking for in a big mess of numbers. Something that can help is formatting the numbers, making them shorter and using graphics to highlight points of interest. Using Pandas style you can make the story of your dataframe standout in a Jupyter notebook, and even export the styling to Excel. The Pandas style documentation gives pretty clear examples of how to use it.

jobs

## A First Cut of Job Extraction

I've finally built a first iteration of a job extraction pipeline in my job-advert-analysis repository. There's nothing in there that I haven't written about, but it's simply doing the work to bring it all together. I'm really happy to have a full pipeline that extracts lots of interesting features to analyse, and is easy to extend. I've already talked about how to extract jobs from Common Crawl and the architecture for extracting the data.

programming

## Which /bin/sh

I tried to run a shell script and got this error: set: Illegal option -o pipefail I had a quick look and the first line was #!/bin/sh, the -o pipefail isn't valid across POSIX shells so I would expect that to fail. More specifically on modern Ubuntu /bin/sh is dash which doesn't support these bash like constructions. But /bin/sh is very different on different systems; on some it is bash, on others it's ash (from which dash is derived), and on others it's ksh or something else.

programming

## Operating a Tower of Hacks

Remember after you run the update process to run the fix script on the production database. But run it twice because it only fixes some of the rows the first time. Oh, and don't use the old importer tool in the import directory, use the one in the scripts directory now. You already used the old one? It's ok, just manually alter the production database with this gnarly query. Ah right, I see the filler table it uses is corrupted, let's just copy it from a backup.

general

## Endurance Counting

Counting is a strangely powerful tool for enduring through something. Standard advice when you're angry is to count to ten. When stretching counting to a target number helps sustain the stretch longer. A good counting based technique for endurance is box breathing. It involves repeatedly inhaling to a count of 4, holding to a count of 4, exhaling to a count of 4 and holding to a count of 4. This is a technique used by Navy SEALs to induce calm and focus.

tools

## Moving Away From Keepass

A password manager is one of the best ways for the majority of people to keep their logins secure. After using KeePass and its derivatives for years, the Kee Firefox Addon dropped support for Keepass and it's now less convenient to use. After looking at the alternatives I'm going to switch to an online alternative. One of the most frequent ways people get their accounts hacked is by password reuse. Their email and password is revealed in some online breach of a website, and then these credentials can be used on other websites.

analytics

## Finding Analytics in Melbourne

My first job in analytics was in large part luck. I had an academic background in Physics and Mathematics, some professional programming experience building applications and self-studied computer science. I searched for "python" jobs, since I liked the language, and applied for a job titled something like "Awk, Bash and Grep". I didn't get that job, but was forwarded on to the data engineering team building bespoke reports. That was at a medium size company called Hitwise that provided digital competitive insights.

books

## Devil Take The Hindmost: Book Summary

Edward Chancellor's Devil Take the Hindmost: A History of Financial Speculation is a history of several market bubbles and crashes. It covers bubbles such as the South Sea Bubble, the 1920s bubble in the US stock market preceeding the great depression, the dotcom bubble of the 1990's and Japan in the 1980's. The main lessons I took was if a market sounds too good to be true it probably is, that highly leveraged financial instruments tend to prolong and worsen bubbles and often the people who bear the cost of reckless speculation are different to the people who take and profit from it.

general

## Don't Not Avoid Being Indirect

When I say something that is hard to hear I say it in a complex way. As if saying it in a hard to understand way will soften the blow. But it just muddies the message and causes confusion. Instead of saying "I don't want the soup", I say something like "It's not that I don't like your soup; I just don't feel like it right now. I mean it's not my favourite soup and I wouldn't be unhappy having it.

insight

## Australian Deathographics

I've recently tried to estimate Australian Deaths using life expectancy. This failed badly and I think the reason is demographics; this article looks more into this. The Australian Bureau of Statistics has population by age, and the Australian Institute of Health and Welfare have Mortality Over Time and Regions (MORT) which summarises the current probability of death by age range. Here is a super summarised version of this data: Age Population Death Rate Population Deaths Fraction of Deaths 0-19 25% 0% 0% 0% 20-39 29% 0% 0% 0% 40-59 25% 0% 0% 0% 60-79 17% 1.

insight

## Checking Australian Oil Imports

I've estimated Australian oil imports; here I check the data to see how reasonable my estimates are. The overall tree diagram for the estimate is below: graph BT; Import[Oil imports1.3 Million Barrels/Day] ImportL[Oil imports200ML/Day] -- Import Barrel[Size of Barrel160L] --|-1| Import Consumption[Oil consumed L/Day200ML/Day] -- ImportL ImportRatio[Oil Imported / Consumed1] -- ImportL CarConsumption[Oil Consumed by Cars100ML/Day] -- Consumption CarFraction[Oil Consumed in Total / Oil Consumed by Cars2] -- Consumption Cars[Number of Cars20 Million] -- CarConsumption ConsumptionCar[Oil Consumed by Car5L/Day] -- CarConsumption People[Number of People25 Million] -- Cars CarPeople[Number of Cars per Person0.

gneeral

## Redundancy on Phone Power Button

My 5 year old OnePlus One's power button has finally worn out, to the point where I can't press it. I panicked when the battery ran out - I was afraid I wouldn't be able to power it back on. However I found a video demonstrating how to turn it on with a power cable and the volume button. Pressing the volume down button when you attach the power cable to a computer puts it into recovery mode and you can boot it from there.

insight

## How many People in Australia Die?

How many people die in Australia each year? The life expectancy in Australia is about 80 years, and the population is 25 million. So each year the number of people that die would be about 25 million divided by 80, which is about 300,000. The actual number of people that died in 2018 is 160,000. This is about half my estimate; what am I doing wrong? One factor is life expectancy is at birth, the longer people live the longer they will be expected to live.

insight

## Checking Australian Births Estimates

I estimated the number of Australian births as 250,000. The actual number of births, according to the Australian Institute of Family Studies, it's around 310,000. Where did I go wrong? My estimate was 25 million times 0.8 children per person lifetime divided by lifetime of 80 years. The actual total fertility rate is 1.74 per woman, giving a birth rate of around half this of 0.87 per person which is significantly higher than I estimated.

insight

## Australian Births

How many babies are born in Australia? Australia has 25 million people. I would estimate the birth rate is 0.8 children per person; I think it's slightly less than one. These children are born across life, which is about 80 years. So a really crude estimate for annual births is 25 million people times (0.8 children per person lifetime) divided by 80 years per lifetime. This is 20 million divided by 80 which is 250 thousand.

sicp

## SICP Excerise 1.5

Exercise from SICP: Exercise 1.5. Ben Bitdiddle has invented a test to determine whether the interpreter he is faced with is using applicative-order evaluation or normal-order evaluation. He defines the following two procedures. (define (p) (p)) (define (test x y) (if (= x 0) 0 y)) Then he evaluates the expression (test 0 (p)) What behavior will Ben observe with an interpreter that uses applicative-order evaluation? What behavior will he observe with an interpreter that uses normal-order evaluation?

sicp

## SICP Excerise 1.4

Exercise from SICP: Exercise 1.4. Observe that our model of evaluation allows for combinations whose operators are compound expressions. Use this observation to describe the behavior of the following procedure: (define (a-plus-abs-b a b) ((if (> b 0) + -) a b)) Solution There are two possible branches, if b is positive then we get: ((if (> b 0) + -) a b) ((if #t + -) a b) (+ a b) Wheras if b is non-positive we get

sicp

## Sicp Exercise 1.3

Exercise from SICP: Exercise 1.2. Define a function that takes three numbers as arguments and returns the sum of the two larger numbers. Solution The first thing we need to do is to get the largest two numbers from 3 numbers. We can do this with a conditional statement. (define (sum-square-largest-two a b c) (cond ((and (<= a b) (<= a c)) (sum-of-squares b c)) ((and (<= b a) (<= b c)) (sum-of-squares a c)) ((and (<= c a) (<= c b)) (sum-of-squares a b))))

sicp

## SICP Excerise 1.2

Exercise from SICP: Exercise 1.2. Translate the following expression into prefix form $\frac{5 + 4 + (2 - (3 - (6 + \frac{1}{5})))}{3(6-2)(2-7)}$ Solution One way to do this is to read it from the outside in and translate it into a tree (for example the first thing we extract is the division). graph BT; DIV[/] -- ANS TOP[.] -- ANS[.] BOTTOM[.] -- ANS TOPSUM[+] -- TOP S1[5] -- TOP S2[4] -- TOP S3[.

sicp

## SICP Exercise 1.1

Exercise from SICP: Exercise 1.1. Below is a sequence of expressions. What is the result printed by the interpreter in response to each expression? Assume that the sequence is to be evaluated in the order in which it is presented. 10 (+ 5 3 4) (- 9 1) (/ 6 2) (+ (* 2 4) (- 4 6)) (define a 3) (define b (+ a 1)) (+ a b (* a b)) (= a b) (if (and (> b a) (< b (* a b))) b a) (cond ((= a 4) 6) ((= b 4) (+ 6 7 a)) (else 25)) (+ 2 (if (> b a) b a)) (* (cond ((> a b) a) ((< a b) b) (else -1)) (+ a 1)) Solution We can step through these using the substitution model with environment.

insight

## Mixing Warm Water

I used to have a fancy kettle that came with settings for heating water to different temperatures between 80° C and 100° C. However it's really easy to get water at any temperature using an ordinary kettle by mixing with refrigerated water. When you mix together two volumes of water at different temperatures their volumes add and the resulting temperature is a volume weighted average of the temperatures. For example if you take 25mL of water at 10° C and 75mL of water at 40° C you will get 100mL of water at 32.

insight

python

## Double emphasis error in html2text

I'm trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. I've already resolved an issue with multiple types of emphasis. However HTML in the wild has all sort of weird edge cases that the library has trouble with. In this case I found a term that was emphasised twice: <strong><strong>word</strong></strong>. I'm pretty sure for a browser this is just the same as doing it once; <strong>word</strong>.

python

## An edge bug in html2text

I've been trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. But I quickly hit an edge case where it fails, because parsing HTML is surprisingly difficult. I was parsing some HTML that looked like this: Some text.<br /><i><b>Title</b></i><br />... When I ran html2text it produced an output like this:

maths

## Symmetry in probability

The simplest way to model probability of a system is through symmetry. For example the concept of a "fair" coin means there are two possible outcomes that are indistinguishable. Because each result is equally likely the outcome is 50/50 heads or tails. Similarly for a fair die there are 6 possible outcomes, that are all equally likely. This means they each have the probability 1/6. The idea of symmetry is behind random sampling.

maths

## Sunk Cost of Pure Mathematics

Today I went through the painful exercise of culling my notebooks. My honours notebooks, independent research and work from textbooks and courses. These are things I spent a large part of my early life and energy on. Even though I haven't looked at them for years they are very hard to let go. A large amount of the material is pure mathematics. Notes on differential geometry, topology, and measure theory. These are particularly vexing because I don't believe they hold much real value.

writing

## Writing Blog Posts with Jupyter and Hugo

It can be convenient to directly publish a mixture of prose, source code and graphs. It ensures the published code actually runs and makes it much easier to rerun at a later point. I’ve done this before in Hugo with R Blogdown, and now I’m experimenting with Jupyter notebooks. The best available option seems to be nb2hugo which converts the notebook to markdown, keeping the front matter exporting the images.

general

## Searching within a Website

Some websites, like this one, have a lot of content but have no search function. Others have search but it performs poorly, for example Bunnings has great category pages but the search never hits it. Fortunately there's a simple way to search these sites with the site: search operator. If I want to search for articles about jobs just in this website I can type: site:skeptric.com job into either Google or Bing.

nlp

## NLP Learning Resources in 2020

There's a lot of great freely available resources in NLP right now; and the field is moving quickly with the recent success of neural models. I wanted to mention a few that look interesting to me. Jurefsky and Martin's Speech and Language Processing The third edition is a free ebook that is in progress that covers a lot of the basic ideas in NLP. It's got a great reputation in the NLP community and is nearly complete now.

communication

## Speaking Quota

I often find listening more productive than talking, but still find it easy to spend a lot of meetings talking. When I get curious I ask lots of questions in a meeting that can take it off on a tangent, especially switching from high level to detail. If you find yourself in a similar situation give yourself a small speaking quota. I got the idea from a former management consultant, who when he was a junior was told he was only allowed to say one thing in a meeting.

general

## Being Patient with People

I'm sitting in a meeting listening to an update. They've missed the point, and they're focussing on the wrong thing. I start to get frustrated; why are they so far off track? Why haven't they taken the time to understand the problem? This isn't a helpful reaction; getting short tempered won't help resolve the problem. I haven't taken the time to understand the speaker and their perspective. Why do they think this is the right thing to focus on?

nlp

## Don't Stop Pretraining

In the past two years the best performing NLP models have been based on transformer models trained on an enormous corpus of text. By understanding how language in general works they are much more effective at detecting sentiment, classifying documents, answering questions and translating documents. However in any particular case we are solving a particular task in a certain domain. Can we get a better performing model by further training the lanugage model on the specific domain or task?

nlp

## Tangled up in BLEU

How can we evaluate how good a machine generated translation is? We could get bilingual readers to score the translation, and average their scores. However this is expensive and time consuming. This means evaluation becomes a bottleneck for experimentation If we need hours of human time to evaluate an expriment this becomes a bottleneck for experimentation. This motivates automatic metrics for evaluation machine translation. One of the oldest examples is the BiLingual Evaluation Understudy (BLEU).

blog

## Hugo Casper 2 to 3

I've been wanting to upgrade my version of Hugo, but the Casper 2 theme I was using didn't support it. As a first step to this transition is to use Casper 3. It looks similar to my old theme, is easy to set up, but seems to be missing some features. I cloned the repository, and changed the theme in my config.toml to theme = "hugo-casper3". The article images weren't showing because the Casper 3 theme uses feature_image instead of image and requires a leading slash in the path (which was optional in 2).

wsl

## Running an X server with WSL2

I've recently started working with WSL2 on my Windows machine, but have had trouble getting an X server to run. This is an issue for me because running Emacs with Evil keybindings under Windows Terminal I often find there's a lag in registering pressing escape which leads to some confusing issues (but vanilla Vim is fine). But having an X Server would also allows running any Linux graphical application under X.

data

## Metrics you can Drive

Tracking a metric can help to drive dramatic improvements. When your team is focused on a metric you can test what has impact and quickly optimise it. However for this to work it's important to be something you can actually impact. When people start looking for a metric to track they want to look for things that have a direct impact on the business, such as revenue, share price or customer satisfaction.

git

## Customising Portable Dotfiles

I keep my personal configuration files in a public dotfiles repository. This means that whenever I'm on a new machine it's very easy to get comfortable in a new environment. However I find I often need machine specific configuration, so I provide ways to override them with local configuration. When I get to a new machine I'll pretty quickly want some of my usual configuration (although I don't need it). I can clone or download a zipfile of my dotfiles and then install it via some symlinks via a bootstrap bash script.

git

## Git Folder Identities

Sometimes you want a different git configuration in different contexts. For example you might want different author information, or to exclude files for only some kinds of projects, or to have a specific templace for certain kinds of projects. The easiest way to do this consistently is with a includeIf statement. For example to have custom options for any git repository under a folder called apache add this to the bottom of your ~/.

python

## Raising Exceptions in Python Futures

Python concurrent.futures are a handy way of dealing with asynchronous execution. However if you're not careful it will swallow your exceptions leading to difficult to debug errors. While you can perform concurrent downloads with multiprocessing it means starting up multiple processes and sending data between them as pickles. One problem with this is that you can't pickle some kinds of objects and often have to refactor your code to use multiprocessing.

wsl

## Getting Started with WSL2

I've finally started trying out Windows System for Linux version 2. When comparing with WSL1 it's much faster because it works on a Virtual Machine rather than translating syscalls, but is slower when working on Windows filesystems. The speed up is significant when launching processes and dealing with small files, and git and Python virtualenvs are an order of magnitude faster. I'm still working through some of the issues of transferring.

general

## Targeting my brand

My friend has four different magnets for plumbers on his fridge. Three of them are generic rectangular magnets that have generic information and contact details. One of them was in the shape of a dripping tap, mentioning they were experts in leaks and drips. If they had a leaking faucet it's pretty easy to guess which plumber they would call; the specialists in dripping taps. On the other hand if they had a clogged toilet it's down to chance which of the plumbers they would call, although they're less likely to call the dripping tap specialist they're also more likely to forget to look at the fridge and just search for a plumber online.

general

## Embrace, Extend and Extinguish

In the 90s Microsoft famoursly used a strategy of embracing other protocols, then adding extensions to their implementation until it's no longer compatible and utilising their market leverage to extinguish competing implementations. While "EEE" is normally associated with Microsoft many of the software titans use it as an effective strategy to further their existing dominance into new markets. Embracing a technology with an existing market is an effective way to quickly gain adoption.

general

## Mace-Bearer

The University of Adelaide, being a sandstone Group of Eight University, has the archaic ceremony of a mace-bearer leading the processiong carrying a heavy piece of expensive metal. When I graduated with my Bachelor of Science I was fortunate enough to be that mace-bearer. Unfortunately I wasn't really prepared for the formality. The ceremony was on a typical Adelaide summer's day, hot and dry. I was going out to lunch with my parents afterwards, so I wanted to make sure I was comfortable.

data

## Data Models

Information is useful in that it helps make better decisions. This is much easier if the data is represented in a way that closely match the conceptual model of the business. Building a useful view of the data can dramatically decrease the time and cost of answering questions and even elevate the conversation to answering deeper questions about the business. A typical example of where analysis can help is trying to increase revenue of a digitally sold product.

sql

## Filling Gaps in SQL

It's common for there to be gaps or missing values in an SQL table. For example you may have daily traffic by source, but on some low volume days around Christmas there are no values in the low traffic sources. Missing values can really complicate some calculations like moving averages, and some times you need a way of filling them in. This is straightforward with a cross join. You need all the possible variables you're filling in, and the value to fill.

general

## Directions of Delegation

For any actionable item there are four ways to handle it: do it, defer it, delegate it or delete it. Delegation is an often overlooked powerful option to handle things. It's not just for high powered executives to delegate down to their personal assistants; even if you don't have any reports it's possible to delegate. You can delegate in three directions: down, sideways and up. Downwards delegation is the classic kind that comes to most people's minds.

data

## A Checklist for NLP models

When training machine learning models typically you get a training dataset for fitting the model and a test dataset for evaluating the model (on small datasets techniques like cross-validation are common). You typically assume the performance on your chosen metric on the test dataset is the best way of judging the model. However it's really easy for systematic biases or leakage to creep into the datasets, meaning that your evaluation will differ significantly to real world usage.

data

## Deep Neural Networks as a Building Block

Deep Neural Networks have transformed dealing with unstructured data like images and text, making totally new things possible. However they are difficult to train, require a large amount of relevant training data, are hard to interpret, hard to debug and hard to refine. I think for these reasons there's a lot of space to use neural networks as a building block for extracting structured data for less parameterised models. Josh Tenenbaum gave an excellent keynote at ACL 2020 titled Cognitive and computational building blocks for more human-like language in machines.

maths

## Mean Value Theorem

I remember three things from lectures in my first year of university. One is a chemistry professor acting out the three vibrational modes of water; his head being the Oxygen atom and his hands being the Hydrogen atoms. Another is Rod Crewther demonstrating torque by showing how difficult it to open a heavy lecture door by pushing at the hinge. The third is how Nick Buchdahl illustrated the mean value theorem.

data

## Sequential Weak Labelling for NER

The traditional way to train an NER model on a new domain is to annotate a whole bunch of data. Techniques like active learning can speed this up, but especially neural models with random weights require a ton of data. A more modern approach is to take a large pretrained NER model and fine tune it on your dataset. This is the approach of AdaptaBERT (paper), using BERT. However this takes a large amount of GPU compute and finnicky regularisation techniques to get right.

python

## Stanza for NLP

Working with unstructured text is much easier if we add structure to it. Stanza is a state of the art library for doing this in over 60 languages. Given some text it will tokenize, sentencize, tag parts of speach and morphological features, parse syntactic dependencies and in a few languages perform NER. It's easy to use and gets extremely good results on benchmarks for each of these tasks on a large number of languages.

python

## pyBART: Better Dependencies for Information Extraction

Dependency trees are a remarkably powerful tool for information extraction. Neural based taggers are very good and Universal Dependencies means the approach can be used for almost any language (although the rules are language specific). However syntax can get really strange requiring increasingly complex rules to extract information. The pyBART system solves this by rewriting the rules to be half a step closer to semantics than syntax. I've seen that dependency based rules are useful for extracting skills from noun phrases and adpositions.

data

## Using HTML in NLP

Many documents available on the web have meaningful markup. Headers, paragraph breaks, links, emphasis and lists all change the meaning of the text. A common way to deal with HTML documents in NLP is to strip away all the markup, e.g. with Beautiful Soup's .get_text. This is fine for a bag of words approach, but for more structured text extraction or language model this seems like throwing away a lot of information.

python

## Demjson for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Python's inbuilt json.loads is effective, but won't handle very dynamic Javascript, but demjson will. The problem shows up when using json.loads as the following obscure error: json.decoder.JSONDecodeError: Expecing value: line N column M (char X) Looking at the character in my case looking near the character I see that it is a JavaScript undefined, which is not valid in JSON.

python

## Tips for Extracting Data with Beautiful Soup

Beautiful soup can be a useful library for extracting infomation from HTML. Unfortunately there's a lot of little issues I hit working with it to extract data from a careers webpage using Common Crawl. The library is still useful enough to work with; but the issues make me want to look at alternatives like lxml (via html5-parser). The source data can be obtained at the end of the article. Use a good HTML parser Python has an inbuild html.

python

## Saving Requests and Responses in WARC

When fetching large amounts of data from the internet a best practice is caching all the data. While it might seem easy to extract just the information you need, it's easy to hit edge cases or changing structure, and you can never use the data you throw away. This is easy to do in the Web ARChive (WARC) format with warcio used by the Internet Archive and Common Crawl. from warcio.

python

## Only write file on success

When writing data pipelines it can be useful to cache intermediate results to recover more quickly from failures. However if a corrupt or incomplete file was written then you could end up caching that broken file. The solution is simple; only write the file on success. A strategy for this is to write to some temporary file, and then move the temporary file on completion. I've wrapped this in a Python context manager called AtomicFileWriter which can be used in a with statement in place of open:

general

## Diverge then Converge

It's very useful to diverge on ideas before converging on a solution. Trying to do both at the same time tends to stifle creativity and lead to less innovative solutoins. I find the creative process of brainstorming is more effective if I do it separately to refining ideas. Taking the time to brainstorm leads to better solutions, whether thinking about what to work on, planning out a presentation or designing a technical solution.

data

Downloading files can often be a bottleneck in a data pipeline because network I/O is slow. A really simple way to handle this is to run multiple downloads in parallel accross threads. While it's possible to deal with the unused CPU cycles using asynchronous processing, in Python it's generally easier to throw more threads at it. Using multiprocessing can be very simple if you can turn make the processing occur in a pure function or object method, and both the variables are results are picklable.

data

## Processing RDF nquads with grep

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I previously came up with a SPARQL query to extract the Australian jobs from the domain, country and currency. Unfortunately it's quite slow, but we can speed it up dramatically by replacing it with a similar script in grep. With a short grep script we can get twenty thousand Australian Job Postings with metadata from 16 million lines of compressed nquad in 30 seconds on my laptop.

data

## Coarse Geocoding

Sometimes you have some description of a location and want to work out where it is. This is called geocoding; if you just want to know what state or country it's in it's called coarse geocoding. I found that while many structured JobPostings contain a country some have it as a description rather than a country code, and some put the location in other fields. We can often find the country using geocoding.

jobs

## Extracting Australian Job Postings with SPARQL

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I have previously written scripts to read in the graphs, explore JobPosting schema and analyst the schema using SPARQL. Now we can use these to find some Austrlian Job Postings in the data. For this analysis I used 15,000 pages containing job postings with different domains from the 2019 Web Data Commons Extract.

python

## Analytics Web Data Commons with SPARQL

I am trying to understand how the JobPosting schema is used in Web Data Commons structured data extracts from Common Crawl. I wrote a lot of ad hoc Python to get usage statistics on JobPosting. However SPARQL is a tool that makes it much easier to answer these kinds of questions. After reading in the graphs individually they can be combined into a rdflib.Dataset so we can query them all together.

python

I've been using RDFLib to parse Job posts extracted from Common Crawl. RDF Literals It automatically parses XML Schema Datatypes into Python datastructures, but doesn't handle the <http://schema.org/Date> datatype that commonly occurs in JSON-LD. It's easy to add with the rdflib.term.bind command, but this kind of global binding could lead to problems. When RDFLib parses a literal it will create a rdflib.term.Literal object and the value field will contain the Python type if it can be successfully converted, otherwise it will be None.

jobs

## Schemas for JobPostings in Practice

A job posting has a description, a company, sometimes a salary, ... and what else? Schema.org have a detailed JobPosting schema, but it's not immediately obvious what is important and how to use it. However the Web Data Commons have extracted JobPostings from hundreds of thousands of webpages from Common Crawl. By parsing the data we can see how these are actually used in practice which will help show what is actually useful in describing a job posting.

python

## Converting RDF to Dictionary

The Web Data Commons has a vast repository of structured RDF Data about local businesses, hostels, job postings, products and many other things from the internet. Unfortunately it's not in a format that's easy to do analysis on. We can stream the nquad format to get RDFlib Graphs, but we still need to convert the data into a form we can do analysis on. We'll do this by turning the relations into dictionaries of properties to the list of objects they contain.

data

The Web Data Commons extracts structured RDF Data from about one monthly Common Crawl per year. These contain a vast amount of structured infomation about local businesses, hostels, job postings, products and many other things from the internet. Python's RDFLib can read the n-quad format the data is stored in, but by default requires reading all of the millions to billions of relations into memory. However it's possible to process this data in a streaming fashion allowing it to be processed much faster.

programming

## Scheduling Github Actions

I use Github actions to publish daily articles via Hugo. I had set it up to publish on push, but sometimes I future date articles to have a backlog. This means that they won't be published until my next commit or manual publish action. To fix this I've set up a scheduled action to run just after 8am in UTC+10 (close to my timezone in Melbourne, Australia) every day. By default Hugo will not publish articles with a future date, so it's easy to keep a backlog by setting the date in front matter to a future date.

programming

## Using Local Github Actions

I've been using Github Actions to publish this website for almost a month. The experience has been great; whenever I push a commit it gets consistently published without me thinking about it within minutes. However I have one concern; I'm passing my rsync credentials into an external action. I've specified a tag in my yaml uses: wei/rclone@v1, but it would be easy for the author to move this tag to another commit that sends my private credentials to their personal server.

sql

## Checking for Uniques in SQL

When checking my work in SQL one of the first things I do is confirm a column I expect to be unique is. Many tables have a unique key at the level they are at; for session level data it's a session id, for user level data it's a user_id or for daily data it's a date. It's generally a useful thing to check because all it takes is one bad join to end up with a bunch of duplicate (or dropped) rows.

general

One of the most important abilities of an analyst is to be able to check your work. It's really easy to get incorrect data, have issues in data processing, or even misunderstand what the output means. But if your work is valuable enough to change a decision it's worth doing whatever you can to check it's right. When you get to the end of a long analysis it seems like a time to relax and be glad the hard work is over.

python

## Parsing Escaped Strings

Sometimes you may have to parse a string with backslash escapes; for example "this is a \"string\"". This is quite straightforward to parse with a state machine. The idea of a state machine is that the action we need to take will change depending on what we have already consumed. This can be used for proper regular expressions (without special things like lookahead), and the ANTLR4 parser generator can maintain a stack of "modes" that can be used similarly.

commoncrawl

## Extracting Job Ads from Common Crawl

I've been using data from the Adzuna Job Salary Predictions Kaggle Competition to extract skills, find near duplicate job ads and understand seniority of job titles. But the dataset has heavily processed ad text which makes it harder to do natural language processing on. Instead I'm going to find job ads in Common Crawl's, a dataset containing over a billion webpages each month. The Common Crawl data is much better because it's longitudinal over several years, international, broad and continually being updated.

excel

## Excel Completion Count

I was recently running some simple, but tedious, annotation in Excel. While it's not a good tool for complex annotation for a problem with a simple textual annotation where you can fit all the information to make a decision in a row it can be effective. However I needed a way to track progress across the team to make sure we finished on time, and see who needed help. We had a blank column that was being filled in as the annotation progressed, and each person was working on some set of rows.

commoncrawl

## Common Crawl Index Athena

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are petabytes of data archived so directly searching through them is very expensive and slow. To search for pages that have been archived within a domain (for example all pages from wikipedia.com) you can search the Capture Index. But this doesn't help if you want to search for paths archived across domains. For example you might want to find how many domains been archived, or the distribution of languages of archived pages, or find pages offered in multiple languages to build a corpus of parallel texts for a machine translation model.

commoncrawl

## Extracing Text, Metadata and Data from Common Crawl

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. You can search the index to find where pages from a particular website are archived, but you still need a way to access the data. Common Crawl provides the data in 3 formats: If you just need the text of the internet use the WET files If you just need the response metadata, HTML head information or links in the webpage use the WAT files If you need the whole HTML (with all the metadata) then use the full WARC files The index only contains locations for the WARC files, the WET and WAT files are just summarisations of it.

commoncrawl

## Searching 100 Billion Webpages Pages With Capture Index

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. Every month they use Apache Nutch to follow links accross the web and download over a billion unique items to Amazon S3, and have data back to 2008. This is like what Google and Bing do to build their search engines, the difference being that Common Crawl provides their data to the world for free.

jobs

## Understaning Job Ad Titles with Salary

Different industries have different ways of distinguishing seniority in a job title. Is a HR Officer more senior than a HR Administrator? Is a PHP web developer more skilled than a PHP developer? How different is a medical sales executive to general sales roles? Using the jobs from Adzuna Job Salary Predictions Kaggle Competition I've found common job titles and can use the advertised salary to help understand them. Note that since the data is from the UK from several years ago a lot of the details aren't really applicable, but the techniques are.

jobs

## Discovering Job Titles

A job ad title can contain a lot of things like location, skills or benefits. I want a list of just the job titles, without the rest of those things. This is a key piece of information extraction that can be used to better understand jobs, and built on by understanding how different job titles relate, for example with salary. To do this we first normalise the words in the ad title, doing things like removing plurals and expanding acronyms.

jobs

## Normalise Job Title Words

I'm trying to find job titles in job ads, but the same title can be written lots of different ways. An "RN" is the same as a "Registered Nurse", and broadly the same role as "Registered nurses". As a preprocessing step to job title discovery I need to normalise the text. The process I use is simple: rewrite terms containing of, e.g. "Director of Sales" to "Sales Director" Expand puntuation with whitespace; e.

programming

## Heuristics for Active Open Source Project

When evaluating whether to use an open source project I generally want to know how active the project is. A project doesn't need to be active to be useable; mature and stable projects don't need to change much to be reliable. But if a project has problems or missing essential features, or is in an evolving ecosystem (like any web project or kernel drivers), it's important to know how fast it changes.

nlp

## Making Words Singular

Trying to normalise text in job titles I need a way to convert plural words into their singular form. For example a job for "nurses" is about a "nurse", a job for "salespeople" is about a "salesperson", a job for "workmen" is about a "workman" and a job about "midwives" is about a "midwife". I developed an algorithm that works well enough for converting plural words to singular without changing singular words in the text like "sous chef", "business" or "gas".

nlp

## Rewriting A of B

When examining words in job titles I noticed that if was common to see titles written as "head of ..." or "director of ...". This is unusual because most role titles go from specific to general (e.g. finance director) to you look backwards from the role word. In the "A of B" format the role goes from specific to general and so you have to reverse the search order. One solution is to rewrite "director of finance" to "finance director".

general

## Mail merge to PDF Files

A friend needed to generate a hundred contracts and their HR information system wasn't working properly. I helped them implement a workaround solution by using mail merge to generate a PDF for every contract, which saved them a lot of time filling in the details of each contract. I couldn't automatically generate the PDF despite some efforts, but using mail merge was much quicker and more reliable than filling in all the contract details manually into the template.

python

## Minibatching in Python

Sometimes you have a long sequence you want to break into smaller sized chunks. This is generally because you want to use some downstream process that can only handle so much data at a time. This is common in stochastic gradient descent in deep learning where you are constrained by the memory on the GPU. But this is also useful for API calls that can take a list, but can't handle all the data at once.

jobs

## Job Title Words

I found NER wasn't the right tool for extracting job titles, and a frequency based approach is going to work better. The first step for this is to identify words that signify a job title, like "manager", "nurse" or "accountant". I develop a whitelist of these terms and start moving towards a process for detecting role titles. I have developed a method for identifying duplicate job ads and used it to remove duplicates.

data

## Simple Metrics

I have a tendency to create really complex metrics. Sometimes when I'm analysing data I'll need to transform the data to understand it. I often calculate the ratio of common metrics to get a more stable rate. Or when building a machine learning model I'll find that log-loss or root mean square log error is the right metric. This can be appropriate for gaining insight or training a model, but it's not good for communication.

nlp

## Summary of Finding Near Duplicates in Job Ads

I've been trying to find near duplicate job ads in the Adzuna Job Salary Predictions Kaggle Competition. Job ads can be duplicated because a hirer posts the same ad multiple times to a job board, or to multiple job boards. Finding exact duplicates is easy by sorting the job ads or a hash of them. But the job board may mangle the text in some way, or add its own footer, or the hirer might change a word or two in different posts.

jobs

## Finding Duplicate Companies with Cliques

We've found pairs of near duplicate texts in 400,000 job ads from the Adzuna Job Salary Predictions Kaggle Competition. When we tried to extracted groups of similar ads by finding connected components in the graph of similar ads. Unfortunately with a low threshold of similarity we ended up with a chain of ads that were each similar, but the first and last ad were totally unrelated. One way to work around this is to find cliques, or a group of job ad were every job ad is similar to all of the others.

general

## Market for Highschool Maths Textbooks

My first professional job was for Haese mathematics which is a small family-owned South Australian business that writed and publishes mathematics textbooks. Working for a small company was a really interesting experience, I learned software development for their applications both for students and teachers, made animations, edited audio and did layout and graphic design of the books. Unfortunately I didn't make the effort to learn much about the business itself, which makes me wonder how big the market is for mathematics textbooks.

general

## Pain gain matrix for discussing approaches

Placing options on a scatterplot of costs versus benefits is a common practice for prioritising opportunities and solutions. The primary benefit of this approach is it can spark discussions. When people see the options on the canvas they it can help uncover unseen issues and opportunities. Getting a group of people involved in putting it together can help get them on the same page. The primary risk of this approach is getting too precise about it.

excel

## Spreadsheets as a Rough Annotation Tool

I needed to design some heuristic thresholds for grouping together items. In my first step attempt I iteratively tried to guess the thresholds by trying them on different examples. This was directionally useful but as I refined the thresholds I had to keep going back to check whether I had broken earlier examples. To improve this I used a spreadsheet as a rough annotation tool. There are various tools for data entry like org mode tables in Emacs, or you can use a spreadsheet interface in R with data.

data

## Bridging Bipartite Graph

When you have behavioural data between actors and events you naturally get a bipartite graph. For example you can have the actors as customers and events as products that are purchased, or the actors as users of a website and the events as videos that are viewed, or the actors as members of a forum and the events as posts they comment on. One of the ways to represent this is to relate actors by the number of events they both participate in.

data

## Clustering for Exploration

Suppose you're running a website with tens of thousands of different products, and no satisfactory way to group them up. Even a mediocre clustering can really help bootstrap your understanding. You can use the clusters to see new patterns in the data, and you can manually refine the clusters much more easily than you can make them. There are many techniques to cluster structured data or even detect them as communities in the graph of interactions with your users.

general

## Less Is Better

Today I was picking grapes from their vine for my partner's grandmother. They had been left too long and many were rotting or had bright blue spots where some form of fungus or algae was growing on them. I sorted the grapes into piles of rotten grapes and edible grapes. When I picked a big bunch of grapes with a couple of rotting overripe grapes I sorted it into the rotten pile, despite there being a dozen ripe looking grapes.

programming

## Using Github Actions with Hugo

I really like the idea of having a process triggered automatically when I push code. Github actions gives a way to do this with Github repositories, and this article was first published with a Github action. While convenient for simple things Github actions seem hard to customise, heavyweight to configure and give me security concerns. My workflow for publishing this website used to be commit and push the changes and run a deploy script.

general

## Project Estimation

Estimating projects is notoriously difficult, and the larger the project the harder to estimate. But even small pieces of work for a single person are easy to underestimate. When you make an estimate base it on actual elapsed times of similar projects, always try to overestimate the time, and reduce the scope before promising more than you can deliver. Everyone knows that construction jobs are typically going to take longer and cost more than quoted, from home rennovations to major construction projects.

math

## Probability Jaccard

I don't like Jaccard index for clustering because it doesn't work well on sets of different sizes. Instead I find the concepts from Association Rule Learning (a.k.a market basket analysis) very useful. It turns out Jaccard Similarity can be written in terms of these concepts so they really are more general. The main metrics in association rule mining are the confidence, which for pairs is just the conditional probability $$P(B \vert A) = \frac{P(A, B)}{P(A)}$$ There is also the lift which is how much more likely than random (from the marginals) the two events are likely to occur together $$\frac{P(A, B)}{P(A)P(B)}$$.

data

## Community detection in Graphs

People using a website or app will have different patterns of behaviours. It can be useful to cluster the customers or products to help understand the business and make better strategic decisions. One way to view this data is as an interaction graph between people and the product they interact with. Clustering a graph of interactions is called "community detection" Santo Fortunato's review article and user guide provides a really good introduction to community detection.

web

## Serving Static Assets with Python Simple Server

I was trying to load a local file in a HTML page and got a Cross-Origin Request Blocked error in my browser. The solution was to start a Python web server with python3 -m http.server. I had a JSON file I wanted to load into Javascript in a HTML page. Looking at StackOverflow I found I found fetch could do this fetch("test.json") .then(response => response.json()) .then(json => process(json)) Where process is some function that acts on the data; console.

communication

## Listening

When I'm in a comfortable environment I love to talk. This can be really useful for working through a problem by bouncing ideas off of other people, or for educating people and getting a point accross. But in getting something done I find listening is much more powerful than talking. There's lots of reasons to spend more time listening than talking. When you get a greater diversity of ideas you generally get to a better solution, and often the quieter people in the room have a valuable perspective.

data

## Finding Common Substrings

I've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a high 3-Jaccard similarity it's because they have some text in common. The most asymptotically efficient to find the longest common substring would be to build a suffix tree, but for experimentation the heuristics in Python's DiffLib work well enough.

data

## Simple Models

My first instinct when dealing with a new problem is to try to find a complex technique to solve it. However I've almost always found it more useful to start with a simple model before trying something more complex. You gain a lot from trying simple models and the cost is low. Even if they're not enough to solve the problem (which they can be) they will often give a lot of information about the problem which will set you up for later techniques.

software

## Power of Easy

Something being easy makes a huge difference in how often it is used. Even small frictions can add up and make a task less desirable. In the book Nudge, Thaler and Sunstein talk about how small changes to defaults impact major decisions like whether they donate their organs and how they save for retirement. Whenever you're designing something make it as easy as possible for people to do the desired thing; and make sure it's easy from their perspective - where they don't care about the product they're using but the task they are trying to achieve.

python

## Cartesian Product in R and Python

You've got a couple of groups and you want to get every possible combination of them. This is called the Cartesian Product of the groups. There are standard ways of doing this in R and Python. Python: List Comprehensions Concretely we've got (in Python notation) the vectors x = [1, 2, 3] and y = [4, 5] and we want to get all possible pairs: [(1, 4), (2, 4), (3, 4), (1, 5), (2, 5), (3, 5)]`.

math

## Beta Function

The Beta Function comes up in the likelihood of the binomial distribution. Understanding its properties is useful for understanding the binomial distribution. The beta function is given by $$B(a, b) = \int_0^1 p^{a-1}(1-p)^{b-1} \rm{d}p$$ for a and b positive. If you have $N$ flips of a coin of which $k$ turn heads the likelihood is proportional to $$p^{k}(1-p)^{N-k}$$ for the probability p between 0 and 1. So the beta function can be seen as the normaliser of the likelihood, with $$a = k + 1$$ and $$b = N - k + 1$$ (or inversely $$k = a - 1$$ and $$N = a + b - 2$$).

data

## From Bernoulli to Binomial Distributions

Suppose that you flip a fair coin 10 times, how many heads will you get? You'd think it was close to 5, but it might be a bit higher or lower. If you only got 7 heads would you reconsider you assumption the coin is fair? What if you got 70 heads out of 100 flips? This might seem a bit abstract, but the inverse problem is often very important. Given that 7 out of 10 people convert on a new call to action, can we say it's more successful than the existing one that converts at 50%?

jobs

## Minhash Sets

We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.

nlp

## Searching for Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.

emacs

## Considering VS Code from Emacs

I've been using Emacs as my primary editor for around 5 years now (after 4 years of Vim). I'm very comfortable in it, having spent a long time configuring my init.el. But once in a while I'm slowed down by some strange issue, so I'm going to put aside my sunk configuration costs and have a look at using VS Code. On Emacs I recently read a LWN article on Making Emacs Popular Again (and the corresponding HN thread).

## Estimating Bias in a Coin with Bayes Rule

I wanted to work through an example of applying Bayes rule to update model paremeters based on toy data This example comes from Kruschke’s Doing Bayesian Data Analysis, Section 5.3. The model is that we have a coin and we’re trying to estimate the bias in the coin, that is the probability that it will come up heads when flipped. For simplicity we assume the bias, theta is a multiple of 0.

nlp

## Detecting Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.

maths

## Lessons from a mathematician on building a community

Mathematicians and software developers have a lot in common. They both build structures of ideas, typically working in small groups or alone, but leveraging structures built by others. For software developers the ideas are concrete code implementations, and the building blocks are subroutines, and are published as "libraries" or "packages". For mathematicians the ideas are abstract, built on definitions and theorems and published in papers, conferences and informal conversations. To grow a substantial body of work in both mathematics or software requires a community to contribute to it.

data

## Clustering for Segmentation

Dealing with thousands of different items is difficult. When you've got a couple of dozen you can view them together, but as you get into the hundreds, thousands and beyond it becomes necessary to group items to make sense of them. For example if you've got a list of customers you might group them by state, or by annual spend. But sometimes it would be useful to split them into a few groups using some heuristic criteria; clustering is a powerful technique to do this.

data

## Representing Decision Trees on a grid

A decision tree is a series of conditional rules leading to an outcome. When stated as a chain of if-then-else rules it can be really hard to understand what is going on. If the number of dimensions and cutpoints is relatively small it can be useful to visualise on a grid to understand the tree. Decision trees are often represented as a heirarchy of splits. Here's an example of a classification tree on Titanic survivors.

writing

## Writing 50 Daily Articles

I've been writing an article a day for 50 days now. I started this to help build a portfolio, keep track of useful learnings and to become better at writing. This post reflects on the progress so far. Inspiration While there are many sources of inspiration for my writing, Sacha Chua's No Excuses Guide to Blogging is the biggest one. I bought the book around 2 years ago but I've found it useful and kept coming back to it.

data

## Four Competencies of an Effective Analyst

Analysts tend to be natural problem solvers, good at reasoning and adept with numbers. But to know how to frame the problem and what to look for they need to understand the context. To solve the problems they have to collect the right data and perform any necessary calculations. To have impact they need to be able to understand what's valuable, communicate their insights and influence decisions. These make up the four competencies of an effective analyst.

data

## 4am Rule for timeseries

When you've got a timeseries that doesn't have a timezone attched to it the natural question is "what timezone is this data from?" Sometimes it's UTC, sometimes it's the timezone of the server, otherwise it could be the timezone of one of the locations it's about (and it may or may not change with daylight savings). When it's people's web activity there's a simple heuristic to check this: the activity will be minimum between 3am and 5am.

data

A very useful open dataset the Australian Government provides is the Geocoded National Address File (G-NAF). This is a database mapping addresses to locations. This is really useful for applications that want to provide information or services based on someone's location. For instance you could build a custom store finder, get aggregate details of your customers, or locate business entities with an address, for example ATMs. There's another open and editable dataset of geographic entities, Open Street Map (and it has a pretty good open source Android app OsmAnd).

emacs

## Pipetable to CSV

Sometimes I get out pipe tables in Emacs that I want to convert into a CSVto put somewhere else. This is really easy with regular expressions. I often get data output from an SQL query like this text | num | value --------------+------+------------- Some text | 0.3 | 0.2 Rah rah | 7 | 0.00123(2 rows) Running sed 's/$$^ *\| *|\|(.*$$ */,/g' gives: ,text,num,value --------------+------+------------- ,Some text,0.3,0.2 ,Rah rah,7,0.00123, I can delete the divider and then use as a CSV.

sql

## Binning data in SQL

Generally when combining datasets you want to join them on some key. But sometimes you really want a range lookup like Excel's VLOOKUP. A common example is binning values; you want to group values into custom ranges. While you could do this with a giant CASE statement, it's much more flexible to specify in a separate table (for regular intervals you can do it with some integer division gymnastics). It is possible to implement VLOOKUP in SQL by using window functions to select the right rows.

maths

## A Mixture of Bernoullis is Bernoulli

Suppose you are analysing email conversion through rates. People either follow the call to action or they don't, so it's a Bernoulli Distribution with probability the actual probability a random person will the email. But in actuality your email list will be made up of different groups; for example people who have just signed up to the list may be more likely to click through than people who have been on it for a long time.

maths

## Probability Squares

A geometric way to represent combining two independent discrete random variables is as a probability square. On each side of the square we have the distributions of the random variables, where the length of each segment is proportional to the probability. In the centre we have the function evaluated on the two edges and the probability is proportional to the area of the rectangle. For example suppose we had a random process that generated 1, 2 or 3 with equal probability (for example half the value of a die, rounded up).

data

## Representing Interaction Networks

Behavioural data can illuminate the structure of the underlying actors. For example looking at which products customers buy can help understand how both the products and customers interact. The same idea can apply to people who attend events, watch the same movie, or have authored a scientific paper together. There are a few ways to represent these kinds of interactions which gives a large toolbox of ways to approach the problem.

excel

## Excel Binning

Putting numeric data into bins is a useful technique for summarising, especially for continuous data. This is what underlies histograms which is a bar chart of frequency counts in each bin. There are two main ways of doing this in Excel with groups and with vlookup (you can also do this in SQL). If you want equal length bins in a Pivot Table the easiest way is with groups. Right click on the column you want to bin and select Group

programming

## Powershell Debugging with Write-Warning

I had to debug some Powershell, without knowing anything about it. I found Write-Warning was the right tool for printline debugging. This was enough to resolve my issue. I first tried Write-Output but apparently it doesn't work inside a function which I found misleading for a while (at first I thought that it wasn't getting to the function). Write-Warning worked straight away and I could see in bright yellow what was going on.

data

## Analysis Needs to Change A Decision

Any analysis where the results won't change a decision is worthless. Before even thinking of getting any data it's worth being clear on how it impacts the decision. There's lots of reasons people want an analysis. Sometimes it's to confirm what they already believe (and they'll discount anything that tells them otherwise). Sometimes it's to prove to others something they believe; possibly to inform a decision someone else is making. But it's most valuable when it effects a decision they can make with an outcome they care about.

sql

## SQL Views for hiding business logic

The longer I work with a database the more I learn the dark corners of the dataset. Make sure you exclude the rows created by the test accounts listed in another table. Don't use the create_date field, use the real_create_date_v2 instead, unless it's not there, then just use create_date. Make sure you only get data from the latest snapshot for the key. Very quickly I end up with complex spaghetti SQL, which either contains monstrous subqueries or a chain of CREATE TEMPORARY TABLE.

nlp

## Near Duplicates with TF-IDF and Jaccard

I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.

nlp

## Near Duplicates with Jaccard

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calcuate.

nlp

## Edit Distance

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.

emacs

## Using Emacs under WSL

Getting Emacs to work nicely on a Windows system can be a challenge. You can install it natively (although getting all the dependencies is a challenge), but many packages require libraries or utilities that are hard to install or don't exist on Windows. The best solution I have found is using Emacs under the Windows Subsystem for Linux (WSL) with Xming. However if you run Emacs 26 or greater after starting Xming with XLaunch you're faced with a blank screen and can't see any writing on Emacs

data

## The Problem with Jaccard for Clustering

The Jaccard Index is a useful measure of similarity between two sets. It makes sense for any two sets, is efficient to compute at scale and it's arithmetic complement is a metric. However for clustering it has one major disadvantage; small sets are never close to large sets. Suppose you have sets that you want to cluster together for analysis. For example each set could be a website and the elements are people who visit that website.

maths

## Jaccard Shingle Inequality

Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if you're concerned with plagarism and copyright, getting the same data through multiple sources, or finding versions of the same document this approach could be useful. In particular MinHash can quickly find pairs of items with a high Jaccard index, which we can run on sequences of w tokens. A hard question is what's the right number for w?

python

## Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.

python

## Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

nlp

## Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

data

## All of Statistics

For anyone who wants to learn Statistics and has a maths or physics I highly recommend Larry Wasserman's All of Statistics . It covers a wide range of statistics with enough mathematical detail to really understand what's going on, but not so much that the machinery is overwhelming. What I learned reading it really helped me understand statistics well enough to design bespoke statistical experiments and effectively use and implement machine learning models.

life

## Remote social catchups are less intimate

As an introvert I really like catching up with good friends in small groups. But a video/remote catchup is much less intimate than real life because only one person can talk at a time. When you get 4 or more people in a group setting, frequently the conversation splits into smaller subgroups. The subgroups let people intermingle and participate in topics they're more interested in while all being together. With a video call you can't easily do this splitting and only one person can talk at a time.

python

## Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.

linux

## Waiting for System clock to synchronise

When trying to install packages with apt on a new Ubuntu AWS EC2 instance I had issues where the signature would fail to verify. The reason was the system clock was far in the past and so it looked like the signature was signed in the future. I created a workaround to wait for the system clock to synchronise that solved the problem and could be useful when starting a new machine with time sensitive issues.

nlp

## Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

nlp

## Rules, Pipelines and Models

Over the past decade deep neural networks have revolutionised dealing with unstructured data. Problems like identifying what objects are in a video through generating realistic text to translating speech between languages that were intractible are now used in real-time production systems. You might think that today all problems on text, audio an images should be solved by training end-to-end neural networks. However rules and pipelines are still extremely valuable in building systems, and can leverage the information extracted from the black-box neural networks.

nlp

## Active NER with Prodigy Teach

Active learning reduces the number of annotations you have to make by selecting for annotation the items that will have the biggest impact on model retraining. Active learning for NER is built into Prodigy, but I failed to use to it to improve my job title recogniser. Having built a reasonable NER model for recognising job titles I wanted to see if I could easily improve it with Protidy's active learning.

python

## Python Inequality Chaining

In Python the comparison a <= b == c < d does the mathematically correct thing. This is a handy notational trick. This wasn't obvious to me because a lot of programming languages treat these associatively, so that a <= b < c may resolve to (a <= b) < c. This is very dangerous if boolean (True or False) are coerced to integers (1 or 0) because it may look like it works but give the wrong results.

nlp

## Training a job title NER with Prodigy

In a couple of hourse I trained a reasonable job title Named Entity Recogniser for job ad titles using Prodigy, with over 70% accuracy. While 70% doesn't sound great it's a bit ambiguous what a job title is, and getting exactly the bounds of the job title can be a hard problem. It's definitely good enough to be useful, and could be improved. After thinking through an annotation scheme for job titles I wanted to try annotating and training a model.

nlp

## Annotating Job Titles

When doing Named Entity Recognition it's important to think about how to set up the problem. There's a balance between what you're trying to achieve and what the algorithm can do easily. Coming up with an annotation scheme is hard, because as soon as you start annotating you notice lots of edge cases. This post will go through an example with extracting job titles from job ads. In our previous post we looked at what was in a job ad title and a way of extracting some common job titles from the ads.

nlp

## What's in a Job Ad Title?

The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from very broad to very narrow, be obscure or acronym-laden and even hard to nail down. They're even hard to extract from a job ad's title - which is what I'll focus on in this series. In a previous series of posts I developed a method that could extract skills written a very particular way.

linux

## Disk Usage in Linux with du

When your harddrive is filling up the du utility is a great way of seeing what's taking up all the space. It can recursively walk through directories to a maximum depth, and print it in human readable sizes. I'll normally start by running df to see what space is used and available. It's worth looking at the Mounted On column if you don't administer the machine because sometimes there are special partitions for large files.

python

## Getting Started Debugging with pdb

When there's something unexpected happening in your Python code the first thing you want to do is to get more information about what's going wrong. While you can use print statements or logging it may take a lot of iterations of rerunning and editing your statements to capture the right information. You could use a REPL but sometimes it's challenging to capture all the state at the point of execution. The most powerful tool for this kind of problem is a debugger, and it's really easy to get started with Python's pdb.

SQL

## Calculating percentages in Presto

One trick I use all the time is calculating percentages in SQL by dividing with the count. Percentages quickly tell me how much coverage I've got when looking at the top few rows. However Presto uses integer division so doing the naive thing will always give you 0 or 1. There's a simple trick to work around this: replace count(*) with sum(1e0). Suppose for example you want to calculate the percentage of a column that is not null; you might try something like

SQL

## Moving Averages in SQL

Moving averages can help smooth out the noise to reveal the undelying signal in a dataset. As they lag behind the actual signal they tradeoff timeliness for increased precision in the underlying signal. You could use them for reporting metrics or for alerting in cases where it's more important to be sure ther is a change than it is to catch any change early. It's typically better to have a 7 day moving average than weekly reporting for important metrics because you'll see changes earlier.

Presto

## Getting most recent value in Presto with max_by

Presto and the AWS managed alternative Amazon Athena have some powerful aggregation functions that can make writing SQL much easier. A common problem is getting the most recent status of a transaction log. The max_by function (and its partner min_by) makes this a breeze. Suppose you have a table tracking user login activity over time like this: country user_id time status AU 1 2020-01-01 08:00 logged-in CN 2 2020-01-01 09:00 logged-in AU 1 2020-01-01 12:00 logged-out AU 1 2020-01-01 13:00 logged-in CN 2 2020-01-01 14:00 logged-out You need to find out which users are currently logged in and out, which requires you to find their most recent status.

android

## Syncing Calendars and Contacts to Android with DAVx5

I find it really handy to have my calendar and contacts from my email client on my mobile phone. DAVx5 is a fantastic free (GPLv3) app to do this on Android. This lets me organise my life accross devices and helps me know when friends and family's birthdays are. DAVx5 is simple to set up and has worked almost flawlessly for me for over 4 years. It supports two way synchronisation to CalDAV and CardDAV servers that many email providers support.

email

## Don't manage work email with Emacs

I do a lot of work in Emacs and at the command line, and I get quite a few emails so it would be great if I could handle my emails there too. Email in Emacs can be surprisingly featureful and handles HTML markup, images and can even send org markup with images and equations all from the comfort of an Emacs buffer. However it can be a whole heap of work, and as you get deeper into the features your mail client provides the amount of custom integration required grows very rapidly.

data

## Data Transformations in the Shell

There are many great tools for filtering, transforming and aggregating data like SQL, R dplyr and Python Pandas (not to mention Excel). But sometimes when I'm working on a remote server I want to quickly extract some information from a file without switching to one of these environments. The standard unix tools like uniq, sort, sed and awk can do blazing fast transformations on text files that don't fit in memory and are easy to chain together.

python

## Second most common value with Pandas

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group. This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas. Problem For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).

## Property Based Testing - A thousand test cases in a single line

Property based testing lets you specify rules that a function being tested will satisfy over a wide range of inputs. This specifies how to throughly test a function without coming up with a detailed set of test cases. For example instead of writing a specific test case like sort([1, 3, 2]) == [1, 2, 3], you could state that the input and output of sort should contain exactly the same elements for any valid input.

emacs

## Using emacs dumb-jump with evil

## Presto and Athena CLI in Emacs

I find having Emacs as a unified programming environment really useful. When writing an SQL pipeline I can iteratively develop my SQL in emacs, running it against the database. For a quick and dirty analysis I can copy the output into the .sql file and comment it out. Then I can copy the SQL into a programming language, parameterise it, and test it without touching the mouse. So when I started using Presto and AWS's managed alternative Athena, I needed to integrate it into emacs.

## Fastai Callbacks as Lisp Advice

Creating state of the art deep learning algorithms often requires changing the details of the training process. Whether it's scheduling hyperparameters, running on multiple GPUs or plotting the metrics it requires changing something in the training loop. However constantly modifying the core training loop everytime you want to add a feature, and adding a switch to enable it, quickly becomes unmaintainable. The solution fast.ai developed is to add points where custom code can be called that modifies the state of training, which they call callbacks.

## 94% confidence with 5 measurements

There are many things that are valuable to know in business but are hard to measure. For example the time from when a customer has a need to purchase, the number of related products customers use or the or the actual value your products are delivering. However you don't need a sample size of hundreds to get an estimate; in fact you can get a statistically significant result from measuring just 5 random customers.

## How to Display All Columns in R Jupyter

I like to do one-off analyses in R because tidyverse makes it really easy and beautiful. I also like to do them in Jupyter Notebooks because they form a neat way to collate the results. While R Markdown is better for reproducible code, often I'm doing expensive things with databases that are changing, and so I tend to find the "write once" behaviour of Jupyter Notebooks fit this use case better (although R Markdown Notebooks are catching up).

## Exporting data to Python with Amazon Athena

One necessary hurdle in doing data analysis or machine learning is loading the data. In many businesses larger datasets live in databases, in an object store (like Amazon S3) or the Hadoop File System. For some use cases you can do the work where the data lives using SQL or Spark, but sometimes it's more convenient to load it into a language like Python (or R) with a wider range of tools.

nlp

## Extracting Skills from Job Ads: Part 3 Conjugations

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "experience in telesales" using spaCy's dependency parse, but it wouldn't extract many types of experience from a job ad. Here we will extend these rules to extract lists of skills (for example extracting "telesales" and "callcentre" from "experience in telesales or receptionist", which will let us analyse which experiences are related.

nlp

Extracting Experience in a Field I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "subsea cable engineering experience". This worked well, but extracted a lot of qualifiers that aren't skills (like "previous experience in", or "any experience in"). Here we will write rules to extract experience from phrases like "experience in subsea cable engineering", with much better results.

nlp

## Extracting Skills from Job Ads: Part 1 - Noun Phrases

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience) You can see the Jupyter notebook for the full analysis. Extracting Noun Phrases It's common for ads to write something like "have this kind of experience":

• Edward Ross

## Data Blockless: A better way to create data

• Edward Ross

## Constant Models

When predicting outcomes using machine learning it's always useful to have a baseline to compare results against. A simple baseline is the best constant model; that is a model that gives the same prediction for any input. This is a really simple check to perform against any dataset, and can be informative to check across validation splits. There are simple algorithms for finding the best constant model. For categorical predictions just evaluate every possible category to choose as the constant prediction.

## A programmer using Excel

Intro When I was 15 I did a week of work experience with my neighbour, who was an agricultural economist running his own one person business. I'm still not really sure what an agricultural economist does, but I went out with him to visit his clients to talk through their business, and saw how he analysed their data in his Excel spreadsheet. It was really closer to an application than a spreadsheet; the interface made it clear where the client was meant to enter their data, it showed some summary output and most of the intermediate calculations were hidden.

## Spectra of atoms

Why is a sodium lamp yellow? How can we determine the elemental composition of the sun? How does a Helium-neon laser can work? To some degree all of these questions require knowing the spectra of atoms, which can in theory be calculated by Quantum mechanics. However the calculations of these spectra for arbitrary systems from first principles is prohibitively difficult and computationally intensive (which is why techniques such as Density Functional Theory are used).

## Regular expressions, automata and monoids

In formal language theory the task is to specify, over some given alphabet, a set of valid strings. This is useful in searching for structures textual data through files (e.g. via grep), for specifying the syntactic structure of programming languages (e.g. in Bison or pandoc), and for generating output of a specified form (e.g. automatic computer science and mathematics paper generators).

An automoton is (roughly) a set of symbols, and a set of states, along with transitions for each state that take a symbol and return another state. They can be used to model (and verify) simple processes.

Automata can be brought into correspondence with formal languages in a very natural way; given an initial state s, and a sequence of symbols (a1, a2, …, an) the automata has a naturally assigned state (… ((s a1) a2) … an) (where “(state symbol)” represents the state obtained from the transition on symbol using state). Then if we nominate an initial state, and a set of “accepting” valid states, we say a string is in the language of the automata if and only if when applied to the initial state it ends in a final state.

This gives a very useful pairing in computer science; formal languages are useful tools, and automata (often) give an efficient way to implement them on a computer.

## DVI by example

The Device Independent File Format (DVI) is the output format of Knuth’s TeX82; modern TeX engines (pdfTeX, luaTeX) output straight to Adobe’s Portable document format (PDF). However TeX82 and DVI still work as well today as they did when they were written; DVI files are easily cast to postscript or PDF.

The defining reference for DVI files is David R Fuch’s article in TUGboat Vol 3 No 2.

To find out what information is contained in a particular DVI file use Knuth’s dvitype, which outputs the operations contained in the bytecode in human readable format.

This article goes into gory detail the instructions contained in a very simple DVI file.

## Algorithms for finding the real roots of polynomials

Given an degree n polynomial over the real numbers we are guaranteed there are at most n real roots by the fundamental theorem of algebra; but how do we find them? Here we explore the Vincent-Collins-Akritas algorithm.

It uses Descartes’ rule of signs: given a polynomial $$p(x) = a_n x^n + \cdots + a_1 x + a_0$$ the number of real positive roots (counting multiplicites) is bounded above by the number of sign variations in the sequence $$(a_n, \ldots, a_1, a_0)$$ .

## Geometry and topology of division rings

Following from my last post (and Veblen and Young’s Projective Geometry) consider a projective plane satisfying the axioms:

1. Given two distinct points there is a unique line that both points lie on
2. Each line has at least three points which lie on it
3. Given a triangle any line that intersects two sides of the triangle intersects the third.
4. All points are spanned by d+1 points and no fewer.

Then for d>=3 is equivalent to the projective space of lines over a division ring (or skew field).

Kolmogorov asked the question what projective spaces can we do analysis on? In order to do things such as find tangent lines we are going to need some sort of topology.

maths

## Geometry of division rings

It is fairly easy to construct a geometry from algebra: given a division ring K we form an n-dimensional vector space, the points being the elements of the field and a line being a translation of all (left) multiples of a non-zero vector, i.e. of the form $$\{a\mathbf{v} + \mathbf{c}| a \in K\}$$ for some fixed vectors $$\mathbf{v} \neq 0$$ and c.

Interestingly it’s just as possible to go the other way, if we’re careful about what we mean by a geometry. I will loosely follow Artin’s book Geometric Algebra. In particular we have the undefined terms of point, line and the undefined relation of lies on. Then, for a fixed positive integer, the axioms are:

1. Given two distinct points there is a unique line that both points lie on
2. Each line has at least three points which lie on it
3. Given a line and a point not on that line there exists a unique line lying on the plane containing them that the point lies on and no point of the first line lies on.
4. All points are spanned by d+1 points and no fewer.

## Linear representation of additive groups and the Fourier Transform: Part 1

In this article I will show that the cyclic group of order n, that is the set $$\{0,1,2,\ldots,n-1\}$$ under addition modulo n motivates the discrete Fourier transform on a particular finite dimensional complex inner product space, and gives many of its properties. In a subsequent article I will extend this to the general Fourier transform and its relation to the group of integers and real numbers under addition.

## Do you really mean ℝⁿ?

In mathematics and physics it is common to talk about $$\mathbb{R}^n$$ when really we mean something else that can be represented by $$\mathbb{R}^n$$.

Consider mechanics or geometry, these are often represented as theories in $$\mathbb{R}^n$$ , but really they don’t occur in a vector space at all! Look around you, a three-dimensional description of space probably seems reasonable, but where’s the origin? [Perhaps the centre of your eyes could be an origin, but someone else would disagree with you]. Classical mechanics, special relativity and geometry are much better described as an affine space – which is a vector space without an origin.

## Tensor notation

Language affects the way you think, often subconsciously. The easier and more natural something is to express in a language the more likely you are to express it. This is especially true of mathematical thought where the language is very precise.

I know three types of notations for tensors and each seem to be useful in different situations and gives you a different perspective on how tensors “work”. [Technical Note: I will assume all vector spaces are finite dimensional so $$V$$ is naturally isomorphic to $$V^{**}$$]

## LaTeXing Multiple Equations

In mathematics and the (hard) sciences it’s important to be able to write documents with lots of equations, lots of figures and lots of references efficiently. This can be done in, for example, Microsoft Word, but the mathematics and theoretical physics community heavily prefer $$\TeX$$ (and in particular $$\LaTeX$$ ), so the bottom line is if you want to get papers published you’re going to have to get good at it.

There are a lot of resources for learning $$\LaTeX$$ on the web, and a lot of people teach themselves from this (I know I did), but this can get you into some bad habits. For instance eqnarray gets the spacing around the equals signs all wrong. (I typeset my thesis using exclusively eqnarray and didn’t notice this until it was pointed out to me). So a lot of people advocate align from AMSTeX, but align has it’s limitations too; it only comes with one alignment tab &. If you want to make a comment at the end of multiple equations (like “for $$x \in X$$ “) or you want to have two equations and the second one breaks over two lines you can’t line the equations up properly; but there is a solution – IEEEeqnarray (which is an external class, IEEEtrantools, available from the IEEE). Stefan Moser has written an excellent paper covering everything I’ve said and much more, showing good ways to typeset equations.

## Solving polynomials of degree 2,3 and 4

$\newcommand\nth{n^{\mathrm{th}}}$

• Edward Ross