Learning about the world through critical thinking, data and programming.

Code Structure Reflecting Function
programming

Code Structure Reflecting Function

I've been trying to extract job ads from Common Crawl. However I've been stuck on how to structure the code. Thinking through the relationships really helped me do this. The architecture of the pipeline is a set of methods that fetch source data, extract the structured data and normalise it into a common form to be combined. I previously had these methods all written in one large file, adding each extractor to a dictionary, which was a headache to look at.

  • Edward Ross
Setting the Icon in Jupyter Notebooks
jupyter

Setting the Icon in Jupyter Notebooks

I often have way too many Jupyter notebook tabs open and I have to distinguish them from the first couple letters of the notebook in front of the Jupyter organge book icon. What if we could change the icons to visually distinguish different notebooks? I thought I found a really easy way to set the icon in Jupyter notebooks... but it works in Firefox and not Chrome. I'll go through the easy solution works in more browsers and the hard solution.

  • Edward Ross
Retrying Python Requests
python

Retrying Python Requests

The computer networks that make up the internet are complex and handling an immense amount of traffic. So sometimes when you make a request it will fail intermittently, and you want to try until it succeeds. This is easy in requests using urllib3 Retry. I was trying to download data from Common Crawl's S3 exports, but ocassionally the process would fail due to a network or server error. My process would keep the successful downloads using an AtomicFileWriter, but I'd have to restart the process.

  • Edward Ross
Decorating Pandas Tables
python

Decorating Pandas Tables

When looking at Pandas dataframes in a Jupyter notebook it can be hard to find what you're looking for in a big mess of numbers. Something that can help is formatting the numbers, making them shorter and using graphics to highlight points of interest. Using Pandas style you can make the story of your dataframe standout in a Jupyter notebook, and even export the styling to Excel. The Pandas style documentation gives pretty clear examples of how to use it.

  • Edward Ross
A First Cut of Job Extraction
jobs

A First Cut of Job Extraction

I've finally built a first iteration of a job extraction pipeline in my job-advert-analysis repository. There's nothing in there that I haven't written about, but it's simply doing the work to bring it all together. I'm really happy to have a full pipeline that extracts lots of interesting features to analyse, and is easy to extend. I've already talked about how to extract jobs from Common Crawl and the architecture for extracting the data.

  • Edward Ross
Which /bin/sh
programming

Which /bin/sh

I tried to run a shell script and got this error: set: Illegal option -o pipefail I had a quick look and the first line was #!/bin/sh, the -o pipefail isn't valid across POSIX shells so I would expect that to fail. More specifically on modern Ubuntu /bin/sh is dash which doesn't support these bash like constructions. But /bin/sh is very different on different systems; on some it is bash, on others it's ash (from which dash is derived), and on others it's ksh or something else.

  • Edward Ross
Operating a Tower of Hacks
programming

Operating a Tower of Hacks

Remember after you run the update process to run the fix script on the production database. But run it twice because it only fixes some of the rows the first time. Oh, and don't use the old importer tool in the import directory, use the one in the scripts directory now. You already used the old one? It's ok, just manually alter the production database with this gnarly query. Ah right, I see the filler table it uses is corrupted, let's just copy it from a backup.

  • Edward Ross
Packaging your Expertise in a Tiny Product
business

Packaging your Expertise in a Tiny Product

I was listening to the $100 MBA Podcast about How to Easily Create a Small Information Product. I really like the idea of building a tiny product in under 12 hours of work to get experience making something and building success in small steps. Their basic premise is it's really easy to create a simple product that let's you share your expertise to help people make a small transformation: Create a one-page infographic or cheatsheet in Powerpoint and export it as a PDF Write a small ebook in a Word processor and convert it to PDF (or an ebook with Pandoc) Create some small videos using any HD camera and a USB microphone Then you can easily use Gumroad for fulfilment, even for a free product.

  • Edward Ross
Energy to Orbit vs Launch into Deep Space
insight

Energy to Orbit vs Launch into Deep Space

This is from Sanjoy Mahajan's The Art of Insight Problem 1.11 Estimate the energy in a 9-volt battery. Is it enough to launch the battery into orbit? I tried to answer this with the energy density required to launch into deep space. But this is different to going into orbit; how much energy is required to get into low Earth orbit? Low Earth Orbit A low orbit has to be above the height of the atmosphere (otherwise will require propuslion to overcome atmospheric friction), and so is typically above 300 km.

  • Edward Ross
Energy Desnsity to Launch into Space
insight

Energy Desnsity to Launch into Space

This is from Sanjoy Mahajan's The Art of Insight Problem 1.11 Estimate the energy in a 9-volt battery. Is it enough to launch the battery into orbit? I have already (mis)estimated the energy of a battery, but looked it up as 500 mAh. Energy density required to launch into space To launch into space you have to exchange energy to counteract the change in gravitational energy (at least, you'll need more for air resistance).

  • Edward Ross
Why is Vmemm Using All My Memory?
wsl

Why is Vmemm Using All My Memory?

My Windows laptop was halting to a crawl; I was waiting seconds to switch windows and even typing took a couple of seconds to respond. I opened the task manager by hitting Ctrl-Shift-Esc and saw that Vmemm was using >95% of my memory. What the heck is Vmemm and how can I stop it using all my memory? Vmemm is the process associated with virtual machines on Windows. I'm using WSL2 and Docker (through WSL2), and so all their memory appears on Vmemm.

  • Edward Ross
Run Webserver Without Root
programming

Run Webserver Without Root

You've written your web application or API and you now want to deploy it to a server. You don't want to run it as root, because if someone finds a vulnerability in the server then it will be trivial for them to take over the system. However only root has permission to run applications on ports 80 and 443. There are a few ways to do this, but only a couple that make sense for an interpreted language (like Python, as opposed to a compiled binary).

  • Edward Ross
Myth of the Hawthorne Effect
general

Myth of the Hawthorne Effect

The Hawthorne effect is where when measuring the effect of lighting changes on worker output in an electrical factory any change increased output, even back to the original lighting conditions. I've heard this explained as running the experiment caused the employees to be observed more closely which led them to work harder, and used as a rationale for observing employees more. Except the Hawthorne effect is a myth. The economists Steven D.

  • Edward Ross
Running out of Resources on AWS Athena
athena

Running out of Resources on AWS Athena

AWS Athena is a managed version of Presto, a distributed database. It's very convenient to be able to run SQL queries on large datasets, such as Common Crawl's Index, without having to deal with managing the infrastructure of big data. However the downside of a managed service is when you hit its limits there's no way of increasing resources. Today I was running some queries for a regular reporting pipeline in Athena when I got failure with the error Query exhausted resources at this scale factor.

  • Edward Ross
Building a Job Extraction Pipeline
jobs

Building a Job Extraction Pipeline

I've been trying to extract job ads from Common Crawl. However I was stuck for some time on how to actually write transforms for all the different data sources. I've finally come up with an architecture that works; download, extract and normalise. I need a way to extract the job ads from hetrogeneous sources that allows me to extract different kinds of data, such as the title, location and salary. I got stuck in code for a long time trying to do all this together and getting a bit confused about how to make changes.

  • Edward Ross
Insights From Google Analytics for a Small Blog
analytics

Insights From Google Analytics for a Small Blog

I started regularly writing this website to get better at writing, to build a portfolio and share my learnings. Because of this I haven't been focussed on building an audience or looking at analytics. However now I've been writing continuously for 6 months I'd see if I learned anything interesting from looking at Google Analytics. I installed Google Analytics a couple of weeks ago on the website to see how people are actually viewing my site.

  • Edward Ross
Importance of Collecting You Own Training Data
data

Importance of Collecting You Own Training Data

A couple years ago I built whatcar.xyz which predicts the make and model of Australian cars. It was built mainly with externally sourced data and so only works sometimes, under good conditions. To make it better I've started collecting my own training data. External data sources are extremely convenient for training a model as they can often be obtained much more cheaply than curating your own data. But the data will almost always be different to what you are actually performaning inference on, and so you're relying on a certain amount of generalisation.

  • Edward Ross
Unhappy Path Programming
programming

Unhappy Path Programming

When programming it's easy to think about the happy path. The path along which you get well-formed valid data, all your requests return successfully and everything works on your target platform. When you're in this mindset it's easy to just check it works in one case and assume everything is alright. But the majority of real work in programming is the unhappy paths. While you always need to be thinking about how things could go wrong, it's much more important in web programming.

  • Edward Ross
Updating a Python Project: Whatcar
whatcar

Updating a Python Project: Whatcar

The hardest part of programming isn't learning the language itself, it's getting familiar with the gotchas of the ecosystem. I recently updated my whatcar car classifier in Python after leaving it for a year and hit a few roadblocks along the way. Because I'm familiar with Python I knew enough heuristics to work through them quickly, but it takes experience with running into problems to get there. I thought I had done a good job of making it reproducible by creating a Dockerfile for it.

  • Edward Ross
Building NLP Datasets from Scratch
nlp

Building NLP Datasets from Scratch

There's a common misconception that the best way to build up an NLP dataset is to first define a rigorous annotation schema and then crowdsource the annotations. The problem is that it's actually really hard to guess the right annotation schema up front, and this is often the hardest part on the modelling side (as opposed to the business side). This is explained wonderfully by spaCy's Matthew Honnibal at PyData 2018.

  • Edward Ross
Orderly Life for Original Work
general

Orderly Life for Original Work

Be settled in your life and as ordinary as the bourgeois, in order to be fierce and original in your works. Gustave Flaubert, To Gertrude Tennant (December 25, 1876) It's hard to find the energy and focus to be creative when your life is a mess. Before you can be productive you need to sleep well, eat well, exercise well and have good routines and social supports. See here for more on the origin of this quote.

  • Edward Ross
Experimental Generalisability
statistics

Experimental Generalisability

Experiments reveal the relationship between inputs and outcomes. With statistical methods you can often, with enough observations, tell whether there's a strong relationship or if it's just noise. However it's much harder to know how generally the relationship holds, but it's essential for making decisions. Suppose you're testing two alternate designs for a website. One has a red and green button with a santa hat and bauble, and the other has a blue button.

  • Edward Ross
Choosing a Static Site Generator
blog

Choosing a Static Site Generator

Static website generators fill a useful niche between handcoding all your HTML and running a server. However there's a plethora of site generators and it's hard to choose between them. However I've got a simple recommendation: if you're writing a blog use Jekyll (if you don't want to use something like Wordpress). Static website generators compile input assets into a set of static HTML, CSS and Javascript files that can be deployed almost anywhere.

  • Edward Ross
Social Flashcards
general

Social Flashcards

I'm terrible at remembering names. When someone introduces themself I'm normally a bit anxious and in my own head and don't take in their name. It takes concious effort to remember their name, let alone the names of their family or facts about them. However remembering things about people are really important for building relationships. If you take an interest in other people's lives they will be more receptive to you.

  • Edward Ross
Can I? Must I? Should I?
general

Can I? Must I? Should I?

Whenever someone gets an idea in their head they start filtering out evidence that contradicts that idea. This idea is called confirmation bias, people start looking for evidence that confirms their current idea and neglecting evidence that challenges it. There's no way to completely beat a bias, but something that helps me is reframing the question. The first question that comes is normally "Can". Can it be? This leads to looking for evidence that confirms the idea.

  • Edward Ross
Learning Hugo by Editing Themes
programming

Learning Hugo by Editing Themes

One of the hardest parts of learning something new is motivation. This is why one of the best ways to learn programming is editing code; it's goal driven so motivation is built in. I've successfully used this to start learning how to write Hugo themes. Now that I've got a reasonable collection of posts, over 250, I would like to understand what content people are actually accessing on this website to get an idea of what would be useful.

  • Edward Ross
Manually Triggering Github Actions
programming

Manually Triggering Github Actions

I have been publishing this webiste using Github Actions with Hugo on push and on a daily schedule. I recently received an error notification via email from Github, and wanted to check whether it was an intermittent error. Unfortunately I couldn't find anyway to rerun it manually; I would have to push again or wait. Fortunately there's a way to enable manual reruns with workflow_dispatch. There's a Github blog post on enabling manual triggers with workflow_dispatch.

  • Edward Ross
R: Keeping Up With Python
r

R: Keeping Up With Python

About 5 years ago a colleague told me that the days were numbered for R and Python had won. From his perspective he is probably right; in software engineering companies Python has got increasing adoption in programmatic analytics. However R has its own set of unique strengths which make it more appealing for the stats people and has kept up surprisingly well with Python. Python has a wider audience than R, and keeps to its reputation as "not the best language for anything but the second best language for everything".

  • Edward Ross
Population Density Australia
insight

Population Density Australia

How dense is the population in Australia? I've looked at the Gridded Population of the World and you can see that the population is concentrated around the few capital cities on the coast. It's hard to visually average something so lumpy, but it's easy to estimate it. I know it's about 10 hours driving from Melbourne to Sydney, and about the same again to Brisbane. Brisbane is about halfway between Melbourne in the south and Cairns in the far north.

  • Edward Ross
Gridded Population of the World
data

Gridded Population of the World

I've spent the last few hours looking at the Gridded Population of the World which consistenly estimates the population density consistent with national censuses and population registers. This would have been a massive job to compile and is really interesting to look at. You can immediately see a strip through the north of India, Pakistan and Bangladesh that is incredibly dense. The north-east of China and the island of Java in Indonesia are also very dense.

  • Edward Ross
Implicit Bias
general

Implicit Bias

I like to think of myself as an egalatarian, but I know I have implicit bias. I've done some tests on Project Implicit and have roughly the implicit biases you would expect for my demographic. This makes me feel a bit sad, but you can't really control your implicit biases, they're a function of your environment and perception growing up. The key question is given that we have implicit biases how do we act against them?

  • Edward Ross
Finding Files Installed in Ubuntu and Debian
programming

Finding Files Installed in Ubuntu and Debian

My bashrc file sources the git prompt helper to show the branch I'm on in the prompt. Unfortunately it's quite old and was pointing to the wrong file, how do I find where it is? dpkg -L git | grep prompt Debian and its derivatives such as Ubuntu you can use apt to manage packages (e.g. apt upgrade, apt install). However apt is just a thin layer over dpkg that does useful things like resolving dependencies and downloading files.

  • Edward Ross
The Fifth Risk
books

The Fifth Risk

Michael Lewis' The Fifth Risk promotes parts of the US public service and some people who work in it. The public service is culturally opposed, if not legally prevented, from promoting itself which means a lot of the successes and heros go unsung. Michael Lewis spells out what some of the largest, yet most obscure parts, of the US government accomplish and how they could be at risk through mismanagement of the Trump administration.

  • Edward Ross
Endurance Counting
general

Endurance Counting

Counting is a strangely powerful tool for enduring through something. Standard advice when you're angry is to count to ten. When stretching counting to a target number helps sustain the stretch longer. A good counting based technique for endurance is box breathing. It involves repeatedly inhaling to a count of 4, holding to a count of 4, exhaling to a count of 4 and holding to a count of 4. This is a technique used by Navy SEALs to induce calm and focus.

  • Edward Ross
Moving Away From Keepass
tools

Moving Away From Keepass

A password manager is one of the best ways for the majority of people to keep their logins secure. After using KeePass and its derivatives for years, the Kee Firefox Addon dropped support for Keepass and it's now less convenient to use. After looking at the alternatives I'm going to switch to an online alternative. One of the most frequent ways people get their accounts hacked is by password reuse. Their email and password is revealed in some online breach of a website, and then these credentials can be used on other websites.

  • Edward Ross
Finding Analytics in Melbourne
analytics

Finding Analytics in Melbourne

My first job in analytics was in large part luck. I had an academic background in Physics and Mathematics, some professional programming experience building applications and self-studied computer science. I searched for "python" jobs, since I liked the language, and applied for a job titled something like "Awk, Bash and Grep". I didn't get that job, but was forwarded on to the data engineering team building bespoke reports. That was at a medium size company called Hitwise that provided digital competitive insights.

  • Edward Ross
Devil Take The Hindmost: Book Summary
books

Devil Take The Hindmost: Book Summary

Edward Chancellor's Devil Take the Hindmost: A History of Financial Speculation is a history of several market bubbles and crashes. It covers bubbles such as the South Sea Bubble, the 1920s bubble in the US stock market preceeding the great depression, the dotcom bubble of the 1990's and Japan in the 1980's. The main lessons I took was if a market sounds too good to be true it probably is, that highly leveraged financial instruments tend to prolong and worsen bubbles and often the people who bear the cost of reckless speculation are different to the people who take and profit from it.

  • Edward Ross
Australian Deathographics
insight

Australian Deathographics

I've recently tried to estimate Australian Deaths using life expectancy. This failed badly and I think the reason is demographics; this article looks more into this. The Australian Bureau of Statistics has population by age, and the Australian Institute of Health and Welfare have Mortality Over Time and Regions (MORT) which summarises the current probability of death by age range. Here is a super summarised version of this data: Age Population Death Rate Population Deaths Fraction of Deaths 0-19 25% 0% 0% 0% 20-39 29% 0% 0% 0% 40-59 25% 0% 0% 0% 60-79 17% 1.

  • Edward Ross
Checking Australian Oil Imports
insight

Checking Australian Oil Imports

I've estimated Australian oil imports; here I check the data to see how reasonable my estimates are. The overall tree diagram for the estimate is below: graph BT; Import[Oil imports1.3 Million Barrels/Day] ImportL[Oil imports200ML/Day] -- Import Barrel[Size of Barrel160L] --|-1| Import Consumption[Oil consumed L/Day200ML/Day] -- ImportL ImportRatio[Oil Imported / Consumed1] -- ImportL CarConsumption[Oil Consumed by Cars100ML/Day] -- Consumption CarFraction[Oil Consumed in Total / Oil Consumed by Cars2] -- Consumption Cars[Number of Cars20 Million] -- CarConsumption ConsumptionCar[Oil Consumed by Car5L/Day] -- CarConsumption People[Number of People25 Million] -- Cars CarPeople[Number of Cars per Person0.

  • Edward Ross
Australian Oil Imports
insight

Australian Oil Imports

This is a variation of Sanjoy Mahajan's The Art of Insight Section 1.4 (and problem 1.6) How much oil does Australia import (in Barrels per Day)? As in the text we approach this by estimating car consumption. graph BT; Import[Oil imports Barrel/Day] ImportL[Oil imports L/Day] -- Import Barrel[Size of Barrel L] --|-1| Import Consumption[Oil consumed L/Day] -- ImportL ImportRatio[Oil Imported / Consumed] -- ImportL CarConsumption[Oil Consumed by Cars L/Day] -- Consumption CarFraction[Oil Consumed in Total / Oil Consumed by Cars] -- Consumption Cars[Number of Cars] -- CarConsumption ConsumptionCar[Oil Consumed by Car L/Day] -- CarConsumption People[Number of People] -- Cars CarPeople[Number of Cars per Person] -- Cars To estimate imports we estimate demand, since that is estimable.

  • Edward Ross
Redundancy on Phone Power Button
gneeral

Redundancy on Phone Power Button

My 5 year old OnePlus One's power button has finally worn out, to the point where I can't press it. I panicked when the battery ran out - I was afraid I wouldn't be able to power it back on. However I found a video demonstrating how to turn it on with a power cable and the volume button. Pressing the volume down button when you attach the power cable to a computer puts it into recovery mode and you can boot it from there.

  • Edward Ross
How many People in Australia Die?
insight

How many People in Australia Die?

How many people die in Australia each year? The life expectancy in Australia is about 80 years, and the population is 25 million. So each year the number of people that die would be about 25 million divided by 80, which is about 300,000. The actual number of people that died in 2018 is 160,000. This is about half my estimate; what am I doing wrong? One factor is life expectancy is at birth, the longer people live the longer they will be expected to live.

  • Edward Ross
Checking Australian Births Estimates
insight

Checking Australian Births Estimates

I estimated the number of Australian births as 250,000. The actual number of births, according to the Australian Institute of Family Studies, it's around 310,000. Where did I go wrong? My estimate was 25 million times 0.8 children per person lifetime divided by lifetime of 80 years. The actual total fertility rate is 1.74 per woman, giving a birth rate of around half this of 0.87 per person which is significantly higher than I estimated.

  • Edward Ross
Australian Births
insight

Australian Births

How many babies are born in Australia? Australia has 25 million people. I would estimate the birth rate is 0.8 children per person; I think it's slightly less than one. These children are born across life, which is about 80 years. So a really crude estimate for annual births is 25 million people times (0.8 children per person lifetime) divided by 80 years per lifetime. This is 20 million divided by 80 which is 250 thousand.

  • Edward Ross
SICP Excerise 1.5
sicp

SICP Excerise 1.5

Exercise from SICP: Exercise 1.5. Ben Bitdiddle has invented a test to determine whether the interpreter he is faced with is using applicative-order evaluation or normal-order evaluation. He defines the following two procedures. (define (p) (p)) (define (test x y) (if (= x 0) 0 y)) Then he evaluates the expression (test 0 (p)) What behavior will Ben observe with an interpreter that uses applicative-order evaluation? What behavior will he observe with an interpreter that uses normal-order evaluation?

  • Edward Ross
Sicp Exercise 1.3
sicp

Sicp Exercise 1.3

Exercise from SICP: Exercise 1.2. Define a function that takes three numbers as arguments and returns the sum of the two larger numbers. Solution The first thing we need to do is to get the largest two numbers from 3 numbers. We can do this with a conditional statement. (define (sum-square-largest-two a b c) (cond ((and (<= a b) (<= a c)) (sum-of-squares b c)) ((and (<= b a) (<= b c)) (sum-of-squares a c)) ((and (<= c a) (<= c b)) (sum-of-squares a b))))

  • Edward Ross
SICP Exercise 1.1
sicp

SICP Exercise 1.1

Exercise from SICP: Exercise 1.1. Below is a sequence of expressions. What is the result printed by the interpreter in response to each expression? Assume that the sequence is to be evaluated in the order in which it is presented. 10 (+ 5 3 4) (- 9 1) (/ 6 2) (+ (* 2 4) (- 4 6)) (define a 3) (define b (+ a 1)) (+ a b (* a b)) (= a b) (if (and (> b a) (< b (* a b))) b a) (cond ((= a 4) 6) ((= b 4) (+ 6 7 a)) (else 25)) (+ 2 (if (> b a) b a)) (* (cond ((> a b) a) ((< a b) b) (else -1)) (+ a 1)) Solution We can step through these using the substitution model with environment.

  • Edward Ross
Tree Diagram Bills
insight

Tree Diagram Bills

This is from Sanjoy Mahajan's The Art of Insight Problem 1.5 Make a tree diagram for your estimate in Problem 1.3. Do it in three steps: (1) Draw the tree without any leaf estimates, (2) estimate the leaf values, and (3) propagate the leaf values upward to the root. This is referring to the suitcase of money. Step 1: Tree graph LR; VolBankNote[Volume of Bank Note] VolSuitcase[Volume of Suitcase] ValueBankNote[Value of Bank Notes in Suitcase] NumBankNote[Number of Bank Notes in Suitcase] ValueSuitcase[Value of Suitcase] NumBankNote -- ValueSuitcase ValueBankNote -- ValueSuitcase VolBankNote --|-1| NumBankNote VolSuitcase -- NumBankNote SuitWidth[Width of Suitcase] -- VolSuitcase SuitHeight[Height of Suitcase] -- VolSuitcase SuitDepth[Depth of Suitcase] -- VolSuitcase BankHeight[Height of Bank Note] -- VolBankNote BankWidth[Width of Bank Note] -- VolBankNote BankDepth[Thickness of Bank Note] -- VolBankNote Step 2: Annotated Leaves graph LR; VolBankNote[Volume of Bank Note] VolSuitcase[Volume of Suitcase] ValueBankNote[Value of Bank Notes in Suitcase

  • Edward Ross
Mixing Warm Water
insight

Mixing Warm Water

I used to have a fancy kettle that came with settings for heating water to different temperatures between 80° C and 100° C. However it's really easy to get water at any temperature using an ordinary kettle by mixing with refrigerated water. When you mix together two volumes of water at different temperatures their volumes add and the resulting temperature is a volume weighted average of the temperatures. For example if you take 25mL of water at 10° C and 75mL of water at 40° C you will get 100mL of water at 32.

  • Edward Ross
Gold or Bills
insight

Gold or Bills

This is from Sanjoy Mahajan's The Art of Insight Problem 1.4 As a bank robber sitting in the vault planning your getaway, do you fill your suitcase with gold bars or $100 bills? Assume first that how much you can carry is a fixed weight. Then redo your analysis assuming that how much you can carry is a fixed volume. As I estimated in suitcase of money the mass of a paper note is about 1 gram, and the volume is about 1 cm³.

  • Edward Ross
Estimating Weight with Body Mass Index
insight

Estimating Weight with Body Mass Index

When estimating things it's good to find approximate constants, typically ratios, that are easier to remember than things that vary. The Body Mass Index (BMI) is an example for measuring people. It's relatively easy to measure human height, as a human. For example I'm about 180cm tall; the top of my nose is about 170cm, the bottom of my chin is about 160cm and the bottom of my neck is about 150cm.

  • Edward Ross
Diagrams in Hugo with Mermaid
hugo

Diagrams in Hugo with Mermaid

Being able to write simple diagrams with text is very convenient. We can do this in Hugo by rendering with mermaid.js. In particular I want to render some factor tree diagrams of the style of The Art of Insight. Like this one: The final result looks like: graph LR; A[sheets ream-1 500] --|-1| B[thickness 10-2cm ] C[thickness ream-1 5cm] -- B B -- D[volume 1cm3] E[height 6cm] -- D F[width 15cm] -- D Implementation I copied the Mermaid Hugo shortcode from the learn theme and put it in layouts/shortcodes/mermaid.

  • Edward Ross
How much Money is in a Suitcase?
insight

How much Money is in a Suitcase?

This is from Sanjoy Mahajan's The Art of Insight Problem 1.3 In the movies, and perhaps in reality, cocaine and elections are bought with a suitcase of $100 bills. Estimate the dollar value in such a suitcase. Size of $100 note Let's assume a banknote is about the same thickness as paper; Australian notes are probably a little bit thicker. A 500 page ream of paper is about 5cm tall, so each sheet is about 0.

  • Edward Ross
Programming Languages to Learn in 2020
programming

Programming Languages to Learn in 2020

A language that doesn't affect the way you think about programming, is not worth knowing. Alan Perlis I spend a lot of time programming in Python and SQL, some time in Bash and R (or at least tidyverse), and a little in Java and Javascript/HTML/CSS. This set of tools is actually pretty versatile about getting things done, but is fairly narrow from a programming concept perspective. Once in a while I think it's useful to broaden the programming frame to understand different ways of doing things; even if you still stick to the same few languages.

  • Edward Ross
Some Ideas for Recurring Articles
planning

Some Ideas for Recurring Articles

Radio shows, comedy sketch shows and talk shows have the difficult task of filling air time with less structured content. A technique used in all of these mediums to help fill the gaps is a recurring segment. The Saturday Night Live Weekend Update is an example of this. Using a structured recurring segment with a familiar pattern and style gives a structured environment to be creative in. It's really hard to be creative in a completely unstructured and original way, like Monty Python was, since there are a so many options.

  • Edward Ross
Diffing in SQL
sql

Diffing in SQL

One way of refactoring legacy code is to use diff tests; checking what changes when you change the code. While it can be easy to diff files, it's a little less obvious how to do this with SQL pipelines. Fortunately there are a few different techniques to do this. For exact matching you can use union all to find the number of rows that don't occur in both datasets. For approximate matching you can use a join to check whether the differences are within some bounds.

  • Edward Ross
Diff Tests
legacy code

Diff Tests

When making changes to code tests are a great way to make sure you haven't inadvertantly introduced regressions. This means that you can make changes much faster with more confidence, knowing that your tests will catch many careless mistakes. But what do you do when you're working with a legacy codebase that doesn't have any tests? One method is creating diff tests; testing how your changes impact the output. For batch model training or ETL pipeline there's typically a natural way to do this.

  • Edward Ross
Dataflow Chasing
data

Dataflow Chasing

When making changes to a new model training pipeline I find it really useful to understand the dataflow. Analytics workflows are done as a series of transformations, taking some inputs and producing some outputs (or in the case of mutation; an input is also an output). Seeing this dataflow helps give a big picture overview of what is happening and makes it easier to understand the impact of changes. Generally you can view the process as a directed and (hopefully) acyclic graph.

  • Edward Ross
Comment to Function
programming

Comment to Function

A lot of analytics code I've read is a very long procedural chain. These can be hard to follow because the only way to really know what's going on in any point is to insert a probe to inspect the inputs and outputs at that stage. Breaking these into functions is a really useful way of making the code easier to understand, change and find bugs in. In Martin Fowler's Refactoring he mentions that whenever there's a block of code that has (or requires) a comment to describe what it does, that's a good opportunity to package that code into a function.

  • Edward Ross
Tidy Time
general

Tidy Time

I love having a clean desk and empty inbox. But I hate spending the time cleaning my desk and processing emails. It feels like wasted time where I could do something better. However having "tidy time" to maintain things is important. A while ago I read David Allen's Getting Things Done. When I tried to implement it I got stuck on the notion of a weekly review. Setting aside some time every week to see how you're progressing on tasks and to process any new tasks.

  • Edward Ross
From Multiprocesing to Concurrent Futures in Python
python

From Multiprocesing to Concurrent Futures in Python

Waiting for independent I/O can be a performance bottleneck. This can be things like downloading files, making API calls or running SQL queries. I've already talked about how to speed this up with multiprocessing. However it's easy to move to the more recent concurrent.futures library which allows running on threads as well as processes, and allows handling more complicated asynchronous flows. From the previous post suppose we have this multiprocessing code:

  • Edward Ross
Approximate Percentiles in Presto and Athena
presto

Approximate Percentiles in Presto and Athena

Calculating percentiles and quantiles is a common operation in analytics. While they can be done in vanilla SQL with window functions and row counting, it's a bit of work and can be slow and in the worst case can hit database memory or execution time limits. Presto (and Amazon's hosted version Athena) provide an approx_percentile function that can calculate percentiles approximately on massive datasets efficiently. When running this I found that it was non-deterministic.

  • Edward Ross
Git Stash Changesets
git

Git Stash Changesets

Pretty frequently I start writing some code, when I realise there's another change I need to make before I can continue. I like to make lots of small atomic changes to a code base because it lets me test more quickly and catch errors earlier. I used to do this by saving my changes in a temporary file, but this was clunky. A better way is with git stash. But git stash reverts all files; and very often I want to keep some, especially configuration parameters.

  • Edward Ross
Solving Solved Problems
general

Solving Solved Problems

A good technique for deeply understanding something is to try to solve it yourself first. Sometimes this can even lead to better methods or new discoveries. I heard an interesting technique from Jeremy Howard in one of the fast.ai courses about how to read a paper. First read the abstract and introduction. Then spend a couple of days trying to implement what you think they're talking about. Then go back and read the rest of the paper and see how it compares to what you did.

  • Edward Ross
Contact Tracing in Fighting Epidemics
data

Contact Tracing in Fighting Epidemics

The state government of Victoria, Australia has recently announced a plan on how to respond to the current Covid-19 pandemic. Based on epidemiological modelling they have set to reduce restrictions based on 14 day averages of new case numbers. If the 14 day average daily new cases are 30-50 in 3 weeks they will reduce restrictions; if they are below 5 a month after that they will reduce restrictions again.

  • Edward Ross
Modelling the Spread of Infectious Disease
maths

Modelling the Spread of Infectious Disease

Understanding the spread of infectious disease is very important for policies around public health. Whether it's the seasonal flu, HIV or a novel pandemic the health implications of infectious diseases can be huge. A change in decision can mean saving thousands of lives and relieving massive suffering and related economic productivity losses. The SIR model is a model that is simple, but captures the underlying dynamics of how quickly infectious diseases spread.

  • Edward Ross
Time Budgeting
general

Time Budgeting

It's worthwhile spending some time thinking about how you spend your time. Time and energy are among your most valuable resources. A regular investment of time can build into substantial assets, but if you don't budget time it's easily misspent. I don't believe that you should allocate away all of your time, but setting some time constraints is important. If you don't put the big rocks of things that are important to you first in the jar first, all the sand and water of mundane things will fill it up.

  • Edward Ross
Fixing suddenly unable to connect to X server in WSL2
wsl

Fixing suddenly unable to connect to X server in WSL2

Today when I tried to connect to VcXsrv after running it with XLaunch it didn't work. I'd had it working for months and so was surprised it suddenly stopped working. The reason was simple; the IP subnet WSL2 had changed and so it was now being blocked by a firewall. Annoyingly there is very little feedback as to why it can't connect to an XServer. I went back through my previous instructions of setting up an X server in WSL2, but noticed something.

  • Edward Ross
Exceed Expectations
general

Exceed Expectations

Today I saw a picture in someone's windows "Always Exceed Everyone's Expectations". My initial reaction was that was a quick way to burnout - trying to always exceed expectations sounds like running on a treadmill that gets faster and faster. But another way to look at it is to set lower expectations and only commit when you can confidently deliver. In another expression "underpromise and overdeliver". Consistently delivering what you promise to customers is the way to build trust and loyalty.

  • Edward Ross
South Sea Bubble
finance

South Sea Bubble

I've been surprised to learn that financial bubbles and collapses are actually hundreds of years old. I learned this reading the book Devil Takes the Hindmost: A History of Financial Speculation by Edward Chancellor. The chapters on the South Sea Bubble and the following craze over investing in South America sound thoroughly modern; except they happened 200-300 years ago. In fact Isaac Newton lost £20,000 by investing in the bubble. Chancellor describes futures, options, and margin loans - things I had wrongly assumed were more modern inventions.

  • Edward Ross
Embeddings for categories
data

Embeddings for categories

Categorical objects with a large number of categories are quite problematic for modelling. While many models can work with them it's really hard to learn parameters across many categories without doing a lot of work to get extra features. If you've got a related dataset containing these categories you may be able to meaningfuly embed them in a low dimensional vector space which many models can handle. Categorical objects occur all the time in business settings; products, customers, groupings and of course words.

  • Edward Ross
From Descriptive to Predictive Analytics
data

From Descriptive to Predictive Analytics

The starting point for an analysis is often summary statistics, such as the mean or the median. For some of these you're going to want it more precisely, more timely or cut by thinner segments. When the data gets too volatile to report on it's a good time to reframe the descriptive statistics as a predictive problem. Businesses often have a lot of reporting around important metrics cut by key segments.

  • Edward Ross
Teaching Programming by Editing Code
programming

Teaching Programming by Editing Code

I've had a few discussions with people, especially analysts, about how to learn programming. Generally I encourage them to find a project they want to accomplish and try to learn programming on the way. However I really struggle to find resources to recommend because they tend to spend a lot of time teaching programming concepts from stratch. I wonder if a better way to teach these things would be to start with code that's close to what they want to accomplish, and get them to edit it.

  • Edward Ross
Interpretable models with Cynthia Rudin
data

Interpretable models with Cynthia Rudin

A while ago I came across Cynthia Rudin through their work on the FICO Explainable Machine Learning Challenge. Her team got an honourable mention and she wrote an opinion with Joanna Radin on explainable models. I think the article was hyperbolic on claiming interpretable models always work as well as black box models. On the other hand I only came across her because of this article, so taking an extreme viewpoint in the media is a good way to get attention.

  • Edward Ross
Topic Modelling to Bootstrap a Classifier
data

Topic Modelling to Bootstrap a Classifier

Sometimes you want to classify documents, but you don't have an existing classification. Building a classification that is mutually exclusive and completely exhaustive is actually very hard. Topic modelling is a great way to quickly get started with a basic classification. Creating a classification may sound easy until you try to do it. Think about novels; is a Sherlock Holmes novel a mystery novel or a crime novel (or both)? Or do we go more granular and call it a detective novel, or even more specifically a whodunit?

  • Edward Ross
Rough Coarse Geocoding
data

Rough Coarse Geocoding

A coarse geocoder takes a human description of a large area like a city, area or country and returns the details of that location. I've been looking into the source of the excellent Placeholder (a component of the Pelias geocoder) to understand how this works. The overall approach is straightforward, but it takes a lot of work to get it to be reliable. A key component geocoder is a gazetteer that contains the names of locations.

  • Edward Ross
Python HTML Parser
python

Python HTML Parser

A lot of information is embedded in HTML pages, which contain both human text and markup. If you ever want to extract this information, don't use regex use a parser. Python has an inbuilt library html.parser library to do just that. The excellent html2text library uses it to parse HTML into markdown, which you can use for removing formatting. However for your own purposes you can use a similar approach to build a custom parser by subclassing HTMLParser.

  • Edward Ross
Refining Location with Placeholder
data

Refining Location with Placeholder

Placeholder is a great library for Coarse Geocoding, and I'm using it for finding locations in Australia. In my application I want to get the location to a similar level of granularity; however the input may be for a higher level of granularity. Placeholder doesn't directly provide a method to do this, but you can use their SQLite database to do it. For example to find the largest locality for East Gippsland, with Who's On First id 102049039, you can use the SQL.

  • Edward Ross
Maybe Monad in Python
python

Maybe Monad in Python

A monad in languages like Haskell is used as a particular way to raise the domain of a function beyond where it was domain. You can think of them as a generalised form of function composition; they are a way of taking one type of function and getting another function. A very useful case is the maybe monad used for dealing with missing data. Suppose you've got some useful function that parses a date: parse_date('2020-08-22') == datetime(2020,8,22).

  • Edward Ross
Dip Statistic for Multimodality
maths

Dip Statistic for Multimodality

If you've got a distribution you may want a way to tell if it has multiple components. For example a sample of heights may have a couple of peaks for different gender, or other attributes. While you could determine this through explicitly modelling them as a mixture the results are sensitive to your choice of model. Another approach is statistical tests for multimodality. One common test is Silverman's Test which checks for the number of modes in a kernel density estimate; the trick is choosing the right width.

  • Edward Ross
Create User Sessions with SQL
sql

Create User Sessions with SQL

Sometimes you may want to experiment with sessions and need to hand-roll your own in SQL. There's a good mode blog on how to do this. If you're using Postgres or Greenplum you may be able to use Apache Madlib's Sessionize for the basic case. This blog post will give a very brief summary of how to do this with some examples in Presto/Athena. The idea of a session is to capture a continuous unit of user activity.

  • Edward Ross
Python is not a Functional Programming Language
python

Python is not a Functional Programming Language

Python is a very versitile multiparadigm language with a great ecosystem of libraries. However it is not a functional programming lanugage, as I know some people have described it. While you can write it in a functional style it goes against common practice, and has some practical issues. There is no fundamental definition of a functional programming language but two core concepts are that data are immutable and the existence of higher order functions.

  • Edward Ross
Differentiation is Linear Approximations
maths

Differentiation is Linear Approximations

Differentiation is the process of creating a local linear approximation of a function. This is useful because linear functions are very well understood and efficient to work with. One application of them is gradient descent, often used for fitting models in machine learning. In this context a function is something that maps between coordinate spaces. For example consider an image classifier that takes a 128x128 pixel image with three channels for colours (Red, Green, Blue) and returns a probability that the image contains a cat and the probability that the image contains a dog.

  • Edward Ross
Classifying Finite Groups
maths

Classifying Finite Groups

Groups can be thought of a mathematical realisation of symmetry. For example the symmetric groups are all possible permutations of n elements. Or the dihedral groups are the symmetrics of a regular polygon. A questions mathematicians ask is what kinds of groups are there? One way to tackle this is to try to decompose them. One way of doing this is a decomposition series of normal subgroups. \[ 1 = H_0\triangleleft H_1\triangleleft \cdots \triangleleft H_n = G \]

  • Edward Ross
Complex Analysis
maths

Complex Analysis

Imaginary numbers sound like a very impractical thing; surely we should only be interested in real numbers. However imaginary numbers are very convenient for understanding phenomena with real numbers, and are useful models for periodic phases like in electrical engineering and quantum mechanics. The techniques are also often useful for evaluating integrals, solving two-dimensional electrostatics and decomposing periodic signals. Most of mathematical analysis, topology and measure theory is about inapplicable abtruse examples.

  • Edward Ross
Data Tests with SQL
data

Data Tests with SQL

A challenge of data analytics is that the data can change as well as the code. The systems producing and collecting data are often changed and can lead to missing or corrupt data. These can easily corrupt reports and machine learning systems. Worst of all the data may be lost permenantly. So if you're going to use some data it's important to check the data regularly to catch the worst kind of mistakes as early as possible.

  • Edward Ross
Sessionisation Experiments
data

Sessionisation Experiments

You don't need a lot of data to prove a point. People often think statistics requries big expensive datasets that cost a lot to acquire. However in relatively unexplored spaces a small amount of data can have high yield in changing a decision. I've been working on some problems around web sessionisation. The underlying model is that when someone visits your website they may come at different times for different reasons.

  • Edward Ross
Test Driven Salary Extraction
python

Test Driven Salary Extraction

Even when there's a specific field for a price there's a surprising number of ways people write it. This is what the tool price-parser solves. Unfortunately it doesn't work too well on salaries, which tend to be ranges and much higher, but the approach works. Price parser has a very large set of tests covering different ways people write prices. The solution is a simple process involving a basic regular expression, but it solves all these different cases.

  • Edward Ross
Finding Australian Locations with Placeholder
python

Finding Australian Locations with Placeholder

People write locations in many different ways. This makes them really hard to analyse, so we need a way to normalise them. I've already discussed how Placeholder is useful for coarse geocoding. Now I'm trying to apply it to normalising locations from Australian Job Ads in Common Crawl. The best practices when using Placeholder are: Go from the most specific location information (e.g. street address) to the most general (e.

  • Edward Ross
Converting HTML to Text
python

Converting HTML to Text

I've been thinking about how to convert HTML to Text for NLP. We want to at least extract the text, but if we can preserve some of the formatting it can make it easier to extract information down the line. Unfortunately it's a little tricky to get the segmentation right. The standard answers on Stack Overflow are to use Beautiful Soup's getText method. Unfortunately this just turns every tag into the argument, whether it is block level or inline.

  • Edward Ross
How to turn off LaTeX in Jupyter
jupyter

How to turn off LaTeX in Jupyter

When showing money in Jupyter notebooks the dollar signs can disappear and turn into LaTeX through Mathjax. This is annoying if you really want to print monetary amounts and not typeset mathematical equations. However this is easy to fix in Pandas dataframes, Markdown or HTML output. For Pandas dataframes this is especially annoying because it's much more likely you would want to be showing $ signs than displays math. Thankfully it's easy to fix by setting the display option pd.

  • Edward Ross
Double emphasis error in html2text
python

Double emphasis error in html2text

I'm trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. I've already resolved an issue with multiple types of emphasis. However HTML in the wild has all sort of weird edge cases that the library has trouble with. In this case I found a term that was emphasised twice: <strong><strong>word</strong></strong>. I'm pretty sure for a browser this is just the same as doing it once; <strong>word</strong>.

  • Edward Ross
An edge bug in html2text
python

An edge bug in html2text

I've been trying to find a way of converting HTML to something meaningful for NLP. The html2text library converts HTML to markdown, which strips away a lot of the meaningless markup. But I quickly hit an edge case where it fails, because parsing HTML is surprisingly difficult. I was parsing some HTML that looked like this: Some text.<br /><i><b>Title</b></i><br />... When I ran html2text it produced an output like this:

  • Edward Ross
Symmetry in probability
maths

Symmetry in probability

The simplest way to model probability of a system is through symmetry. For example the concept of a "fair" coin means there are two possible outcomes that are indistinguishable. Because each result is equally likely the outcome is 50/50 heads or tails. Similarly for a fair die there are 6 possible outcomes, that are all equally likely. This means they each have the probability 1/6. The idea of symmetry is behind random sampling.

  • Edward Ross
Sunk Cost of Pure Mathematics
maths

Sunk Cost of Pure Mathematics

Today I went through the painful exercise of culling my notebooks. My honours notebooks, independent research and work from textbooks and courses. These are things I spent a large part of my early life and energy on. Even though I haven't looked at them for years they are very hard to let go. A large amount of the material is pure mathematics. Notes on differential geometry, topology, and measure theory. These are particularly vexing because I don't believe they hold much real value.

  • Edward Ross
Writing Blog Posts with Jupyter and Hugo
writing

Writing Blog Posts with Jupyter and Hugo

It can be convenient to directly publish a mixture of prose, source code and graphs. It ensures the published code actually runs and makes it much easier to rerun at a later point. I’ve done this before in Hugo with R Blogdown, and now I’m experimenting with Jupyter notebooks. The best available option seems to be nb2hugo which converts the notebook to markdown, keeping the front matter exporting the images.

  • Edward Ross
Searching within a Website
general

Searching within a Website

Some websites, like this one, have a lot of content but have no search function. Others have search but it performs poorly, for example Bunnings has great category pages but the search never hits it. Fortunately there's a simple way to search these sites with the site: search operator. If I want to search for articles about jobs just in this website I can type: site:skeptric.com job into either Google or Bing.

  • Edward Ross
NLP Learning Resources in 2020
nlp

NLP Learning Resources in 2020

There's a lot of great freely available resources in NLP right now; and the field is moving quickly with the recent success of neural models. I wanted to mention a few that look interesting to me. Jurefsky and Martin's Speech and Language Processing The third edition is a free ebook that is in progress that covers a lot of the basic ideas in NLP. It's got a great reputation in the NLP community and is nearly complete now.

  • Edward Ross
Speaking Quota
communication

Speaking Quota

I often find listening more productive than talking, but still find it easy to spend a lot of meetings talking. When I get curious I ask lots of questions in a meeting that can take it off on a tangent, especially switching from high level to detail. If you find yourself in a similar situation give yourself a small speaking quota. I got the idea from a former management consultant, who when he was a junior was told he was only allowed to say one thing in a meeting.

  • Edward Ross
Being Patient with People
general

Being Patient with People

I'm sitting in a meeting listening to an update. They've missed the point, and they're focussing on the wrong thing. I start to get frustrated; why are they so far off track? Why haven't they taken the time to understand the problem? This isn't a helpful reaction; getting short tempered won't help resolve the problem. I haven't taken the time to understand the speaker and their perspective. Why do they think this is the right thing to focus on?

  • Edward Ross
Don't Stop Pretraining
nlp

Don't Stop Pretraining

In the past two years the best performing NLP models have been based on transformer models trained on an enormous corpus of text. By understanding how language in general works they are much more effective at detecting sentiment, classifying documents, answering questions and translating documents. However in any particular case we are solving a particular task in a certain domain. Can we get a better performing model by further training the lanugage model on the specific domain or task?

  • Edward Ross
Tangled up in BLEU
nlp

Tangled up in BLEU

How can we evaluate how good a machine generated translation is? We could get bilingual readers to score the translation, and average their scores. However this is expensive and time consuming. This means evaluation becomes a bottleneck for experimentation If we need hours of human time to evaluate an expriment this becomes a bottleneck for experimentation. This motivates automatic metrics for evaluation machine translation. One of the oldest examples is the BiLingual Evaluation Understudy (BLEU).

  • Edward Ross
Hugo Casper 2 to 3
blog

Hugo Casper 2 to 3

I've been wanting to upgrade my version of Hugo, but the Casper 2 theme I was using didn't support it. As a first step to this transition is to use Casper 3. It looks similar to my old theme, is easy to set up, but seems to be missing some features. I cloned the repository, and changed the theme in my config.toml to theme = "hugo-casper3". The article images weren't showing because the Casper 3 theme uses feature_image instead of image and requires a leading slash in the path (which was optional in 2).

  • Edward Ross
Running an X server with WSL2
wsl

Running an X server with WSL2

I've recently started working with WSL2 on my Windows machine, but have had trouble getting an X server to run. This is an issue for me because running Emacs with Evil keybindings under Windows Terminal I often find there's a lag in registering pressing escape which leads to some confusing issues (but vanilla Vim is fine). But having an X Server would also allows running any Linux graphical application under X.

  • Edward Ross
Customising Portable Dotfiles
git

Customising Portable Dotfiles

I keep my personal configuration files in a public dotfiles repository. This means that whenever I'm on a new machine it's very easy to get comfortable in a new environment. However I find I often need machine specific configuration, so I provide ways to override them with local configuration. When I get to a new machine I'll pretty quickly want some of my usual configuration (although I don't need it). I can clone or download a zipfile of my dotfiles and then install it via some symlinks via a bootstrap bash script.

  • Edward Ross
Git Folder Identities
git

Git Folder Identities

Sometimes you want a different git configuration in different contexts. For example you might want different author information, or to exclude files for only some kinds of projects, or to have a specific templace for certain kinds of projects. The easiest way to do this consistently is with a includeIf statement. For example to have custom options for any git repository under a folder called apache add this to the bottom of your ~/.

  • Edward Ross
Raising Exceptions in Python Futures
python

Raising Exceptions in Python Futures

Python concurrent.futures are a handy way of dealing with asynchronous execution. However if you're not careful it will swallow your exceptions leading to difficult to debug errors. While you can perform concurrent downloads with multiprocessing it means starting up multiple processes and sending data between them as pickles. One problem with this is that you can't pickle some kinds of objects and often have to refactor your code to use multiprocessing.

  • Edward Ross
Getting Started with WSL2
wsl

Getting Started with WSL2

I've finally started trying out Windows System for Linux version 2. When comparing with WSL1 it's much faster because it works on a Virtual Machine rather than translating syscalls, but is slower when working on Windows filesystems. The speed up is significant when launching processes and dealing with small files, and git and Python virtualenvs are an order of magnitude faster. I'm still working through some of the issues of transferring.

  • Edward Ross
Targeting my brand
general

Targeting my brand

My friend has four different magnets for plumbers on his fridge. Three of them are generic rectangular magnets that have generic information and contact details. One of them was in the shape of a dripping tap, mentioning they were experts in leaks and drips. If they had a leaking faucet it's pretty easy to guess which plumber they would call; the specialists in dripping taps. On the other hand if they had a clogged toilet it's down to chance which of the plumbers they would call, although they're less likely to call the dripping tap specialist they're also more likely to forget to look at the fridge and just search for a plumber online.

  • Edward Ross
Embrace, Extend and Extinguish
general

Embrace, Extend and Extinguish

In the 90s Microsoft famoursly used a strategy of embracing other protocols, then adding extensions to their implementation until it's no longer compatible and utilising their market leverage to extinguish competing implementations. While "EEE" is normally associated with Microsoft many of the software titans use it as an effective strategy to further their existing dominance into new markets. Embracing a technology with an existing market is an effective way to quickly gain adoption.

  • Edward Ross
Mace-Bearer
general

Mace-Bearer

The University of Adelaide, being a sandstone Group of Eight University, has the archaic ceremony of a mace-bearer leading the processiong carrying a heavy piece of expensive metal. When I graduated with my Bachelor of Science I was fortunate enough to be that mace-bearer. Unfortunately I wasn't really prepared for the formality. The ceremony was on a typical Adelaide summer's day, hot and dry. I was going out to lunch with my parents afterwards, so I wanted to make sure I was comfortable.

  • Edward Ross
Data Models
data

Data Models

Information is useful in that it helps make better decisions. This is much easier if the data is represented in a way that closely match the conceptual model of the business. Building a useful view of the data can dramatically decrease the time and cost of answering questions and even elevate the conversation to answering deeper questions about the business. A typical example of where analysis can help is trying to increase revenue of a digitally sold product.

  • Edward Ross
Filling Gaps in SQL
sql

Filling Gaps in SQL

It's common for there to be gaps or missing values in an SQL table. For example you may have daily traffic by source, but on some low volume days around Christmas there are no values in the low traffic sources. Missing values can really complicate some calculations like moving averages, and some times you need a way of filling them in. This is straightforward with a cross join. You need all the possible variables you're filling in, and the value to fill.

  • Edward Ross
Directions of Delegation
general

Directions of Delegation

For any actionable item there are four ways to handle it: do it, defer it, delegate it or delete it. Delegation is an often overlooked powerful option to handle things. It's not just for high powered executives to delegate down to their personal assistants; even if you don't have any reports it's possible to delegate. You can delegate in three directions: down, sideways and up. Downwards delegation is the classic kind that comes to most people's minds.

  • Edward Ross
A Checklist for NLP models
data

A Checklist for NLP models

When training machine learning models typically you get a training dataset for fitting the model and a test dataset for evaluating the model (on small datasets techniques like cross-validation are common). You typically assume the performance on your chosen metric on the test dataset is the best way of judging the model. However it's really easy for systematic biases or leakage to creep into the datasets, meaning that your evaluation will differ significantly to real world usage.

  • Edward Ross
Deep Neural Networks as a Building Block
data

Deep Neural Networks as a Building Block

Deep Neural Networks have transformed dealing with unstructured data like images and text, making totally new things possible. However they are difficult to train, require a large amount of relevant training data, are hard to interpret, hard to debug and hard to refine. I think for these reasons there's a lot of space to use neural networks as a building block for extracting structured data for less parameterised models. Josh Tenenbaum gave an excellent keynote at ACL 2020 titled Cognitive and computational building blocks for more human-like language in machines.

  • Edward Ross
Sequential Weak Labelling for NER
data

Sequential Weak Labelling for NER

The traditional way to train an NER model on a new domain is to annotate a whole bunch of data. Techniques like active learning can speed this up, but especially neural models with random weights require a ton of data. A more modern approach is to take a large pretrained NER model and fine tune it on your dataset. This is the approach of AdaptaBERT (paper), using BERT. However this takes a large amount of GPU compute and finnicky regularisation techniques to get right.

  • Edward Ross
pyBART: Better Dependencies for Information Extraction
python

pyBART: Better Dependencies for Information Extraction

Dependency trees are a remarkably powerful tool for information extraction. Neural based taggers are very good and Universal Dependencies means the approach can be used for almost any language (although the rules are language specific). However syntax can get really strange requiring increasingly complex rules to extract information. The pyBART system solves this by rewriting the rules to be half a step closer to semantics than syntax. I've seen that dependency based rules are useful for extracting skills from noun phrases and adpositions.

  • Edward Ross
Demjson for parsing tricky Javascript Objects
python

Demjson for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Python's inbuilt json.loads is effective, but won't handle very dynamic Javascript, but demjson will. The problem shows up when using json.loads as the following obscure error: json.decoder.JSONDecodeError: Expecing value: line N column M (char X) Looking at the character in my case looking near the character I see that it is a JavaScript undefined, which is not valid in JSON.

  • Edward Ross
Tips for Extracting Data with Beautiful Soup
python

Tips for Extracting Data with Beautiful Soup

Beautiful soup can be a useful library for extracting infomation from HTML. Unfortunately there's a lot of little issues I hit working with it to extract data from a careers webpage using Common Crawl. The library is still useful enough to work with; but the issues make me want to look at alternatives like lxml (via html5-parser). The source data can be obtained at the end of the article. Use a good HTML parser Python has an inbuild html.

  • Edward Ross
Only write file on success
python

Only write file on success

When writing data pipelines it can be useful to cache intermediate results to recover more quickly from failures. However if a corrupt or incomplete file was written then you could end up caching that broken file. The solution is simple; only write the file on success. A strategy for this is to write to some temporary file, and then move the temporary file on completion. I've wrapped this in a Python context manager called AtomicFileWriter which can be used in a with statement in place of open:

  • Edward Ross
Diverge then Converge
general

Diverge then Converge

It's very useful to diverge on ideas before converging on a solution. Trying to do both at the same time tends to stifle creativity and lead to less innovative solutoins. I find the creative process of brainstorming is more effective if I do it separately to refining ideas. Taking the time to brainstorm leads to better solutions, whether thinking about what to work on, planning out a presentation or designing a technical solution.

  • Edward Ross
Accelerating downloads with Multiprocessing
data

Accelerating downloads with Multiprocessing

Downloading files can often be a bottleneck in a data pipeline because network I/O is slow. A really simple way to handle this is to run multiple downloads in parallel accross threads. While it's possible to deal with the unused CPU cycles using asynchronous processing, in Python it's generally easier to throw more threads at it. Using multiprocessing can be very simple if you can turn make the processing occur in a pure function or object method, and both the variables are results are picklable.

  • Edward Ross
Processing RDF nquads with grep
data

Processing RDF nquads with grep

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I previously came up with a SPARQL query to extract the Australian jobs from the domain, country and currency. Unfortunately it's quite slow, but we can speed it up dramatically by replacing it with a similar script in grep. With a short grep script we can get twenty thousand Australian Job Postings with metadata from 16 million lines of compressed nquad in 30 seconds on my laptop.

  • Edward Ross
Extracting Australian Job Postings with SPARQL
jobs

Extracting Australian Job Postings with SPARQL

I am trying to extract Australian Job Postings from Web Data Commons which extracts structured data from Common Crawl. I have previously written scripts to read in the graphs, explore JobPosting schema and analyst the schema using SPARQL. Now we can use these to find some Austrlian Job Postings in the data. For this analysis I used 15,000 pages containing job postings with different domains from the 2019 Web Data Commons Extract.

  • Edward Ross
Adding Types to Rdflib
python

Adding Types to Rdflib

I've been using RDFLib to parse Job posts extracted from Common Crawl. RDF Literals It automatically parses XML Schema Datatypes into Python datastructures, but doesn't handle the <http://schema.org/Date> datatype that commonly occurs in JSON-LD. It's easy to add with the rdflib.term.bind command, but this kind of global binding could lead to problems. When RDFLib parses a literal it will create a rdflib.term.Literal object and the value field will contain the Python type if it can be successfully converted, otherwise it will be None.

  • Edward Ross
Schemas for JobPostings in Practice
jobs

Schemas for JobPostings in Practice

A job posting has a description, a company, sometimes a salary, ... and what else? Schema.org have a detailed JobPosting schema, but it's not immediately obvious what is important and how to use it. However the Web Data Commons have extracted JobPostings from hundreds of thousands of webpages from Common Crawl. By parsing the data we can see how these are actually used in practice which will help show what is actually useful in describing a job posting.

  • Edward Ross
Converting RDF to Dictionary
python

Converting RDF to Dictionary

The Web Data Commons has a vast repository of structured RDF Data about local businesses, hostels, job postings, products and many other things from the internet. Unfortunately it's not in a format that's easy to do analysis on. We can stream the nquad format to get RDFlib Graphs, but we still need to convert the data into a form we can do analysis on. We'll do this by turning the relations into dictionaries of properties to the list of objects they contain.

  • Edward Ross
Streaming n-quads as RDF
data

Streaming n-quads as RDF

The Web Data Commons extracts structured RDF Data from about one monthly Common Crawl per year. These contain a vast amount of structured infomation about local businesses, hostels, job postings, products and many other things from the internet. Python's RDFLib can read the n-quad format the data is stored in, but by default requires reading all of the millions to billions of relations into memory. However it's possible to process this data in a streaming fashion allowing it to be processed much faster.

  • Edward Ross
Scheduling Github Actions
programming

Scheduling Github Actions

I use Github actions to publish daily articles via Hugo. I had set it up to publish on push, but sometimes I future date articles to have a backlog. This means that they won't be published until my next commit or manual publish action. To fix this I've set up a scheduled action to run just after 8am in UTC+10 (close to my timezone in Melbourne, Australia) every day. By default Hugo will not publish articles with a future date, so it's easy to keep a backlog by setting the date in front matter to a future date.

  • Edward Ross
Using Local Github Actions
programming

Using Local Github Actions

I've been using Github Actions to publish this website for almost a month. The experience has been great; whenever I push a commit it gets consistently published without me thinking about it within minutes. However I have one concern; I'm passing my rsync credentials into an external action. I've specified a tag in my yaml uses: wei/rclone@v1, but it would be easy for the author to move this tag to another commit that sends my private credentials to their personal server.

  • Edward Ross
Checking your Work
general

Checking your Work

One of the most important abilities of an analyst is to be able to check your work. It's really easy to get incorrect data, have issues in data processing, or even misunderstand what the output means. But if your work is valuable enough to change a decision it's worth doing whatever you can to check it's right. When you get to the end of a long analysis it seems like a time to relax and be glad the hard work is over.

  • Edward Ross
Parsing Escaped Strings
python

Parsing Escaped Strings

Sometimes you may have to parse a string with backslash escapes; for example "this is a \"string\"". This is quite straightforward to parse with a state machine. The idea of a state machine is that the action we need to take will change depending on what we have already consumed. This can be used for proper regular expressions (without special things like lookahead), and the ANTLR4 parser generator can maintain a stack of "modes" that can be used similarly.

  • Edward Ross
Extracting Job Ads from Common Crawl
commoncrawl

Extracting Job Ads from Common Crawl

I've been using data from the Adzuna Job Salary Predictions Kaggle Competition to extract skills, find near duplicate job ads and understand seniority of job titles. But the dataset has heavily processed ad text which makes it harder to do natural language processing on. Instead I'm going to find job ads in Common Crawl's, a dataset containing over a billion webpages each month. The Common Crawl data is much better because it's longitudinal over several years, international, broad and continually being updated.

  • Edward Ross
Excel Completion Count
excel

Excel Completion Count

I was recently running some simple, but tedious, annotation in Excel. While it's not a good tool for complex annotation for a problem with a simple textual annotation where you can fit all the information to make a decision in a row it can be effective. However I needed a way to track progress across the team to make sure we finished on time, and see who needed help. We had a blank column that was being filled in as the annotation progressed, and each person was working on some set of rows.

  • Edward Ross
Common Crawl Index Athena
commoncrawl

Common Crawl Index Athena

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. There are petabytes of data archived so directly searching through them is very expensive and slow. To search for pages that have been archived within a domain (for example all pages from wikipedia.com) you can search the Capture Index. But this doesn't help if you want to search for paths archived across domains. For example you might want to find how many domains been archived, or the distribution of languages of archived pages, or find pages offered in multiple languages to build a corpus of parallel texts for a machine translation model.

  • Edward Ross
Extracing Text, Metadata and Data from Common Crawl
commoncrawl

Extracing Text, Metadata and Data from Common Crawl

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. You can search the index to find where pages from a particular website are archived, but you still need a way to access the data. Common Crawl provides the data in 3 formats: If you just need the text of the internet use the WET files If you just need the response metadata, HTML head information or links in the webpage use the WAT files If you need the whole HTML (with all the metadata) then use the full WARC files The index only contains locations for the WARC files, the WET and WAT files are just summarisations of it.

  • Edward Ross
Searching 100 Billion Webpages Pages With Capture Index
commoncrawl

Searching 100 Billion Webpages Pages With Capture Index

Common Crawl builds an open dataset containing over 100 billion unique items downloaded from the internet. Every month they use Apache Nutch to follow links accross the web and download over a billion unique items to Amazon S3, and have data back to 2008. This is like what Google and Bing do to build their search engines, the difference being that Common Crawl provides their data to the world for free.

  • Edward Ross
Understaning Job Ad Titles with Salary
jobs

Understaning Job Ad Titles with Salary

Different industries have different ways of distinguishing seniority in a job title. Is a HR Officer more senior than a HR Administrator? Is a PHP web developer more skilled than a PHP developer? How different is a medical sales executive to general sales roles? Using the jobs from Adzuna Job Salary Predictions Kaggle Competition I've found common job titles and can use the advertised salary to help understand them. Note that since the data is from the UK from several years ago a lot of the details aren't really applicable, but the techniques are.

  • Edward Ross
Discovering Job Titles
jobs

Discovering Job Titles

A job ad title can contain a lot of things like location, skills or benefits. I want a list of just the job titles, without the rest of those things. This is a key piece of information extraction that can be used to better understand jobs, and built on by understanding how different job titles relate, for example with salary. To do this we first normalise the words in the ad title, doing things like removing plurals and expanding acronyms.

  • Edward Ross
Heuristics for Active Open Source Project
programming

Heuristics for Active Open Source Project

When evaluating whether to use an open source project I generally want to know how active the project is. A project doesn't need to be active to be useable; mature and stable projects don't need to change much to be reliable. But if a project has problems or missing essential features, or is in an evolving ecosystem (like any web project or kernel drivers), it's important to know how fast it changes.

  • Edward Ross
Making Words Singular
nlp

Making Words Singular

Trying to normalise text in job titles I need a way to convert plural words into their singular form. For example a job for "nurses" is about a "nurse", a job for "salespeople" is about a "salesperson", a job for "workmen" is about a "workman" and a job about "midwives" is about a "midwife". I developed an algorithm that works well enough for converting plural words to singular without changing singular words in the text like "sous chef", "business" or "gas".

  • Edward Ross
Rewriting A of B
nlp

Rewriting A of B

When examining words in job titles I noticed that if was common to see titles written as "head of ..." or "director of ...". This is unusual because most role titles go from specific to general (e.g. finance director) to you look backwards from the role word. In the "A of B" format the role goes from specific to general and so you have to reverse the search order. One solution is to rewrite "director of finance" to "finance director".

  • Edward Ross
Mail merge to PDF Files
general

Mail merge to PDF Files

A friend needed to generate a hundred contracts and their HR information system wasn't working properly. I helped them implement a workaround solution by using mail merge to generate a PDF for every contract, which saved them a lot of time filling in the details of each contract. I couldn't automatically generate the PDF despite some efforts, but using mail merge was much quicker and more reliable than filling in all the contract details manually into the template.

  • Edward Ross
Summary of Finding Near Duplicates in Job Ads
nlp

Summary of Finding Near Duplicates in Job Ads

I've been trying to find near duplicate job ads in the Adzuna Job Salary Predictions Kaggle Competition. Job ads can be duplicated because a hirer posts the same ad multiple times to a job board, or to multiple job boards. Finding exact duplicates is easy by sorting the job ads or a hash of them. But the job board may mangle the text in some way, or add its own footer, or the hirer might change a word or two in different posts.

  • Edward Ross
Finding Duplicate Companies with Cliques
jobs

Finding Duplicate Companies with Cliques

We've found pairs of near duplicate texts in 400,000 job ads from the Adzuna Job Salary Predictions Kaggle Competition. When we tried to extracted groups of similar ads by finding connected components in the graph of similar ads. Unfortunately with a low threshold of similarity we ended up with a chain of ads that were each similar, but the first and last ad were totally unrelated. One way to work around this is to find cliques, or a group of job ad were every job ad is similar to all of the others.

  • Edward Ross
Market for Highschool Maths Textbooks
general

Market for Highschool Maths Textbooks

My first professional job was for Haese mathematics which is a small family-owned South Australian business that writed and publishes mathematics textbooks. Working for a small company was a really interesting experience, I learned software development for their applications both for students and teachers, made animations, edited audio and did layout and graphic design of the books. Unfortunately I didn't make the effort to learn much about the business itself, which makes me wonder how big the market is for mathematics textbooks.

  • Edward Ross
Pain gain matrix for discussing approaches
general

Pain gain matrix for discussing approaches

Placing options on a scatterplot of costs versus benefits is a common practice for prioritising opportunities and solutions. The primary benefit of this approach is it can spark discussions. When people see the options on the canvas they it can help uncover unseen issues and opportunities. Getting a group of people involved in putting it together can help get them on the same page. The primary risk of this approach is getting too precise about it.

  • Edward Ross
Spreadsheets as a Rough Annotation Tool
excel

Spreadsheets as a Rough Annotation Tool

I needed to design some heuristic thresholds for grouping together items. In my first step attempt I iteratively tried to guess the thresholds by trying them on different examples. This was directionally useful but as I refined the thresholds I had to keep going back to check whether I had broken earlier examples. To improve this I used a spreadsheet as a rough annotation tool. There are various tools for data entry like org mode tables in Emacs, or you can use a spreadsheet interface in R with data.

  • Edward Ross
Bridging Bipartite Graph
data

Bridging Bipartite Graph

When you have behavioural data between actors and events you naturally get a bipartite graph. For example you can have the actors as customers and events as products that are purchased, or the actors as users of a website and the events as videos that are viewed, or the actors as members of a forum and the events as posts they comment on. One of the ways to represent this is to relate actors by the number of events they both participate in.

  • Edward Ross
Clustering for Exploration
data

Clustering for Exploration

Suppose you're running a website with tens of thousands of different products, and no satisfactory way to group them up. Even a mediocre clustering can really help bootstrap your understanding. You can use the clusters to see new patterns in the data, and you can manually refine the clusters much more easily than you can make them. There are many techniques to cluster structured data or even detect them as communities in the graph of interactions with your users.

  • Edward Ross
Less Is Better
general

Less Is Better

Today I was picking grapes from their vine for my partner's grandmother. They had been left too long and many were rotting or had bright blue spots where some form of fungus or algae was growing on them. I sorted the grapes into piles of rotten grapes and edible grapes. When I picked a big bunch of grapes with a couple of rotting overripe grapes I sorted it into the rotten pile, despite there being a dozen ripe looking grapes.

  • Edward Ross
Using Github Actions with Hugo
programming

Using Github Actions with Hugo

I really like the idea of having a process triggered automatically when I push code. Github actions gives a way to do this with Github repositories, and this article was first published with a Github action. While convenient for simple things Github actions seem hard to customise, heavyweight to configure and give me security concerns. My workflow for publishing this website used to be commit and push the changes and run a deploy script.

  • Edward Ross
Project Estimation
general

Project Estimation

Estimating projects is notoriously difficult, and the larger the project the harder to estimate. But even small pieces of work for a single person are easy to underestimate. When you make an estimate base it on actual elapsed times of similar projects, always try to overestimate the time, and reduce the scope before promising more than you can deliver. Everyone knows that construction jobs are typically going to take longer and cost more than quoted, from home rennovations to major construction projects.

  • Edward Ross
Probability Jaccard
math

Probability Jaccard

I don't like Jaccard index for clustering because it doesn't work well on sets of different sizes. Instead I find the concepts from Association Rule Learning (a.k.a market basket analysis) very useful. It turns out Jaccard Similarity can be written in terms of these concepts so they really are more general. The main metrics in association rule mining are the confidence, which for pairs is just the conditional probability \( P(B \vert A) = \frac{P(A, B)}{P(A)} \) There is also the lift which is how much more likely than random (from the marginals) the two events are likely to occur together \( \frac{P(A, B)}{P(A)P(B)} \).

  • Edward Ross
Community detection in Graphs
data

Community detection in Graphs

People using a website or app will have different patterns of behaviours. It can be useful to cluster the customers or products to help understand the business and make better strategic decisions. One way to view this data is as an interaction graph between people and the product they interact with. Clustering a graph of interactions is called "community detection" Santo Fortunato's review article and user guide provides a really good introduction to community detection.

  • Edward Ross
Serving Static Assets with Python Simple Server
web

Serving Static Assets with Python Simple Server

I was trying to load a local file in a HTML page and got a Cross-Origin Request Blocked error in my browser. The solution was to start a Python web server with python3 -m http.server. I had a JSON file I wanted to load into Javascript in a HTML page. Looking at StackOverflow I found I found fetch could do this fetch("test.json") .then(response => response.json()) .then(json => process(json)) Where process is some function that acts on the data; console.

  • Edward Ross
Listening
communication

Listening

When I'm in a comfortable environment I love to talk. This can be really useful for working through a problem by bouncing ideas off of other people, or for educating people and getting a point accross. But in getting something done I find listening is much more powerful than talking. There's lots of reasons to spend more time listening than talking. When you get a greater diversity of ideas you generally get to a better solution, and often the quieter people in the room have a valuable perspective.

  • Edward Ross
Finding Common Substrings
data

Finding Common Substrings

I've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. One thing that would be useful to know is what the common sections of the ads are. Typically if they have a high 3-Jaccard similarity it's because they have some text in common. The most asymptotically efficient to find the longest common substring would be to build a suffix tree, but for experimentation the heuristics in Python's DiffLib work well enough.

  • Edward Ross
Simple Models
data

Simple Models

My first instinct when dealing with a new problem is to try to find a complex technique to solve it. However I've almost always found it more useful to start with a simple model before trying something more complex. You gain a lot from trying simple models and the cost is low. Even if they're not enough to solve the problem (which they can be) they will often give a lot of information about the problem which will set you up for later techniques.

  • Edward Ross
Power of Easy
software

Power of Easy

Something being easy makes a huge difference in how often it is used. Even small frictions can add up and make a task less desirable. In the book Nudge, Thaler and Sunstein talk about how small changes to defaults impact major decisions like whether they donate their organs and how they save for retirement. Whenever you're designing something make it as easy as possible for people to do the desired thing; and make sure it's easy from their perspective - where they don't care about the product they're using but the task they are trying to achieve.

  • Edward Ross
Beta Function
math

Beta Function

The Beta Function comes up in the likelihood of the binomial distribution. Understanding its properties is useful for understanding the binomial distribution. The beta function is given by \( B(a, b) = \int_0^1 p^{a-1}(1-p)^{b-1} \rm{d}p \) for a and b positive. If you have $N$ flips of a coin of which $k$ turn heads the likelihood is proportional to \( p^{k}(1-p)^{N-k} \) for the probability p between 0 and 1. So the beta function can be seen as the normaliser of the likelihood, with \( a = k + 1 \) and \( b = N - k + 1 \) (or inversely \( k = a - 1 \) and \( N = a + b - 2 \)).

  • Edward Ross
From Bernoulli to Binomial Distributions
data

From Bernoulli to Binomial Distributions

Suppose that you flip a fair coin 10 times, how many heads will you get? You'd think it was close to 5, but it might be a bit higher or lower. If you only got 7 heads would you reconsider you assumption the coin is fair? What if you got 70 heads out of 100 flips? This might seem a bit abstract, but the inverse problem is often very important. Given that 7 out of 10 people convert on a new call to action, can we say it's more successful than the existing one that converts at 50%?

  • Edward Ross
Minhash Sets
jobs

Minhash Sets

We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.

  • Edward Ross
Searching for Near Duplicates with Minhash
nlp

Searching for Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.

  • Edward Ross
Considering VS Code from Emacs
emacs

Considering VS Code from Emacs

I've been using Emacs as my primary editor for around 5 years now (after 4 years of Vim). I'm very comfortable in it, having spent a long time configuring my init.el. But once in a while I'm slowed down by some strange issue, so I'm going to put aside my sunk configuration costs and have a look at using VS Code. On Emacs I recently read a LWN article on Making Emacs Popular Again (and the corresponding HN thread).

  • Edward Ross
Detecting Near Duplicates with Minhash
nlp

Detecting Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.

  • Edward Ross
Lessons from a mathematician on building a community
maths

Lessons from a mathematician on building a community

Mathematicians and software developers have a lot in common. They both build structures of ideas, typically working in small groups or alone, but leveraging structures built by others. For software developers the ideas are concrete code implementations, and the building blocks are subroutines, and are published as "libraries" or "packages". For mathematicians the ideas are abstract, built on definitions and theorems and published in papers, conferences and informal conversations. To grow a substantial body of work in both mathematics or software requires a community to contribute to it.

  • Edward Ross
Clustering for Segmentation
data

Clustering for Segmentation

Dealing with thousands of different items is difficult. When you've got a couple of dozen you can view them together, but as you get into the hundreds, thousands and beyond it becomes necessary to group items to make sense of them. For example if you've got a list of customers you might group them by state, or by annual spend. But sometimes it would be useful to split them into a few groups using some heuristic criteria; clustering is a powerful technique to do this.

  • Edward Ross
Representing Decision Trees on a grid
data

Representing Decision Trees on a grid

A decision tree is a series of conditional rules leading to an outcome. When stated as a chain of if-then-else rules it can be really hard to understand what is going on. If the number of dimensions and cutpoints is relatively small it can be useful to visualise on a grid to understand the tree. Decision trees are often represented as a heirarchy of splits. Here's an example of a classification tree on Titanic survivors.

  • Edward Ross
Writing 50 Daily Articles
writing

Writing 50 Daily Articles

I've been writing an article a day for 50 days now. I started this to help build a portfolio, keep track of useful learnings and to become better at writing. This post reflects on the progress so far. Inspiration While there are many sources of inspiration for my writing, Sacha Chua's No Excuses Guide to Blogging is the biggest one. I bought the book around 2 years ago but I've found it useful and kept coming back to it.

  • Edward Ross
Four Competencies of an Effective Analyst
data

Four Competencies of an Effective Analyst

Analysts tend to be natural problem solvers, good at reasoning and adept with numbers. But to know how to frame the problem and what to look for they need to understand the context. To solve the problems they have to collect the right data and perform any necessary calculations. To have impact they need to be able to understand what's valuable, communicate their insights and influence decisions. These make up the four competencies of an effective analyst.

  • Edward Ross
4am Rule for timeseries
data

4am Rule for timeseries

When you've got a timeseries that doesn't have a timezone attched to it the natural question is "what timezone is this data from?" Sometimes it's UTC, sometimes it's the timezone of the server, otherwise it could be the timezone of one of the locations it's about (and it may or may not change with daylight savings). When it's people's web activity there's a simple heuristic to check this: the activity will be minimum between 3am and 5am.

  • Edward Ross
Locating Addresses with G-NAF
data

Locating Addresses with G-NAF

A very useful open dataset the Australian Government provides is the Geocoded National Address File (G-NAF). This is a database mapping addresses to locations. This is really useful for applications that want to provide information or services based on someone's location. For instance you could build a custom store finder, get aggregate details of your customers, or locate business entities with an address, for example ATMs. There's another open and editable dataset of geographic entities, Open Street Map (and it has a pretty good open source Android app OsmAnd).

  • Edward Ross
Pipetable to CSV
emacs

Pipetable to CSV

Sometimes I get out pipe tables in Emacs that I want to convert into a CSVto put somewhere else. This is really easy with regular expressions. I often get data output from an SQL query like this text | num | value --------------+------+------------- Some text | 0.3 | 0.2 Rah rah | 7 | 0.00123(2 rows) Running sed 's/\(^ *\| *|\|(.*\) */,/g' gives: ,text,num,value --------------+------+------------- ,Some text,0.3,0.2 ,Rah rah,7,0.00123, I can delete the divider and then use as a CSV.

  • Edward Ross
Binning data in SQL
sql

Binning data in SQL

Generally when combining datasets you want to join them on some key. But sometimes you really want a range lookup like Excel's VLOOKUP. A common example is binning values; you want to group values into custom ranges. While you could do this with a giant CASE statement, it's much more flexible to specify in a separate table (for regular intervals you can do it with some integer division gymnastics). It is possible to implement VLOOKUP in SQL by using window functions to select the right rows.

  • Edward Ross
A Mixture of Bernoullis is Bernoulli
maths

A Mixture of Bernoullis is Bernoulli

Suppose you are analysing email conversion through rates. People either follow the call to action or they don't, so it's a Bernoulli Distribution with probability the actual probability a random person will the email. But in actuality your email list will be made up of different groups; for example people who have just signed up to the list may be more likely to click through than people who have been on it for a long time.

  • Edward Ross
Probability Squares
maths

Probability Squares

A geometric way to represent combining two independent discrete random variables is as a probability square. On each side of the square we have the distributions of the random variables, where the length of each segment is proportional to the probability. In the centre we have the function evaluated on the two edges and the probability is proportional to the area of the rectangle. For example suppose we had a random process that generated 1, 2 or 3 with equal probability (for example half the value of a die, rounded up).

  • Edward Ross
Representing Interaction Networks
data

Representing Interaction Networks

Behavioural data can illuminate the structure of the underlying actors. For example looking at which products customers buy can help understand how both the products and customers interact. The same idea can apply to people who attend events, watch the same movie, or have authored a scientific paper together. There are a few ways to represent these kinds of interactions which gives a large toolbox of ways to approach the problem.

  • Edward Ross
Excel Binning
excel

Excel Binning

Putting numeric data into bins is a useful technique for summarising, especially for continuous data. This is what underlies histograms which is a bar chart of frequency counts in each bin. There are two main ways of doing this in Excel with groups and with vlookup (you can also do this in SQL). If you want equal length bins in a Pivot Table the easiest way is with groups. Right click on the column you want to bin and select Group

  • Edward Ross
Powershell Debugging with Write-Warning
programming

Powershell Debugging with Write-Warning

I had to debug some Powershell, without knowing anything about it. I found Write-Warning was the right tool for printline debugging. This was enough to resolve my issue. I first tried Write-Output but apparently it doesn't work inside a function which I found misleading for a while (at first I thought that it wasn't getting to the function). Write-Warning worked straight away and I could see in bright yellow what was going on.

  • Edward Ross
Analysis Needs to Change A Decision
data

Analysis Needs to Change A Decision

Any analysis where the results won't change a decision is worthless. Before even thinking of getting any data it's worth being clear on how it impacts the decision. There's lots of reasons people want an analysis. Sometimes it's to confirm what they already believe (and they'll discount anything that tells them otherwise). Sometimes it's to prove to others something they believe; possibly to inform a decision someone else is making. But it's most valuable when it effects a decision they can make with an outcome they care about.

  • Edward Ross
SQL Views for hiding business logic
sql

SQL Views for hiding business logic

The longer I work with a database the more I learn the dark corners of the dataset. Make sure you exclude the rows created by the test accounts listed in another table. Don't use the create_date field, use the real_create_date_v2 instead, unless it's not there, then just use create_date. Make sure you only get data from the latest snapshot for the key. Very quickly I end up with complex spaghetti SQL, which either contains monstrous subqueries or a chain of CREATE TEMPORARY TABLE.

  • Edward Ross
Near Duplicates with TF-IDF and Jaccard
nlp

Near Duplicates with TF-IDF and Jaccard

I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.

  • Edward Ross
Near Duplicates with Jaccard
nlp

Near Duplicates with Jaccard

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calcuate.

  • Edward Ross
Edit Distance
nlp

Edit Distance

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.

  • Edward Ross
Using Emacs under WSL
emacs

Using Emacs under WSL

Getting Emacs to work nicely on a Windows system can be a challenge. You can install it natively (although getting all the dependencies is a challenge), but many packages require libraries or utilities that are hard to install or don't exist on Windows. The best solution I have found is using Emacs under the Windows Subsystem for Linux (WSL) with Xming. However if you run Emacs 26 or greater after starting Xming with XLaunch you're faced with a blank screen and can't see any writing on Emacs

  • Edward Ross
The Problem with Jaccard for Clustering
data

The Problem with Jaccard for Clustering

The Jaccard Index is a useful measure of similarity between two sets. It makes sense for any two sets, is efficient to compute at scale and it's arithmetic complement is a metric. However for clustering it has one major disadvantage; small sets are never close to large sets. Suppose you have sets that you want to cluster together for analysis. For example each set could be a website and the elements are people who visit that website.

  • Edward Ross
Jaccard Shingle Inequality
maths

Jaccard Shingle Inequality

Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if you're concerned with plagarism and copyright, getting the same data through multiple sources, or finding versions of the same document this approach could be useful. In particular MinHash can quickly find pairs of items with a high Jaccard index, which we can run on sequences of w tokens. A hard question is what's the right number for w?

  • Edward Ross
Finding Exact Duplicate Text
python

Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.

  • Edward Ross
Showing Side-by-Side Diffs in Jupyter
python

Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

  • Edward Ross
Creating a Diff Recipe in Prodigy
nlp

Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

  • Edward Ross
All of Statistics
data

All of Statistics

For anyone who wants to learn Statistics and has a maths or physics I highly recommend Larry Wasserman's All of Statistics . It covers a wide range of statistics with enough mathematical detail to really understand what's going on, but not so much that the machinery is overwhelming. What I learned reading it really helped me understand statistics well enough to design bespoke statistical experiments and effectively use and implement machine learning models.

  • Edward Ross
Remote social catchups are less intimate
life

Remote social catchups are less intimate

As an introvert I really like catching up with good friends in small groups. But a video/remote catchup is much less intimate than real life because only one person can talk at a time. When you get 4 or more people in a group setting, frequently the conversation splits into smaller subgroups. The subgroups let people intermingle and participate in topics they're more interested in while all being together. With a video call you can't easily do this splitting and only one person can talk at a time.

  • Edward Ross
Counting n-grams with Python and with Pandas
python

Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.

  • Edward Ross
Waiting for System clock to synchronise
linux

Waiting for System clock to synchronise

When trying to install packages with apt on a new Ubuntu AWS EC2 instance I had issues where the signature would fail to verify. The reason was the system clock was far in the past and so it looked like the signature was signed in the future. I created a workaround to wait for the system clock to synchronise that solved the problem and could be useful when starting a new machine with time sensitive issues.

  • Edward Ross
Not using NER for extracting Job Titles
nlp

Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

  • Edward Ross
Rules, Pipelines and Models
nlp

Rules, Pipelines and Models

Over the past decade deep neural networks have revolutionised dealing with unstructured data. Problems like identifying what objects are in a video through generating realistic text to translating speech between languages that were intractible are now used in real-time production systems. You might think that today all problems on text, audio an images should be solved by training end-to-end neural networks. However rules and pipelines are still extremely valuable in building systems, and can leverage the information extracted from the black-box neural networks.

  • Edward Ross
Training a job title NER with Prodigy
nlp

Training a job title NER with Prodigy

In a couple of hourse I trained a reasonable job title Named Entity Recogniser for job ad titles using Prodigy, with over 70% accuracy. While 70% doesn't sound great it's a bit ambiguous what a job title is, and getting exactly the bounds of the job title can be a hard problem. It's definitely good enough to be useful, and could be improved. After thinking through an annotation scheme for job titles I wanted to try annotating and training a model.

  • Edward Ross
Annotating Job Titles
nlp

Annotating Job Titles

When doing Named Entity Recognition it's important to think about how to set up the problem. There's a balance between what you're trying to achieve and what the algorithm can do easily. Coming up with an annotation scheme is hard, because as soon as you start annotating you notice lots of edge cases. This post will go through an example with extracting job titles from job ads. In our previous post we looked at what was in a job ad title and a way of extracting some common job titles from the ads.

  • Edward Ross
What's in a Job Ad Title?
nlp

What's in a Job Ad Title?

The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from very broad to very narrow, be obscure or acronym-laden and even hard to nail down. They're even hard to extract from a job ad's title - which is what I'll focus on in this series. In a previous series of posts I developed a method that could extract skills written a very particular way.

  • Edward Ross
Disk Usage in Linux with du
linux

Disk Usage in Linux with du

When your harddrive is filling up the du utility is a great way of seeing what's taking up all the space. It can recursively walk through directories to a maximum depth, and print it in human readable sizes. I'll normally start by running df to see what space is used and available. It's worth looking at the Mounted On column if you don't administer the machine because sometimes there are special partitions for large files.

  • Edward Ross
Getting Started Debugging with pdb
python

Getting Started Debugging with pdb

When there's something unexpected happening in your Python code the first thing you want to do is to get more information about what's going wrong. While you can use print statements or logging it may take a lot of iterations of rerunning and editing your statements to capture the right information. You could use a REPL but sometimes it's challenging to capture all the state at the point of execution. The most powerful tool for this kind of problem is a debugger, and it's really easy to get started with Python's pdb.

  • Edward Ross
Calculating percentages in Presto
SQL

Calculating percentages in Presto

One trick I use all the time is calculating percentages in SQL by dividing with the count. Percentages quickly tell me how much coverage I've got when looking at the top few rows. However Presto uses integer division so doing the naive thing will always give you 0 or 1. There's a simple trick to work around this: replace count(*) with sum(1e0). Suppose for example you want to calculate the percentage of a column that is not null; you might try something like

  • Edward Ross
Moving Averages in SQL
SQL

Moving Averages in SQL

Moving averages can help smooth out the noise to reveal the undelying signal in a dataset. As they lag behind the actual signal they tradeoff timeliness for increased precision in the underlying signal. You could use them for reporting metrics or for alerting in cases where it's more important to be sure ther is a change than it is to catch any change early. It's typically better to have a 7 day moving average than weekly reporting for important metrics because you'll see changes earlier.

  • Edward Ross
Getting most recent value in Presto with max_by
Presto

Getting most recent value in Presto with max_by

Presto and the AWS managed alternative Amazon Athena have some powerful aggregation functions that can make writing SQL much easier. A common problem is getting the most recent status of a transaction log. The max_by function (and its partner min_by) makes this a breeze. Suppose you have a table tracking user login activity over time like this: country user_id time status AU 1 2020-01-01 08:00 logged-in CN 2 2020-01-01 09:00 logged-in AU 1 2020-01-01 12:00 logged-out AU 1 2020-01-01 13:00 logged-in CN 2 2020-01-01 14:00 logged-out You need to find out which users are currently logged in and out, which requires you to find their most recent status.

  • Edward Ross
Syncing Calendars and Contacts to Android with DAVx5
android

Syncing Calendars and Contacts to Android with DAVx5

I find it really handy to have my calendar and contacts from my email client on my mobile phone. DAVx5 is a fantastic free (GPLv3) app to do this on Android. This lets me organise my life accross devices and helps me know when friends and family's birthdays are. DAVx5 is simple to set up and has worked almost flawlessly for me for over 4 years. It supports two way synchronisation to CalDAV and CardDAV servers that many email providers support.

  • Edward Ross
Don't manage work email with Emacs
email

Don't manage work email with Emacs

I do a lot of work in Emacs and at the command line, and I get quite a few emails so it would be great if I could handle my emails there too. Email in Emacs can be surprisingly featureful and handles HTML markup, images and can even send org markup with images and equations all from the comfort of an Emacs buffer. However it can be a whole heap of work, and as you get deeper into the features your mail client provides the amount of custom integration required grows very rapidly.

  • Edward Ross
Data Transformations in the Shell
data

Data Transformations in the Shell

There are many great tools for filtering, transforming and aggregating data like SQL, R dplyr and Python Pandas (not to mention Excel). But sometimes when I'm working on a remote server I want to quickly extract some information from a file without switching to one of these environments. The standard unix tools like uniq, sort, sed and awk can do blazing fast transformations on text files that don't fit in memory and are easy to chain together.

  • Edward Ross
Second most common value with Pandas
python

Second most common value with Pandas

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group. This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas. Problem For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).

  • Edward Ross
Property Based Testing - A thousand test cases in a single line

Property Based Testing - A thousand test cases in a single line

Property based testing lets you specify rules that a function being tested will satisfy over a wide range of inputs. This specifies how to throughly test a function without coming up with a detailed set of test cases. For example instead of writing a specific test case like sort([1, 3, 2]) == [1, 2, 3], you could state that the input and output of sort should contain exactly the same elements for any valid input.

  • Edward Ross
Using emacs dumb-jump with evil
emacs

Using emacs dumb-jump with evil

Dumb-jump is a fantastic emacs package for code navigation. It jumps to the definition of a function/class/variable by searching for regular expressions that look like a definition using ag, ripgrep or git-grep/grep. Because it is so simple it works in over 40 languages (including oddities like SQL, LaTeX and Bash) and is easy to extend. While it is slower and less accurate than ctags, for medium sized projects it's fast enough and requiring no setup makes it much more useful in practice.

  • Edward Ross
Presto and Athena CLI in Emacs

Presto and Athena CLI in Emacs

I find having Emacs as a unified programming environment really useful. When writing an SQL pipeline I can iteratively develop my SQL in emacs, running it against the database. For a quick and dirty analysis I can copy the output into the .sql file and comment it out. Then I can copy the SQL into a programming language, parameterise it, and test it without touching the mouse. So when I started using Presto and AWS's managed alternative Athena, I needed to integrate it into emacs.

  • Edward Ross
Fastai Callbacks as Lisp Advice

Fastai Callbacks as Lisp Advice

Creating state of the art deep learning algorithms often requires changing the details of the training process. Whether it's scheduling hyperparameters, running on multiple GPUs or plotting the metrics it requires changing something in the training loop. However constantly modifying the core training loop everytime you want to add a feature, and adding a switch to enable it, quickly becomes unmaintainable. The solution fast.ai developed is to add points where custom code can be called that modifies the state of training, which they call callbacks.

  • Edward Ross
94% confidence with 5 measurements

94% confidence with 5 measurements

There are many things that are valuable to know in business but are hard to measure. For example the time from when a customer has a need to purchase, the number of related products customers use or the or the actual value your products are delivering. However you don't need a sample size of hundreds to get an estimate; in fact you can get a statistically significant result from measuring just 5 random customers.

  • Edward Ross
How to Display All Columns in R Jupyter

How to Display All Columns in R Jupyter

I like to do one-off analyses in R because tidyverse makes it really easy and beautiful. I also like to do them in Jupyter Notebooks because they form a neat way to collate the results. While R Markdown is better for reproducible code, often I'm doing expensive things with databases that are changing, and so I tend to find the "write once" behaviour of Jupyter Notebooks fit this use case better (although R Markdown Notebooks are catching up).

  • Edward Ross
Extracting Skills from Job Ads: Part 3 Conjugations
nlp

Extracting Skills from Job Ads: Part 3 Conjugations

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "experience in telesales" using spaCy's dependency parse, but it wouldn't extract many types of experience from a job ad. Here we will extend these rules to extract lists of skills (for example extracting "telesales" and "callcentre" from "experience in telesales or receptionist", which will let us analyse which experiences are related.

  • Edward Ross
Extracting Skills from Job Ads: Part 2 - Adpositions
nlp

Extracting Skills from Job Ads: Part 2 - Adpositions

Extracting Experience in a Field I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "subsea cable engineering experience". This worked well, but extracted a lot of qualifiers that aren't skills (like "previous experience in", or "any experience in"). Here we will write rules to extract experience from phrases like "experience in subsea cable engineering", with much better results.

  • Edward Ross
Extracting Skills from Job Ads: Part 1 - Noun Phrases
nlp

Extracting Skills from Job Ads: Part 1 - Noun Phrases

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience) You can see the Jupyter notebook for the full analysis. Extracting Noun Phrases It's common for ads to write something like "have this kind of experience":

  • Edward Ross
Leading the Product 2019

Leading the Product 2019

I attended the excellent 2019 Leading the Product conference in Melbourne with aronud 500 other Product Managers and Enthusiasts. The conference had a broad range of great talks, a stimulating networking event where we connected by sharing our favourite books on product management, and overall a energetic atmosphere. I got something out of every talk, but here are the highlights from a data perspective. Find quick ways of testing difficult and uncertain hypotheses John Zeratsky talked about the design sprint for implementing a design solution in a week; from storyboarding an experience, to brainstorming solutions to prototyping and testing.

  • Edward Ross
Data Blockless: A better way to create data

Data Blockless: A better way to create data

Before you can do any machine learning you need to be able to read the data, create test and training splits and convert it into the right format. Fastai has a generic data block API for doing these tasks. However it's quite hard to extend to new data types. There's a few classes to implement; Items, ItemLists, LabelLists and the Preprocessors which are obfuscated through a complex inheritence and dispatch heirarchy.

  • Edward Ross
Constant Models

Constant Models

When predicting outcomes using machine learning it's always useful to have a baseline to compare results against. A simple baseline is the best constant model; that is a model that gives the same prediction for any input. This is a really simple check to perform against any dataset, and can be informative to check across validation splits. There are simple algorithms for finding the best constant model. For categorical predictions just evaluate every possible category to choose as the constant prediction.

  • Edward Ross
A programmer using Excel

A programmer using Excel

Intro When I was 15 I did a week of work experience with my neighbour, who was an agricultural economist running his own one person business. I'm still not really sure what an agricultural economist does, but I went out with him to visit his clients to talk through their business, and saw how he analysed their data in his Excel spreadsheet. It was really closer to an application than a spreadsheet; the interface made it clear where the client was meant to enter their data, it showed some summary output and most of the intermediate calculations were hidden.

  • Edward Ross
Spectra of atoms

Spectra of atoms

Why is a sodium lamp yellow? How can we determine the elemental composition of the sun? How does a Helium-neon laser can work? To some degree all of these questions require knowing the spectra of atoms, which can in theory be calculated by Quantum mechanics. However the calculations of these spectra for arbitrary systems from first principles is prohibitively difficult and computationally intensive (which is why techniques such as Density Functional Theory are used).

  • Edward Ross
Regular expressions, automata and monoids

Regular expressions, automata and monoids

In formal language theory the task is to specify, over some given alphabet, a set of valid strings. This is useful in searching for structures textual data through files (e.g. via grep), for specifying the syntactic structure of programming languages (e.g. in Bison or pandoc), and for generating output of a specified form (e.g. automatic computer science and mathematics paper generators).

An automoton is (roughly) a set of symbols, and a set of states, along with transitions for each state that take a symbol and return another state. They can be used to model (and verify) simple processes.

Automata can be brought into correspondence with formal languages in a very natural way; given an initial state s, and a sequence of symbols (a1, a2, …, an) the automata has a naturally assigned state (… ((s a1) a2) … an) (where “(state symbol)” represents the state obtained from the transition on symbol using state). Then if we nominate an initial state, and a set of “accepting” valid states, we say a string is in the language of the automata if and only if when applied to the initial state it ends in a final state.

This gives a very useful pairing in computer science; formal languages are useful tools, and automata (often) give an efficient way to implement them on a computer.

  • Edward Ross
DVI by example

DVI by example

The Device Independent File Format (DVI) is the output format of Knuth’s TeX82; modern TeX engines (pdfTeX, luaTeX) output straight to Adobe’s Portable document format (PDF). However TeX82 and DVI still work as well today as they did when they were written; DVI files are easily cast to postscript or PDF.

The defining reference for DVI files is David R Fuch’s article in TUGboat Vol 3 No 2.

To find out what information is contained in a particular DVI file use Knuth’s dvitype, which outputs the operations contained in the bytecode in human readable format.

This article goes into gory detail the instructions contained in a very simple DVI file.

  • Edward Ross
Algorithms for finding the real roots of polynomials

Algorithms for finding the real roots of polynomials

Given an degree n polynomial over the real numbers we are guaranteed there are at most n real roots by the fundamental theorem of algebra; but how do we find them? Here we explore the Vincent-Collins-Akritas algorithm.

It uses Descartes’ rule of signs: given a polynomial \(p(x) = a_n x^n + \cdots + a_1 x + a_0\) the number of real positive roots (counting multiplicites) is bounded above by the number of sign variations in the sequence \((a_n, \ldots, a_1, a_0)\) .

  • Edward Ross
Geometry and topology of division rings

Geometry and topology of division rings

Following from my last post (and Veblen and Young’s Projective Geometry) consider a projective plane satisfying the axioms:

  1. Given two distinct points there is a unique line that both points lie on
  2. Each line has at least three points which lie on it
  3. Given a triangle any line that intersects two sides of the triangle intersects the third.
  4. All points are spanned by d+1 points and no fewer.

Then for d>=3 is equivalent to the projective space of lines over a division ring (or skew field).

Kolmogorov asked the question what projective spaces can we do analysis on? In order to do things such as find tangent lines we are going to need some sort of topology.

  • Edward Ross
Geometry of division rings
maths

Geometry of division rings

It is fairly easy to construct a geometry from algebra: given a division ring K we form an n-dimensional vector space, the points being the elements of the field and a line being a translation of all (left) multiples of a non-zero vector, i.e. of the form \(\{a\mathbf{v} + \mathbf{c}| a \in K\}\) for some fixed vectors \(\mathbf{v} \neq 0\) and c.

Interestingly it’s just as possible to go the other way, if we’re careful about what we mean by a geometry. I will loosely follow Artin’s book Geometric Algebra. In particular we have the undefined terms of point, line and the undefined relation of lies on. Then, for a fixed positive integer, the axioms are:

  1. Given two distinct points there is a unique line that both points lie on
  2. Each line has at least three points which lie on it
  3. Given a line and a point not on that line there exists a unique line lying on the plane containing them that the point lies on and no point of the first line lies on.
  4. All points are spanned by d+1 points and no fewer.

  • Edward Ross
Linear representation of additive groups and the Fourier Transform: Part 1

Linear representation of additive groups and the Fourier Transform: Part 1

In this article I will show that the cyclic group of order n, that is the set \(\{0,1,2,\ldots,n-1\}\) under addition modulo n motivates the discrete Fourier transform on a particular finite dimensional complex inner product space, and gives many of its properties. In a subsequent article I will extend this to the general Fourier transform and its relation to the group of integers and real numbers under addition.

  • Edward Ross
From polynomials to transcendental numbers

From polynomials to transcendental numbers

In a previous post I discussed finding the zeros of low degree polynomials; I want to extend that discussion to algorithmically finding the zeros of polynomials, more on solving the quintic and a brief discussion of transcendental numbers.

  • Edward Ross
Symmetry, Lie Algebras and Differential Equations Part 1

Symmetry, Lie Algebras and Differential Equations Part 1

There is a deep relationship between being able to solve a differential equation and its symmetries. Much of the theory of second order linear differential equations is really the theory of infinite dimensional linear algebra. In particular Sturm-Liouville theory is the diagonalization of an infinite dimensional Hermitian operator. However there are deeper relationships, as Miller points out in “Lie theory and special functions”; the relationships between special functions such as Rodrigues’ formulae are related to the Lie algebra and symmetries of the system. Even better in some cases the solutions can be found almost entirely algebraically. Some examples from physics come from the Simple Harmonic Oscillator, the theory of Angular Momentum and the Kepler Problem (using the Laplace Runge Lenz vector). The rest of this article will be devoted to exploring a special case of these relations the Quantum Simple Harmonic Oscillator.

  • Edward Ross
Do you really mean ℝⁿ?

Do you really mean ℝⁿ?

In mathematics and physics it is common to talk about \(\mathbb{R}^n\) when really we mean something else that can be represented by \(\mathbb{R}^n\).

Consider mechanics or geometry, these are often represented as theories in \(\mathbb{R}^n\) , but really they don’t occur in a vector space at all! Look around you, a three-dimensional description of space probably seems reasonable, but where’s the origin? [Perhaps the centre of your eyes could be an origin, but someone else would disagree with you]. Classical mechanics, special relativity and geometry are much better described as an affine space – which is a vector space without an origin.

  • Edward Ross
Tensor notation

Tensor notation

Language affects the way you think, often subconsciously. The easier and more natural something is to express in a language the more likely you are to express it. This is especially true of mathematical thought where the language is very precise.

I know three types of notations for tensors and each seem to be useful in different situations and gives you a different perspective on how tensors “work”. [Technical Note: I will assume all vector spaces are finite dimensional so \(V\) is naturally isomorphic to \(V^{**}\)]

  • Edward Ross
LaTeXing Multiple Equations

LaTeXing Multiple Equations

In mathematics and the (hard) sciences it’s important to be able to write documents with lots of equations, lots of figures and lots of references efficiently. This can be done in, for example, Microsoft Word, but the mathematics and theoretical physics community heavily prefer \(\TeX\) (and in particular \(\LaTeX\) ), so the bottom line is if you want to get papers published you’re going to have to get good at it.

There are a lot of resources for learning \(\LaTeX\) on the web, and a lot of people teach themselves from this (I know I did), but this can get you into some bad habits. For instance eqnarray gets the spacing around the equals signs all wrong. (I typeset my thesis using exclusively eqnarray and didn’t notice this until it was pointed out to me). So a lot of people advocate align from AMSTeX, but align has it’s limitations too; it only comes with one alignment tab &. If you want to make a comment at the end of multiple equations (like “for \(x \in X\) “) or you want to have two equations and the second one breaks over two lines you can’t line the equations up properly; but there is a solution – IEEEeqnarray (which is an external class, IEEEtrantools, available from the IEEE). Stefan Moser has written an excellent paper covering everything I’ve said and much more, showing good ways to typeset equations.

  • Edward Ross
Solving polynomials of degree 2,3 and 4

Solving polynomials of degree 2,3 and 4

\[\newcommand\nth{n^{\mathrm{th}}}\]

It is well known in mathematics that it is possible to find the roots of a general quadratic, cubic or quartic in terms of radicals (linear combinations and products of \(\nth\) roots). Another way of saying this is that the equation \(a x^4+b x^3+c x^2 + d x + e = 0\) can be solved for any complex constants \(a\),\(b\),\(c\),\(d\), and \(e\) if one can solve the equation \(x^n-t=0\) for \(n \in \{2,3\}\) (\(1\) being trivial) (\(t\) may be an algebraic combination of solutions of \(x^n-s\) for a variety of \(s\) which are algebraic combinations of \(a\),\(b\),\(c\),\(d\) and \(e\)). This is not true for the quintic.

  • Edward Ross
The point of computer algebra systems

The point of computer algebra systems

I wanted to do the contour integral of \(\frac{1}{z-a}\) around the unit circle on a computer for kicks. So I parameterised it with \(e^{it}\) and entered it into Maxima and Matlab

\[\int_0^{2\pi} \frac{i e^{it}}{e^{it}-a} dt\]

for various values of \(a\) . By Cauchy’s integral formula we know it should be \(2 \pi i\) if \(|a|<1\) and 0 if \(|a|>1\) . Interestingly both programs gave the right answer for \(a=0\) (I suppose the calculation is easy there) but gave totally wrong answers for \(a \neq 0\) (Matlab gave \(0\) for \(|a|<1\) and \(-2 \pi i\) if \(|a|>1\) and Maxima gave \(0\) everywhere).

I have no idea why this happens! In Matlab if you expand the complex exponential it gives the right answer. In Maxima it then has trouble computing the integral – but if you perform operations to make the denominator real and perform some simplifications it gives the right answer.

Now people sometimes ask Why learn arithmetic when we have calculators? I think this shows exactly why: you need to know when your calculator is giving you garbage.

  • Edward Ross
Local Lie Groups and Hilbert's Fifth Problem

Local Lie Groups and Hilbert's Fifth Problem

Lie Groups are mathematically “very nice” structures – they are analytic manifolds (real or complex) with a group structure such that multiplication and inversion are continuous. They are deeply related to infinitesimal symmetries; a group acting on a space can generally be thought of as a group of symmetries (or automorphisms) of some structure [e.g. rotations (\(SO(n)\)) preserve lengths, Möbius transformations map circles and lines in the complex plane into circles and lines], and the analyticity gives a “local” feel to it – if we know the symmetry locally we can extend it analytically (by the exponential map).

Hilbert’s Fifth Problem essentially asked how restrictive just looking at analytic actions is – what if we looked at continuous actions, how many more groups would we get?

Theorem [Gleason, Montgomery, Zippin] For a locally compact group \(G\) the following are equivalent:

  1. \(G\) is locally Euclidean.
  2. \(G\) has no small subgroups; i.e. there exists a neighbourhood of the identity that contains no non-trivial subgroups of \(G\).
  3. \(G\) is a Lie Group.

  • Edward Ross