## Installing Tidyverse in WSL without Timedatectl Status 1 Issue

When I tried to install tidyverse in WSL2 I ran into issues with timedatectl and xml2. The simple solution is: # Assuming Debian derivatives sudo apt-get install libxml2-dev # Modify TZ to whatever your timeozne is TZ="Australia/Sydney" R -e 'install.packages("tidyverse")' What happens When I try to install tidyverse I get this error: > install.packages('tidyverse') ERROR: configuration failed for package ‘xml2’ System has not been booted with systemd as init system (PID 1).

## Automated Refactoring in Python

I am a very recent convert on automatic refactoring tools. I thought it was something for languages like Java that have a lot of boilerplate, and overkill for something like Python. I still liked the concept of refactoring, but I just moved the code around with Vim keymotions or sed. But then I came up against a giant Data Science codebase that was a wall of instructions like this: import pandas as pd import datetime df = pd.

## Writing Pandas Dataframes to S3

Writing a Pandas (or Dask) dataframe to Amazon S3, or Google Cloud Storage, all you need to do is pass an S3 or GCS path to a serialisation function, e.g. # df is a pandas dataframe df.to_csv(f's3://{bucket}/{key}') Under the hood Pandas uses fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS).

## Code Optimisation as a Trade Off

I've been writing some ARM Assembly as part of a Raspberry Pi Operating System Tutorial, and writing in Assembly really forces me to think about performance in terms of registers and instructions. When I'm writing Python trying to write concise code leads to breaking a problem into small functions or methods (and using idioms like list comprehensions). The same drive for concise code in Assembly leads me to reduce the number of instructions used and the number of registers, but even though it feels like its making things more efficient it may have negligible actual impact.

## Hardware Is Hard

I've been revisiting baremetal Raspberry Pi programming (with Alex Chadwick's Baking Pi tutorial, although there are plenty of others). It really highlights how much I take for granted. I spend a lot of time in lanugages like Python and R processing data, without much understanding of the interpreters written in C, let alone thinking about the compilers that turn those interpreters and libraries into assembly, or how that assembly executes.

## Fast Pandas DataFrame to Dictionary

Tabular data in Pandas is very flexible, but sometimes you just want a key value store for fast lookups. Because Python is slow, but Pandas and Numpy often have fast C implementations under the hood, the way you do something can have a large impact on its speed. The fastest way I've found to convert a dataframe to a dictionary from the columns keys to the column value is: df.set_index(keys)[value].to_dict() The rest of this article will discuss how I used this to speed up a function by a factor of 20.

## Chompjs for parsing tricky Javascript Objects

Modern Javascript web frameworks often embed the data used to render each webpage in the HTML. This means an easy way of extracting data is capturing the string representation of the object with a pushdown automoton and then parsing it. Sometimes Python's json.loads won't cut it for dynamic JSON; one option is demjson but another much faster option is chompjs. Chompjs converts a javascript string into something that json.loads. It's a little less strict than demjson; for example {"key": undefined} will be converted by chompjs.

## Aggregating Quantiles with Pandas

One of my favourite tools in Pandas is agg for aggregation (it's a worse version of dplyrs summarise). Unfortunately it can be difficult to work with for custom aggregates, like nth largest value. If your aggregate is parameterised, like quantile, you potentially have to define a function for every parameter you use. A neat trick is to use a class to capture the parameters, making it much easier to try out variations.

## A Command Line Interface for HTML With parsel-cli

There are many great command line tools for searching and manipulating text (like grep), columnar data (like awk), JSON data (like jq). With HTML there's parsel-cli built on top of the wonderful parsel Python library. Parsel is a fantastic library that gives a simple and powerful interface for extracting data from HTML documents using CSS selectors, Xpath and regular expressions. Parsel-cli is a very small utility that lets you use parsel from the command line (and can be installed with pip install parsel-cli).

## Pasting text from long ago in Emacs and Vim

I use Vim keybindings in Emacs through Evil Mode and Evil Collection. Often I'll copy something, make some edits, and then want to pase the text. The problem is that the edits changed what was on the register and I want to recover that text. In Vim (and Evil mode) I can type :reg to see what's in the registers, find the value I want to pase, commit the register to memory (for example 8), exit and then paste that register ("8p from normal mode, C-r 8 from insert mode).

## Persistent Dictionaries in Python

Dictionaries in Python (in other languages called maps or hashmaps) are a useful and flexible data structure that can be used to solve lots of problems. Part of their charm is the affordances in the lanugage for them; setting and accessing with square brackets [], deleting with del. But sometimes you want a dictionary that persists across sessions, or can handle more data than you can fit into memory - and there's a solution persistent dictionaries.

## Not Using Scrapy for Web Scraping

Scrapy is a fast high-level web crawling and web scraping framework. But as much as I want to like it I find it very constraining and there's a lot of added complexity and magic. If you don't fit the typical use case it feels like a lot more work and learning doing things with scrapy than without. I really like Zyte (formerly ScrapingHub) the team behind Scrapy. They really know what they're talking about with great blogs about QA of Data Crawls, guide to browser tools, how bots are tracked, and Scrapy's documentation has a very useful page on selecting dynamically-loaded content.

## Select, Fetch, Extract, Watch: Web Scraping Architecture

The internet is full of useful information just waiting to be collected, processes and analysed. While getting started scraping data from the web is straightforward, it's easy to tangle the whole process together in a way that makes it fragile to failure, or hard to change with requirements. And the internet is inconsistent and changing; in anything but the smallest scraping projects you're going to run into failures. I find it useful to conceptually break web scraping down into four steps; selecting the data to retrieve, fetching the data from the internet, extracting structured data from the raw responses, and watching the process to make sure it's functioning correctly.

## Taking Screenshots in Firefox

I find taking screenshots in Linux a bit painful. My current way is to use GIMP to create an image from a screenshot, but it's a bit slow to startup and interrupts my flow. I've had trouble installing Shutter which I haven't worked through yet. However I've just found out that Firefox has a way to take screenshots. All you need to do is press Control-Shift-S and then it brings up a selector where you can pick an element, or a region (like an improved version of Windows Snipping tool).

Sometimes you have a HTML webpage or email that you want to extract all the links from. There's lots of ways to do this, but there's a simple solution in Python with BeautifulSoup: from bs4 import BeautifulSoup def extract_links(html): soup = BeautifulSoup(html, 'html.parser') return [a.get('href') for a in soup.find_all('a') if a.get('href')] Some other methods would be to use regular expressions (which would be faster than parsing, but a little harder to get right), directly going through a parse tree or using lxml.

## Reading Email in Python with imap-tools

You can use Python to read, process and manage your emails. While most email providers provide autoreplies and filter rules, you can do so much more with Python. You could download all your PDF bills from your electricity provider, you could parse structured data from emails (using e.g. BeautifulSoup), sort or filter by sentiment, or even do your own personal analytics like Steven Wolfram. The easiest tool I've found for reading emails in Python is imap_tools.

## Energy to Orbit vs Launch into Deep Space

This is from Sanjoy Mahajan's The Art of Insight Problem 1.11 Estimate the energy in a 9-volt battery. Is it enough to launch the battery into orbit? I tried to answer this with the energy density required to launch into deep space. But this is different to going into orbit; how much energy is required to get into low Earth orbit? Low Earth Orbit A low orbit has to be above the height of the atmosphere (otherwise will require propuslion to overcome atmospheric friction), and so is typically above 300 km.

## Energy Desnsity to Launch into Space

This is from Sanjoy Mahajan's The Art of Insight Problem 1.11 Estimate the energy in a 9-volt battery. Is it enough to launch the battery into orbit? I have already (mis)estimated the energy of a battery, but looked it up as 500 mAh. Energy density required to launch into space To launch into space you have to exchange energy to counteract the change in gravitational energy (at least, you'll need more for air resistance).

## How Much Energy is there in a 9V Battery

This is from Sanjoy Mahajan's The Art of Insight Problem 1.11 Estimate the energy in a 9-volt battery. Is it enough to launch the battery into orbit? We're just going to estimate the first part. Battery Energy A volt is energy per unit charge $$V = \frac{E}{q}$$. To get towards an energy we need an amount of charge; the current in Ampere is the charge per unit time $$I = frac{q}{t}$$.

## Success in Small Steps

A lot of times I've failed by biting off more than I can chew. I get in over my head and lose motivation. A lot of times I've succeeded it's by starting small and slowly building up a roll of successes. When I was in highschool I tried to build a simulation of the solar system for a project. I wasn't satisfied with building ellipses, I wanted to take into account all the N-body interations.

## Why is Vmemm Using All My Memory?

My Windows laptop was halting to a crawl; I was waiting seconds to switch windows and even typing took a couple of seconds to respond. I opened the task manager by hitting Ctrl-Shift-Esc and saw that Vmemm was using >95% of my memory. What the heck is Vmemm and how can I stop it using all my memory? Vmemm is the process associated with virtual machines on Windows. I'm using WSL2 and Docker (through WSL2), and so all their memory appears on Vmemm.

## Run Webserver Without Root

You've written your web application or API and you now want to deploy it to a server. You don't want to run it as root, because if someone finds a vulnerability in the server then it will be trivial for them to take over the system. However only root has permission to run applications on ports 80 and 443. There are a few ways to do this, but only a couple that make sense for an interpreted language (like Python, as opposed to a compiled binary).

## Myth of the Hawthorne Effect

The Hawthorne effect is where when measuring the effect of lighting changes on worker output in an electrical factory any change increased output, even back to the original lighting conditions. I've heard this explained as running the experiment caused the employees to be observed more closely which led them to work harder, and used as a rationale for observing employees more. Except the Hawthorne effect is a myth. The economists Steven D.

## Running out of Resources on AWS Athena

AWS Athena is a managed version of Presto, a distributed database. It's very convenient to be able to run SQL queries on large datasets, such as Common Crawl's Index, without having to deal with managing the infrastructure of big data. However the downside of a managed service is when you hit its limits there's no way of increasing resources. Today I was running some queries for a regular reporting pipeline in Athena when I got failure with the error Query exhausted resources at this scale factor.

## Building a Job Extraction Pipeline

I've been trying to extract job ads from Common Crawl. However I was stuck for some time on how to actually write transforms for all the different data sources. I've finally come up with an architecture that works; download, extract and normalise. I need a way to extract the job ads from hetrogeneous sources that allows me to extract different kinds of data, such as the title, location and salary. I got stuck in code for a long time trying to do all this together and getting a bit confused about how to make changes.

## Insights From Google Analytics for a Small Blog

I started regularly writing this website to get better at writing, to build a portfolio and share my learnings. Because of this I haven't been focussed on building an audience or looking at analytics. However now I've been writing continuously for 6 months I'd see if I learned anything interesting from looking at Google Analytics. I installed Google Analytics a couple of weeks ago on the website to see how people are actually viewing my site.

## Importance of Collecting You Own Training Data

A couple years ago I built whatcar.xyz which predicts the make and model of Australian cars. It was built mainly with externally sourced data and so only works sometimes, under good conditions. To make it better I've started collecting my own training data. External data sources are extremely convenient for training a model as they can often be obtained much more cheaply than curating your own data. But the data will almost always be different to what you are actually performaning inference on, and so you're relying on a certain amount of generalisation.

## Unhappy Path Programming

When programming it's easy to think about the happy path. The path along which you get well-formed valid data, all your requests return successfully and everything works on your target platform. When you're in this mindset it's easy to just check it works in one case and assume everything is alright. But the majority of real work in programming is the unhappy paths. While you always need to be thinking about how things could go wrong, it's much more important in web programming.

## Updating a Python Project: Whatcar

The hardest part of programming isn't learning the language itself, it's getting familiar with the gotchas of the ecosystem. I recently updated my whatcar car classifier in Python after leaving it for a year and hit a few roadblocks along the way. Because I'm familiar with Python I knew enough heuristics to work through them quickly, but it takes experience with running into problems to get there. I thought I had done a good job of making it reproducible by creating a Dockerfile for it.

## Activating Mobile Phone Camera from HTML

Building a web application is great because, if it is well built, it can be accessed across many operating systems. But sometimes you want to access particular aspects of the device; for example take a picture from a mobile camera. It turns out this is easy to do on many systems in HTML. A year ago I built whatcar.xyz which classifies a photo of an Australian car with its make and model.

## Building NLP Datasets from Scratch

There's a common misconception that the best way to build up an NLP dataset is to first define a rigorous annotation schema and then crowdsource the annotations. The problem is that it's actually really hard to guess the right annotation schema up front, and this is often the hardest part on the modelling side (as opposed to the business side). This is explained wonderfully by spaCy's Matthew Honnibal at PyData 2018.

## Orderly Life for Original Work

Be settled in your life and as ordinary as the bourgeois, in order to be fierce and original in your works. Gustave Flaubert, To Gertrude Tennant (December 25, 1876) It's hard to find the energy and focus to be creative when your life is a mess. Before you can be productive you need to sleep well, eat well, exercise well and have good routines and social supports. See here for more on the origin of this quote.

## Experimental Generalisability

Experiments reveal the relationship between inputs and outcomes. With statistical methods you can often, with enough observations, tell whether there's a strong relationship or if it's just noise. However it's much harder to know how generally the relationship holds, but it's essential for making decisions. Suppose you're testing two alternate designs for a website. One has a red and green button with a santa hat and bauble, and the other has a blue button.

## Choosing a Static Site Generator

Static website generators fill a useful niche between handcoding all your HTML and running a server. However there's a plethora of site generators and it's hard to choose between them. However I've got a simple recommendation: if you're writing a blog use Jekyll (if you don't want to use something like Wordpress). Static website generators compile input assets into a set of static HTML, CSS and Javascript files that can be deployed almost anywhere.

## Social Flashcards

I'm terrible at remembering names. When someone introduces themself I'm normally a bit anxious and in my own head and don't take in their name. It takes concious effort to remember their name, let alone the names of their family or facts about them. However remembering things about people are really important for building relationships. If you take an interest in other people's lives they will be more receptive to you.

## Can I? Must I? Should I?

Whenever someone gets an idea in their head they start filtering out evidence that contradicts that idea. This idea is called confirmation bias, people start looking for evidence that confirms their current idea and neglecting evidence that challenges it. There's no way to completely beat a bias, but something that helps me is reframing the question. The first question that comes is normally "Can". Can it be? This leads to looking for evidence that confirms the idea.

## Learning Hugo by Editing Themes

One of the hardest parts of learning something new is motivation. This is why one of the best ways to learn programming is editing code; it's goal driven so motivation is built in. I've successfully used this to start learning how to write Hugo themes. Now that I've got a reasonable collection of posts, over 250, I would like to understand what content people are actually accessing on this website to get an idea of what would be useful.

## Manually Triggering Github Actions

I have been publishing this webiste using Github Actions with Hugo on push and on a daily schedule. I recently received an error notification via email from Github, and wanted to check whether it was an intermittent error. Unfortunately I couldn't find anyway to rerun it manually; I would have to push again or wait. Fortunately there's a way to enable manual reruns with workflow_dispatch. There's a Github blog post on enabling manual triggers with workflow_dispatch.

## R: Keeping Up With Python

About 5 years ago a colleague told me that the days were numbered for R and Python had won. From his perspective he is probably right; in software engineering companies Python has got increasing adoption in programmatic analytics. However R has its own set of unique strengths which make it more appealing for the stats people and has kept up surprisingly well with Python. Python has a wider audience than R, and keeps to its reputation as "not the best language for anything but the second best language for everything".

## Population Density Australia

How dense is the population in Australia? I've looked at the Gridded Population of the World and you can see that the population is concentrated around the few capital cities on the coast. It's hard to visually average something so lumpy, but it's easy to estimate it. I know it's about 10 hours driving from Melbourne to Sydney, and about the same again to Brisbane. Brisbane is about halfway between Melbourne in the south and Cairns in the far north.

## Gridded Population of the World

I've spent the last few hours looking at the Gridded Population of the World which consistenly estimates the population density consistent with national censuses and population registers. This would have been a massive job to compile and is really interesting to look at. You can immediately see a strip through the north of India, Pakistan and Bangladesh that is incredibly dense. The north-east of China and the island of Java in Indonesia are also very dense.

## Implicit Bias

I like to think of myself as an egalatarian, but I know I have implicit bias. I've done some tests on Project Implicit and have roughly the implicit biases you would expect for my demographic. This makes me feel a bit sad, but you can't really control your implicit biases, they're a function of your environment and perception growing up. The key question is given that we have implicit biases how do we act against them?

## Finding Files Installed in Ubuntu and Debian

My bashrc file sources the git prompt helper to show the branch I'm on in the prompt. Unfortunately it's quite old and was pointing to the wrong file, how do I find where it is? dpkg -L git | grep prompt Debian and its derivatives such as Ubuntu you can use apt to manage packages (e.g. apt upgrade, apt install). However apt is just a thin layer over dpkg that does useful things like resolving dependencies and downloading files.

## The Fifth Risk

Michael Lewis' The Fifth Risk promotes parts of the US public service and some people who work in it. The public service is culturally opposed, if not legally prevented, from promoting itself which means a lot of the successes and heros go unsung. Michael Lewis spells out what some of the largest, yet most obscure parts, of the US government accomplish and how they could be at risk through mismanagement of the Trump administration.

## Estimating Weight with Body Mass Index

When estimating things it's good to find approximate constants, typically ratios, that are easier to remember than things that vary. The Body Mass Index (BMI) is an example for measuring people. It's relatively easy to measure human height, as a human. For example I'm about 180cm tall; the top of my nose is about 170cm, the bottom of my chin is about 160cm and the bottom of my neck is about 150cm.

## Diagrams in Hugo with Mermaid

Being able to write simple diagrams with text is very convenient. We can do this in Hugo by rendering with mermaid.js. In particular I want to render some factor tree diagrams of the style of The Art of Insight. Like this one: The final result looks like: graph LR; A[sheets ream-1 500] --|-1| B[thickness 10-2cm ] C[thickness ream-1 5cm] -- B B -- D[volume 1cm3] E[height 6cm] -- D F[width 15cm] -- D Implementation I copied the Mermaid Hugo shortcode from the learn theme and put it in layouts/shortcodes/mermaid.

## How much Money is in a Suitcase?

This is from Sanjoy Mahajan's The Art of Insight Problem 1.3 In the movies, and perhaps in reality, cocaine and elections are bought with a suitcase of $100 bills. Estimate the dollar value in such a suitcase. Size of$100 note Let's assume a banknote is about the same thickness as paper; Australian notes are probably a little bit thicker. A 500 page ream of paper is about 5cm tall, so each sheet is about 0.

## Programming Languages to Learn in 2020

A language that doesn't affect the way you think about programming, is not worth knowing. Alan Perlis I spend a lot of time programming in Python and SQL, some time in Bash and R (or at least tidyverse), and a little in Java and Javascript/HTML/CSS. This set of tools is actually pretty versatile about getting things done, but is fairly narrow from a programming concept perspective. Once in a while I think it's useful to broaden the programming frame to understand different ways of doing things; even if you still stick to the same few languages.

## How Much Does a Box of Books Weigh?

This is from Sanjoy Mahajan's The Art of Insight Problem 1.1 How heavy is a small moving-box filled with books? Guesstimating weight I've moved small boxes of books a few times, it's light enough for me to carry. It's much heavier than a couple of 2kg bag of onions, but probably more similar to a 20kg bag of pool salt. I'd guess it's in the range 10-20kg, so I'd guess around 15kg.

## Some Ideas for Recurring Articles

Radio shows, comedy sketch shows and talk shows have the difficult task of filling air time with less structured content. A technique used in all of these mediums to help fill the gaps is a recurring segment. The Saturday Night Live Weekend Update is an example of this. Using a structured recurring segment with a familiar pattern and style gives a structured environment to be creative in. It's really hard to be creative in a completely unstructured and original way, like Monty Python was, since there are a so many options.

## Diffing in SQL

One way of refactoring legacy code is to use diff tests; checking what changes when you change the code. While it can be easy to diff files, it's a little less obvious how to do this with SQL pipelines. Fortunately there are a few different techniques to do this. For exact matching you can use union all to find the number of rows that don't occur in both datasets. For approximate matching you can use a join to check whether the differences are within some bounds.

## Diff Tests

When making changes to code tests are a great way to make sure you haven't inadvertantly introduced regressions. This means that you can make changes much faster with more confidence, knowing that your tests will catch many careless mistakes. But what do you do when you're working with a legacy codebase that doesn't have any tests? One method is creating diff tests; testing how your changes impact the output. For batch model training or ETL pipeline there's typically a natural way to do this.

## Dataflow Chasing

When making changes to a new model training pipeline I find it really useful to understand the dataflow. Analytics workflows are done as a series of transformations, taking some inputs and producing some outputs (or in the case of mutation; an input is also an output). Seeing this dataflow helps give a big picture overview of what is happening and makes it easier to understand the impact of changes. Generally you can view the process as a directed and (hopefully) acyclic graph.

## Comment to Function

A lot of analytics code I've read is a very long procedural chain. These can be hard to follow because the only way to really know what's going on in any point is to insert a probe to inspect the inputs and outputs at that stage. Breaking these into functions is a really useful way of making the code easier to understand, change and find bugs in. In Martin Fowler's Refactoring he mentions that whenever there's a block of code that has (or requires) a comment to describe what it does, that's a good opportunity to package that code into a function.

## Tidy Time

I love having a clean desk and empty inbox. But I hate spending the time cleaning my desk and processing emails. It feels like wasted time where I could do something better. However having "tidy time" to maintain things is important. A while ago I read David Allen's Getting Things Done. When I tried to implement it I got stuck on the notion of a weekly review. Setting aside some time every week to see how you're progressing on tasks and to process any new tasks.

## From Multiprocesing to Concurrent Futures in Python

Waiting for independent I/O can be a performance bottleneck. This can be things like downloading files, making API calls or running SQL queries. I've already talked about how to speed this up with multiprocessing. However it's easy to move to the more recent concurrent.futures library which allows running on threads as well as processes, and allows handling more complicated asynchronous flows. From the previous post suppose we have this multiprocessing code:

## Approximate Percentiles in Presto and Athena

Calculating percentiles and quantiles is a common operation in analytics. While they can be done in vanilla SQL with window functions and row counting, it's a bit of work and can be slow and in the worst case can hit database memory or execution time limits. Presto (and Amazon's hosted version Athena) provide an approx_percentile function that can calculate percentiles approximately on massive datasets efficiently. When running this I found that it was non-deterministic.

You've done an analysis and generated an output file in a Jupyter notebook. How do you get it down to your computer? For a local server you could find it in your filesystem, or for a remote server copy it with something like scp. But there are easier ways. You can download individual files from the file navigator (which you can get to by clicking on the Jupyter icon in the top left corner).

## Git Stash Changesets

Pretty frequently I start writing some code, when I realise there's another change I need to make before I can continue. I like to make lots of small atomic changes to a code base because it lets me test more quickly and catch errors earlier. I used to do this by saving my changes in a temporary file, but this was clunky. A better way is with git stash. But git stash reverts all files; and very often I want to keep some, especially configuration parameters.

## Solving Solved Problems

A good technique for deeply understanding something is to try to solve it yourself first. Sometimes this can even lead to better methods or new discoveries. I heard an interesting technique from Jeremy Howard in one of the fast.ai courses about how to read a paper. First read the abstract and introduction. Then spend a couple of days trying to implement what you think they're talking about. Then go back and read the rest of the paper and see how it compares to what you did.

## Contact Tracing in Fighting Epidemics

The state government of Victoria, Australia has recently announced a plan on how to respond to the current Covid-19 pandemic. Based on epidemiological modelling they have set to reduce restrictions based on 14 day averages of new case numbers. If the 14 day average daily new cases are 30-50 in 3 weeks they will reduce restrictions; if they are below 5 a month after that they will reduce restrictions again.

## Modelling the Spread of Infectious Disease

Understanding the spread of infectious disease is very important for policies around public health. Whether it's the seasonal flu, HIV or a novel pandemic the health implications of infectious diseases can be huge. A change in decision can mean saving thousands of lives and relieving massive suffering and related economic productivity losses. The SIR model is a model that is simple, but captures the underlying dynamics of how quickly infectious diseases spread.

## Time Budgeting

It's worthwhile spending some time thinking about how you spend your time. Time and energy are among your most valuable resources. A regular investment of time can build into substantial assets, but if you don't budget time it's easily misspent. I don't believe that you should allocate away all of your time, but setting some time constraints is important. If you don't put the big rocks of things that are important to you first in the jar first, all the sand and water of mundane things will fill it up.

## Fixing suddenly unable to connect to X server in WSL2

Today when I tried to connect to VcXsrv after running it with XLaunch it didn't work. I'd had it working for months and so was surprised it suddenly stopped working. The reason was simple; the IP subnet WSL2 had changed and so it was now being blocked by a firewall. Annoyingly there is very little feedback as to why it can't connect to an XServer. I went back through my previous instructions of setting up an X server in WSL2, but noticed something.

## Exceed Expectations

Today I saw a picture in someone's windows "Always Exceed Everyone's Expectations". My initial reaction was that was a quick way to burnout - trying to always exceed expectations sounds like running on a treadmill that gets faster and faster. But another way to look at it is to set lower expectations and only commit when you can confidently deliver. In another expression "underpromise and overdeliver". Consistently delivering what you promise to customers is the way to build trust and loyalty.

## South Sea Bubble

I've been surprised to learn that financial bubbles and collapses are actually hundreds of years old. I learned this reading the book Devil Takes the Hindmost: A History of Financial Speculation by Edward Chancellor. The chapters on the South Sea Bubble and the following craze over investing in South America sound thoroughly modern; except they happened 200-300 years ago. In fact Isaac Newton lost £20,000 by investing in the bubble. Chancellor describes futures, options, and margin loans - things I had wrongly assumed were more modern inventions.

## Embeddings for categories

Categorical objects with a large number of categories are quite problematic for modelling. While many models can work with them it's really hard to learn parameters across many categories without doing a lot of work to get extra features. If you've got a related dataset containing these categories you may be able to meaningfuly embed them in a low dimensional vector space which many models can handle. Categorical objects occur all the time in business settings; products, customers, groupings and of course words.

## From Descriptive to Predictive Analytics

The starting point for an analysis is often summary statistics, such as the mean or the median. For some of these you're going to want it more precisely, more timely or cut by thinner segments. When the data gets too volatile to report on it's a good time to reframe the descriptive statistics as a predictive problem. Businesses often have a lot of reporting around important metrics cut by key segments.

## Teaching Programming by Editing Code

I've had a few discussions with people, especially analysts, about how to learn programming. Generally I encourage them to find a project they want to accomplish and try to learn programming on the way. However I really struggle to find resources to recommend because they tend to spend a lot of time teaching programming concepts from stratch. I wonder if a better way to teach these things would be to start with code that's close to what they want to accomplish, and get them to edit it.

## Interpretable models with Cynthia Rudin

A while ago I came across Cynthia Rudin through their work on the FICO Explainable Machine Learning Challenge. Her team got an honourable mention and she wrote an opinion with Joanna Radin on explainable models. I think the article was hyperbolic on claiming interpretable models always work as well as black box models. On the other hand I only came across her because of this article, so taking an extreme viewpoint in the media is a good way to get attention.

## Topic Modelling to Bootstrap a Classifier

Sometimes you want to classify documents, but you don't have an existing classification. Building a classification that is mutually exclusive and completely exhaustive is actually very hard. Topic modelling is a great way to quickly get started with a basic classification. Creating a classification may sound easy until you try to do it. Think about novels; is a Sherlock Holmes novel a mystery novel or a crime novel (or both)? Or do we go more granular and call it a detective novel, or even more specifically a whodunit?

## Rough Coarse Geocoding

A coarse geocoder takes a human description of a large area like a city, area or country and returns the details of that location. I've been looking into the source of the excellent Placeholder (a component of the Pelias geocoder) to understand how this works. The overall approach is straightforward, but it takes a lot of work to get it to be reliable. A key component geocoder is a gazetteer that contains the names of locations.

## Legality of Publishing Web Crawls

As a data analyst I rely on open code and open data to inform decisions. There's a lot of data available on the web which would be great to transform and make openly available to the community. However it's not my data to give, and I'm concerned whether it would violate copyright. An interesting aspect is there are companies that scrape data from all over the web to use for analysis.

## Python HTML Parser

A lot of information is embedded in HTML pages, which contain both human text and markup. If you ever want to extract this information, don't use regex use a parser. Python has an inbuilt library html.parser library to do just that. The excellent html2text library uses it to parse HTML into markdown, which you can use for removing formatting. However for your own purposes you can use a similar approach to build a custom parser by subclassing HTMLParser.

## Refining Location with Placeholder

Placeholder is a great library for Coarse Geocoding, and I'm using it for finding locations in Australia. In my application I want to get the location to a similar level of granularity; however the input may be for a higher level of granularity. Placeholder doesn't directly provide a method to do this, but you can use their SQLite database to do it. For example to find the largest locality for East Gippsland, with Who's On First id 102049039, you can use the SQL.

A monad in languages like Haskell is used as a particular way to raise the domain of a function beyond where it was domain. You can think of them as a generalised form of function composition; they are a way of taking one type of function and getting another function. A very useful case is the maybe monad used for dealing with missing data. Suppose you've got some useful function that parses a date: parse_date('2020-08-22') == datetime(2020,8,22).

## Dip Statistic for Multimodality

If you've got a distribution you may want a way to tell if it has multiple components. For example a sample of heights may have a couple of peaks for different gender, or other attributes. While you could determine this through explicitly modelling them as a mixture the results are sensitive to your choice of model. Another approach is statistical tests for multimodality. One common test is Silverman's Test which checks for the number of modes in a kernel density estimate; the trick is choosing the right width.

## Priorities mean saying No

There are always more things you can be doing than the time you have. If you try to do everything you will end up accomplishing nothing. It can be hard to say "no" to someone, but it's important in order to focus on your priorities. I find it useful to have priorities and goals to set a direction. Often the actual choice of goals doesn't matter as much as that I make them, and regularly review them.

## Create User Sessions with SQL

Sometimes you may want to experiment with sessions and need to hand-roll your own in SQL. There's a good mode blog on how to do this. If you're using Postgres or Greenplum you may be able to use Apache Madlib's Sessionize for the basic case. This blog post will give a very brief summary of how to do this with some examples in Presto/Athena. The idea of a session is to capture a continuous unit of user activity.

## Removing Timezone in Athena

When creating a table in Athena I got the error: Invalid column type for column: Unsupported Hive type: timestamp with time zone. Unfortunately it can't support timestamps with timezone. In my case all the data was in UTC so I just needed to remove the timezone to create the table. The easiest way to do that was to cast it to a timestamp (without a timezone). cast(event_time as timestamp)

## Python is not a Functional Programming Language

Python is a very versitile multiparadigm language with a great ecosystem of libraries. However it is not a functional programming lanugage, as I know some people have described it. While you can write it in a functional style it goes against common practice, and has some practical issues. There is no fundamental definition of a functional programming language but two core concepts are that data are immutable and the existence of higher order functions.

## Differentiation is Linear Approximations

Differentiation is the process of creating a local linear approximation of a function. This is useful because linear functions are very well understood and efficient to work with. One application of them is gradient descent, often used for fitting models in machine learning. In this context a function is something that maps between coordinate spaces. For example consider an image classifier that takes a 128x128 pixel image with three channels for colours (Red, Green, Blue) and returns a probability that the image contains a cat and the probability that the image contains a dog.

## Classifying Finite Groups

Groups can be thought of a mathematical realisation of symmetry. For example the symmetric groups are all possible permutations of n elements. Or the dihedral groups are the symmetrics of a regular polygon. A questions mathematicians ask is what kinds of groups are there? One way to tackle this is to try to decompose them. One way of doing this is a decomposition series of normal subgroups. $1 = H_0\triangleleft H_1\triangleleft \cdots \triangleleft H_n = G$

## Complex Analysis

Imaginary numbers sound like a very impractical thing; surely we should only be interested in real numbers. However imaginary numbers are very convenient for understanding phenomena with real numbers, and are useful models for periodic phases like in electrical engineering and quantum mechanics. The techniques are also often useful for evaluating integrals, solving two-dimensional electrostatics and decomposing periodic signals. Most of mathematical analysis, topology and measure theory is about inapplicable abtruse examples.

## Data Tests with SQL

A challenge of data analytics is that the data can change as well as the code. The systems producing and collecting data are often changed and can lead to missing or corrupt data. These can easily corrupt reports and machine learning systems. Worst of all the data may be lost permenantly. So if you're going to use some data it's important to check the data regularly to catch the worst kind of mistakes as early as possible.

## Sessionisation Experiments

You don't need a lot of data to prove a point. People often think statistics requries big expensive datasets that cost a lot to acquire. However in relatively unexplored spaces a small amount of data can have high yield in changing a decision. I've been working on some problems around web sessionisation. The underlying model is that when someone visits your website they may come at different times for different reasons.

## Test Driven Salary Extraction

Even when there's a specific field for a price there's a surprising number of ways people write it. This is what the tool price-parser solves. Unfortunately it doesn't work too well on salaries, which tend to be ranges and much higher, but the approach works. Price parser has a very large set of tests covering different ways people write prices. The solution is a simple process involving a basic regular expression, but it solves all these different cases.

## Finding Australian Locations with Placeholder

People write locations in many different ways. This makes them really hard to analyse, so we need a way to normalise them. I've already discussed how Placeholder is useful for coarse geocoding. Now I'm trying to apply it to normalising locations from Australian Job Ads in Common Crawl. The best practices when using Placeholder are: Go from the most specific location information (e.g. street address) to the most general (e.

## Converting HTML to Text

I've been thinking about how to convert HTML to Text for NLP. We want to at least extract the text, but if we can preserve some of the formatting it can make it easier to extract information down the line. Unfortunately it's a little tricky to get the segmentation right. The standard answers on Stack Overflow are to use Beautiful Soup's getText method. Unfortunately this just turns every tag into the argument, whether it is block level or inline.

## From Bernoulli to Binomial Distributions

Suppose that you flip a fair coin 10 times, how many heads will you get? You'd think it was close to 5, but it might be a bit higher or lower. If you only got 7 heads would you reconsider you assumption the coin is fair? What if you got 70 heads out of 100 flips? This might seem a bit abstract, but the inverse problem is often very important. Given that 7 out of 10 people convert on a new call to action, can we say it's more successful than the existing one that converts at 50%?

## Minhash Sets

We've found pairs of near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition using Minhash. But many pairs will be part of the same group, in an extreme case there could be a group of 5 job ads with identical texts which produces 10 pairs. Both for interpretability and usability it makes sense to extract these groups from the pairs. Extracting the Groups Directly with Union Find Each band of the LSH consists of buckets of items that may be similar; you could view the buckets as a partition of the corpus of all documents.

## Searching for Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. In the last article I built a collection of MinHashes of the 400,000 job ads in half an hour in a 200MB file. Now I need to efficiently search through these minhashes to find the near duplicates because brute force search through them would take a couple of days on my laptop. MinHash was designed to approach this problem as outlined in the original paper.

## Considering VS Code from Emacs

I've been using Emacs as my primary editor for around 5 years now (after 4 years of Vim). I'm very comfortable in it, having spent a long time configuring my init.el. But once in a while I'm slowed down by some strange issue, so I'm going to put aside my sunk configuration costs and have a look at using VS Code. On Emacs I recently read a LWN article on Making Emacs Popular Again (and the corresponding HN thread).

## Estimating Bias in a Coin with Bayes Rule

I wanted to work through an example of applying Bayes rule to update model paremeters based on toy data This example comes from Kruschke’s Doing Bayesian Data Analysis, Section 5.3. The model is that we have a coin and we’re trying to estimate the bias in the coin, that is the probability that it will come up heads when flipped. For simplicity we assume the bias, theta is a multiple of 0.

• Edward Ross
## Detecting Near Duplicates with Minhash

I'm trying to find near duplicates texts in the Adzuna Job Salary Predictions Kaggle Competition. I've found that that the Jaccard index on n-grams is effective for finding these. Unfortunately it would take about 8 days to calculate the Jaccard index on all pairs of the 400,000 ads, and take about 640GB of memory to store it. While this is tractable we can find almost all pairs with a significant overlap it in half an hour in-memory using MinHash.

## Lessons from a mathematician on building a community

Mathematicians and software developers have a lot in common. They both build structures of ideas, typically working in small groups or alone, but leveraging structures built by others. For software developers the ideas are concrete code implementations, and the building blocks are subroutines, and are published as "libraries" or "packages". For mathematicians the ideas are abstract, built on definitions and theorems and published in papers, conferences and informal conversations. To grow a substantial body of work in both mathematics or software requires a community to contribute to it.

## Clustering for Segmentation

Dealing with thousands of different items is difficult. When you've got a couple of dozen you can view them together, but as you get into the hundreds, thousands and beyond it becomes necessary to group items to make sense of them. For example if you've got a list of customers you might group them by state, or by annual spend. But sometimes it would be useful to split them into a few groups using some heuristic criteria; clustering is a powerful technique to do this.

## Representing Decision Trees on a grid

A decision tree is a series of conditional rules leading to an outcome. When stated as a chain of if-then-else rules it can be really hard to understand what is going on. If the number of dimensions and cutpoints is relatively small it can be useful to visualise on a grid to understand the tree. Decision trees are often represented as a heirarchy of splits. Here's an example of a classification tree on Titanic survivors.

## Writing 50 Daily Articles

I've been writing an article a day for 50 days now. I started this to help build a portfolio, keep track of useful learnings and to become better at writing. This post reflects on the progress so far. Inspiration While there are many sources of inspiration for my writing, Sacha Chua's No Excuses Guide to Blogging is the biggest one. I bought the book around 2 years ago but I've found it useful and kept coming back to it.

## Four Competencies of an Effective Analyst

Analysts tend to be natural problem solvers, good at reasoning and adept with numbers. But to know how to frame the problem and what to look for they need to understand the context. To solve the problems they have to collect the right data and perform any necessary calculations. To have impact they need to be able to understand what's valuable, communicate their insights and influence decisions. These make up the four competencies of an effective analyst.

## 4am Rule for timeseries

When you've got a timeseries that doesn't have a timezone attched to it the natural question is "what timezone is this data from?" Sometimes it's UTC, sometimes it's the timezone of the server, otherwise it could be the timezone of one of the locations it's about (and it may or may not change with daylight savings). When it's people's web activity there's a simple heuristic to check this: the activity will be minimum between 3am and 5am.

A very useful open dataset the Australian Government provides is the Geocoded National Address File (G-NAF). This is a database mapping addresses to locations. This is really useful for applications that want to provide information or services based on someone's location. For instance you could build a custom store finder, get aggregate details of your customers, or locate business entities with an address, for example ATMs. There's another open and editable dataset of geographic entities, Open Street Map (and it has a pretty good open source Android app OsmAnd).

## Pipetable to CSV

Sometimes I get out pipe tables in Emacs that I want to convert into a CSVto put somewhere else. This is really easy with regular expressions. I often get data output from an SQL query like this text | num | value --------------+------+------------- Some text | 0.3 | 0.2 Rah rah | 7 | 0.00123(2 rows) Running sed 's/$$^ *\| *|\|(.*$$ */,/g' gives: ,text,num,value --------------+------+------------- ,Some text,0.3,0.2 ,Rah rah,7,0.00123, I can delete the divider and then use as a CSV.

## Binning data in SQL

Generally when combining datasets you want to join them on some key. But sometimes you really want a range lookup like Excel's VLOOKUP. A common example is binning values; you want to group values into custom ranges. While you could do this with a giant CASE statement, it's much more flexible to specify in a separate table (for regular intervals you can do it with some integer division gymnastics). It is possible to implement VLOOKUP in SQL by using window functions to select the right rows.

## A Mixture of Bernoullis is Bernoulli

Suppose you are analysing email conversion through rates. People either follow the call to action or they don't, so it's a Bernoulli Distribution with probability the actual probability a random person will the email. But in actuality your email list will be made up of different groups; for example people who have just signed up to the list may be more likely to click through than people who have been on it for a long time.

## Probability Squares

A geometric way to represent combining two independent discrete random variables is as a probability square. On each side of the square we have the distributions of the random variables, where the length of each segment is proportional to the probability. In the centre we have the function evaluated on the two edges and the probability is proportional to the area of the rectangle. For example suppose we had a random process that generated 1, 2 or 3 with equal probability (for example half the value of a die, rounded up).

## Representing Interaction Networks

Behavioural data can illuminate the structure of the underlying actors. For example looking at which products customers buy can help understand how both the products and customers interact. The same idea can apply to people who attend events, watch the same movie, or have authored a scientific paper together. There are a few ways to represent these kinds of interactions which gives a large toolbox of ways to approach the problem.

## Excel Binning

Putting numeric data into bins is a useful technique for summarising, especially for continuous data. This is what underlies histograms which is a bar chart of frequency counts in each bin. There are two main ways of doing this in Excel with groups and with vlookup (you can also do this in SQL). If you want equal length bins in a Pivot Table the easiest way is with groups. Right click on the column you want to bin and select Group

## Powershell Debugging with Write-Warning

I had to debug some Powershell, without knowing anything about it. I found Write-Warning was the right tool for printline debugging. This was enough to resolve my issue. I first tried Write-Output but apparently it doesn't work inside a function which I found misleading for a while (at first I thought that it wasn't getting to the function). Write-Warning worked straight away and I could see in bright yellow what was going on.

## Analysis Needs to Change A Decision

Any analysis where the results won't change a decision is worthless. Before even thinking of getting any data it's worth being clear on how it impacts the decision. There's lots of reasons people want an analysis. Sometimes it's to confirm what they already believe (and they'll discount anything that tells them otherwise). Sometimes it's to prove to others something they believe; possibly to inform a decision someone else is making. But it's most valuable when it effects a decision they can make with an outcome they care about.

## SQL Views for hiding business logic

The longer I work with a database the more I learn the dark corners of the dataset. Make sure you exclude the rows created by the test accounts listed in another table. Don't use the create_date field, use the real_create_date_v2 instead, unless it's not there, then just use create_date. Make sure you only get data from the latest snapshot for the key. Very quickly I end up with complex spaghetti SQL, which either contains monstrous subqueries or a chain of CREATE TEMPORARY TABLE.

## Near Duplicates with TF-IDF and Jaccard

I've looked at finding near duplicate job ads using the Jaccard index on n-grams. I wanted to see whether using the TF-IDF to weight the ads would result in a clearer separation. It works, but the results aren't much better, and there are some complications in using it in practice. When trying to find similar ads with the Jaccard index we looked at the proportion of n-grams they have in common relative to all the n-grams between them.

## Near Duplicates with Jaccard

Finding near-duplicate texts is a hard problem, but the Jaccard index for n-grams is an effective measure that's efficient on small sets. I've tried it on the Adzuna Job Salary Predictions Kaggle Competition with good success. This works pretty well at finding near-duplicates and even ads from the same company; although by itself it can't detect duplicates. I've looked before at using the edit distance which looks for the minimum number of changes to transform one text to another, but it's slow to calcuate.

## Edit Distance

Edit distance, also known as Levenshtein Distance is a useful way of the similarity of two sequences. It counts what is the minimum number of substitutions, insertions and deletions you need to make to transform one sequence to another. I had a look at using this for trying to compare duplicate ads with reasonable results, but it's a little slow to run on many ads. I've previously looked at finding ads with exactly the same text in the Adzuna Job Salary Predictions Kaggle Competition, but there are a lot of ads that are slight variations.

## Using Emacs under WSL

Getting Emacs to work nicely on a Windows system can be a challenge. You can install it natively (although getting all the dependencies is a challenge), but many packages require libraries or utilities that are hard to install or don't exist on Windows. The best solution I have found is using Emacs under the Windows Subsystem for Linux (WSL) with Xming. However if you run Emacs 26 or greater after starting Xming with XLaunch you're faced with a blank screen and can't see any writing on Emacs

## The Problem with Jaccard for Clustering

The Jaccard Index is a useful measure of similarity between two sets. It makes sense for any two sets, is efficient to compute at scale and it's arithmetic complement is a metric. However for clustering it has one major disadvantage; small sets are never close to large sets. Suppose you have sets that you want to cluster together for analysis. For example each set could be a website and the elements are people who visit that website.

## Jaccard Shingle Inequality

Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if you're concerned with plagarism and copyright, getting the same data through multiple sources, or finding versions of the same document this approach could be useful. In particular MinHash can quickly find pairs of items with a high Jaccard index, which we can run on sequences of w tokens. A hard question is what's the right number for w?

## Finding Exact Duplicate Text

Finding exact duplicates texts is quite straightforward and fast in Python. This can be useful for removing duplicate entries in a dataset. I tried this on the Adzuna Job Salary Predictions Kaggle Competition job ad texts and found it worked well. Naively finding exact duplicates by comparing every pair would be O(N^2), but if we sort the input, which is O(N log(N)), then duplicate items are adjacent. This scales really well to big datasets, and then the duplicate entries can be handled efficiently with itertools groupby to do something like uniq.

## Showing Side-by-Side Diffs in Jupyter

When comparing two texts it's useful to have a side-by-side comparison highlighting the differences. This is straightforward using HTML in Jupyter Notebooks with Python, and the inbuilt DiffLib. I used this to display job ads duplicated between different sites. For a long document it's important to align the sentences (otherwise it's hard to compare the differences), and highlight the individual differences at a word level. Overall the problems are breaking up a text into sentences and words, aligning the sentences, finding word level differences and displaying them side-by-side.

## Creating a Diff Recipe in Prodigy

I created a simple custom recipe to show diffs between two texts in Prodigy. I intend to use this to annotate near-duplicates. The process was pretty easy, but I got tripped up a little. I've been extracting job titles and skills from the job ads in the Adzuna Job Salary Predictions Kaggle Competition. One thing I noticed is there are a lot of job ads that are almost exactly the se; sometimes between the train and test set which is a data leak.

## All of Statistics

For anyone who wants to learn Statistics and has a maths or physics I highly recommend Larry Wasserman's All of Statistics . It covers a wide range of statistics with enough mathematical detail to really understand what's going on, but not so much that the machinery is overwhelming. What I learned reading it really helped me understand statistics well enough to design bespoke statistical experiments and effectively use and implement machine learning models.

## Remote social catchups are less intimate

As an introvert I really like catching up with good friends in small groups. But a video/remote catchup is much less intimate than real life because only one person can talk at a time. When you get 4 or more people in a group setting, frequently the conversation splits into smaller subgroups. The subgroups let people intermingle and participate in topics they're more interested in while all being together. With a video call you can't easily do this splitting and only one person can talk at a time.

## Counting n-grams with Python and with Pandas

Sequences of words are useful for characterising text and for understanding text. If two texts have many similar sequences of 6 or 7 words it's very likely they have a similar origin. When splitting apart text it can be useful to keep common phrases like "New York" together rather than treating them as the separate words "New" and "York". To do this we need a way of extracting and counting sequences of words.

## Waiting for System clock to synchronise

When trying to install packages with apt on a new Ubuntu AWS EC2 instance I had issues where the signature would fail to verify. The reason was the system clock was far in the past and so it looked like the signature was signed in the future. I created a workaround to wait for the system clock to synchronise that solved the problem and could be useful when starting a new machine with time sensitive issues.

## Not using NER for extracting Job Titles

I've been trying to use Named Entity Recogniser (NER) to extract job titles from the titles of job ads to better understand a collection of job ads. While NER is great, it's not the right tool for this job, and I'm going to switch to a counting based approach. NER models try to extract things like the names of people, places or products. SpaCy's NER model which I used is optimised to these cases (looking at things like capitalisation of words).

## Rules, Pipelines and Models

Over the past decade deep neural networks have revolutionised dealing with unstructured data. Problems like identifying what objects are in a video through generating realistic text to translating speech between languages that were intractible are now used in real-time production systems. You might think that today all problems on text, audio an images should be solved by training end-to-end neural networks. However rules and pipelines are still extremely valuable in building systems, and can leverage the information extracted from the black-box neural networks.

## Active NER with Prodigy Teach

Active learning reduces the number of annotations you have to make by selecting for annotation the items that will have the biggest impact on model retraining. Active learning for NER is built into Prodigy, but I failed to use to it to improve my job title recogniser. Having built a reasonable NER model for recognising job titles I wanted to see if I could easily improve it with Protidy's active learning.

## Python Inequality Chaining

In Python the comparison a <= b == c < d does the mathematically correct thing. This is a handy notational trick. This wasn't obvious to me because a lot of programming languages treat these associatively, so that a <= b < c may resolve to (a <= b) < c. This is very dangerous if boolean (True or False) are coerced to integers (1 or 0) because it may look like it works but give the wrong results.

## Training a job title NER with Prodigy

In a couple of hourse I trained a reasonable job title Named Entity Recogniser for job ad titles using Prodigy, with over 70% accuracy. While 70% doesn't sound great it's a bit ambiguous what a job title is, and getting exactly the bounds of the job title can be a hard problem. It's definitely good enough to be useful, and could be improved. After thinking through an annotation scheme for job titles I wanted to try annotating and training a model.

## Annotating Job Titles

When doing Named Entity Recognition it's important to think about how to set up the problem. There's a balance between what you're trying to achieve and what the algorithm can do easily. Coming up with an annotation scheme is hard, because as soon as you start annotating you notice lots of edge cases. This post will go through an example with extracting job titles from job ads. In our previous post we looked at what was in a job ad title and a way of extracting some common job titles from the ads.

## What's in a Job Ad Title?

The job title should succinctly summarise what the role is about, so it should tell you a lot about the role. However in practice job titles can range from very broad to very narrow, be obscure or acronym-laden and even hard to nail down. They're even hard to extract from a job ad's title - which is what I'll focus on in this series. In a previous series of posts I developed a method that could extract skills written a very particular way.

## Disk Usage in Linux with du

When your harddrive is filling up the du utility is a great way of seeing what's taking up all the space. It can recursively walk through directories to a maximum depth, and print it in human readable sizes. I'll normally start by running df to see what space is used and available. It's worth looking at the Mounted On column if you don't administer the machine because sometimes there are special partitions for large files.

## Getting Started Debugging with pdb

When there's something unexpected happening in your Python code the first thing you want to do is to get more information about what's going wrong. While you can use print statements or logging it may take a lot of iterations of rerunning and editing your statements to capture the right information. You could use a REPL but sometimes it's challenging to capture all the state at the point of execution. The most powerful tool for this kind of problem is a debugger, and it's really easy to get started with Python's pdb.

## Calculating percentages in Presto

One trick I use all the time is calculating percentages in SQL by dividing with the count. Percentages quickly tell me how much coverage I've got when looking at the top few rows. However Presto uses integer division so doing the naive thing will always give you 0 or 1. There's a simple trick to work around this: replace count(*) with sum(1e0). Suppose for example you want to calculate the percentage of a column that is not null; you might try something like

## Moving Averages in SQL

Moving averages can help smooth out the noise to reveal the undelying signal in a dataset. As they lag behind the actual signal they tradeoff timeliness for increased precision in the underlying signal. You could use them for reporting metrics or for alerting in cases where it's more important to be sure ther is a change than it is to catch any change early. It's typically better to have a 7 day moving average than weekly reporting for important metrics because you'll see changes earlier.

## Getting most recent value in Presto with max_by

Presto and the AWS managed alternative Amazon Athena have some powerful aggregation functions that can make writing SQL much easier. A common problem is getting the most recent status of a transaction log. The max_by function (and its partner min_by) makes this a breeze. Suppose you have a table tracking user login activity over time like this: country user_id time status AU 1 2020-01-01 08:00 logged-in CN 2 2020-01-01 09:00 logged-in AU 1 2020-01-01 12:00 logged-out AU 1 2020-01-01 13:00 logged-in CN 2 2020-01-01 14:00 logged-out You need to find out which users are currently logged in and out, which requires you to find their most recent status.

## Syncing Calendars and Contacts to Android with DAVx5

I find it really handy to have my calendar and contacts from my email client on my mobile phone. DAVx5 is a fantastic free (GPLv3) app to do this on Android. This lets me organise my life accross devices and helps me know when friends and family's birthdays are. DAVx5 is simple to set up and has worked almost flawlessly for me for over 4 years. It supports two way synchronisation to CalDAV and CardDAV servers that many email providers support.

## Don't manage work email with Emacs

I do a lot of work in Emacs and at the command line, and I get quite a few emails so it would be great if I could handle my emails there too. Email in Emacs can be surprisingly featureful and handles HTML markup, images and can even send org markup with images and equations all from the comfort of an Emacs buffer. However it can be a whole heap of work, and as you get deeper into the features your mail client provides the amount of custom integration required grows very rapidly.

## Data Transformations in the Shell

There are many great tools for filtering, transforming and aggregating data like SQL, R dplyr and Python Pandas (not to mention Excel). But sometimes when I'm working on a remote server I want to quickly extract some information from a file without switching to one of these environments. The standard unix tools like uniq, sort, sed and awk can do blazing fast transformations on text files that don't fit in memory and are easy to chain together.

## Second most common value with Pandas

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group. This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas. Problem For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).

## Property Based Testing - A thousand test cases in a single line

Property based testing lets you specify rules that a function being tested will satisfy over a wide range of inputs. This specifies how to throughly test a function without coming up with a detailed set of test cases. For example instead of writing a specific test case like sort([1, 3, 2]) == [1, 2, 3], you could state that the input and output of sort should contain exactly the same elements for any valid input.

• Edward Ross
## Using emacs dumb-jump with evil

Dumb-jump is a fantastic emacs package for code navigation. It jumps to the definition of a function/class/variable by searching for regular expressions that look like a definition using ag, ripgrep or git-grep/grep. Because it is so simple it works in over 40 languages (including oddities like SQL, LaTeX and Bash) and is easy to extend. While it is slower and less accurate than ctags, for medium sized projects it's fast enough and requiring no setup makes it much more useful in practice.

## Presto and Athena CLI in Emacs

I find having Emacs as a unified programming environment really useful. When writing an SQL pipeline I can iteratively develop my SQL in emacs, running it against the database. For a quick and dirty analysis I can copy the output into the .sql file and comment it out. Then I can copy the SQL into a programming language, parameterise it, and test it without touching the mouse. So when I started using Presto and AWS's managed alternative Athena, I needed to integrate it into emacs.

## Fastai Callbacks as Lisp Advice

Creating state of the art deep learning algorithms often requires changing the details of the training process. Whether it's scheduling hyperparameters, running on multiple GPUs or plotting the metrics it requires changing something in the training loop. However constantly modifying the core training loop everytime you want to add a feature, and adding a switch to enable it, quickly becomes unmaintainable. The solution fast.ai developed is to add points where custom code can be called that modifies the state of training, which they call callbacks.

• Edward Ross

There are many things that are valuable to know in business but are hard to measure. For example the time from when a customer has a need to purchase, the number of related products customers use or the or the actual value your products are delivering. However you don't need a sample size of hundreds to get an estimate; in fact you can get a statistically significant result from measuring just 5 random customers.

• Edward Ross

I like to do one-off analyses in R because tidyverse makes it really easy and beautiful. I also like to do them in Jupyter Notebooks because they form a neat way to collate the results. While R Markdown is better for reproducible code, often I'm doing expensive things with databases that are changing, and so I tend to find the "write once" behaviour of Jupyter Notebooks fit this use case better (although R Markdown Notebooks are catching up).

## Exporting data to Python with Amazon Athena

One necessary hurdle in doing data analysis or machine learning is loading the data. In many businesses larger datasets live in databases, in an object store (like Amazon S3) or the Hadoop File System. For some use cases you can do the work where the data lives using SQL or Spark, but sometimes it's more convenient to load it into a language like Python (or R) with a wider range of tools.

## Extracting Skills from Job Ads: Part 3 Conjugations

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "experience in telesales" using spaCy's dependency parse, but it wouldn't extract many types of experience from a job ad. Here we will extend these rules to extract lists of skills (for example extracting "telesales" and "callcentre" from "experience in telesales or receptionist", which will let us analyse which experiences are related.

Extracting Experience in a Field I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. In the previous post I extracted skills written in phrases like "subsea cable engineering experience". This worked well, but extracted a lot of qualifiers that aren't skills (like "previous experience in", or "any experience in"). Here we will write rules to extract experience from phrases like "experience in subsea cable engineering", with much better results.

• Edward Ross
## Extracting Skills from Job Ads: Part 1 - Noun Phrases

I'm trying to extract skills from job ads, using job ads in the Adzuna Job Salary Predictions Kaggle Competition. Using rules to extract noun phrases ending in experience (e.g. subsea cable engineering experience) we can extract many skills, but there's a lot of false positives (e.g. previous experience) You can see the Jupyter notebook for the full analysis. Extracting Noun Phrases It's common for ads to write something like "have this kind of experience":

I attended the excellent 2019 Leading the Product conference in Melbourne with aronud 500 other Product Managers and Enthusiasts. The conference had a broad range of great talks, a stimulating networking event where we connected by sharing our favourite books on product management, and overall a energetic atmosphere. I got something out of every talk, but here are the highlights from a data perspective. Find quick ways of testing difficult and uncertain hypotheses John Zeratsky talked about the design sprint for implementing a design solution in a week; from storyboarding an experience, to brainstorming solutions to prototyping and testing.

## Data Blockless: A better way to create data

Before you can do any machine learning you need to be able to read the data, create test and training splits and convert it into the right format. Fastai has a generic data block API for doing these tasks. However it's quite hard to extend to new data types. There's a few classes to implement; Items, ItemLists, LabelLists and the Preprocessors which are obfuscated through a complex inheritence and dispatch heirarchy.

## Constant Models

When predicting outcomes using machine learning it's always useful to have a baseline to compare results against. A simple baseline is the best constant model; that is a model that gives the same prediction for any input. This is a really simple check to perform against any dataset, and can be informative to check across validation splits. There are simple algorithms for finding the best constant model. For categorical predictions just evaluate every possible category to choose as the constant prediction.

## A programmer using Excel

Intro When I was 15 I did a week of work experience with my neighbour, who was an agricultural economist running his own one person business. I'm still not really sure what an agricultural economist does, but I went out with him to visit his clients to talk through their business, and saw how he analysed their data in his Excel spreadsheet. It was really closer to an application than a spreadsheet; the interface made it clear where the client was meant to enter their data, it showed some summary output and most of the intermediate calculations were hidden.

## Spectra of atoms

Why is a sodium lamp yellow? How can we determine the elemental composition of the sun? How does a Helium-neon laser can work? To some degree all of these questions require knowing the spectra of atoms, which can in theory be calculated by Quantum mechanics. However the calculations of these spectra for arbitrary systems from first principles is prohibitively difficult and computationally intensive (which is why techniques such as Density Functional Theory are used).

## Regular expressions, automata and monoids

In formal language theory the task is to specify, over some given alphabet, a set of valid strings. This is useful in searching for structures textual data through files (e.g. via grep), for specifying the syntactic structure of programming languages (e.g. in Bison or pandoc), and for generating output of a specified form (e.g. automatic computer science and mathematics paper generators).

An automoton is (roughly) a set of symbols, and a set of states, along with transitions for each state that take a symbol and return another state. They can be used to model (and verify) simple processes.

Automata can be brought into correspondence with formal languages in a very natural way; given an initial state s, and a sequence of symbols (a1, a2, …, an) the automata has a naturally assigned state (… ((s a1) a2) … an) (where “(state symbol)” represents the state obtained from the transition on symbol using state). Then if we nominate an initial state, and a set of “accepting” valid states, we say a string is in the language of the automata if and only if when applied to the initial state it ends in a final state.

This gives a very useful pairing in computer science; formal languages are useful tools, and automata (often) give an efficient way to implement them on a computer.

## DVI by example

The Device Independent File Format (DVI) is the output format of Knuth’s TeX82; modern TeX engines (pdfTeX, luaTeX) output straight to Adobe’s Portable document format (PDF). However TeX82 and DVI still work as well today as they did when they were written; DVI files are easily cast to postscript or PDF.

The defining reference for DVI files is David R Fuch’s article in TUGboat Vol 3 No 2.

To find out what information is contained in a particular DVI file use Knuth’s dvitype, which outputs the operations contained in the bytecode in human readable format.

This article goes into gory detail the instructions contained in a very simple DVI file.

## Algorithms for finding the real roots of polynomials

Given an degree n polynomial over the real numbers we are guaranteed there are at most n real roots by the fundamental theorem of algebra; but how do we find them? Here we explore the Vincent-Collins-Akritas algorithm.

It uses Descartes’ rule of signs: given a polynomial $$p(x) = a_n x^n + \cdots + a_1 x + a_0$$ the number of real positive roots (counting multiplicites) is bounded above by the number of sign variations in the sequence $$(a_n, \ldots, a_1, a_0)$$ .

## Geometry and topology of division rings

Following from my last post (and Veblen and Young’s Projective Geometry) consider a projective plane satisfying the axioms:

1. Given two distinct points there is a unique line that both points lie on
2. Each line has at least three points which lie on it
3. Given a triangle any line that intersects two sides of the triangle intersects the third.
4. All points are spanned by d+1 points and no fewer.

Then for d>=3 is equivalent to the projective space of lines over a division ring (or skew field).

Kolmogorov asked the question what projective spaces can we do analysis on? In order to do things such as find tangent lines we are going to need some sort of topology.

maths

## Geometry of division rings

It is fairly easy to construct a geometry from algebra: given a division ring K we form an n-dimensional vector space, the points being the elements of the field and a line being a translation of all (left) multiples of a non-zero vector, i.e. of the form $$\{a\mathbf{v} + \mathbf{c}| a \in K\}$$ for some fixed vectors $$\mathbf{v} \neq 0$$ and c.

Interestingly it’s just as possible to go the other way, if we’re careful about what we mean by a geometry. I will loosely follow Artin’s book Geometric Algebra. In particular we have the undefined terms of point, line and the undefined relation of lies on. Then, for a fixed positive integer, the axioms are:

1. Given two distinct points there is a unique line that both points lie on
2. Each line has at least three points which lie on it
3. Given a line and a point not on that line there exists a unique line lying on the plane containing them that the point lies on and no point of the first line lies on.
4. All points are spanned by d+1 points and no fewer.

## Linear representation of additive groups and the Fourier Transform: Part 1

In this article I will show that the cyclic group of order n, that is the set $$\{0,1,2,\ldots,n-1\}$$ under addition modulo n motivates the discrete Fourier transform on a particular finite dimensional complex inner product space, and gives many of its properties. In a subsequent article I will extend this to the general Fourier transform and its relation to the group of integers and real numbers under addition.

## Do you really mean ℝⁿ?

In mathematics and physics it is common to talk about $$\mathbb{R}^n$$ when really we mean something else that can be represented by $$\mathbb{R}^n$$.

Consider mechanics or geometry, these are often represented as theories in $$\mathbb{R}^n$$ , but really they don’t occur in a vector space at all! Look around you, a three-dimensional description of space probably seems reasonable, but where’s the origin? [Perhaps the centre of your eyes could be an origin, but someone else would disagree with you]. Classical mechanics, special relativity and geometry are much better described as an affine space – which is a vector space without an origin.

## Tensor notation

Language affects the way you think, often subconsciously. The easier and more natural something is to express in a language the more likely you are to express it. This is especially true of mathematical thought where the language is very precise.

I know three types of notations for tensors and each seem to be useful in different situations and gives you a different perspective on how tensors “work”. [Technical Note: I will assume all vector spaces are finite dimensional so $$V$$ is naturally isomorphic to $$V^{**}$$]

## LaTeXing Multiple Equations

In mathematics and the (hard) sciences it’s important to be able to write documents with lots of equations, lots of figures and lots of references efficiently. This can be done in, for example, Microsoft Word, but the mathematics and theoretical physics community heavily prefer $$\TeX$$ (and in particular $$\LaTeX$$ ), so the bottom line is if you want to get papers published you’re going to have to get good at it.

There are a lot of resources for learning $$\LaTeX$$ on the web, and a lot of people teach themselves from this (I know I did), but this can get you into some bad habits. For instance eqnarray gets the spacing around the equals signs all wrong. (I typeset my thesis using exclusively eqnarray and didn’t notice this until it was pointed out to me). So a lot of people advocate align from AMSTeX, but align has it’s limitations too; it only comes with one alignment tab &. If you want to make a comment at the end of multiple equations (like “for $$x \in X$$ “) or you want to have two equations and the second one breaks over two lines you can’t line the equations up properly; but there is a solution – IEEEeqnarray (which is an external class, IEEEtrantools, available from the IEEE). Stefan Moser has written an excellent paper covering everything I’ve said and much more, showing good ways to typeset equations.

## Solving polynomials of degree 2,3 and 4

$\newcommand\nth{n^{\mathrm{th}}}$

It is well known in mathematics that it is possible to find the roots of a general quadratic, cubic or quartic in terms of radicals (linear combinations and products of $$\nth$$ roots). Another way of saying this is that the equation $$a x^4+b x^3+c x^2 + d x + e = 0$$ can be solved for any complex constants $$a$$,$$b$$,$$c$$,$$d$$, and $$e$$ if one can solve the equation $$x^n-t=0$$ for $$n \in \{2,3\}$$ ($$1$$ being trivial) ($$t$$ may be an algebraic combination of solutions of $$x^n-s$$ for a variety of $$s$$ which are algebraic combinations of $$a$$,$$b$$,$$c$$,$$d$$ and $$e$$). This is not true for the quintic.

