On Applying Optimisation
maths

On Applying Optimisation

I recently watched the ACEMS public lecture Optimal Decision Making: A tribute to female ingenuity, where Alison Harcourt (née Doig) talked about her contribution developing Branch and Bound (William Cook has a paper on this history, and the landmark paper was An Automatic Method of Solving Discrete Programming Problems (Land and Doig, 1960)). They talked about how impactful the work was on industry, particularly with logistics where you often have constrained resources (many indivisible) that you need to optimise some goal.

Estimating Group Means with Empirical Bayes
maths

Estimating Group Means with Empirical Bayes

When calculating the averages of lots of different groups it doesn't make sense to treat the groups as independent, but to pool information across groups, especially on groups with little data. One way to do this is to build a model on covariates of the groups or on categorical embeddings to use information from other observations to inform that observation. Surprisingly Stein's Paradox says if we're trying to minimise the root mean square error, have only one observation per group, and they're all normally distributed with the same standard deviation we're always better off shrinking the means towards a common point than treating them separately (even if they're completely unrelated quantities!

Calculus of the Inverse Logit Function
maths

Calculus of the Inverse Logit Function

I was recently doing some logistic regression, and calculated the derivative of the Inverse Logit function (sometimes known as expit), to understand how the coefficients impact changes depending on the predicted probability. It turns out it has some mathematically interesting properties that I thought would be fun to explore. The inverse logit function is \( {\rm logit}^{-1}(x) = \frac{\exp(x)}{1+\exp{x}} \). A bit of calculus shows that \[ \frac{\rm d}{{\rm d} x} {\rm invlogit}(x) = \frac{e^{x}}{\left(1+e^{x}\right)^2} = {\rm invlogit}(x) (1 - {\rm invlogit}(x)) \]

Interpretable Parameterisations
data

Interpretable Parameterisations

Interpretable models are incredibly useful for getting users to trust your model (and understand when not to trust it). In projects helping subject matter experts make decisions it helps them see what's going into it, and what adjustments need to be made. However a large part of what makes a model interpretable is not just its form, but how it's presented. Reparameterising it in terms that make sense from a decision makers point of view can make all the difference in gaining insight, trust and adoption.

Constrained Gradient Descent
maths

Constrained Gradient Descent

Gradient descent is an effective algorithm for finding the local extrema of functions, and the global extrema of convex functions. It's very useful in machine learning for fitting a model from a family of models by finding the parameters that minimise a loss function. However sometimes you have extra constraints; I recently worked on a problem where the maximum value of the fitted function had to occur at a certain point.

Probability Distributions Between the Mean and the Median
maths

Probability Distributions Between the Mean and the Median

The normal distribution is used throughout statistics, because of the Central Limit Theorem it occurs in many applications, but also because it's computationally convenient. The expectation value of the normal distribution is the mean, which has many nice arithmetic properties, but the drawback of being sensitive to outliers. When discussing constant models I noted that the minimiser of the Lᵖ error is a generalisation of the mean; for \( p = 2 \) it's the mean, for \( p = 1 \) it's the median, and for \( p = \infty \) it's the midrange (half way betwen the maximum and minimum points).

Integrating Powers of Exponentials
maths

Integrating Powers of Exponentials

When working with distributions that were powers of exponentials, of which the normal and exponential distributions are special cases, I had to calculate the integrals of exponentials. It's possible to transform these into expressions involving the Gamma function. Specifically I found that for all positive p: \[ \int_{0}^{\infty} x^m e^{-x^p}\, \rm{d}x = \frac{\Gamma\left(\frac{m+1}{p}\right)}{p} \] This is useful for calculating moments of powers of exponentials, namely for positive p and k:

Cosine Similarity is Euclidean Distance
maths

Cosine Similarity is Euclidean Distance

In mathematics it's surprising how often something that's obvious (or trivial) to someone else can be revolutionary (or weeks of work) to someone else. I was looking at the annoy (Approximate Nearest Neighbours, Oh Yeah) library and saw this comment: Cosine distance is equivalent to Euclidean distance of normalized vectors I hadn't realised it at all, but once the claim was made I could immediately verify it. Given two vectors u and v their distance is given by the length of the vector between them: \( d = \| u - v \| = \sqrt{(u - v) \cdot (u - v)} \).

Centroid of Points on the Surface of a Sphere
maths

Centroid of Points on the Surface of a Sphere

I’ve written a derivation of how to find the centroid of a polygon on a sphere. This post shows it explicitly in numerical computations, and also looks at the solution in Spherical Averages and Applications to Spherical Splines and Interpolation, by Buss and Fillmore, ACM Transactions on Graphics 20, 95–126 (2001). Explicitly coding mathematics is a great exercise; having to concretely represent everything unearthed gaps in my understanding and found errors in both drafts of my derivation and the paper.

Centroid Spherical Polygon
maths

Centroid Spherical Polygon

You're organising a conference of operations research analysts from all over the world, but their time is very valuable and they only agree to meet if you minimise the average distance they need to travel (even if they have to have it on a boat in the middle of the ocean). Where do you put the conference? Let's model the world as a unit sphere in 3 dimensional space, and have the N people at cartesian coordinates \( \{ p_i\}_{i=1}^{N}\).

Centroid for Cosine Similarity
data

Centroid for Cosine Similarity

Cosine similarity is often used as a similarity measure in machine learning. Suppose you have a group of points (like a cluster); you want to represent the group by a single point - the centroid. Then you can talk about how well formed the group is by the average distance of points from the centroid, or compare it to other centroids. Surprisingly it's not much more complex than finding the geometric centre in euclidean space, if you pick the right coordinate system.

Checking With Calculation
general

Checking With Calculation

I recently convinced myself that I was right and a published paper was wrong. I did all the calculus by hand and decided they must have missed something. I resolved it by calculation. I played some very basic examples in my head and thought I was right, but I couldn't be sure. So I quickly implemented a concrete example case in Python using numpy. This helped me resolve pretty quickly that I was wrong; some searching produced a counterexample.

Solving Solved Problems
general

Solving Solved Problems

A good technique for deeply understanding something is to try to solve it yourself first. Sometimes this can even lead to better methods or new discoveries. I heard an interesting technique from Jeremy Howard in one of the fast.ai courses about how to read a paper. First read the abstract and introduction. Then spend a couple of days trying to implement what you think they're talking about. Then go back and read the rest of the paper and see how it compares to what you did.

Modelling the Spread of Infectious Disease
maths

Modelling the Spread of Infectious Disease

Understanding the spread of infectious disease is very important for policies around public health. Whether it's the seasonal flu, HIV or a novel pandemic the health implications of infectious diseases can be huge. A change in decision can mean saving thousands of lives and relieving massive suffering and related economic productivity losses. The SIR model is a model that is simple, but captures the underlying dynamics of how quickly infectious diseases spread.

Dip Statistic for Multimodality
maths

Dip Statistic for Multimodality

If you've got a distribution you may want a way to tell if it has multiple components. For example a sample of heights may have a couple of peaks for different gender, or other attributes. While you could determine this through explicitly modelling them as a mixture the results are sensitive to your choice of model. Another approach is statistical tests for multimodality. One common test is Silverman's Test which checks for the number of modes in a kernel density estimate; the trick is choosing the right width.

Differentiation is Linear Approximations
maths

Differentiation is Linear Approximations

Differentiation is the process of creating a local linear approximation of a function. This is useful because linear functions are very well understood and efficient to work with. One application of them is gradient descent, often used for fitting models in machine learning. In this context a function is something that maps between coordinate spaces. For example consider an image classifier that takes a 128x128 pixel image with three channels for colours (Red, Green, Blue) and returns a probability that the image contains a cat and the probability that the image contains a dog.

Classifying Finite Groups
maths

Classifying Finite Groups

Groups can be thought of a mathematical realisation of symmetry. For example the symmetric groups are all possible permutations of n elements. Or the dihedral groups are the symmetries of a regular polygon. A questions mathematicians ask is what kinds of groups are there? One way to tackle this is to try to decompose them. One way of doing this is a decomposition series of normal subgroups. \[ 1 = H_0\triangleleft H_1\triangleleft \cdots \triangleleft H_n = G \]

Complex Analysis
maths

Complex Analysis

Imaginary numbers sound like a very impractical thing; surely we should only be interested in real numbers. However imaginary numbers are very convenient for understanding phenomena with real numbers, and are useful models for periodic phases like in electrical engineering and quantum mechanics. The techniques are also often useful for evaluating integrals, solving two-dimensional electrostatics and decomposing periodic signals. Most of mathematical analysis, topology and measure theory is about inapplicable abtruse examples.

Symmetry in probability
maths

Symmetry in probability

The simplest way to model probability of a system is through symmetry. For example the concept of a "fair" coin means there are two possible outcomes that are indistinguishable. Because each result is equally likely the outcome is 50/50 heads or tails. Similarly for a fair die there are 6 possible outcomes, that are all equally likely. This means they each have the probability 1/6. The idea of symmetry is behind random sampling.

Sunk Cost of Pure Mathematics
maths

Sunk Cost of Pure Mathematics

Today I went through the painful exercise of culling my notebooks. My honours notebooks, independent research and work from textbooks and courses. These are things I spent a large part of my early life and energy on. Even though I haven't looked at them for years they are very hard to let go. A large amount of the material is pure mathematics. Notes on differential geometry, topology, and measure theory. These are particularly vexing because I don't believe they hold much real value.

Lessons from a mathematician on building a community
maths

Lessons from a mathematician on building a community

Mathematicians and software developers have a lot in common. They both build structures of ideas, typically working in small groups or alone, but leveraging structures built by others. For software developers the ideas are concrete code implementations, and the building blocks are subroutines, and are published as "libraries" or "packages". For mathematicians the ideas are abstract, built on definitions and theorems and published in papers, conferences and informal conversations. To grow a substantial body of work in both mathematics or software requires a community to contribute to it.

A Mixture of Bernoullis is Bernoulli
maths

A Mixture of Bernoullis is Bernoulli

Suppose you are analysing email conversion through rates. People either follow the call to action or they don't, so it's a Bernoulli Distribution with probability the actual probability a random person will the email. But in actuality your email list will be made up of different groups; for example people who have just signed up to the list may be more likely to click through than people who have been on it for a long time.

Probability Squares
maths

Probability Squares

A geometric way to represent combining two independent discrete random variables is as a probability square. On each side of the square we have the distributions of the random variables, where the length of each segment is proportional to the probability. In the centre we have the function evaluated on the two edges and the probability is proportional to the area of the rectangle. For example suppose we had a random process that generated 1, 2 or 3 with equal probability (for example half the value of a die, rounded up).

The Problem with Jaccard for Clustering
data

The Problem with Jaccard for Clustering

The Jaccard Index is a useful measure of similarity between two sets. It makes sense for any two sets, is efficient to compute at scale and it's arithmetic complement is a metric. However for clustering it has one major disadvantage; small sets are never close to large sets. Suppose you have sets that you want to cluster together for analysis. For example each set could be a website and the elements are people who visit that website.

Jaccard Shingle Inequality
maths

Jaccard Shingle Inequality

Two similar documents are likely to have many similar phrases relative to the number of words in the document. In particular if you're concerned with plagiarism and copyright, getting the same data through multiple sources, or finding versions of the same document this approach could be useful. In particular MinHash can quickly find pairs of items with a high Jaccard index, which we can run on sequences of w tokens. A hard question is what's the right number for w?

Geometry of division rings
maths

Geometry of division rings

It is fairly easy to construct a geometry from algebra: given a division ring K we form an n-dimensional vector space, the points being the elements of the field and a line being a translation of all (left) multiples of a non-zero vector, i.e. of the form \(\{a\mathbf{v} + \mathbf{c}| a \in K\}\) for some fixed vectors \(\mathbf{v} \neq 0\) and c.

Interestingly it’s just as possible to go the other way, if we’re careful about what we mean by a geometry. I will loosely follow Artin’s book Geometric Algebra. In particular we have the undefined terms of point, line and the undefined relation of lies on. Then, for a fixed positive integer, the axioms are:

  1. Given two distinct points there is a unique line that both points lie on
  2. Each line has at least three points which lie on it
  3. Given a line and a point not on that line there exists a unique line lying on the plane containing them that the point lies on and no point of the first line lies on.
  4. All points are spanned by d+1 points and no fewer.