In Bayesian statistics you have to choose a prior distribution for the parameters to combine with the data to get a posterior distribution. Choosing a tight prior, assuming that the parameters should live in a particular space, reduces the impact of the data on the posterior estimates. This is just like regularisation in machine learning where adding a penalty to the loss function prevents over-fitting. This is more than just an analogy, and this article will explore a couple of cases with constant regression and classification.

A typical machine learning approach to regression is to minimise the root mean squared error. A probabilistic perspective for this is to consider the regression $$y = f_\theta(X) + \epsilon$$, where y is the outcome, X are the predictors, $$f_\theta$$ is a function parameterised by $$\theta$$, and $\epsilon$ is the error term. If we assume that $$\epsilon \in N(0, \sigma^2)$$ is normally distributed, this is equivalent to saying that $$y \in N(f_\theta^2(X), \sigma^2)$$. We then need to pick the most likely parameters $$\theta$$ given the data.

The Bayesian perspective on this is if we have a prior on the parameters $$p(\theta)$$, and data $$X_i, y_i$$ then the posterior estimate is $$p(\theta \vert X_i, y_i) = \frac{p(X_i, y_i \vert \theta) p(\theta)}{p(\{X_i,y_i\}_i)}$$. In Bayesian statistics we estimate the whole distribution, but we can focus on the maximum likelihood estimator, the value of $$\theta$$ that maximises the posterior probability. Since the logarithm is a monotonic function, the maximum likelihood occurs as the same point as the maximum log likelihood. Taking the logarithm and plugging in the normal distribution for $$p(X,y \vert \theta)$$ gives $$l(\theta, \sigma) = -\frac{1}{2\sigma^2} \sum_{i=1}^{N} (f_{\theta}(X_i) - y_i)^2 + \log(p(\theta)) - N \log(\sigma) + c$$ for some constant c. In the case of a flat prior, $$p(\theta) = 1$$ then the maximum likelihood estimator is equivalent to minimising the (root) mean squared error. However in general the prior acts as a regularisation; for example if we take a prior that the parameters are normally distributed it reduces to Tikhonov Regularisation. However we could pick other prior distributions to recover an Lᵖ regularisation, and in particular a Laplace distribution recovers the LASSO.

There's more here, in Bayesian statistics people tend to use a Horseshoe Prior instead of a Laplace Distribution, and Michael Betancourt has an article on my reading list on Bayes Sparse Regression that goes through the trade-offs with different regularising priors.

# Binary classification

Similar ideas can be applied in Binary Classification, here the metric is typically Binary Cross Entropy. From a probabilistic perspective we can assume the data comes from a Binomial distribution $$y \in B(f_\theta(X))$$. Here $$p(X_i, y_i \vert \theta) = f_\theta(X_i)^{y_i} (1 - f_\theta(X_i))^{1-y_i}$$ (keeping in mind that $$y_i$$ can only take the values 0 or 1). Then, as in the normal regression case, we can find the maximum likelihood estimator by minimising the log likelihood $$l(\theta) = \sum_{y_i = 1} \log(f_\theta (X_i)) + \sum_{y_i=0} \log(1 - f_\theta(X_i) + \log(p(\theta)) + c$$. With a flat prior this maximising the log likelihood is equivalent to minimising the Binary Cross Entropy.

Consider in particular the constant model $$f_\theta(X_i) = \theta$$, where this reduces to $$l(\theta) = s \log(\theta) + (N-s) \log(1-\theta) + \log(p(\theta))$$, where s is the number of successes and N is the total number of trials. A bit of calculus and algebra shows that with a flat prior this is maximised when $$\hat{\theta} = \frac{s}{N}$$.

One problem with this is the variance of the binomial is $$\sqrt{\frac{\theta(1-\theta)}{N}}$$, and so if we have 0 or N successes the maximum likelihood estimate for the variance is 0, which in most cases isn't right - we're not going to be exactly zero. A method for handling this, which I learned in the book Regression and Other Stories, is to set a prior of $${\rm Beta}(3,3)$$ which is equivalent to adding 4 extra trials with 2 successes. Then the maximum likely estimate for the parameter is $$\hat{\theta} = \frac{s+2}{N+4}$$ and the variance will always be non-zero.

In the log likelihood this adds a penalty of $$\log(\theta^2 (1-\theta)^2) + c'$$, for some constant $$c'$$. Rewriting $$\psi = \theta - \frac{1}{2}$$ and rearranging gives the penalty, up to a constant, as $$2 \log(\frac{1}{4} - \psi^2)$$. For small $$\psi$$ we can do a Taylor expansion to get $$-8 \psi^2 = -8 (\theta - \frac{1}{2})^2$$. So this transformation is similar to a $$l^2$$ penalty (I suspect this is for the same reason a Binomial converges to a Gaussian for large samples and moderate probabilities).

What's interesting here is the Beta prior gives a more reasonable and understandable regularisation than $$l^2$$ regularisation, especially for probabilities close to 0 or 1. I would never have thought of a log Beta penalty, but thinking of it as a prior it makes really good sense. On the other hand being able to switch to a maximum likelihood, and thinking of the prior as a penalty, makes things much quicker to calculate than trying to estimate the whole posterior. There's a Wikipedia article on Bayesian interpretation of Kernel Regularisation It's useful being able to switch between the two viewpoints.