# Stan Linear Priors

This is the second on a series of articles showing the basics of building models in Stan and accessing them in R. In the previous article I showed how to specify a simple linear model with flat priors in Stan, and fit it in R with a formula syntax. In this article we extend this to specify priors; defaulting to general weakly informative priors but allowing use of specific priors.

# Priors in linear regression

In our previous model \(y \sim N(\alpha + \beta x, \sigma)\), we didn’t specify any priors for our parameters, the intercept \(\alpha\), the coefficients \(\beta\) and the residual standard deviation \(\sigma\). We can extend our Stan model to take data specifying these priors, and to declare the priors themselves in the model.

Following `rstanarm::stan_glm`

it would be nice that if you didn’t specify a prior for it to use a reasonable default prior. Following the discussion in *Regression and Other Stories*, Section 9.5, we can use the same weak priors that they use that keep inferences stable, but don’t have much impact on the estimates.

The default prior for the coefficients is \(\beta \sim N(0, 2.5 s_y/s_x)\). Centring on 0 also makes sense without knowing which direction the coefficients should lie in. The ratio of standard deviations is important for the prior to be invariant under rescaling transformation. If we were to rescale \(y' = k y\) and \(x' = A x\), then the coefficients would scale as \(\beta' = kA^{-1} \beta\), so our prior should rescale in an analogous way. The factor 2.5, quoting from *Regression and Other Stories*, “is somewhat arbitrary, chosen to provide some stability in estimation while having little influence on the coefficient estimate when data are even moderately informative”. Perhaps the worst part of this assumption is that as you add more coefficients (and especially interactions) that the prior stays the same and they are all independent. Perhaps a better approach would be a joint distribution where some of the coefficients were more spread than others, since in many cases as you get more predictors a few of them may have a significant association but most will not.

The default weak prior for the intercept \(\alpha\) is given indirectly by assigning a prior the expected value of y at the mean value of x is normally distributed with mean the mean value of y, and standard deviation 2.5 times the standard deviation of y; that is \(E(y | x=\bar{x}) \sim N(\bar{y}, 2.5 s_y)\). Essentially we’re saying that at the centre of x, the data should be near the centre of y, and the error scales with the standard deviation of y (using a similar rescaling argument as above), again picking 2.5 as a . This is better than putting a prior directly on the intercept, because it’s invariant in a translation of $x$ or $y$; we’re always evaluating near the centre of the data (where we’re likely to have the most information). The expected value of y in our model is precisely \(\alpha + \beta x\), so we can rearrange this into \(\alpha \sim N(\bar{y} - \beta \bar{x}, 2.5 s_y)\).

Finally for the residual standard deviation assumed prior is \(\sigma \sim {\rm Exponential}(1/s_y)\). This means in particular that the expected value is \(s_y\), which is reasonable from scaling assumptions, and that the value is non-negative. I’m not sure how reasonable the assumption in the distribution itself is, but I’ll take it as a given.

We further want to be able to extend from these default priors to enable passing informative priors. We can directly extend the prior for the coefficients to take a centre vector and standard deviation vector (or more generally a covariance matrix) that can be passed in place of the default priors. For the intercept \(\alpha\) we could similarly specify a centre point and standard deviation, but to conform with the weak prior form we could pretend the data is centred, \(\bar{x} = 0\), so the \(\beta\) coefficient has no influence on the prior. Finally for the standard deviation we could pass a different parameter for the exponential distribution than the inverse standard deviation of y.

# Writing a Stan Model

With this plan we extend our Stan data to include the centre

\[\begin{align} \beta &\sim N(\mu_\beta, s_\beta) \\ \alpha &\sim N(\mu_\alpha - \beta \bar{x}, s_\alpha)\\ \sigma &\sim {\rm Exponential}(1/{\mu_\sigma}) \end{align}\]

Following `rstanarm`

I refer to the centre as the `location`

and the standard deviation as the `scale`

, and I call the parameter in the exponential distribution the `rate`

. Note that the priors are specified as part of the model.

```
// Linear model - linear.stan
data {
int<lower=0> N; // Number of data points
int<lower=0> K; // Number of predictors
matrix[N, K] X; // Predictor matrix
real y[N]; // Observations
// NEW: Data specifying priors
vector[K] prior_location; // Coefficient Normal Prior - centre
vector[K] prior_scale; // Coefficient Normal Prior - standard deviation
real prior_intercept_location; // Intercept Normal Prior - centre
matrix[1, K] prior_intercept_predictor; // Intercept Normal Prior - offset centre by -beta * prior_intercept_predictor
real prior_intercept_scale; // Intercept Normal Prior - standard deviation
real prior_aux_rate; // Exponential prior on sigma
}parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}model {
// NEW: Prior distributions
beta ~ normal(prior_location, prior_scale);
alpha ~ normal(prior_intercept_location - prior_intercept_predictor * beta, prior_intercept_scale);
sigma ~ exponential(prior_aux_rate);
// Target Density
// target density
y ~ normal(alpha + X * beta, sigma); }
```

# Running the model from R

As before we can wrap this in a function, adding extra parameters for the priors. Note I set the defaults to `FALSE`

; it would have made more sense to use `NULL`

but I had an idea that I could copy `rstanarm`

’s approach of using `NULL`

for a flat prior before realising I’d need to do a lot of work to add that kind of flexibility.

One thing that caught me is a vector of length 1 will be treated as a scalar, not a vector, by Stan (because it’s hard to distinguish these in R), and so we need to wrap prior vectors passed to RStan in `array`

. From the RStan vignette

If we want to prevent RStan from treating the input data for y as a scalar when N‘ is 1, we need to explicitly make it an array

```
<- function(formula, data,
fit_stan_linear
...,prior_location=FALSE,
prior_scale=FALSE,
prior_intercept_location=FALSE,
prior_intercept_scale=FALSE,
prior_aux_rate=FALSE) {
<- model.response(model.frame(formula, data))
y <- remove_intercept_from_model(model.matrix(formula, data))
X
<- ncol(X)
K <- nrow(data)
N
if (isFALSE(prior_location)) {
<- rep(0, K)
prior_location
}
if (isFALSE(prior_scale)) {
<- 2.5 * sd(y) / apply(X, 2, sd)
prior_scale
}
if (isFALSE(prior_intercept_scale)) {
<- 2.5 * sd(y)
prior_intercept_scale
}
if (isFALSE(prior_aux_rate)) {
<- 1/sd(y)
prior_aux_rate
}
if (isFALSE(prior_intercept_location)) {
<- mean(y)
prior_intercept_location <- matrix(apply(X, 2, mean), ncol=K)
prior_intercept_predictor else {
} # When a specific location is set, remove the effect of predictor offset
# by setting it to 0
<- prior_intercept_location
prior_intercept_location <- matrix(rep(0,K), ncol=K)
prior_intercept_predictor
}
<- rstan::stan(
fit file = "linear.stan",
data = list(
N = nrow(X),
K = ncol(X),
X = X,
y = y,
prior_intercept_predictor = prior_intercept_predictor,
prior_intercept_location = prior_intercept_location,
# Need array when there is just 1 predictor
prior_scale = array(prior_scale, dim=K),
prior_location = array(prior_location, dim=K),
prior_centre_scale = prior_intercept_scale,
prior_sigma_rate = prior_aux_rate
),
...
)
names(fit) <- get_linear_names(names(fit), colnames(X))
structure(list(fit=fit, formula=formula, data=data), class=c("my_linstan"))
}
```

# Testing using priors

As a test of this functionality let’s compare `rstanarm::stan_glm`

with our function on the SexRatio data from Section 9.5 of *Regression and Other Stories* (inspired by a study of the effect of Beauty on the sex ratio of children, where there is weak data and small priors).

We have a small data set of 5 points, representing the percentage of girl babies \(y\), as a function of standardised beauty \(x\).

```
<- seq(-2,2,1)
x <- c(50, 44, 50, 47, 56)
y <- data.frame(x, y) sexratio
```

The weakly informative priors give similar results to the minimum likelihood estimator found by `lm`

.

```
<- lm(y~x, data=sexratio)
fit_sexratio_lm <- stan_glm(y ~ x, data=sexratio)
fit_sexratio_default <- fit_stan_linear(y ~ x, data=sexratio) fit_sexratio_default_stan
```

The coefficients are all very close to this (the estimated residual deviation of `lm`

is a little lower at 4.3).

```
Median MAD_SD
(Intercept) 49.3 1.9
x 1.4 1.4
Auxiliary parameter(s):
Median MAD_SD
sigma 4.6 1.7
```

However we can add an informative prior (which acts as regularisation on the coefficients), based on the fact the rate of girl births is around 48.5% to 49%, and based on prior studies we wouldn’t expect beauty to have more than a 0.8 percentage point impact on the rate of girl births.

```
<- stan_glm(y ~ x, data=sexratio, prior=normal(0,0.2), prior_intercept=normal(48.8, 0.5))
fit_sexratio_post <- fit_stan_linear(y ~ x, data=sexratio,
fit_sexratio_post_stan prior_location=0,
prior_scale=0.2,
prior_intercept_location=48.8,
prior_intercept_scale=0.5)
```

These give identical coefficient estimates to one decimal place.

```
Median MAD_SD
(Intercept) 48.8 0.5
x 0.0 0.2
Auxiliary parameter(s):
Median MAD_SD
sigma 4.3 1.3
```

Now we know how to fit a simple linear model in Stan and add priors, it would be nice if we could make predictions and take posterior draws from it. That’s covered in the next article making Bayesian predictions with Stan and R.