I was working through problem 15.5 of Regression and Other Stories, which asks to fit a probit regression to a previous example with a logistic regression. I used a model I had built on the National Election Survey dataset (on rstanarm 2.21.1):

fit_nes_probit <-
rstanarm::stan_glm(rvote ~ income_int_std + gender + race +
region + religion + education_cts +
advanced_degree + party + ideology3 +
gender : party,
data=nes92)

When I got this error about the chains not converging:

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: Rejecting initial value:
Chain 1:   Log probability evaluates to log(0), i.e. negative infinity.
Chain 1:   Stan can't start sampling from this initial value.
...
Chain 1:
Chain 1: Initialization between (-2, 2) failed after 100 attempts.
Chain 1:  Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler\$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done
Error in check_stanfit(stanfit) :
Invalid stanfit object produced please report bug

Even worse it crashed my Jupyter IRkernel, and the only way I could get it running again was to restart the whole process! I continued to debug it in an R console which wouldn't crash after a fail.

This surprised me because I found it worked fine with link="logit", that is logistic regression. The solution was to set a lower scale value in the prior (the default is 4):

fit_nes_probit <-
rstanarm::stan_glm(rvote ~ income_int_std + gender + race +
region + religion + education_cts +
advanced_degree + party + ideology3 +
gender : party,
prior=rstanarm::normal(scale=0.5, autoscale=TRUE),
data=nes92)

Why this works

To work out what was going on I started trying to simulate similar data that led to the same error. After some experimentation I found a simple reproducable failing example with a few noise predictors:

N <- 1000
fake_data <- data.frame(a=rbinom(N,1,0.5),
b=rbinom(N,1,0.5),
c=rbinom(N,1,0.5),
d=rbinom(N,1,0.5),
e=rbinom(N,1,0.5),
y=rbinom(N,1,0.5))

fit <- rstanarm::stan_glm(y ~ a + b + c + d + e
data=fake_data)

This seems to have a better than 25% chance of failing because one of the chains fails most times I run it. If I remove just one of the predictors it succeeds.

I got suspicious about the priors, in Regression and Other Stories they state (at least for logistic regression) the priors on the coefficients are 2.5/sd(x). Here the standard deviation of all our predictors is just 0.5 (since they're binomial with probaibility 0.5). I tried to check this on a model that fit:

fit <- rstanarm::stan_glm(y ~ a + b + c + d,
data=fake_data)
prior_summary(fit)

Which resulted in:

Priors for model 'fit'
------
Intercept (after predictors centered)
~ normal(location = 0, scale = 4)

Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [4,4,4,...])
~ normal(location = [0,0,0,...], scale = [7.98,8.00,7.97,...])
------

So it looks like the default scale was 4, and they've been increased by the inverse standard deviation, that is a factor of 2.

The next thing I tried was a flat prior on the coefficients, which converged (to my surprise):

fit_flat <- rstanarm::stan_glm(y ~ a + b + c + d + e,
prior=NULL,
data=fake_data)

The next idea I had was to make the priors slightly tighter, to try to make the coefficients closer to 0. Some experimentation found it would start to converse consistently around a scale of 1:

fit_weak <- rstanarm::stan_glm(y ~ a + b + c + d + e,
prior=rstanarm::normal(scale=1, autoscale=TRUE),
data=fake_data)

I found for the NES model I was fitting I had to further reduce the scale to 0.5 for it to converge, or alternatively set a flat prior.

I don’t understand how Stan works, but I’m guessing as we add more predictors there’s a greater chance of getting one that is far from it’s actual value (in this case 0). I suspect because the probabilities in the normal distribution go to zero much faster than in the logistic distribution, this leads to numerical errors in link="probit" that don’t occur in link="logit". I've asked in the Stan Discourse to see if anyone can give more insight.

Getting the data

For reproducability, here's how the nes92 dataset above was generated, with a bit of feature engineering.

library(dplyr)

nes92 <-
nes %>%
filter(year == 1992) %>%
# Only people who actually voted republican or democrat
filter(!is.na(presvote_2party)) %>%
transmute(
age= age,
gender=gender,
race=race,
region=region,
income=income,
religion=religion,
education=educ3,
party = partyid3_b,
rvote = presvote_2party == '2. republican',
ideology = ideo7, # ideo is just a summary
) %>%
select(-father_party, -mother_party) %>%
filter_all(~!is.na(.))  %>%
mutate(income_int = as.integer(income) - 1,
income_int_std = (income_int - mean(income_int))/(2*sd(income_int)),
ideology3 = if_else(ideology_int < 0, "liberal", if_else(ideology_int > 0, "conservative", "moderate")))