I was working through problem 15.5 of Regression and Other Stories, which asks to fit a probit regression to a previous example with a logistic regression.
I used a model I had built on the National Election Survey dataset (on
fit_nes_probit <- rstanarm::stan_glm(rvote ~ income_int_std + gender + race + region + religion + education_cts + advanced_degree + party + ideology3 + gender : party, family=binomial(link="probit"), data=nes92)
When I got this error about the chains not converging:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1). Chain 1: Rejecting initial value: Chain 1: Log probability evaluates to log(0), i.e. negative infinity. Chain 1: Stan can't start sampling from this initial value. ... Chain 1: Chain 1: Initialization between (-2, 2) failed after 100 attempts. Chain 1: Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.  "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed." error occurred during calling the sampler; sampling not done Error in check_stanfit(stanfit) : Invalid stanfit object produced please report bug
Even worse it crashed my Jupyter IRkernel, and the only way I could get it running again was to restart the whole process! I continued to debug it in an R console which wouldn't crash after a fail.
This surprised me because I found it worked fine with
link="logit", that is logistic regression.
The solution was to set a lower scale value in the prior (the default is 4):
fit_nes_probit <- rstanarm::stan_glm(rvote ~ income_int_std + gender + race + region + religion + education_cts + advanced_degree + party + ideology3 + gender : party, prior=rstanarm::normal(scale=0.5, autoscale=TRUE), family=binomial(link="probit"), data=nes92)
Why this works
To work out what was going on I started trying to simulate similar data that led to the same error. After some experimentation I found a simple reproducable failing example with a few noise predictors:
N <- 1000 fake_data <- data.frame(a=rbinom(N,1,0.5), b=rbinom(N,1,0.5), c=rbinom(N,1,0.5), d=rbinom(N,1,0.5), e=rbinom(N,1,0.5), y=rbinom(N,1,0.5)) fit <- rstanarm::stan_glm(y ~ a + b + c + d + e family=binomial(link="probit"), data=fake_data)
This seems to have a better than 25% chance of failing because one of the chains fails most times I run it. If I remove just one of the predictors it succeeds.
I got suspicious about the priors, in Regression and Other Stories they state (at least for logistic regression) the priors on the coefficients are
Here the standard deviation of all our predictors is just 0.5 (since they're binomial with probaibility 0.5).
I tried to check this on a model that fit:
fit <- rstanarm::stan_glm(y ~ a + b + c + d, family=binomial(link="probit"), data=fake_data) prior_summary(fit)
Which resulted in:
Priors for model 'fit' ------ Intercept (after predictors centered) ~ normal(location = 0, scale = 4) Coefficients Specified prior: ~ normal(location = [0,0,0,...], scale = [4,4,4,...]) Adjusted prior: ~ normal(location = [0,0,0,...], scale = [7.98,8.00,7.97,...]) ------
So it looks like the default scale was 4, and they've been increased by the inverse standard deviation, that is a factor of 2.
The next thing I tried was a flat prior on the coefficients, which converged (to my surprise):
fit_flat <- rstanarm::stan_glm(y ~ a + b + c + d + e, family=binomial(link="probit"), prior=NULL, data=fake_data)
The next idea I had was to make the priors slightly tighter, to try to make the coefficients closer to 0. Some experimentation found it would start to converse consistently around a scale of 1:
fit_weak <- rstanarm::stan_glm(y ~ a + b + c + d + e, family=binomial(link="probit"), prior=rstanarm::normal(scale=1, autoscale=TRUE), data=fake_data)
I found for the NES model I was fitting I had to further reduce the scale to 0.5 for it to converge, or alternatively set a flat prior.
I don’t understand how Stan works, but I’m guessing as we add more predictors there’s a greater chance of getting one that is far from it’s actual value (in this case 0).
I suspect because the probabilities in the normal distribution go to zero much faster than in the logistic distribution, this leads to numerical errors in
link="probit" that don’t occur in
I've asked in the Stan Discourse to see if anyone can give more insight.
Getting the data
For reproducability, here's how the
nes92 dataset above was generated, with a bit of feature engineering.
library(dplyr) nes <- foreign::read.dta('https://raw.githubusercontent.com/avehtari/ROS-Examples/master/NES/data/nes5200_processed_voters_realideo.dta') nes92 <- nes %>% filter(year == 1992) %>% # Only people who actually voted republican or democrat filter(!is.na(presvote_2party)) %>% transmute( age= age, gender=gender, race=race, region=region, income=income, religion=religion, education=educ3, party = partyid3_b, rvote = presvote_2party == '2. republican', ideology = ideo7, # ideo is just a summary ) %>% select(-father_party, -mother_party) %>% filter_all(~!is.na(.)) %>% mutate(income_int = as.integer(income) - 1, income_int_std = (income_int - mean(income_int))/(2*sd(income_int)), advanced_degree = education == '7. advanced degrees incl. llb', education_cts = (as.integer(education) - 5)/6, ideology_int = (as.numeric(ideology) - 5)/6, ideology3 = if_else(ideology_int < 0, "liberal", if_else(ideology_int > 0, "conservative", "moderate")))