I was working through problem 15.5 of *Regression and Other Stories*, which asks to fit a probit regression to a previous example with a logistic regression.
I used a model I had built on the National Election Survey dataset (on `rstanarm`

2.21.1):

```
fit_nes_probit <-
rstanarm::stan_glm(rvote ~ income_int_std + gender + race +
region + religion + education_cts +
advanced_degree + party + ideology3 +
gender : party,
family=binomial(link="probit"),
data=nes92)
```

When I got this error about the chains not converging:

```
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: Rejecting initial value:
Chain 1: Log probability evaluates to log(0), i.e. negative infinity.
Chain 1: Stan can't start sampling from this initial value.
...
Chain 1:
Chain 1: Initialization between (-2, 2) failed after 100 attempts.
Chain 1: Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
[1] "Error in sampler$call_sampler(args_list[[i]]) : Initialization failed."
error occurred during calling the sampler; sampling not done
Error in check_stanfit(stanfit) :
Invalid stanfit object produced please report bug
```

Even worse it crashed my Jupyter IRkernel, and the only way I could get it running again was to restart the whole process! I continued to debug it in an R console which wouldn't crash after a fail.

This surprised me because I found it worked fine with `link="logit"`

, that is logistic regression.
The solution was to set a lower scale value in the prior (the default is 4):

```
fit_nes_probit <-
rstanarm::stan_glm(rvote ~ income_int_std + gender + race +
region + religion + education_cts +
advanced_degree + party + ideology3 +
gender : party,
prior=rstanarm::normal(scale=0.5, autoscale=TRUE),
family=binomial(link="probit"),
data=nes92)
```

# Why this works

To work out what was going on I started trying to simulate similar data that led to the same error. After some experimentation I found a simple reproducable failing example with a few noise predictors:

```
N <- 1000
fake_data <- data.frame(a=rbinom(N,1,0.5),
b=rbinom(N,1,0.5),
c=rbinom(N,1,0.5),
d=rbinom(N,1,0.5),
e=rbinom(N,1,0.5),
y=rbinom(N,1,0.5))
fit <- rstanarm::stan_glm(y ~ a + b + c + d + e
family=binomial(link="probit"),
data=fake_data)
```

This seems to have a better than 25% chance of failing because one of the chains fails most times I run it. If I remove just one of the predictors it succeeds.

I got suspicious about the priors, in *Regression and Other Stories* they state (at least for logistic regression) the priors on the coefficients are `2.5/sd(x)`

.
Here the standard deviation of all our predictors is just 0.5 (since they're binomial with probaibility 0.5).
I tried to check this on a model that fit:

```
fit <- rstanarm::stan_glm(y ~ a + b + c + d,
family=binomial(link="probit"),
data=fake_data)
prior_summary(fit)
```

Which resulted in:

```
Priors for model 'fit'
------
Intercept (after predictors centered)
~ normal(location = 0, scale = 4)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [4,4,4,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [7.98,8.00,7.97,...])
------
```

So it looks like the default scale was 4, and they've been increased by the inverse standard deviation, that is a factor of 2.

The next thing I tried was a flat prior on the coefficients, which converged (to my surprise):

```
fit_flat <- rstanarm::stan_glm(y ~ a + b + c + d + e,
family=binomial(link="probit"),
prior=NULL,
data=fake_data)
```

The next idea I had was to make the priors slightly tighter, to try to make the coefficients closer to 0. Some experimentation found it would start to converse consistently around a scale of 1:

```
fit_weak <- rstanarm::stan_glm(y ~ a + b + c + d + e,
family=binomial(link="probit"),
prior=rstanarm::normal(scale=1, autoscale=TRUE),
data=fake_data)
```

I found for the NES model I was fitting I had to further reduce the scale to 0.5 for it to converge, or alternatively set a flat prior.

I don’t understand how Stan works, but I’m guessing as we add more predictors there’s a greater chance of getting one that is far from it’s actual value (in this case 0).
I suspect because the probabilities in the normal distribution go to zero much faster than in the logistic distribution, this leads to numerical errors in `link="probit"`

that don’t occur in `link="logit"`

.
I've asked in the Stan Discourse to see if anyone can give more insight.

# Getting the data

For reproducability, here's how the `nes92`

dataset above was generated, with a bit of feature engineering.

```
library(dplyr)
nes <- foreign::read.dta('https://raw.githubusercontent.com/avehtari/ROS-Examples/master/NES/data/nes5200_processed_voters_realideo.dta')
nes92 <-
nes %>%
filter(year == 1992) %>%
# Only people who actually voted republican or democrat
filter(!is.na(presvote_2party)) %>%
transmute(
age= age,
gender=gender,
race=race,
region=region,
income=income,
religion=religion,
education=educ3,
party = partyid3_b,
rvote = presvote_2party == '2. republican',
ideology = ideo7, # ideo is just a summary
) %>%
select(-father_party, -mother_party) %>%
filter_all(~!is.na(.)) %>%
mutate(income_int = as.integer(income) - 1,
income_int_std = (income_int - mean(income_int))/(2*sd(income_int)),
advanced_degree = education == '7. advanced degrees incl. llb',
education_cts = (as.integer(education) - 5)/6,
ideology_int = (as.numeric(ideology) - 5)/6,
ideology3 = if_else(ideology_int < 0, "liberal", if_else(ideology_int > 0, "conservative", "moderate")))
```