Making Bayesian Predictions with Stan and R
stan

Making Bayesian Predictions with Stan and R

This is the third on a series of articles showing the basics of building models in Stan and accessing them in R. Now that we can specify a linear model and fit it in with formula syntax, and specify priors for the model, it would be useful to be able to make predictions with it. In principle making predictions from our linear model \( y \sim N(\alpha + \beta x, \sigma)\) is easy; to make point predictions we take central estimates of the coefficients \(\hat{\alpha}\) and \(\hat{\beta}\) and estimate \( y \approx \hat{\alpha} + \hat{\beta} x\).

Getting Started with RStan
r

Getting Started with RStan

I wanted to fit a Bayesian Tobit model, but I couldn't find one (probably because I didn't know how to look). So I decided to build one in Stan, which I had never used before. This article is the first in a series showing how I got there; this one builds a linear model in Stan and makes it useable from R using formula syntax, then next we add priors to the model, make model predictions from R, and then handle censored values with Tobit regression.

Fixing sampler errors in probit regression with rstanarm
stan

Fixing sampler errors in probit regression with rstanarm

I was working through problem 15.5 of Regression and Other Stories, which asks to fit a probit regression to a previous example with a logistic regression. I used a model I had built on the National Election Survey dataset (on rstanarm 2.21.1): fit_nes_probit <- rstanarm::stan_glm(rvote ~ income_int_std + gender + race + region + religion + education_cts + advanced_degree + party + ideology3 + gender : party, family=binomial(link="probit"), data=nes92) When I got this error about the chains not converging:

Binning Binary Predictions
data

Binning Binary Predictions

When understanding how a binary prediction depends on a continuous input I find a very useful way is to bin it into quantiles and plot the average probability. For example here's a plot using the iris dataset showing how the probability that a flower is a "virginica" and not a "versicolor" changes with the sepal width of the flower. This kind of plot can show nonlinearities and indicate how we should include this variable in a logistic regression.

Plotting Bayesian Parameter Distributions with R Tidyverse
R

Plotting Bayesian Parameter Distributions with R Tidyverse

I’m currently reading Regression and Other Stories which contains lovely plots of coefficients and their distributions. What really impressed me is how easily I could solve this with the few concepts in the tidyverse. Suppose we’ve got an rstanarm model like this: model <- rstanarm::stan_glm(Petal.Width ~ Sepal.Length + Sepal.Width + Species, data=iris, refresh=0) We can access all the coefficients from the posterior draws using as.matrix. With a few standard transformations we can plot the distribution of each of the coefficients.

Installing Tidyverse in WSL without Timedatectl Status 1 Issue
R

Installing Tidyverse in WSL without Timedatectl Status 1 Issue

When I tried to install tidyverse in WSL2 I ran into issues with timedatectl and xml2. The simple solution is: # Assuming Debian derivatives sudo apt-get install libxml2-dev # Modify TZ to whatever your timeozne is TZ="Australia/Sydney" R -e 'install.packages("tidyverse")' What happens When I try to install tidyverse I get this error: > install.packages('tidyverse') ERROR: configuration failed for package ‘xml2’ System has not been booted with systemd as init system (PID 1).

Jupyter Notebook Preamble
jupyter

Jupyter Notebook Preamble

Whenever I use Jupyter Notebooks for analysis I tend to set a bunch of options at the top of every file to make them more pleasant to use. Here they are for Python and R with IRKernel Python # Automatically reload code from dependencies when running cells # This is indispensible when importing code you are actively modifying. %load_ext autoreload %autoreload 2 # I almost always use pandas and numpy import pandas as pd import numpy as np # Set the maximum rows to display in a dataframe pd.

Composing Functions
programming

Composing Functions

R core looks like it's getting a new pipe operator |> for composing functions. It's just like the existing magrittr pipe %>%, but has been implemented as a syntax transformation so that it is more computationally efficient and produces better stack traces. The pipe means instead of writing f(g(h(x))) you can write x |> h |> g |> f, which can be really handy when changing dataframes. Python's Pandas library doesn't have this kind of convenience and it opens up a class of error that won't happen in that R code.

Setting the Icon in Jupyter Notebooks
jupyter

Setting the Icon in Jupyter Notebooks

I often have way too many Jupyter notebook tabs open and I have to distinguish them from the first couple letters of the notebook in front of the Jupyter organge book icon. What if we could change the icons to visually distinguish different notebooks? I thought I found a really easy way to set the icon in Jupyter notebooks... but it works in Firefox and not Chrome. I'll go through the easy solution works in more browsers and the hard solution.

R: Keeping Up With Python
r

R: Keeping Up With Python

About 5 years ago a colleague told me that the days were numbered for R and Python had won. From his perspective he is probably right; in software engineering companies Python has got increasing adoption in programmatic analytics. However R has its own set of unique strengths which make it more appealing for the stats people and has kept up surprisingly well with Python. Python has a wider audience than R, and keeps to its reputation as "not the best language for anything but the second best language for everything".

Second most common value with Pandas
python

Second most common value with Pandas

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group. This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas. Problem For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).