This is a relatively brief addendum to last week’s post, where I described how the rvar
datatype implemented in the R
package posterior
makes it quite easy to perform posterior probability checks to assess goodness of fit. In the initial post, I generated data from a linear model and estimated parameters for a linear regression model, and, unsurprisingly, the model fit the data quite well. When I introduced a quadratic term into the data generating process and fit the same linear model (without a quadratic term), equally unsurprising, the model wasn’t a great fit.
Immediately after putting the post up, I decided to make sure the correct model with the quadratic term would not result in extreme p-value (i.e. would fall between 0.02 and 0.98). And, again not surprisingly, the model was a good fit. I’m sharing all this here, because I got some advice on how to work with the rvar
data a little more efficiently, and wanted to make sure those who are interested could see that. And while I was at it, I decided to investigate the distribution of Bayesian p-values under the condition that the model and data generating process are the same (i.e. the model is correct).
Just as a reminder, here is the data generation process:
\[y \sim N(\mu = 2 + 6*x - 0.3x^2, \ \sigma^2 = 4)\]
Here are the necessary libraries:
library(simstudy)
library(data.table)
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)
And here is the data generation:
b_quad <- -0.3
ddef <- defData(varname = "x", formula = "0;10", dist = "uniform")
ddef <- defData(ddef, "y", "2 + 6*x + ..b_quad*(x^2)", 4)
set.seed(72612)
dd <- genData(100, ddef)
The Stan
model is slightly modified to include the additional term; \(\gamma\) is the quadratic parameter:
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
transformed data{
vector[N] x2;
for (i in 1:N) {
x2[i] = x[i]*x[i];
};
}
parameters {
real alpha;
real beta;
real gamma;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta*x + gamma*x2, sigma);
}
mod <- cmdstan_model("code/quadratic_regression.stan")
fit <- mod$sample(
data = list(N = nrow(dd), x = dd$x, y = dd$y),
seed = 72651,
refresh = 0,
chains = 4L,
parallel_chains = 4L,
iter_warmup = 500,
iter_sampling = 2500
)
## Running MCMC with 4 parallel chains...
##
## Chain 3 finished in 0.5 seconds.
## Chain 1 finished in 0.5 seconds.
## Chain 2 finished in 0.5 seconds.
## Chain 4 finished in 0.5 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.5 seconds.
## Total execution time: 0.6 seconds.
As before, I am plotting the observed (actual data) along with the 80% intervals of predicted values at each level of \(x\). The observed data appear to be randomly scattered within the intervals with no apparent pattern:
post_rvars <- as_draws_rvars(fit$draws())
mu <- with(post_rvars, alpha + beta * as_rvar(dd$x) + gamma * as_rvar(dd$x^2))
pred <- rvar_rng(rnorm, nrow(dd), mu, post_rvars$sigma)
df.80 <- data.table(x = dd$x, y=dd$y, t(quantile(pred, c(0.10, 0.90))))
df.80[, extreme := !(y >= V1 & y <= V2)]
The code to estimate the p-value is slightly modified from last time. The important difference is that the lists of rvars
(bin_prop_y and bin_prop_pred) are converted directly into vectors of rvars
using the do.call
function:
df <- data.frame(x = dd$x, y = dd$y, mu, pred)
df$grp <- cut(df$x, breaks = seq(0, 10, by = 2),include.lowest = TRUE, labels=FALSE)
bin_prop_y <- lapply(split(df, df$grp), function(x) rvar_mean(x$y < x$mu))
rv_y <- do.call(c, bin_prop_y)
T_y <- rvar_var(rv_y)
bin_prop_pred <- lapply(split(df, df$grp), function(x) rvar_mean(x$pred < x$mu))
rv_pred <- do.call(c, bin_prop_pred)
T_pred <- rvar_var(rv_pred)
mean(T_pred > T_y)
## [1] 0.583
In this one case, the p-value is 0.58, suggesting the model is a good fit. But, could this have been a fluke? Looking below at the density plot of p-values based on 10,000 simulated data sets suggests not; indeed, \(P(0.02 < \text{p-value} < 0.98) = 99.8\%.\) (If you are interested in the code that estimated the density of p-values, I can post it as well.)