# Prediction intervals for Generalized Additive Models (GAMs)

Update on Aug 9, 2022: In the code chunk below, sd = summary(fit_gam)$scale) was changed to sd = sqrt(summary(fit_gam)$scale)):

y_sim <- matrix(rnorm(n = prod(dim(exp_val_sim)),
mean = exp_val_sim,
sd = summary(fit_gam)$scale), nrow = nrow(exp_val_sim), ncol = ncol(exp_val_sim)) Thanks to David Kaplan (IRD, France) Finding prediction intervals (for future observations) is something different than finding confidence intervals (for unknown population parameters). Here, I demonstrate one approach to doing so. First we load the library and simulate some data: library(mgcv) set.seed(1) dat <- gamSim(eg = 1, n = 400, dist = "normal", scale = 2) ## Gu & Wahba 4 term additive model The simulated in dat contains the “truth” in the f variables: str(dat) ## 'data.frame': 400 obs. of 10 variables: ##$ y : num  3.3407 -0.0758 10.6832 8.7291 14.9911 ...
##  $x0: num 0.266 0.372 0.573 0.908 0.202 ... ##$ x1: num  0.659 0.185 0.954 0.898 0.944 ...
##  $x2: num 0.8587 0.0344 0.971 0.7451 0.2733 ... ##$ x3: num  0.367 0.741 0.934 0.673 0.701 ...
##  $f : num 5.51 3.58 8.69 8.75 16.19 ... ##$ f0: num  1.481 1.841 1.948 0.569 1.184 ...
##  $f1: num 3.74 1.45 6.74 6.02 6.6 ... ##$ f2: num  2.98e-01 2.88e-01 8.61e-05 2.16 8.40 ...
##  $f3: num 0 0 0 0 0 0 0 0 0 0 ... fit_lm <- lm(f ~ f0 + f1 + f2 + f3, data = dat) plot(dat$f, predict(fit_lm))
abline(0, 1)

And what can be used to demonstrate GAMs are available in the y and x variables:

fit_gam <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)

The idea is to simulate parameter vectors, and using such a simulated parameter vector, we can draw observations. Doing this many times can be used to generate a prediction interval.

We first extract the parameter estimates and their covariance matrix:

beta <- coef(fit_gam)
V <- vcov(fit_gam)

We then to simulate a number of parameter vectors. This is done by assuming approximate normality and using the Cholesky trick. The trick is that $\hat{\beta} + L \nu \quad \dot \sim \quad \text{MultivariateNormal}(\beta, V) ,$ where $$V = L L^\top$$ is the Cholesky factorisation of the covariance matrix $$V$$ and $$\nu \sim N(0, 1)$$.

num_beta_vecs <- 10000
Cv <- chol(V)
set.seed(1)
nus <- rnorm(num_beta_vecs * length(beta))
beta_sims <- beta + t(Cv) %*% matrix(nus, nrow = length(beta), ncol = num_beta_vecs)
dim(beta_sims)
## [1]    37 10000

This should give us something similar to the confidence intervals of the model fit:

d_beta <- cbind(summary(fit_gam)$se, apply(beta_sims, 1, sd)) head(d_beta) ## [,1] [,2] ## (Intercept) 0.1036817 0.1039225 ## s(x0).1 0.3977579 0.3988932 ## s(x0).2 0.7249847 0.7212706 ## s(x0).3 0.1967577 0.1982036 ## s(x0).4 0.4260063 0.4265382 ## s(x0).5 0.1450504 0.1444811 plot(d_beta[, 1], d_beta[, 2], xlab = "Calculated SE", ylab = "Simulated SE") abline(0, 1) We can now proceed to simulating observations. We do that by randomly sampling from the observations with replacement: n_obs <- 100 sim_idx <- sample.int(nrow(dat), size = n_obs, replace = TRUE) sim_dat <- dat[sim_idx, c("x0", "x1", "x2", "x3")] dim(sim_dat) ## [1] 100 4 # sim_dat <- dat[, c("x0", "x1", "x2", "x3")] # dim(sim_dat) Instead of doing such a sampling, it is also possible to use e.g. a grid of values: if (FALSE) { # Intentionally not run xs_range <- apply(dat[, c("x0", "x1", "x2", "x3")], 2, range) xs_seq <- apply(xs_range, 2, function(x) seq(from = x[1], to = x[2], length.out = 10)) xs_seq_lst <- as.list(as.data.frame(xs_seq)) sim_dat <- do.call(expand.grid, list(xs_seq_lst, KEEP.OUT.ATTRS = FALSE)) dim(sim_dat) } Now we can calculate the linear predictors: covar_sim <- predict(fit_gam, newdata = sim_dat, type = "lpmatrix") linpred_sim <- covar_sim %*% beta_sims Now, the inverse link function must be applied to obtain the expected values. So if e.g. family = Gamma(link = log) was used in gam(), then exp() must be used. We used the Gaussian family with identity link so the inverse is the identity: invlink <- function(x) x exp_val_sim <- invlink(linpred_sim) Now, the family itself must be used. If Gamma was used, then rgamma() (properly parametised) must be used. We just used Gaussian, so: y_sim <- matrix(rnorm(n = prod(dim(exp_val_sim)), mean = exp_val_sim, sd = sqrt(summary(fit_gam)$scale)),
nrow = nrow(exp_val_sim),
ncol = ncol(exp_val_sim))
dim(y_sim)
## [1]   100 10000

So now entry $$(i, j)$$ of y_sim contains the $$i$$’th simulated observation’s simulated response under parameter vector $$j$$ (column $$j$$ of beta_sims).

We can then find a 95% prediction interval:

pred_int_sim <- apply(y_sim, 1, quantile, prob = c(.025, 0.975))
dim(pred_int_sim)
## [1]   2 100

And e.g. plot it against x0:

plot(y ~ x0, data = dat)

sim_dat_x0ord <- order(sim_dat$x0) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2) Confidence intervals can be added: plot(y ~ x0, data = dat) sim_dat_x0ord <- order(sim_dat$x0)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2) lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2)

sim_dat_pred <- predict(fit_gam, newdata = sim_dat, se = TRUE)
lines(sim_dat$x0[sim_dat_x0ord], sim_dat_pred$fit[sim_dat_x0ord], col = 1, lwd = 2)

upr_ci <- invlink(sim_dat_pred$fit + 2*sim_dat_pred$se.fit)
lwr_ci <- invlink(sim_dat_pred$fit - 2*sim_dat_pred$se.fit)
lines(sim_dat$x0[sim_dat_x0ord], upr_ci[sim_dat_x0ord], col = 3, lwd = 2) lines(sim_dat$x0[sim_dat_x0ord], lwr_ci[sim_dat_x0ord], col = 3, lwd = 2)

##### Mikkel Meyer Andersen
###### Assoc. Professor of Applied Statistics

My research interests include applied statistics and computational statistics.