Prediction intervals for Generalized Additive Models (GAMs)

Finding prediction intervals (for future observations) is something different than finding confidence intervals (for unknown population parameters).

Here, I demonstrate one approach to doing so.

First we load the library and simulate some data:

library(mgcv)
set.seed(1)
dat <- gamSim(eg = 1, n = 400, dist = "normal", scale = 2)
## Gu & Wahba 4 term additive model

The simulated in dat contains the “truth” in the f variables:

str(dat)
## 'data.frame':    400 obs. of  10 variables:
##  $ y : num  3.3407 -0.0758 10.6832 8.7291 14.9911 ...
##  $ x0: num  0.266 0.372 0.573 0.908 0.202 ...
##  $ x1: num  0.659 0.185 0.954 0.898 0.944 ...
##  $ x2: num  0.8587 0.0344 0.971 0.7451 0.2733 ...
##  $ x3: num  0.367 0.741 0.934 0.673 0.701 ...
##  $ f : num  5.51 3.58 8.69 8.75 16.19 ...
##  $ f0: num  1.481 1.841 1.948 0.569 1.184 ...
##  $ f1: num  3.74 1.45 6.74 6.02 6.6 ...
##  $ f2: num  2.98e-01 2.88e-01 8.61e-05 2.16 8.40 ...
##  $ f3: num  0 0 0 0 0 0 0 0 0 0 ...
fit_lm <- lm(f ~ f0 + f1 + f2 + f3, data = dat)
plot(dat$f, predict(fit_lm))
abline(0, 1)

And what can be used to demonstrate GAMs are available in the y and x variables:

fit_gam <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)

The idea is to simulate parameter vectors, and using such a simulated parameter vector, we can draw observations. Doing this many times can be used to generate a prediction interval.

We first extract the parameter estimates and their covariance matrix:

beta <- coef(fit_gam)
V <- vcov(fit_gam)

We then to simulate a number of parameter vectors. This is done by assuming approximate normality and using the Cholesky trick. The trick is that \[ \hat{\beta} + L \nu \quad \dot \sim \quad \text{MultivariateNormal}(\beta, V) , \] where \(V = L L^\top\) is the Cholesky factorisation of the covariance matrix \(V\) and \(\nu \sim N(0, 1)\).

num_beta_vecs <- 10000
Cv <- chol(V)
set.seed(1)
nus <- rnorm(num_beta_vecs * length(beta))
beta_sims <- beta + t(Cv) %*% matrix(nus, nrow = length(beta), ncol = num_beta_vecs)
dim(beta_sims)
## [1]    37 10000

This should give us something similar to the confidence intervals of the model fit:

d_beta <- cbind(summary(fit_gam)$se, apply(beta_sims, 1, sd))
head(d_beta)
##                  [,1]      [,2]
## (Intercept) 0.1036817 0.1039225
## s(x0).1     0.3977579 0.3988932
## s(x0).2     0.7249847 0.7212706
## s(x0).3     0.1967577 0.1982036
## s(x0).4     0.4260063 0.4265382
## s(x0).5     0.1450504 0.1444811
plot(d_beta[, 1], d_beta[, 2], 
     xlab = "Calculated SE", 
     ylab = "Simulated SE")
abline(0, 1)

We can now proceed to simulating observations. We do that by randomly sampling from the observations with replacement:

n_obs <- 100
sim_idx <- sample.int(nrow(dat), size = n_obs, replace = TRUE)
sim_dat <- dat[sim_idx, c("x0", "x1", "x2", "x3")]
dim(sim_dat)
## [1] 100   4
# sim_dat <- dat[, c("x0", "x1", "x2", "x3")]
# dim(sim_dat)

Instead of doing such a sampling, it is also possible to use e.g. a grid of values:

if (FALSE) { # Intentionally not run
  xs_range <- apply(dat[, c("x0", "x1", "x2", "x3")], 2, range)
  xs_seq <- apply(xs_range, 2, function(x) seq(from = x[1], to = x[2], length.out = 10))
  xs_seq_lst <- as.list(as.data.frame(xs_seq))
  sim_dat <- do.call(expand.grid, list(xs_seq_lst, KEEP.OUT.ATTRS = FALSE))
  dim(sim_dat)
}

Now we can calculate the linear predictors:

covar_sim <- predict(fit_gam, newdata = sim_dat, type = "lpmatrix")
linpred_sim <- covar_sim %*% beta_sims

Now, the inverse link function must be applied to obtain the expected values. So if e.g. family = Gamma(link = log) was used in gam(), then exp() must be used. We used the Gaussian family with identity link so the inverse is the identity:

invlink <- function(x) x
exp_val_sim <- invlink(linpred_sim)

Now, the family itself must be used. If Gamma was used, then rgamma() (properly parametised) must be used. We just used Gaussian, so:

y_sim <- matrix(rnorm(n = prod(dim(exp_val_sim)), 
                      mean = exp_val_sim, 
                      sd = summary(fit_gam)$scale), 
                nrow = nrow(exp_val_sim), 
                ncol = ncol(exp_val_sim))
dim(y_sim)
## [1]   100 10000

So now entry \((i, j)\) of y_sim contains the \(i\)’th simulated observation’s simulated response under parameter vector \(j\) (column \(j\) of beta_sims).

We can then find a 95% prediction interval:

pred_int_sim <- apply(y_sim, 1, quantile, prob = c(.025, 0.975))
dim(pred_int_sim)
## [1]   2 100

And e.g. plot it against x0:

plot(y ~ x0, data = dat)

sim_dat_x0ord <- order(sim_dat$x0)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2)

Confidence intervals can be added:

plot(y ~ x0, data = dat)

sim_dat_x0ord <- order(sim_dat$x0)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[1L, sim_dat_x0ord], col = 2, lwd = 2)
lines(sim_dat$x0[sim_dat_x0ord], pred_int_sim[2L, sim_dat_x0ord], col = 2, lwd = 2)

sim_dat_pred <- predict(fit_gam, newdata = sim_dat, se = TRUE)
lines(sim_dat$x0[sim_dat_x0ord], sim_dat_pred$fit[sim_dat_x0ord], col = 1, lwd = 2)

upr_ci <- invlink(sim_dat_pred$fit + 2*sim_dat_pred$se.fit)
lwr_ci <- invlink(sim_dat_pred$fit - 2*sim_dat_pred$se.fit)
lines(sim_dat$x0[sim_dat_x0ord], upr_ci[sim_dat_x0ord], col = 3, lwd = 2)
lines(sim_dat$x0[sim_dat_x0ord], lwr_ci[sim_dat_x0ord], col = 3, lwd = 2)

Avatar
Mikkel Meyer Andersen
Assoc. Professor of Applied Statistics

My research interests include applied statistics and computational statistics.

Related