Dummy variables in R

Last updated on May 26, 2020 3 min read R

Dummy variables are important but also cause much frustration in intro-stat courses. Below I will demonstrate the concept via a linear regression model.

The basic idea is that a factor $f$ with $k$ levels can be replaced by $k-1$ dummy variables that act as switches to select different levels. When all switches are turned off, the reference level is chosen. Mathematically, let $f$ be the factor with levels $l_0, l_1, \ldots, l_{k-1}$, i.e. $f \in \left \{ l_0, l_1, \ldots, l_{k-1} \right \}$. By convention, let $l_0$ be the reference level chosen by the user. Now introduce the $k-1$ dummy variables $z_1, z_2, \ldots, z_{k-1}$ defined by \[ z_i = \begin{cases} 1 & \text{if $f = l_i$} \\ 0 & \text{otherwise} \end{cases} \] for $i \in \{1, 2, \ldots, k-1 \}$. Note that

Assume that we are interested in the ANOVA model given by the R formula y ~ f (e.g. lm(y ~ f)). Then R automatically translates this into the model \[ y = \beta_0 + \beta_1 z_1 + \beta_2 z_2 + \varepsilon \] with dummy variables $z_1$ and $z_2$ as defined above. This can be illustrated in R as follows:

f <- factor(c("l0", "l1", "l2"))
as.data.frame(model.matrix(~ f))
##   (Intercept) fl1 fl2
## 1           1   0   0
## 2           1   1   0
## 3           1   0   1

So the first row is $f = l_0$, the second $f = l_1$, and the third $f = l_2$.

Here we see that the intercept is the constant (“silent”) $1$ in front of $\beta_0$ such that $\beta_0$ is always included. The parameter $\beta_0$ is the mean of the $y$’s for $f = l_0$. Notice the column name fl1; this refers to the difference in mean of $y$ between $f = l_0$ and $f = l_1$. This can be seen by inspecting row two above. The convention in R is to concatenate the factor (variable) name, here f, with the level, here l1.

As seen, the first level was taken as the reference level (silently by R). This is the convention: the first level of the factor is the reference level:

f <- factor(c("l0", "l1", "l2"), level = c("l1", "l0", "l2"))
as.data.frame(model.matrix(~ f))
##   (Intercept) fl0 fl2
## 1           1   1   0
## 2           1   0   0
## 3           1   0   1

Sometimes the relevel() function is useful:

f <- factor(c("l0", "l1", "l2"))
f <- relevel(f, ref = "l2") # ref: the reference level
as.data.frame(model.matrix(~ f))
##   (Intercept) fl0 fl1
## 1           1   1   0
## 2           1   0   1
## 3           1   0   0

Contrasts

The above is one particular way of creating so-called contrast. There are many other ways to do it. See for example https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.

R Statistics

Mikkel Meyer Andersen

Assoc. Professor of Applied Statistics

My research interests include applied statistics and computational statistics.

Dummy variables in R

Contrasts

Mikkel Meyer Andersen

Assoc. Professor of Applied Statistics

Related