Dummy variables are important but also cause much frustration in *intro-stat* courses.
Below I will demonstrate the concept via a linear regression model.

The basic idea is that a factor \(f\) with \(k\) levels can be replaced by \(k-1\) dummy variables that act as switches to select different levels.
When all switches are turned off, the reference level is chosen.
Mathematically, let \(f\) be the factor with levels \(l_0, l_1, \ldots, l_{k-1}\), i.e. \(f \in \left \{ l_0, l_1, \ldots, l_{k-1} \right \}\).
By convention, let \(l_0\) be the *reference level* chosen by the user.
Now introduce the \(k-1\) dummy variables \(z_1, z_2, \ldots, z_{k-1}\) defined by
\[
z_i =
\begin{cases}
1 & \text{if $f = l_i$} \\
0 & \text{otherwise}
\end{cases}
\]
for \(i \in \{1, 2, \ldots, k-1 \}\). Note that

Assume that we are interested in the ANOVA model
given by the `R`

formula `y ~ f`

(e.g. `lm(y ~ f)`

).
Then `R`

automatically translates this into the model
\[
y = \beta_0 + \beta_1 z_1 + \beta_2 z_2 + \varepsilon
\]
with dummy variables \(z_1\) and \(z_2\) as defined above.
This can be illustrated in `R`

as follows:

```
f <- factor(c("l0", "l1", "l2"))
as.data.frame(model.matrix(~ f))
## (Intercept) fl1 fl2
## 1 1 0 0
## 2 1 1 0
## 3 1 0 1
```

So the first row is \(f = l_0\), the second \(f = l_1\), and the third \(f = l_2\).

Here we see that the intercept is the constant (“silent”) \(1\) in front of \(\beta_0\) such that \(\beta_0\) is always included.
The parameter \(\beta_0\) is the mean of the \(y\)’s for \(f = l_0\).
Notice the column name `fl1`

; this refers to the difference
in mean of \(y\) between \(f = l_0\) and \(f = l_1\). This can be seen by inspecting row two above.
The convention in `R`

is to concatenate the factor (variable)
name, here `f`

, with the level, here `l1`

.

As seen, the first level was taken as the reference level (silently by `R`

).
This is the convention: the first level of the factor is the reference level:

```
f <- factor(c("l0", "l1", "l2"), level = c("l1", "l0", "l2"))
as.data.frame(model.matrix(~ f))
## (Intercept) fl0 fl2
## 1 1 1 0
## 2 1 0 0
## 3 1 0 1
```

Sometimes the `relevel()`

function is useful:

```
f <- factor(c("l0", "l1", "l2"))
f <- relevel(f, ref = "l2") # ref: the reference level
as.data.frame(model.matrix(~ f))
## (Intercept) fl0 fl1
## 1 1 1 0
## 2 1 0 1
## 3 1 0 0
```

## Contrasts

The above is one particular way of creating so-called *contrast*.
There are many other ways to do it.
See for example https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.