Statistics

Dummy variables in R

Dummy variables are important but also cause much frustration in intro-stat courses. Below I will demonstrate the concept via a linear regression model. The basic idea is that a factor \(f\) with \(k\) levels can be replaced by \(k-1\) dummy variables that act as switches to select different levels. When all switches are turned off, the reference level is chosen. Mathematically, let \(f\) be the factor with levels \(l_0, l_1, \ldots, l_{k-1}\), i.

Variance in reproductive success (VRS) in forensic genetics lineage markers

Back in 2017, David Balding and I published the paper “How convincing is a matching Y-chromosome profile?”. One of the key parameters of the simulation model was the variance in reproductive success (VRS). Here I will discuss and demonstrate this parameter. First note for intuition that in a Wright–Fisher model, all individuals have the same probability of becoming a father, or of having reproductive success. So the VRS here is 0.

Wolfe conditions for deciding step length in inexact line search: An example with the Rosenbrock function

In inexact line search (a numerical optimisation technique) the step length (or learning rate) must be decided. In connection to that the Wolfe conditions are central. Here I will give an example showing why they are useful. More on this topic can be read elsewhere, e.g. in the book of Nocedal and Wright (2006), “Numerical Optimization”, Springer. Wolfe conditions The Wolfe conditions consists of the sufficient decrease condition (SDC) and curvature condition (CC):

Approximating small probabilities using importance sampling

Update Oct 14, 2019: Michael Höhle caught a mistake and notified me on Twitter. Thanks! The problem is that I used \(\text{Unif}(-10, 10)\) as importance distribution; this does not have infinite support as the target has. This is required, see e.g. Art B. Owen (2013), “Monte Carlo theory, methods and examples”. I have now updated the post to use a normal distribution instead. Box plots are often used. They are not always the best visualisation (e.

Correlation is not transitive, in general at least: A simulation approach

Let \(\rho_{XY}\) be the correlation between the stochastic variables \(X\) and \(Y\) and similarly for \(\rho_{XZ}\) and \(\rho_{YZ}\). If we know two of these, can we say anything about the third? In a recent blog post I dealt with the problem mathematically and I used the concept of a partial correlation coefficient. Here I will take a simulation approach. First z is simulated. Then x and y is simulated based on z in a regression context with a slope between \(-1\) and \(1\).

Correlation is not transitive, in general at least

Update Aug 10, 2019: I wrote a new blog post about the same as below but using a simulation approach. Update Aug 27, 2019: Minor change in how equations are solved (from version 0.9.0.9122). Let \(\rho_{XY}\) be the correlation between the stochastic variables \(X\) and \(Y\) and similarly for \(\rho_{XZ}\) and \(\rho_{YZ}\). If we know two of these, can we say anything about the third? Yes, sometimes, but not always.