Jonas Moss’ blog - Correlation with two lines!

What is this? A rant about an unimportant problem. Read at your own risk!

Everyone knows the correlation coefficient. It measures the degree of linear dependence between two random variables $X$ and $Y$ . It finds the best-fitting line $Y = a + b X$ and measures how far the observed values of $Y$ are from the predicted values of $y$ given $X$ .

Generate and plot some data from a line.

set.seed(313)
n = 100
x = seq(0, 1, length.out = n)
y = 2 * x + 1 + rnorm(n) * 0.5
plot(x, y, main = paste0("Sample correlation: ", round(cor(x,y), 3)))
abline(lm(y ~ x))

Plot of data simulated from $y = 2 x + 1$ .

That should be familiar. But what if there are TWO parallel lines in the data?!

Generate ad plot some data frome two lines!

set.seed(313)
par(mfrow=c(1,2))
n = 50
x1 = seq(0, 1, length.out = n)
y1 = 2 * x1 + 1 + rnorm(n) * 0.5
x2 = seq(0, 1, length.out = n)
y2 = 2 * x2 + 2 + rnorm(n) * 0.5
plot(x1, y1, main = paste0("Sample correlation: ???"), col = "blue", ylim = c(0, 6), xlab = "x", ylab = "y")
points(x2, y2, col = "red")
abline(a = 1, b = 2, col = "blue")
abline(a = 2, b = 2, col = "red")
x = c(x1, x2)
y = c(y1, y2)
z = c(rep(0, 50), rep(1, 50))
plot(x, y, main = paste0("Sample correlation: ", round(cor(x,y), 3)), xlab = "x", ylab = "y", ylim = c(0, 6))
abline(a = 3/2, b = 2)
abline(a = 1, b = 2, lty = 2)
abline(a = 2, b = 2, lty = 2)

Plot of data simulated from $y = 2 x + 1$ and $y = 2 x + 2$ !

The data has been simulated from the model $\begin{array}{rcl} y ∣ x, z = 0 & \sim & 2 x + 1 + \frac{1}{2} ϵ, \\ y ∣ x, z = 1 & \sim & 2 x + 2 + \frac{1}{2} ϵ . \end{array}$ where $z$ is a group indicator and $ϵ$ is standard normal. Moreover, the $x$ s are $50$ observations uniformly spaced on $[0, 1]$ . What is the marginal model, with $z$ integrated out? It’s $2 x + 3 / 2$ , displayed on the right. The plot on the right certainly doesn’t look like it contains two lines.

If we know the joint distribution of $(X, Y, Z)$ , generalizing the correlation is easy! The ordinary correlation coefficient can be written as $Cor (X, Y) = sign β \sqrt{R^{2}}$ ,where $R^{2} = 1 - \frac{min_{a, b} E [(Y - a - b X)^{2}]}{min_{μ} E [(Y - μ)^{2}]},$ and $β$ is the minimizing slope of the numerator. Now just add the covariate $1 [Z = 1]$ to the linear regression model in the denominator, and the resulting correlation becomes $Cor (X, Y; Z) = sign β \sqrt{R_{⋆}^{2}},$ where $R_{⋆}^{2} = 1 - \frac{min_{a, b} E [(Y - a_{0} - a_{1} 1 [Z = 1] - b X)^{2}]}{min_{μ} E [(Y - μ)^{2}]},$ and $β$ is the minimizing slope of the numerator (the sign of the slope is unambiguous since we’ve assumed the lines are parallel).

Calculating this correlation coefficient is easy as pie.

mod = lm(y ~ x + z)
unname(sqrt(summary(mod)$r.sq) * sign(coef(mod)[3]))

[1] 0.8495882

And its much larger than the “one-line correlation”, going from $0.62$ to $0.85$ . But we can’t calculate it without knowing $Z$ !

Defining a correlation for two parallel lines

So, we know the joint distribution of $(x, y)$ , and nothing more. How can think about the “two-lines correlation”? I can only think of one reasonable way to solve the problem. We need to find lower and upper bounds for $Cor (X, Y; Z)$ . Because we might not know $Z$ , but maybe we can find bounds for the correlation $Cor (X, Y; Z)$ that we could have calculate had we known $Z$ .

Finding bounds

I can’t think of an efficient way to calculate upper bounds, but the lower bound is simple enough (in the case when the correlation is positive): It equals $Cor (X, Y)$ . When dealing with estimation problems, a reasonable estimator of the upper bound would be to make every possible split of the data into into two categories (according to z)and fit the regression model y ~ x + z. But that would be $2^{n}$ combinations, which is obviously intractable for large $n$ . It might be possible to find exact algorithms that aren’t exponential in time, indeed, I would expect it, but I can’t justify spending more time on this problem.

However, consider the following data

n <- 15
x <- mtcars$drat[1:n]
y <- mtcars$wt[1:n]

Since $2^{15} = 32768$ we don’t need to fit that many linear regressions to estimate the upper correlation bound in this case.

Code estimating a bunch of regression models.

n <- 15
x <- mtcars$drat[1:n]
y <- mtcars$wt[1:n]
n <- length(x)

mat <- cbind(1, x, 0)
minimum = Inf

for(k in seq(0:n)) {
  icomb <- arrangements::icombinations(n, k)
  choose(n, k)
  for(i in seq(choose(n, k))) {
    indices <- icomb$getnext()
    mat[, 3] <- 0
    mat[indices, 3] <- 1
    mod <- lm.fit(mat, y)
    current <- sum(mod$residuals^2)
    if (current < minimum) {
      minimum = current
      z = indices
      coef <- mod$coefficients
    }
  }  
}

cor2 <- sqrt(1 - minimum/sum((y - mean(y))^2)) * sign(coef[2])
cor1 <- cor(x, y)

From the code above we find that lower and upper bounds for the two lines correlation:

c(lower = cor2, upper = cor1)

   lower.x      upper 
-0.8834406 -0.6358763

Let’s also have a look at the classification of points that maximizes this correlation.

Code for plot.

col <- rep("blue", n)
col[z] <- "red"
plot(x, y, col = col, xlab = "Rear axle ratio", ylab = "Weight",
     pch = 20, main = "Two lines or one line?")
abline(a = coef[1], b = coef[2], col = "blue")
abline(a = coef[1] + coef[3], b = coef[2], col = "red")
abline(lm(y ~ x))

The two lines correlation for this optimal classification equals the lower bound above.