Correlation with two lines!

Author

Jonas Moss

Published

Jan 25, 2023

What is this? A rant about an unimportant problem. Read at your own risk!

Everyone knows the correlation coefficient. It measures the degree of linear dependence between two random variables X and Y. It finds the best-fitting line Y=a+bX and measures how far the observed values of Y are from the predicted values of y given X.

Generate and plot some data from a line.
set.seed(313)
n = 100
x = seq(0, 1, length.out = n)
y = 2 * x + 1 + rnorm(n) * 0.5
plot(x, y, main = paste0("Sample correlation: ", round(cor(x,y), 3)))
abline(lm(y ~ x))

Plot of data simulated from y=2x+1.

That should be familiar. But what if there are TWO parallel lines in the data?!

Generate ad plot some data frome two lines!
set.seed(313)
par(mfrow=c(1,2))
n = 50
x1 = seq(0, 1, length.out = n)
y1 = 2 * x1 + 1 + rnorm(n) * 0.5
x2 = seq(0, 1, length.out = n)
y2 = 2 * x2 + 2 + rnorm(n) * 0.5
plot(x1, y1, main = paste0("Sample correlation: ???"), col = "blue", ylim = c(0, 6), xlab = "x", ylab = "y")
points(x2, y2, col = "red")
abline(a = 1, b = 2, col = "blue")
abline(a = 2, b = 2, col = "red")
x = c(x1, x2)
y = c(y1, y2)
z = c(rep(0, 50), rep(1, 50))
plot(x, y, main = paste0("Sample correlation: ", round(cor(x,y), 3)), xlab = "x", ylab = "y", ylim = c(0, 6))
abline(a = 3/2, b = 2)
abline(a = 1, b = 2, lty = 2)
abline(a = 2, b = 2, lty = 2)

Plot of data simulated from y=2x+1 and y=2x+2!

The data has been simulated from the model yx,z=02x+1+12ϵ,yx,z=12x+2+12ϵ. where z is a group indicator and ϵ is standard normal. Moreover, the xs are 50 observations uniformly spaced on [0,1]. What is the marginal model, with z integrated out? It’s 2x+3/2, displayed on the right. The plot on the right certainly doesn’t look like it contains two lines.

If we know the joint distribution of (X,Y,Z) , generalizing the correlation is easy! The ordinary correlation coefficient can be written as Cor(X,Y)=signβR2,where R2=1mina,bE[(YabX)2]minμE[(Yμ)2], and β is the minimizing slope of the numerator. Now just add the covariate 1[Z=1] to the linear regression model in the denominator, and the resulting correlation becomes Cor(X,Y;Z)=signβR2, where R2=1mina,bE[(Ya0a11[Z=1]bX)2]minμE[(Yμ)2], and β is the minimizing slope of the numerator (the sign of the slope is unambiguous since we’ve assumed the lines are parallel).

Calculating this correlation coefficient is easy as pie.

mod = lm(y ~ x + z)
unname(sqrt(summary(mod)$r.sq) * sign(coef(mod)[3]))
[1] 0.8495882

And its much larger than the “one-line correlation”, going from 0.62 to 0.85. But we can’t calculate it without knowing Z!

Defining a correlation for two parallel lines

So, we know the joint distribution of (x,y), and nothing more. How can think about the “two-lines correlation”? I can only think of one reasonable way to solve the problem. We need to find lower and upper bounds for Cor(X,Y;Z). Because we might not know Z, but maybe we can find bounds for the correlation Cor(X,Y;Z) that we could have calculate had we known Z.

Finding bounds

I can’t think of an efficient way to calculate upper bounds, but the lower bound is simple enough (in the case when the correlation is positive): It equals Cor(X,Y). When dealing with estimation problems, a reasonable estimator of the upper bound would be to make every possible split of the data into into two categories (according to z)and fit the regression model y ~ x + z. But that would be 2n combinations, which is obviously intractable for large n. It might be possible to find exact algorithms that aren’t exponential in time, indeed, I would expect it, but I can’t justify spending more time on this problem.

However, consider the following data

n <- 15
x <- mtcars$drat[1:n]
y <- mtcars$wt[1:n]

Since 215=32768 we don’t need to fit that many linear regressions to estimate the upper correlation bound in this case.

Code estimating a bunch of regression models.
n <- 15
x <- mtcars$drat[1:n]
y <- mtcars$wt[1:n]
n <- length(x)

mat <- cbind(1, x, 0)
minimum = Inf

for(k in seq(0:n)) {
  icomb <- arrangements::icombinations(n, k)
  choose(n, k)
  for(i in seq(choose(n, k))) {
    indices <- icomb$getnext()
    mat[, 3] <- 0
    mat[indices, 3] <- 1
    mod <- lm.fit(mat, y)
    current <- sum(mod$residuals^2)
    if (current < minimum) {
      minimum = current
      z = indices
      coef <- mod$coefficients
    }
  }  
}

cor2 <- sqrt(1 - minimum/sum((y - mean(y))^2)) * sign(coef[2])
cor1 <- cor(x, y)

From the code above we find that lower and upper bounds for the two lines correlation:

c(lower = cor2, upper = cor1)
   lower.x      upper 
-0.8834406 -0.6358763 

Let’s also have a look at the classification of points that maximizes this correlation.

Code for plot.
col <- rep("blue", n)
col[z] <- "red"
plot(x, y, col = col, xlab = "Rear axle ratio", ylab = "Weight",
     pch = 20, main = "Two lines or one line?")
abline(a = coef[1], b = coef[2], col = "blue")
abline(a = coef[1] + coef[3], b = coef[2], col = "red")
abline(lm(y ~ x))

Plot of correlation with two lines.

The two lines correlation for this optimal classification equals the lower bound above.