An errors-in-variables model

Author

Jonas Moss

Published

Mar 2, 2023

Problem and solution

Suppose we wish to estimate the regression coefficient for

Y0X0=α0+β0X0+σ0ϵ0, where E[ϵX0]=0.

However, we do not observe Y0 and X0. Instead, we observe Y1=Y0+SYϵY for some δ with E[δY0]=0 for some random variable S and X1=X0+SXϵX, for some η with E[δY0]=0.

As is well known, the regression coefficient β0=Cov(X1,Y1)VarX1. However, Cov(Y1,X1) equals Cov(Y0,X0). About VarX1, we can employ the Law of Total Variance VarX1=EVar(XSX)+VarE(X0SX). The term VarE(X0SX) vanishes, as E(X0SX) is constant. Moreover, VarX1=VarX0+VarSX, hence VarX0=VarX1VarSX.

Define the regression model Y1=α1+β1X1+(SYδ+σ0η)

It follows that β0=Cov(Y1,X1)VarX1VarSX=β1VarX1VarX1VarSX, Moreover, α0=EY0β0EX0,=EY1β1VarX1VarX1VarSXEX1. If EX1 has been normalized to 0, then α0=EY1=α1.

Notice that Y1 has known errors. This makes it – perhaps – possible to estimate β1 with additional precision, using something similar to weighted least squares. The weights would be SY2+σ2. However, as σ2 is unknown, the resulting regression would not truly be weighted least squares.

Verification

Let’s simulate a bunch of values from the model.

n = 1000000
s_x = sqrt(3)*rexp(n)
s_y = 3*rexp(n)
s_0 = 1

x_0 = rnorm(n, 1, 2)
x_1 = x_0 + s_x * rnorm(n)
y_0 = 0.8 + 0.5 * x_0 + s_0*rnorm(n)
y_1 = y_0 + s_y * rnorm(n)

The calculated coefficients are

beta0_hat = cov(y_1, x_1)/(var(x_1) - 2*var(s_x))
alpha0_hat = mean(y_1) - beta0_hat * mean(x_1)
c(alpha0_hat, beta0_hat)
[1] 0.7964205 0.4951465

But the naive regression Y1α1+β1X1 yields

lm(y_1 ~ x_1)

Call:
lm(formula = y_1 ~ x_1)

Coefficients:
(Intercept)          x_1  
     1.0927       0.1988  

On the other hand, the correct (but unobserved) regression yields

lm(y_0~x_0)

Call:
lm(formula = y_0 ~ x_0)

Coefficients:
(Intercept)          x_0  
     0.7987       0.4999  

Inference and literature

To do inference on this method, use the delta method and large-sample theory (together with the studentized bootstrap), or perhaps the bias-corrected accelerated bootstrap (BCa). The delta method should be fairly easy to derive using the formulation of the “covariance of the covariance” foundin e.g. Magnus and Neudecker’s Matrix differential calculus.

There is a sizable literature on error-in-variable models, and inference for this simple model has probably been worked out, but a very rudimentary search yielded nothing for me. I think it’s uncommon to know the variances of the X errors. Moreover, the problem can probably be cast in the language of structrual equation models. But I’m unsure if software (such as lavaan) will help, because you don’t know the item variances in a typical application of structural equations models.

A final option is to assume bivariate normality and use maximum likelihood. This is also likely to be possible using an R package, but I’m not sure the estimates would be consistent. Probably you’d have to use a sandwich matrix for correct standard errors.

To make things easy on yourself, if you’re faced with a problem of this kind, I would suggest just going with the BCa + the equations above. The equations are trivial to compute and BCa will be fairly simple as well; it might be possible to calculate using packages such as bootstrap. Do something else only if the reviewers demand it.