Omitted variable bias

We assume that the risk factor of interest is denoted by \(x_{1}\), and the outcome of interest is \(y\). Both are collected from a sample of research subjects. Typically we are interested in whether or not measurements of putative risk factor \(x_{1}\) are associated with measurements of \(y\). If we see an association, it could suggest that \(x_{1}\) is a risk factor of concern.

The formula:

\[\begin{equation} \tag{1} y =\beta_{1}x_{1} + \epsilon \end{equation}\] specifies a simple linear relationship between variables \(x_{1}\) and \(y\) with an error term, \(\epsilon\), that is typically assumed to be normally distributed, and \(\beta_{1}\) which is the slope, or regression coefficient.

Confounding occurs when there is an association between an unmeasured variable, say \(x_{2}\) , and both \(x_{1}\) and \(y\). When confounding is present, the relationship between \(x_{1}\) and \(y\) can be either amplified (appear larger than what is real) or attenuated (appear smaller than what is real).

The problem of confounding is well known in science, and in particular, in fields where observational study designs (affected by confounding) are very common. Researchers attempt to control for confounding by including known confounders in modelling. For example, if \(x_{2}\) is a confounder of the relationship between x1 and y, then

\[\begin{equation} \tag{2} y =\beta_{1}x_{1}+ \beta_{2}x_{2} + \epsilon \end{equation}\]

is used in place of \((1)\) with the expectation that the confounding effects are accounted for in the partialled derivative of \(y\) with respect to \(x_{1}\).

Here is a simple simulation that illustrates this process. First, we generate predictions of \(y\) using

\[\begin{equation} \tag{3} y =\beta_{1}x_{1}+ \beta_{2}x_{2} + N(0,1) \end{equation}\]

where

\[\begin{equation} \tag{4} x_{1} =x_{2} + N(0,1) \end{equation}\]

and the random effect, N(0,1), is normally distributed error with a mean of 0 and a standard deviation of 1. Note that here, that all associations are positive and the true model parameters (\(\beta_{1}\) and \(\beta_{2}\)) for \(x_{1}\) and \(x_{2}\), respectively, equal 1. A bias in the variableof interest \((x_{1})\) exists when the estimator of \(\beta_{1}\) is different from 1.

#set.seed(8675309)
x2 <- rnorm(10000,0,1)
x1 <- x2 + rnorm(10000,0,1)
y <- (x1 + x2) + rnorm(10000,0,1)

Next we estimate \(\beta_{1}\) with a linear regression function lm() using an omitted variable model \((1)\). We see that the point estimate for \(\beta_{1}\) (around 1.5) is greater than 1, and has been amplified through the omitted variable bias.

#model with omitted variable
out1 <- lm(y~x1)
summary(out1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2781 -0.8440 -0.0104  0.8261  4.0807 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.009893   0.012380   0.799    0.424    
## x1          1.499261   0.008754 171.267   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.238 on 9998 degrees of freedom
## Multiple R-squared:  0.7458, Adjusted R-squared:  0.7458 
## F-statistic: 2.933e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

Here is the complete model, the point estimates are close to 1, and unbiased predictors of the population parameters \(\beta_{1}\) and \(\beta_{2}\), both of which are close to 1.

#model without omitted variable
out2 <- lm(y~x1 + x2)
summary(out2)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6254 -0.6761 -0.0040  0.6758  3.4894 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.007359   0.009975  -0.738    0.461    
## x1           0.986964   0.009911  99.580   <2e-16 ***
## x2           1.027686   0.013971  73.556   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9972 on 9997 degrees of freedom
## Multiple R-squared:  0.8351, Adjusted R-squared:  0.835 
## F-statistic: 2.531e+04 on 2 and 9997 DF,  p-value: < 2.2e-16

This form of omitteed variable bias is well known. It is difficult to fully address this bias in observational research where there almost always remains the potential for missing confounders.

Residual confounding

A slightly less discussed problem is residual confounding in measured variables. This confounding exists even when confounding variables are included in a model. Consider the complete model \((2)\) where \(x_{2}^*\) is substituted for \(x_{2}\) in the estimation of the \(\beta_{1}\) and \(\beta_{2}\) parameters, and

\[\begin{equation} \tag{5} x_{2}^*=x_{2} + N(0,0.5). \end{equation}\] Equation \((5)\) includes random error in measurement of the confounding variable. Typically this error will be smaller than the variance of the original variable (0.5 standard deviations here instead of 1 as above), but it may not be random in practice. Systematic error (such as instrument bias) in the measurement of this variable results in more unpredictable error.

x2_re <- x2 + rnorm(10000,0,0.5)

The true model remains unchanged, but the data from the sample usinf \(x_{2}^*\) changes the model estimators of \(\beta_{1}\).

out3 <- lm(y~x1 + x2_re)
summary(out3)
## 
## Call:
## lm(formula = y ~ x1 + x2_re)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6911 -0.7488 -0.0054  0.7293  3.8334 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0004469  0.0108973   0.041    0.967    
## x1          1.1642036  0.0098969 117.633   <2e-16 ***
## x2_re       0.6700231  0.0124219  53.939   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 9997 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.8031 
## F-statistic: 2.039e+04 on 2 and 9997 DF,  p-value: < 2.2e-16
As we see here, unbiased error associated with measurements of \(x_{2}\) results in a biased estimate of \(\beta_{1}\), albeit, a smaller bias than observed if \(x_{2}\) was omitted.

Note that if \(x_{1}\) were measured with the same magnitude of random error (substituting \(x_{1}^*\) for \(x_{1}\) in \((2)\)), the bias is towards the null (0).

x1_re <- x1 + rnorm(10000,0,0.5)
out4 <- lm(y~x1_re + x2)
summary(out4)
## 
## Call:
## lm(formula = y ~ x1_re + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2907 -0.7224 -0.0066  0.7400  4.3729 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.010824   0.010867  -0.996    0.319    
## x1_re        0.795419   0.009659  82.352   <2e-16 ***
## x2           1.223961   0.014398  85.010   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.086 on 9997 degrees of freedom
## Multiple R-squared:  0.8043, Adjusted R-squared:  0.8042 
## F-statistic: 2.054e+04 on 2 and 9997 DF,  p-value: < 2.2e-16
What we see overall is that random error in the measurement of the independent variables can produce biased estimates. In this particular example, imprecision in measurement of the effect of interest \(x_{1}\) biases \(\beta_{1}\) towards 0, but increased imprecision in the confounder could bias \(\beta_{1}\) away from 0.


This is well known. What remains generally unknown, however, is how large these biases are in practice and whether or not changes in the precision of variable measurements has had any systematic effect on findings over time. It is conceivable that environmental health scientists see great value in measuring exposures more accurately over time, and are likely to get funding to do this. In air pollution research, for example, exposure estimates were once very crude–based on annual measurements over large geographic areas, or statistical models based on small samples of data. The work of Prindle (1959) is a good reminder of how much measurements of exposure have changed; in this study, the author looked at respiratory cancer and respiratory mortality across urban and rural areas using the differences in population density as a very indirect proxy for exposure to air pollution.


Improvements in the measurement of exposure is important for improving our understanding of the relationship between exposure and the health outcome of concern. For example, in 2002, Pope et al. used small area estimates based on a mixture of air quality surveillance systems to estimate the association between fine particulate air pollution and both cancer and cardiopulmonary mortality. In more recent research, air pollution exposure estimates are based on mobile monitoring systems (see review by Gozzi et al., 2016 for discussion) and provide much more precise exposure estimates than proxies of the past–like distance from roadways and population density.


It is natural to ask whether there has been an equivalent improvement in the measurement of confounders. Residual confounding through the imprecise measurement of confounders can be easily overlooked once a confounder is adjusted for, however it is clear that if no improvement is made in the measurement of confounders over time, there could be a systematic net change in the magnitude of the relationship between exposure to hazard and a health outcome. IF there is a stystematic improvement in the measure of exposures to hazard without a systematic improvement in the measurement of confounders, this should result in a systematic shift in the apparent impact of environment hazards over time. This could have a large impact on policy--if regulations need to be informed by evidence.