The Endogeneity Problem
A good starting point for understanding the endogeneity problem is the concept of spurious correlations.
Spurious Correlations
Consider the well-known example: monthly ice cream sales and monthly murder rates are highly correlated. A naïve look at the data would suggest a reliable statistical relationship. But the relationship is spurious — it is driven by a confounder. Higher temperatures in summer simultaneously boost ice cream sales and, coincidentally, murder rates. There is an omitted cause behind the observed association.
How Endogeneity Differs
Spurious correlations and endogeneity both produce misleading appearances of relationships in data. The key difference lies in directionality:
- A spurious correlation concerns two variables that appear associated with each other.
- Endogeneity arises when one variable is thought to cause the other — and the estimated causal effect is biased because of an omitted factor.
In practice, the line between the two is often blurry. Even when researchers frame their findings as mere “associations,” there is typically an implicit understanding of directionality — a predictor and an outcome have been defined, a directional hypothesis has been stated, and regression analysis has been used to estimate the effect. With that in mind, the logic of spurious correlations provides an intuitive foundation for understanding endogeneity.
What Is Endogeneity?
In its simplest form, regression analysis aims to assess the impact of a predictor on an outcome. The regression equation has three components:
- An intercept — where the regression line meets the y-axis when the predictor equals zero
- An estimated coefficient — the slope describing the effect of the predictor on the outcome, used for hypothesis testing
- An error term — everything else that influences the outcome but is not included in the model
The endogeneity problem arises when the predictor is correlated with one or more unobserved factors captured in the error term. When this happens, the estimated coefficient no longer reflects the true causal effect. It is biased — potentially overstating, understating, masking, or even reversing the real relationship.
Why It Cannot Be Tested Directly
The error term is, by definition, unobservable. This means there is no direct way to test whether it is correlated with the predictor. The threat of endogeneity can therefore never be empirically ruled out — it must be addressed through a combination of research design, theoretical reasoning, and statistical tools.
A note on terminology: the residual is the estimated value of the error term (outcome minus predicted outcome). The error term (or disturbance term) is its theoretical, pre-estimation counterpart.
The Ice Cream Seller Example
Consider a smart ice cream seller who knows that good weather brings more people to the beach — and that those people are willing to pay more. So the seller raises prices on sunny days and lowers them on rainy days.
A researcher analyzing daily data on ice cream prices and sales would find that higher prices are associated with more sales. But this estimated price effect is biased. Weather affects both the price (the seller’s decision) and sales (consumer demand), but weather is not in the model. It is captured in the error term, creating a correlation between the predictor (price) and the error term.
Two Components of the Observed Effect
The observed price effect is a mix of two things:
- The exogenous component — the true causal effect of price on sales (which is negative: higher prices reduce demand)
- The endogenous component — the variation in price driven by the omitted weather variable (which creates a positive association: sunny days bring both higher prices and more customers)
Depending on which component dominates, the estimated effect could be negative, positive, or even zero. In any case, it does not reflect the true price effect that would emerge from a controlled experiment where prices are randomly assigned.
Why the Solution Is Not Always Obvious
In this stylized example, the fix is straightforward: collect data on weather and include it as a control variable. But in realistic settings, addressing endogeneity is far more difficult:
- Researchers cannot readily observe all relevant omitted variables
- The sources of endogeneity extend beyond simple omitted variables to include simultaneity, measurement error, treatment selection, and sample selection
The first step in addressing the endogeneity problem is therefore to identify which sources are most likely to bias the estimates in a given study.