The Sources of Endogeneity

The first step in addressing the endogeneity problem is to identify which sources are most likely to bias the estimates. We distinguish five canonical sources. All five share a common structure: an unobserved factor ends up in the error term of the regression model and is correlated with the predictor, biasing the estimated effect.

To make these sources concrete, we use a running example throughout: a study examining whether exposure to an influencer on social media causes consumers to purchase the featured product.

1. Omitted Variable

An omitted variable affects both the predictor and the outcome but is not included in the model.

In our example, consumers who follow an influencer may already have a pre-existing interest in the product. This interest drives both the likelihood of seeing the influencer’s content (the predictor) and the likelihood of purchasing the product (the outcome). If not controlled for, the effect of influencer exposure on purchasing is confounded — the estimate captures the influence of pre-existing interest alongside any true influencer effect, biasing the result upward.

A specific case of omitted variables arises from the structure of the data itself. Observations may be nested over time (multiple observations per user) or within groups (multiple consumers following the same influencer). Higher-level characteristics — such as seasonality or influencer-specific traits — can affect both the predictor and the outcome. Failing to account for this structure leaves these factors in the error term, biasing the estimate upward or downward.

2. Treatment Selection

Treatment selection is a special case of omitted variables that applies when the predictor is binary — a “treatment” that units either receive or do not.

In our example, an algorithm may decide which consumers see the influencer’s content based on past browsing behavior. Consumers with stronger product interest are more likely to be exposed (scoring “1” on the treatment). But the same browsing behavior that triggers exposure also predicts purchasing. The effect of influencer exposure is therefore confounded with the non-random selection into treatment, biasing the estimate upward.

The key distinction from a general omitted variable: here, the endogeneity operates specifically through the selection mechanism that assigns the binary treatment.

3. Simultaneity

Simultaneity arises when the predictor and the outcome mutually cause each other.

In our example, a consumer who has already decided to buy a product may actively seek out confirming information — browsing social media, encountering influencer content, and thus becoming “exposed” to the influencer after the purchase decision has been made. The outcome (purchasing) temporally precedes the predictor (exposure), even though the data suggest the reverse.

This reciprocal causation means the estimated effect reflects a feedback loop rather than a one-directional causal relationship, biasing the estimate upward.

4. Measurement Error

Measurement error arises from inaccuracies in how the predictor or outcome is measured. It causes endogeneity for two distinct reasons.

Unsystematic Measurement Error

Random noise in the predictor biases the estimated effect toward zero (attenuation bias). The noisier the measure, the more the true effect is understated. This can be mitigated by incorporating the reliability of the predictor into the model.

Random noise in the outcome, by contrast, does not bias the coefficient — it is simply absorbed into the error term.

Systematic Measurement Error

When measurement errors in the predictor and outcome are correlated, the bias can go in either direction. For example, if both influencer exposure and purchases are self-reported, respondents who overstate one may also overstate the other. This common method bias inflates the observed relationship because the correlated measurement errors end up in the error term.

5. Sample Selection

Sample selection arises when observations enter the sample non-randomly in a way that is related to the outcome.

In our example, suppose the data comes from a company that only tracks existing customers or consumers who opted into tracking. Everyone in the sample already has some baseline interest in the product. Studying the effect of influencer exposure on purchases in this non-representative sample produces misleading results — the estimate is biased upward because high-propensity buyers are overrepresented.

Sample selection is a problem to the extent that the researcher wants to generalize beyond the observed sample. If the research question is explicitly limited to the population represented in the data, the concern is less pressing.

Identifying the Relevant Sources

Not every study faces all five sources equally. The diagnostic question for each is straightforward:

  • Omitted variable: Is there any variable that affects both the predictor and the outcome?
  • Treatment selection: Is there any variable that determines who receives the treatment?
  • Simultaneity: Does the outcome also affect the predictor?
  • Measurement error: Is there any systematic error in how variables have been measured?
  • Sample selection: Are observations non-randomly selected into the sample?

If the answer to any of these is yes, consider the likely direction of the bias — will the confound inflate or deflate the estimated effect? This assessment is theoretical rather than statistical, and it guides the choice of appropriate remedies.