Two-Stage Least Squares
Two-stage least squares is the most widely used instrumental variable estimator. It addresses endogeneity from omitted variables, simultaneity, and measurement error by replacing the endogenous predictor with its exogenously predicted component. It is best suited for models with a continuous predictor.
How It Works
Step 1: First-Stage Regression
Regress the endogenous predictor on the instrumental variable(s) and all exogenous controls and fixed effects from the outcome equation. Save the predicted values — these represent the variation in the predictor that is driven solely by the instrument and the exogenous controls, stripped of the endogenous component.
Step 2: Second-Stage Regression
Estimate the outcome equation, but replace the original endogenous predictor with the predicted values from the first stage. The instrument is excluded from this stage.
Because the predicted values contain only exogenous variation, the second-stage coefficient estimates the causal effect free of endogeneity bias — provided the instrument is both strong and valid.
Unlike the control function approach, 2SLS is a combined estimation procedure: both stages are estimated jointly, so standard errors are automatically corrected.
Requirements
- At least one strong instrumental variable (significantly predicts the endogenous predictor)
- At least one valid instrumental variable (affects the outcome only through the predictor)
- The first stage must include the same controls and fixed effects as the outcome equation
- Works best with a continuous endogenous predictor
Limitations
2SLS becomes less practical in certain settings:
- Binary or discrete predictors — the linear first stage may produce predicted values outside the logical range
- Interaction terms — when the endogenous predictor appears in interactions, instrumenting becomes cumbersome
- Nonlinear models — logit, probit, Poisson, and similar models do not lend themselves naturally to the 2SLS framework
In these cases, the control function approach is often a better fit. In linear models with continuous variables, both approaches produce identical results. They diverge when interactions or nonlinearities are present — 2SLS is more robust to misspecification, while the control function is more efficient when correctly specified.
Implementing the Two-Stage Least Squares Approach
The ivregress command in Stata (https://www.stata.com/manuals/rivregress.pdf) can be used to implement two-stage least-squares regression with the following code.
The ivreg package in R (https://cran.r-project.org/web/packages/ivreg/index.html) can be used to implement two-stage least-squares regression with the following code. To obtain the first stage estimates, a separate regression model need to be run.
References
- Papies, Dominik, Peter Ebbes, and Harald van Heerde (2017), “Addressing Endogeneity in Marketing Models,” Advanced Methods for Modeling Markets, Cham: Springer, 581–627.
- Wooldridge, Jeffrey M. (2010). Econometric analysis of cross section and panel data. Cambridge: MIT Press.