Sensitivity Analysis

After addressing plausible confounding through rich controls and fixed effects, accounting for latent heterogeneity, and mitigating simultaneity and measurement error, researchers must assess whether meaningful endogeneity concerns remain. Because no statistical test can directly detect endogeneity, this assessment relies on sensitivity analysis.

The Core Idea

A common endogeneity critique is that an estimated effect may be driven by an unobserved confound. Sensitivity analysis as developed by Frank (2000) reframes this critique as a quantitative question: how strong would an unobserved confound need to be to change the conclusion?

Rather than debating whether bias exists, the approach asks how much bias would be required to push an estimate below a specified threshold, whether that threshold reflects statistical significance or managerial relevance.

An Example

Consider a retailer that deploys personalized coupons through its app. Estimated spending increases following coupon exposure, but a critic argues that higher-propensity customers were more likely to receive the coupon, rendering the estimated lift non-causal. The sensitivity analysis does not try to settle this debate directly. Instead, it asks: how strong would that unobserved propensity need to be to make the estimated coupon effect disappear?

Impact Threshold for a Confounding Variable (ITCV)

The ITCV quantifies the minimum strength an unobserved confound would need, in partial-correlation terms, conditional on observed controls — to overturn a focal inference.

The key insight is that an unobserved confound threatens inference only if it is related to both the predictor and the outcome. The ITCV captures this as the product of these two partial correlations. If no observed control variable in the model exerts an influence of comparable magnitude, an unobserved confound of the required strength is less plausible.

Beyond Inflated Effects

The ITCV is most directly informative when confounding inflates an estimated effect, raising Type I error concerns. But the same logic applies to suppression, situations where confounding masks a true effect. By redefining the inference threshold relative to a substantively meaningful benchmark, researchers can assess how much bias would be required to sustain an incorrect conclusion that an effect is negligible or nonexistent.

Supported model types:

  • Ordinary least squares
  • Poisson regression
  • Tobit regression
  • Weighted regression
  • Fixed effects and random effects models

Robustness of Inference to Replacement (RIR)

The RIR builds on the same logic but expresses robustness differently: it reports the percentage of observations that would need to be entirely driven by an unobserved confounder for the inference to fail.

A high RIR means that a large share of the data would have to be contaminated by confounding before the result breaks down — a strong signal of robustness. Because this metric captures bias regardless of its source, it speaks to endogeneity from omitted variables, simultaneity, measurement error, and selection alike.

Supported model types:

All models supported by ITCV, plus:

  • Logit regression
  • Probit regression

Implementation

Both metrics are implemented through the konfound command, which runs directly after the regression estimation.

Interpreting the Results

There is no universal cutoff for either metric. Both should be interpreted in context:

  • Benchmark against observed controls. Compare the ITCV threshold to the actual partial correlations of control variables already in the model. If no observed control comes close, an unobserved confound of the required strength is less plausible.
  • Benchmark against the literature. Compare the implied confound strength to effect sizes reported in related studies or meta-analyses.
  • Consider the RIR percentage. A result requiring, say, 40% of observations to be entirely driven by bias is more robust than one requiring only 5%.

Sensitivity analysis complements careful research design — it does not substitute for it. These diagnostics are most informative when applied to the strongest available specification, after incorporating all plausible controls, fixed effects, and design features.

An Alternative: Coefficient Stability and Unobservable Selection

A complementary sensitivity approach was developed by Oster (2019) and formalizes a heuristic that researchers have long relied on: if a coefficient barely moves when controls are added, the result is often viewed as unlikely to be driven by unobserved confounding. Oster shows that this intuition is informative only when coefficient movements are evaluated together with movements in the R-squared.

Implementation

What Sensitivity Analysis Does Not Do

These tools speak to the stability of an inference around a threshold, not to the bias in the causal effect itself. Substantial attenuation or inflation of the true effect may occur even when statistical significance is unchanged.

Sensitivity analysis neither reveals the source of bias nor resolves endogeneity. The appropriate interpretation is not that a model is unbiased, but that overturning the focal inference would require confounding of a specified magnitude. When that magnitude seems implausible given the institutional context and observed covariates, the inference stands on firmer ground.

When sensitivity diagnostics suggest that a plausible confound could overturn the result, researchers should escalate to instrumental variable or instrument-free approaches to attempt to correct the bias.

References

    • Altonji, Joseph G., Todd E. Elder, and Christopher R. Taber (2005), “Selection on Observed and Unobserved Variables: Assessing the Effectiveness of Catholic Schools,” Journal of Political Economy, 113 (1), 151-184.
    • Frank, Kenneth A. (2000), “Impact of a confounding variable on a regression coefficient,” Sociological Methods & Research, 29 (2), 147-194.
    • Frank, Kenneth A, Spiro J. Maroulis, Min Q. Duong, and Benjamin M. Kelcey (2013), “What would it take to change an inference? Using Rubin’s causal model to interpret the robustness of causal inferences,” Educational Evaluation and Policy, 35 (4), 437-460.
    • Frank, Kenneth A., Qinyun Lin, Spiro Maroulis, Anna S. Mueller, Ran Xu, Joshua M. Rosenberg, Christopher S. Hayter, Ramy A. Mahmoud, Marynia Kolak, Thomas Dietz, and Lixin Zhang (2021), “Hypothetical Case Replacement can be used to Quantify the Robustness of Trial Results,” Journal of Clinical Epidemiology, 134, 150–159
    • Oster, Emily (2019), “Unobservable Selection and Coefficient Stability: Theory and Evidence,” Journal of Business & Economic Statistics, 37 (2), 187-204.
    • Narvaiz, Sarah, Qinyun Lin, Joshua M. Rosenberg, Kenneth A. Frank, Spiro J. Maroulis, Wei Wang, and Ran Xu (2024), “konfound: An R Sensitivity Analysis Package to Quantify the Robustness of Causal Inferences,” Journal of Open Source Software, 9(95), 5779.
    • Xu, Ran, Kenneth A. Frank, Spiro J. Maroulis, and Joshua M. Rosenberg (2019), “konfound: Command to Quantify Robustness of Causal Inferences,” The Stata Journal, 19(3), 523-550.