{"id":63,"date":"2024-06-10T08:33:35","date_gmt":"2024-06-10T08:33:35","guid":{"rendered":"https:\/\/www.endogeneity.net\/?page_id=63"},"modified":"2026-04-04T14:33:38","modified_gmt":"2026-04-04T14:33:38","slug":"the-remedies","status":"publish","type":"page","link":"https:\/\/www.endogeneity.net\/?page_id=63","title":{"rendered":"Preventing Endogeneity"},"content":{"rendered":"<div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-1 fusion-flex-container has-pattern-background has-mask-background nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--link_hover_color: var(--awb-color5);--link_color: var(--awb-color5);--awb-background-blend-mode:multiply;--awb-border-color:var(--awb-color1);--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-padding-top:50.156000000000006px;--awb-padding-bottom:0px;--awb-padding-top-small:70px;--awb-padding-right-small:40px;--awb-padding-bottom-small:0px;--awb-padding-left-small:40px;--awb-margin-bottom-medium:0px;--awb-margin-bottom-small:60px;--awb-background-color:#ffffff;--awb-flex-wrap:wrap;\" ><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-center fusion-flex-content-wrap\" style=\"max-width:1248px;margin-left: calc(-4% \/ 2 );margin-right: calc(-4% \/ 2 );\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-0 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-padding-bottom-medium:0px;--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:85px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-margin-bottom-small:44px;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-1 fusion-text-no-margin\" style=\"--awb-content-alignment:center;--awb-text-color:var(--awb-color1);--awb-margin-right:15%;--awb-margin-bottom:0px;--awb-margin-left:15%;\"><p style=\"text-align: left;\"><span style=\"color: #000000;\"><strong>Preventing Endogeneity<\/strong><\/span><\/p>\n<p style=\"text-align: left; color: #000000;\">Trying to prevent endogeneity is far preferable to trying to cure it. Prevention strategies address the problem at the source \u2014 before estimation \u2014 while cures rely on strong, often untestable assumptions that can introduce new validity threats.<\/p>\n<p style=\"text-align: left; color: #000000;\">The following approaches each target specific sources of endogeneity. No single approach addresses all sources, so researchers should implement every prevention strategy that applies to their setting before considering corrective methods.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">1. Control Variables<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Omitted variables<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be measured with control variables.<\/p>\n<p style=\"text-align: left; color: #000000;\">The rich data approach prevents endogeneity by anticipating likely omitted confounders and including them as controls. Conceptually, it treats endogeneity as a partial-correlation problem: if the focal predictor is partly driven by observable confounds, conditioning on these variables removes the endogenous component.<\/p>\n<p style=\"text-align: left; color: #000000;\">This means researchers should:<\/p>\n<ul style=\"text-align: left;\">\n<li><span style=\"color: #000000;\">Theorize which confounders are plausibly omitted<\/span><\/li>\n<li><span style=\"color: #000000;\">Prioritize collecting data on those variables<\/span><\/li>\n<li><span style=\"color: #000000;\">Include them in the estimation<\/span><\/li>\n<\/ul>\n<p style=\"text-align: left; color: #000000;\">In marketing contexts, this approach is both feasible and effective when detailed behavioral and transactional data are available.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">A Word of Caution<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Not all controls are helpful. Controls should be selected based on a causal model of the data-generating process. Only variables that plausibly cause both the predictor and the outcome address omitted-variable bias. Including mediators (variables on the causal path between predictor and outcome) or colliders (variables caused by both predictor and outcome) can introduce bias rather than remove it.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">2. Matching<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Treatment selection<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be removed through matching on observed variables.<\/p>\n<p style=\"text-align: left; color: #000000;\">When rich covariate data is available, matching approaches address treatment selection by comparing units with similar observed characteristics but different treatment statuses. Recent implementations scale to high-dimensional settings and accommodate continuous treatments \u2014 for example, through tree-based methods for propensity estimation or outcome modeling.<\/p>\n<p style=\"text-align: left; color: #000000;\">However, matching is limited to balancing observed covariates. It cannot address selection driven by unobservables.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">3. Dummy Controls<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Omitted variables<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be captured by dummy controls.<\/p>\n<p style=\"text-align: left; color: #000000;\">Dummy variables absorb unobserved factors that are constant within clusters or time periods \u2014 such as customer-specific traits, day-of-the-week effects, or regional differences. However, they do not address omitted variables that vary within clusters or periods.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">4. Panel Fixed Effects<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Omitted variables<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be captured by panel fixed effects.<\/p>\n<p style=\"text-align: left; color: #000000;\">Panel fixed effects exploit repeated observations of the same units over time:<\/p>\n<ul style=\"text-align: left;\">\n<li><span style=\"color: #000000;\">Unit fixed effects absorb stable differences across units<\/span><span style=\"color: #0c0c0c;\"><span style=\"color: #0c0c0c;\"><span style=\"color: #000000;\"><br \/>\n<\/span><\/span><\/span><\/li>\n<li><span style=\"color: #0c0c0c;\"><span style=\"color: #0c0c0c;\"><span style=\"color: #000000;\">Time fixed effects absorb shocks common to all units in a given period<\/span><\/span><\/span><\/li>\n<li><span style=\"color: #0c0c0c;\"><span style=\"color: #0c0c0c;\"><span style=\"color: #000000;\">Two-way fixed effects address both sources simultaneously<\/span><\/span><\/span><\/li>\n<\/ul>\n<p style=\"text-align: left; color: #000000;\">However, two-way fixed effects do not capture unobserved shocks that vary by both unit and time. Those require additional time-varying controls.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">A Key Trade-off<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Unit fixed effects eliminate all cross-sectional variation, and time fixed effects eliminate all longitudinal variation \u2014 even when such variation is theoretically meaningful. Researchers should weigh identification gains against the loss of substantively important variation.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">5. Dynamic Panel Analysis<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Omitted variables<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be removed by using deep lags of the dependent variable as instrumental variables.<\/p>\n<p style=\"text-align: left; color: #000000;\">Panel data also enable first-differencing and dynamic panel estimators, which estimate effects from within-unit changes rather than levels. These approaches use lagged values of the dependent variable as instruments to address inertia and persistence in outcomes.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">6. Latent Controls<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Omitted variables<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be captured by latent control variables.<\/p>\n<p style=\"text-align: left; color: #000000;\">Latent-variable approaches proxy unobserved heterogeneity by inferring systematic differences across units from observed data patterns \u2014 rather than measuring them directly. These methods allocate units to latent segments or dimensions, capturing persistent heterogeneity that would otherwise end up in the error term.<\/p>\n<p style=\"text-align: left; color: #000000;\">The validity of these approaches rests on a strong assumption: that the relevant unobserved heterogeneity is sufficiently represented by the inferred latent structure. Conditional on these latent factors, the remaining correlation between the predictor and unobserved determinants of the outcome is assumed to be zero \u2014 an assumption that cannot be directly tested.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">7. Temporal Separation<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Simultaneity<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be removed by lagging the predictor in fine-grained data.<\/p>\n<p style=\"text-align: left; color: #000000;\">A pragmatic approach to mitigating simultaneity is temporal separation: measuring the focal predictor before the outcome. Lagging strengthens the case for temporal precedence, but it does not guarantee exogeneity.<\/p>\n<p style=\"text-align: left; color: #000000;\">Lagged predictors remain endogenous when:<\/p>\n<ul style=\"text-align: left;\">\n<li><span style=\"color: #000000;\">Earlier outcomes influence earlier predictor decisions (feedback loops)<\/span><\/li>\n<li><span style=\"color: #000000;\">Forward-looking expectations drive current predictor values<\/span><\/li>\n<\/ul>\n<p style=\"text-align: left; color: #000000;\">When possible, researchers can also argue that predictors are predetermined based on institutional context \u2014 for example, when promotional calendars are set months in advance, current promotions are not based on current-period demand shocks.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">8. Measurement Error Correction<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">Addresses: Measurement error<\/p>\n<p style=\"text-align: left; color: #000000;\">Untestable assumption: The relevant sources of endogeneity can be removed by accounting for measurement error.<\/p>\n<p style=\"text-align: left; color: #000000;\">Systematic measurement error can attenuate or inflate the observed effect. This concern is well recognized in survey research but equally relevant when constructs are extracted from unstructured data (text, images, audio) using machine learning. ML-generated variables should be treated as error-prone measures.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">What to Report<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">At a minimum, researchers should report out-of-sample predictive performance using validation data and conduct sensitivity analyses over plausible error rates, recognizing that classification error depends on the labeling threshold.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">Correction Methods<\/strong><\/p>\n<p style=\"text-align: left; color: #000000;\">When the measurement-error process can be approximated:<\/p>\n<ul style=\"text-align: left;\">\n<li><span style=\"color: #000000;\">SIMEX corrects continuous variables with classical additive error<\/span><\/li>\n<li><span style=\"color: #000000;\">MC-SIMEX corrects discrete variables characterized by a known misclassification matrix (sensitivity and specificity)<\/span><\/li>\n<\/ul>\n<p style=\"text-align: left; color: #000000;\">Both methods propagate first-stage prediction error into second-stage estimates, providing bias-corrected coefficients.<\/p>\n<p style=\"text-align: left;\"><strong style=\"color: #000000;\">References<\/strong><\/p>\n<ul>\n<li style=\"text-align: left;\"><span style=\"color: #000000;\">Qiao, Mengke and Ke-Wei Huang (2025), &#8220;Correcting Measurement Error in Regression Models with Variables Constructed from Aggregated Output of Data Mining Models,&#8221; MIS Quarterly, 49 (1), 29\u201360.<\/span><\/li>\n<li style=\"text-align: left;\"><span style=\"color: #000000;\">Yang, Mochen, Gediminas Adomavicius, Gordon Burtch, and Yuqing Ren (2018), \u201cMind the Gap: Accounting for Measurement Error and Misclassification in Variables Generated via Data Mining,\u201d Information Systems Research, 29 (1), 4\u201324.<\/span><\/li>\n<\/ul>\n<\/div><\/div><\/div><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"100-width.php","meta":{"footnotes":""},"class_list":["post-63","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=\/wp\/v2\/pages\/63","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=63"}],"version-history":[{"count":17,"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=\/wp\/v2\/pages\/63\/revisions"}],"predecessor-version":[{"id":495,"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=\/wp\/v2\/pages\/63\/revisions\/495"}],"wp:attachment":[{"href":"https:\/\/www.endogeneity.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=63"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}