The Sources

Omitted Variable
The first source of endogeneity emanates from an omitted variable that affects both the predictor and outcome. As described in the introduction, this happens if the analysis does not control for consumer’s interest in the products, which may affect both the likelihood of following the influencer (predictor) and the purchase of the featured product (outcome). Thus, the effect of influencer exposure on purchasing might be confounded with the pre-existing interest in the product. The effect of the omitted variable, which is related to the predictor, is included in the error term of the regression model, leading to an upward bias of the observed effect.

A specific case of omitted variables are omitted fixed effects. For example, data on exposure and purchases may involve multiple observations per user over time or multiple consumers for different influencers. In either case, observations are nested either in time or in another higher-level entity. Thus, it is possible that higher-level characteristics affect both the predictor and outcome, such as seasonality or influencer characteristics. Not accounting for the data structure over time (i.e., multiple observations per user) or within groups (i.e., multiple consumers who follow the same influencer) introduces an omitted variable into the error term of the regression model, leading to an upward or downward bias of the observed effect.

Treatment Choice
The second source of endogeneity arises when the selection process for the treatment of the predictor is not random. Without random assignment, treatment – that is the “0/1 score” of a binary predictor – is potential endogenous, and thus a specific case of an omitted variable. For example, the interest of a consumer in the featured product may be inferred by an algorithm based on past browsing behavior, and given the inferred interest the consumer will more likely be exposed to the influencer (i.e., scoring “1” on the predictor that captures exposure). The browsing behavior which affects the algorithm’s decision to expose a consumer to a certain influencer also affects purchasing. The effect of influencer exposure will be confounded with the non-random exposure, in anticipation of a higher outcome. This systematic treatment choice is included in the error term, leading to an upward bias of the observed effect.

Simultaneity
The third source of endogeneity occurs when the predictor and the outcome concurrently affect each other. In the example, purchasing a featured product may also affect exposure to the influencer on social media, even if the temporal precedence of the predictor is given. In anticipation of the product purchase, a consumer may look for cues to confirm an already made decision, browse social media for such cues, and be more likely exposed to the influencer that promotes the product. Thus, the decision to purchase a product (outcome) temporarily precedes the exposure to the influencer promoting it (predictor). This reciprocal causation introduces simultaneity, captured in the error term and leading to an upward bias of the observed effect.

Measurement Error
The fourth source of endogeneity is related to inaccuracies in measuring the predictor or outcome, and these lead to endogeneity for two reasons. First, unsystematic measurement error in the predictor leads to an attenuation bias, i.e., a downward bias the observed effect. This source of endogeneity can be eliminated by incorporating the reliability of the predictor into the model. Unsystematic measurement error in the outcome does not bias the estimated effect because it is absorbed in the error term. Second, systematic measurement errors in both the predictor and outcome may correlate with the (measurement of) the other variable. For example, if exposure to the influencer on social media and purchases would be self-reported measure, common method bias could upward bias the observed effect because the systematic measurement error is included in the error term of the regression model.

Sample Selection
The fifth source of endogeneity arises when selection into the sample is not random but related to the outcome. As a result, the obtained effect is not representative of the population. For example, if data to explore the relationship between influencer exposure and purchases is obtained from a company, the outcome (i.e., purchase data) may only be available for existing customers or consumers that agreed to be tracked by the company. Studying the effect of exposure on purchases in this sample will produce misleading results because all consumers in the sample are more likely to have an interest in the product. This systematic sample selection is included in the error term, leading to an upward bias of the observed effect.

Navigating Endogeneity in Marketing Research