Skip to content

Introduction to Difference-in-Differences

Info

Author: Void, posted on 2021-06-24, reading time: about 5 minutes, WeChat public account article link:

1 Introduction

As the name suggests, Difference-in-Differences (DID) involves taking the difference twice. So, what is the relationship between DID and Little Doraemon? Moreover, why take difference twice? Don't worry, let us explain it slowly.

The DID model is a common econometric model that examines the impact of an experiment or event and is a bit similar to an AB test. Unlike linear regression, which simply characterizes the correlation relationship, DID is a compact and practical model for causal inference. To understand the origin of this model, we need to start with the assumptions of linear regression.

2 Assumptions of linear regression

We all know linear regression, but we may still not know much about it. It is this "simple" formula:

\[ Y = \beta X + \varepsilon \]

While we enjoy using linear regression, we often ignore the four assumptions of the model:

  • Linearity

  • Strict exogeneity

    \[E(\varepsilon_{t}|X)=E(\varepsilon_{t}|X_{1},X_{2}\cdots X_{n})=0\]
  • No perfect multicollinearity

  • Homoscedasticity

    \[E(\varepsilon_{t}^{2}|X)=\sigma^{2}\]
    \[E(\varepsilon_{t}\varepsilon_{s}|X)=0\]

In layman's terms, Y and X must satisfy a linear relationship (no kidding...). The residual (the difference between the actual and estimated values) is not related to X and has the properties of homoscedasticity and no autocorrelation. There cannot be an X that is the father (represented linearly) of other Xs. Okay, we smart people are ready to ignore these assumptions. But wait, if the assumptions are not met, the estimate could be inaccurate.

Among them, strict exogeneity is a rather haughty and easily unsatisfied condition. In this case, we often refer to this model as having an endogeneity problem. Let's use primary school math to review what is strict exogeneity.

3 Strict exogeneity

\[E(\varepsilon_{t}|X)=E(\varepsilon_{t}|X_{1},X_{2}\cdots X_{n})=0\]
\[t=1,2\cdots n\]

According to the law of iterated expectation, \(E(Y|X)=E[E(Y|X,Z)|X]\), we have

\[E(\varepsilon_{t}|X_{t})=E[E(\varepsilon_{t}|X)|X_{t}]=0\]
\[E(\varepsilon_{t})=E[E(\varepsilon_{t}|X)]=0\]

Hence,

\[E(X_{s}\varepsilon_{t})=E[E(X_{s}\varepsilon_{t}|X)]=0\]

Therefore, \(cov(X_{s},\varepsilon_{t})=0\), which assumes that the disturbance term \(\varepsilon_{t}\) and explanatory variables are not linearly correlated.

4 Common forms of endogeneity problems

Okay, we have finally (not really) understood the academic term "strict exogeneity." So, in actual data, how does endogeneity appear?

  • Omitted variable bias (there are other Xs that can effectively estimate Y)

  • X and Y are mutually causal (such as X being education level and Y being income, education level can affect income, while income can also affect education level, such as getting an MBA)

5 Solutions

  • Instrumental variable\ Find an external variable that is related to the endogenous explanatory variable but uncorrelated with the random disturbance term. Regress together with other existing exogenous variables to obtain an estimate of the endogenous variable, use it as the IV and include it in the original regression equation.
    Example: Y is the probability of a civil war outbreak, X is economic growth, and IV is rainfall. The probability of a civil war outbreak and economic growth are mutually causal (there is an endogeneity problem). Rainfall is related to economic growth (in agricultural countries) and can only unidirectionally affect the probability of a civil war outbreak by influencing economic growth.

  • Difference-in-Differences method (DID)\ If there is an external shock that affects some samples but not others and we want to know the net effect of this shock, DID is used to study it. Because the shock is generally exogenous relative to the research sample, there is no reverse causality problem.

Finally, our superstar DID model appears.

6 The Difference-in-Differences (DID) model

The DID model is relatively simple and essentially a linear regression.

\[Y_{it}=\beta_{0}+\beta_{1}D+\beta_{2}T+\beta_{3}(D\times T)+\varepsilon_{it}\]

D is the group dummy variable. When studying the impact of an event or policy, if an individual i is affected by the shock, i belongs to the experimental group, and \(D=1\). Otherwise, i belongs to the control group, and \(D=0\). T is the dummy variable for time (the event or policy has a starting point). \(T=0\) before the shock and \(T=1\) after the shock. \(D\times T\) is the interaction term between the group dummy variable and the time dummy variable (multiplied together), and its coefficient \(\beta_{3}\) reflects the net effect of the shock.

Wow, it looks so simple. In this model, we can add other control variables. At the same time, the DID model has its own assumptions: the experimental and control groups have parallel trends before the shock, which is similar to an AB test.

In summary, the DID model can help you evaluate the impact of an event or policy scientifically. You no longer have to worry about your boss asking you, "So, what is the impact of this?"

7 Takeaways

  • Pay attention to the model assumptions.
  • Machine learning based on causal inference (instant sublimation)

Viewed times

Comments