Skip to main content

Causal thinking

In the linear regression example we saw some pretty odd behaviour. A model fit of genotype on expression gave no very strong evidence for association, yet when we included a covariate this changed and there seemed to be an effect.

To figure out why this might be we need to think causally.

Three worlds

To start, let's look at the 'three worlds' of scientific inference:

img

The first world is the natural world - that's what we want to know about. This world contains all the wonders of life, but also all the annoying and fiddly bits.

The second world is the theoretical world - that's where we build scientific theories - simplified models of how the world works. This world is much cleaner and more simple than the real world. Arguably this is where advances in scientific understanding really happen.

The third scientific world is the one we've been working in - data analysis world. Here we have data and apply statistical models. The aim is to help us distinguish between our scientific theories.

Linking these worlds are experiments - either purely observational experiments, which measure variables on real-world populations, or controlled experiments which are typically much more theoretically controlled. Experiments generate data relevant to scientific models that can then be tested using data analysis methods.

The point of this diagram is that, while there are some settings where purely knowing correlations between variables is useful, what we want to know most is how the world works. Statistical models typically don't tell us that without the scientific theories that motivate them. In this tutorial we'll see how that plays out in a purely simulated example.

Causal graphs

Let's look at a simple way to represent a functional / mechanistic model: causal graphs.

A causal graph is made up of nodes which stand for quantities of interest and arrows between nodes. An arrow between two nodes means that one causes or causally affects the other.

For example

XYX \rightarrow Y

means "X causally affects Y": changing XX would affect YY. (The graph alone does not say anything about how XX affects YY, or how strong the effect is - just that there is an effect.)

The other important point is that variables and arrows that don't appear in the graph are assumed not to be there. Thus, in the above graph, changing YY would definitely not affect XX (as there's no arrow or path from YY to XX.) Similarly, in this diagram:

XYZX \rightarrow Y \rightarrow Z

we can be sure that all of the causal effect of XX on ZZ is through YY, and not via some other path.

An exercise: three variables

To get a sense of this, let's consider the simplest nontrivial case: three variables. Specifically, let's imagine that we are interested in estimating the (causal) effect of some exposure variable AA (for example, a genotype or a new drug) on an outcome CC (which could be a disease trait, or a biomarker, for example.). But there is a third variable BB that might be causally related to AA and/or CC. Should we condition on BB or not?

To make this concrete, let's consider a set up where

  • AA, BB, and CC are each gaussian variables with variance 11.
  • all causal effects are linear and equal to 0.50.5.

In other words = ACA\rightarrow C means C=0.5A+ϵC = 0.5 A + \epsilon - where ϵ\epsilon is a gaussian variable with just the right variance to make CC have variance 11.

Question

Consider the nine causal diagrams below:

img

Suppose all variables are gaussian with variance 1, all effects are linear and all effect sizes are 0.5.

For each diagram, first try to decide by looking at the diagram whether you think BB should be included or not. (Or maybe it doesn't matter?)

Second, try downloading loading the matching dataset from this folder and use lm() in R to check your answer is right. More precisely, what is the impact of including BB in as a covariate on a) the regression estimate β^\hat{\beta} and b) the standard error of conditioning on BB?

Question

Can you figure out which of the diagrams are observationally equivalent? (That is, they produce identical data distributions?)

Note. In these examples, all variables are gaussian with mean zero and variance 1, and all effects are linear. Thus the joint data distribution is fully determined by the covariances between the three variables.

See if you can complete these before going further.

Worked example

As an example let's analyse diagram #8, which stretched out looks like:

A0.5C0.5BA \stackrel{0.5}{\rightarrow} C \stackrel{0.5}{\rightarrow} B

Let's simulate from this model now. I'll simulate a very large sample size so that we can get accurate estimates:

A = rnorm( 100000 )
C = 0.5 * A + rnorm( 100000, sd = sqrt(0.75) )
B = 0.5 * C + rnorm( 100000, sd = sqrt(0.75) )

When I fit the example dataset I get this:

> l = lm( C ~ A ); summary(l)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001401156 0.002739580 -0.5114491 0.6090377
A 0.497460492 0.002744025 181.2886108 0.0000000

> l = lm( C ~ A + B ); summary(l)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001249272 0.002451502 -0.5095947 0.6103366
A 0.399875908 0.002532207 157.9159381 0.0000000
B 0.398648014 0.002527165 157.7451237 0.0000000

So fitting the model with BB in gives us estimates of around 0.4 - for both AA and BB - yet the causal effects were 0.50.5 and 00! What's going on?

Well, this model corresponds to the following formula:

AN(0,1)A \sim N(0,1)
C=0.5×A+ϵCC = 0.5\times A + \epsilon_C
B=0.5×C+ϵB=0.25×A+0.5×ϵC+ϵBB = 0.5\times C + \epsilon_B = 0.25 \times A + 0.5\times \epsilon_C + \epsilon_B

(where, as before, ϵC\epsilon_C is chosen so that CC has variance 11 and similarly for ϵB\epsilon_B.)

The unadjusted linear regression estimate is 0.50.5 (it is equal to the covariance between AA and CC.). This is indeed the causal effect.

The adjusted estimate is different - but it still gets the contribution of AA right. Because since B=0.25×A+...B = 0.25 \times A + ... the fitted contribution of AA to CC is:

0.4+0.4×0.25=0.50.4 + 0.4\times 0.25 = 0.5

In other words, the regression has made a perfectly sensible estimate - it just doesn't understand that it's the effect of AA on CC that we are interested in.

One way to understand this is to say that some of the association with AA is 'explained by' BB - and in this case the estimate for BB is positive because BB also 'explains' some of the noise in CC.

Note. Diagram 2 is sort of similar in that BB is correlated to AA. In this case, however, there's a different behaviour:

A = rnorm( 10000 )
B = 0.5 * A + rnorm( 10000, sd = sqrt( 0.75 ))
C = 0.5 * A + rnorm( 10000, sd = sqrt( 0.75 ))
> l = lm( C ~ A ); summary(l)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.003590284 0.008678783 0.4136851 0.6791136
A 0.506865138 0.008710480 58.1902630 0.0000000

> l = lm( C ~ A + B ); summary(l)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.003520708 0.008677774 0.4057155 0.68496031
A 0.497305419 0.010094251 49.2662029 0.00000000
B 0.018778370 0.010023912 1.8733575 0.06104819

In this case including BB does not alter the point the estimate of AA on CC. One way to understand this is that BB has no other component of CC in it (other than AA), so all of the association of BB with CC is 'explained' by AA. This leads to the coefficient of BB being close to zero. However, including BB has imnpacted the precision of our estimate: the standard error for β\beta has changed from 0.0087 to 0.010. This might not look very much but look at its effect on the z-score (i.e. how many standard errors away from zero the estimate is):

> 0.506865138/0.008710480
[1] 58.19026
> 0.497305419/0.010094251
[1] 49.2662

Including BB has made the estimate about less certain, by about 10 standard errors.

In this example with 10,000 samples, both estimates are anyway very many standard errors away from zero, but clearly including BB is adding to the uncertainty of our estimate.

Hint

The effect sizes we are looking at here are very strong! (Each effect explains 14\frac{1}{4} of the variance of the target variable). Try reducing the sample size to 100 to see how this affects the resulting P-value.

Observational equivalence

One of the important things to realise is that more than one causal model can produce the same observable data. This puts limits on the extent to which the data can help us distinguish scientific theories.

An example of this in the above are diagrams 6 and 7. We can check this by following through the definitions in each diagram to compute variable covariances:

cov(A,B)=0.5,cov(B,C)=0.75,andcov(A,C)=0.75\text{cov}(A,B) = 0.5, \quad \text{cov}(B,C) = 0.75, \quad\text{and}\quad \text{cov}(A,C) = 0.75

A full computation of cov(B,C)\text{cov}(B,C) in diagram 6 looks as follows:

cov(B,C)=cov(0.5A+ϵB,0.5A+0.5B+ϵC)\text{cov}(B,C) = \text{cov}(0.5 A + \epsilon_B, 0.5 A + 0.5 B + \epsilon_C)
=cov(0.5A+ϵB,0.5A+0.25A+0.5ϵB+ϵC)= \text{cov}(0.5 A + \epsilon_B, 0.5 A + 0.25 A + 0.5 \epsilon_B + \epsilon_C )
=cov(0.5A,0.75A)+cov(ϵB,0.5ϵB)= \text{cov}(0.5 A, 0.75 A) + \text{cov}( \epsilon_B, 0.5\epsilon_B)
=38+12var(ϵB)= \frac{3}{8} + \frac{1}{2}\text{var}( \epsilon_B )
Note

This calculation uses the key properties of covariance, namely:

  • cov(,)\text{cov}(\cdot,\cdot) is bilinear, so linear changes to either of the variables can be split out, e.g.:

    • cov(αX,Y)=αcov(X,Y)\text{cov}(\alpha X, Y) = \alpha \cdot \text{cov}(X, Y) if α\alpha is a number,
    • cov(X+Z,Y)=cov(X,Y)+cov(Z,Y)\text{cov}(X+Z, Y) = \text{cov}(X, Y) + \text{cov}(Z, Y).
    • And similarly for changes to the 2nd variable.
  • The covariance between any two independent quantities is zero.

  • And by definition var(X)=cov(X,X)\text{var}(X) = \text{cov}(X,X).

If the above calculation seems too hard, don't worry! In general in these diagrams the covariance can be read off by simply following all paths between two variables (regardless of arrow direction) and multiplying coefficients. So in this example, there are two undirected paths that connect BB to CC - one has coefficient 0.5×0.5=0.250.5\times 0.5 = 0.25% (through $A) and the other has coefficient 0.50.5, ergo, the total covariance is 0.750.75.

However there is one exception to this rule about following all paths. A path that is "blocked" by a collider - i.e. a variable where two arrows collide:

XYblocking variableZX \rightarrow \stackrel{\text{blocking variable}}{Y} \leftarrow Z

doesn't contribute to the covariance.

Note

Why is this? Hint. follow through the contributions of variables to the collider - you should be able to see that unlike other nodes that join two other variables, the collider does not generate a shared component to the incoming variables.

Question

Which variables are colliders in diagram 6 and 7?

Confounding

The fact that diagrams 6 and 7 are observationally equivalent is particularly important for the following reason.

In diagram 6, variable BB is part of a causal path from AA to CC. The full causal effect of AA on CC is 0.75, and this is what we get from an unadjusted estimate of AA on CC. Thus in diagram 6 we should not control for BB. (Although if we were specifically interested in just the direct effect of AA, we could control for BB.)

However in diagram 7, variable BB is not part of a causal path from AA to CC. The full causal effect of AA on CC is 0.5. If we do not include BB in our estimate, we get an inflated estimate of the effect equal to 0.75. In this case BB is known as a confounding variable or a confounder - more generally we say the estimate of AA on CC is confounded by the path through BB. To accurately estimate causal effects it is mandatory to control for the confounding variable.

Warning

To get an accurate estimate, you must control for confounders.

However - the surprising thing here is that, as we saw above, diagrams 6 and 7 are observationally equivalent. Therefore, we can't tell from the data whether BB is a confounder or not!

Conclusion

The only way to tell whether BB should be included or not in diagrams 6-7, is through our scientific knowledge about what AA, BB and CC represent - that is, what the true causal diagram is.

Diagram 9: colliders

Finally let's look at diagram 9. In diagram 9, variable BB is a collider (two arrows meet at it). Let's simulate that now:

A = rnorm( 10000, sd = 1 )
C = 0.5 * A + rnorm( 10000, sd = sqrt(0.75) )
B = 0.5 * A + 0.5 * C + rnorm( 10000, sd = sqrt(0.5) )

l = lm( C ~ A ); summary(l)$coeff[,1:2] Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0008303819 0.008687452 -0.09558405 0.9238528 A 0.4895884550 0.008781247 55.75386231 0.0000000

> l = lm( C ~ A + B ); summary(l)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00203657 0.007387861 -0.2756643 7.828116e-01
A 0.06833455 0.010105579 6.7620615 1.436458e-11
B 0.55480287 0.008967174 61.8704222 0.000000e+00

In this diagram conditioning on BB gets the answer completely wrong.

In some ways this is like diagram 8 that we studied above. However, this collider example is even more insidious because it can generate completely spurious effects. (Whereas the problem in diagram 8 only occurs if AA actually contributes to CC - that is, if there is a nonzero causal effect in the first place.). To see that we can try again with zero effect:

A = rnorm( 10000, sd = 1 )
C = rnorm( 10000, sd = 1 )
B = 0.5 * A + 0.5 * C + rnorm( 10000, sd = sqrt(0.5) )
> l = lm( C ~ A ); summary(l)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001944327 0.010010210 -0.1942344 0.8459963
A -0.017850158 0.009992815 -1.7862993 0.0740811

> l = lm( C ~ A + B ); summary(l)$coeff
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.003406093 0.008153387 -0.4177518 6.761375e-01
A -0.351401096 0.009390186 -37.4221675 5.732243e-287
B 0.669783192 0.009403352 71.2281297 0.000000e+00

In both cases the estimate conditional on BB gets the estimate of β\beta completely wrong.

Warning

To get an accurate estimate, you must not control for colliders.

To understand this behaviour, let's keep this last dataset with zero causal effect of AA on CC and let's plot the variables:

plot_data = tibble( A = A, B = B, C = C )
print(
ggplot( data = plot_data, aes( x = A, y = C ))
+ geom_point()
+ geom_smooth( method = "lm" )
)

img

This looks as expected. But now let's look at the same thing within strata of fixed BB:

plot_data$B_quantile = cut( plot_data$B, quantile( B, seq( from = 0, to = 1, by = 0.05 )))
print(
ggplot( data = plot_data, aes( x = A, y = C ))
+ geom_point()
+ geom_smooth( method = "lm" )
+ facet_wrap( ~ B_quantile )
)

img

For fixed values of BB, the association between AA and CC is drastically different - much closer to zero.

You can see why this is: since B=0.5A+0.5C+ϵBB = 0.5 A + 0.5 C + \epsilon_B, for a fixed value of BB there must be a negative relationship between AA and CC:

C=2BA2ϵBC = 2B - A - 2\epsilon_B

Controlling for BB in the regression is like looking at the effect of AA on CC within strata like this - it 'soaks up' the global effect of BB.

Question

In the actual diagram 9, this effect is tempered by the fact that AA does have an effect on CC. How do the plots look if you simulate from diagram 9?