Causality 101: The Book of Why – Part 1 (Work in Progress)

How to read this post?

"Quoted excerpts from the book"

My notes on the quoted part....

Book of Why - Introduction: Mind Over Data

This book tries to draw attention to "Causal Inference" as a new science and the key message of the book is that "You are smarter than your data". The key goal of this book is to convince the reader that you cannot build intelligence without first building an inherent sense of cause and effect in an agent.

This idea of cause and effect has been studied by two schools of thought, correlation people (Hume) and causation people (Aristotelian). Judea's work in causality can be considered to be inspired by the Aristotelian view whereas Hume in his book A Treatise of Human Nature argues that there is no causal relation.

However, tinkering has long been a huge part of human behavior, and can be probably considered the key to human evolution - what would happen if I did x? Questions like this are often lead the path to creativity and novelty. Thus, this relationship between cause and effect has been of long interest to humans.

No other species grasps this (cause and effect), certainly not to the extent that we do. From this discovery came organized societies, then towns, cities and eventually the science and technology-based civilization. All because we asked a simple question: Why?No other species grasps this (cause and effect), certainly not to the extent that we do. From this discovery came organized societies, then towns, cities and eventually the science and technology-based civilization. All because we asked a simple question: Why?

This might very well be true of conscious societies however it might be slightly presumptive to deduce that humans make decisions via our rational or conscious minds. A lot of work has been done on this topic by economists like Amos Tversky, Daniel Kahenmann and Dan Ariely. Take for example a life and death situation where you are being chased by a tiger - people have tried to tease away a tiger with a stick or climb a tree or jump into water even though we know that tigers belong to cat family.

Our brains store an incredible amount of causal knowledge which, supplemented by data, we could harness to answer some of the most pressing questions of our time. More ambitiously, once we really understand the logic behind causal thinking, we could emulate it on modern computers and create an "artificial scientist"Our brains store an incredible amount of causal knowledge which, supplemented by data, we could harness to answer some of the most pressing questions of our time. More ambitiously, once we really understand the logic behind causal thinking, we could emulate it on modern computers and create an "artificial scientist"

This might be a hasty generalization since prediction and reasoning do not operate as a single phenomenon with a shared set of toolkit. While the former can be heavily influenced by neural circuitry, the latter is heavily influenced by a conscious intervention.

  • Among the questions that interest human societies, causal inference is a science/tool to answer questions of the following type -
  1. How effective is a given treatment in preventing a disease?
  2. Did the new tax law cause our sales to go up, or was it our advertising campaign?
  3. What is the health-care cost attributable to obesity?
  4. I'm about to quite my job. Should I?

and so on... In short, data-backed decision making...

  • One of the key challenges for the deep learning community is decision-making in sparse reward loops. This was discussed by Lex Fridman with Pieter Abbeel recently in an eloquent manner, mentioning that while our neural networks are excellent at giving out a number in the end (let's say accuracy) it lacks a greater awareness about the context of what that number means. Because of this potential limitation of neural network systems, many people are highly sceptical of them and it has become a clickbait goldmine for ethics committees around the world and journalists to constantly point fingers at.

But the most serious impediment, in my opinion, has been the fundamental gap between the vocabulary in which we cast causal questions and the traditional vocabulary in which we communicate scientific theories.

Very well said.

For the sake of example, let us imagine a scientist trying to study the relationship between atmospheric pressure, P and the barometer reading B, one way to potentially express it could be -

B = kP

where k is a constant. While for some value of P and k, it allows us to estimate B. However, it is only half the information since it lacks a greater context of the non-statistical or qualitative relationship between the values B and P. Consider for example, the question-

Is it the pressure that causes the barometer to change or the other way around? Does the change in barometer cause changes in pressure?

Any similar questions could potentially fall in the same category. For example, does the sun cause the rooster to crow or does the rooster's crow cause the sun to rise?

While humans are intuitively capable of answering such mechanisms effortlessly but purely statistically driven machines cannot. But then why is this stuff not talked about as much?

Despite heroic efforts by the geneticist Sewall Wright (1889-1988), causal vocabulary was virtually prohibited for more than half a century.

I think it was only as prohibited as any entrepreneur's startup from being funded. Wright was neither boycotted nor executed, like Socrates.

I think Judea has done more justice to the work than what Wright could. But leaving Wright's people skills aside, this intuitive mechanism that correlation is not causation is the key ingredient of intelligence. And it does call for designing our machines with a causal mechanism but before we can, we will have to understand causality in further detail.

I hope with this book to convince you that data are profoundly dumb.

Indeed. In 1971, Steve Jobs described computers as incredibly fast but incredibly dumb. What an excellent analogy. If so, does the data have any significance at all? Yes, it does and very much like computers. Just because computers are extremely dumb doesn't mean we should have thrown them away, we instead adopted them further inculcating them in every aspect of our life. And same, I believe, will be the role that data shall serve.

To explain further,

Only after the gamblers invited intricate games of chance, sometimes carefully designed to trick us into making bad choices, did mathematicians like Blaise Pascal (1654), Pierre de Fermat (1654), and Christian Huygens (1657) find it necessary to develop what we call probability theory. Likewise, when companies demanded accurate estimates of life annuity did mathematicians like Edmond Halley (1693) and Abraham de Moivre (1725) begin looking at the mortality tables to calculate life expediencies. Similarly, astronomers demands for accurate predictions of celestial motion led Jacob Bernoulli, Pierre-Simon Laplace and Cark Freidrich Gauss to develop a theory of errors to help us extract signals from noise.

While needless to say that necessity is the mother of invention but this might have been stretched a little too far in a spur of passion or because of a systematic cognitive bias, confirmation bias, as Kahneman would put it. The discoveries of above might have simply been inspired by a curiosity to understand or replicate the regularities and irregularities in data.
  • The determinate relationships in data have most often led humans to go down the rabbit hole of trying to map the indeterminate relationships in the data, which often asks for multivariate analysis in highly complex non-deterministic environments. For example, the success of Einstein, Mozart and likes has caused humans to study the habits and routines of geniuses. According to Hume and his theory of causality, cause follows the effect.
  • And this is exactly where the population-based methods and reinforcement learning can bring in a massive advantage, for the causal inference engine.

If you are still with me, here's a funny over-the-top joke in causal inquiry-

Even two decades ago, asking a statistician "Was it the aspirin that stopped my headache?" would be like asking if he believed in voodoo.

Ha, very funny! But come on. Given, we already even had Google and Amazon two decades ago, in 1998, it couldn't be all that bad.
  • While there are many causal models available, the pros and cons of which we shall discuss in further posts, one of the most efficient and highly interpretable models is what we call graphical models where each node is a variable and each arrow represents the sender and listener
  • This also deserves for some digging into the relationship between RNNs, Time Series or any kind of sequence modelling and Causal Inference.
  • The notation used to express causal relations is P(L|do(D)) which means the effect of a drug(L) on lifespan(L). This is often confused with P(L|D).
  • While P(L|D) is the probability of events L and D happening at the same time, P(L|do(D)) is the measure of the relationship between L and D in isolation from all other variable factors.
  • This technique of disassociating variables of interest (say, D and L) from other variables (say Z) that would otherwise affect them both is called RCT, or Randomized Controlled Trial.
  • This "do" operator is what separates, as reasonably possible, coincidence from relation between two variables.

But why do we need to calculate the "do" operator? It's because our data is incredibly dumb. Some practical analogies to make the point-

  1. Patients would avoid going to the doctor to reduce the probability of being seriously ill.
  2. Los Angeles would dismiss all fire-fighters to reduce the incidence of fires in L.A.
  3. New York would fire all its police to cut down its crime rates

As mentioned earlier, while current deep learning techniques are incredibly powerful tools for prediction, but one of the areas where they lack is decision making. For example, consider the retrospective questions like-

Should I quit my job?

From Downton Abbey-

Would Sybil be alive had Lord Grantham listened to Doctor Clark?

A retrospective question of this sort is incredibly hard for, even, humans to answer. Maybe/Maybe Not? Who knows? You can try? If that is what makes you happy.

In a moment like this, we tend to make decisions using what we call "our mental model" and "our assumptions", thus to have a machine be able to answer such questions it is important that these become the two key parts of the thinking machines.

Two people who share the same causal model will also share all counterfactual judgements.

I don't know. I am not convinced, yet. I think we need a unique i.i.d proof to posit this thus it needs some digging.

I obsess over whether we can express a certain claim in a given language and whether one claim follows from others.

While there is no absolute answer to this. This would call into question, the mental makeup of a person which includes how he/she conceptualizes the world around them i.e. their mental model which includes their systematic cognitive biases, their assumptions and the available data. If one person is biased towards perception or openness, he might see the irregularities in data far more than one who is biased towards a judging or closed personality for he might be more prone to the confirmation bias, among others, thus over-looking many apparent facts.

A causal reasoning module will give machines the ability to reflect on their mistakes, to pinpoint weaknesses in their software, to function as moral entities, and to converse naturally with humans about their own choices and intentions.

This problem is far deeper than what causal reasoning is capable of, including-

1. Humans societies are not completely moral entities and thus a moral entity in an immoral world shall have a very hard time
2. I think it might be worth digging into the behaviour of self-organizing societies. Deep Learning Methods and Population-Based Methods thus become excellent tools for their strong bias towards strategy and imitation learning thus excelling in greedy scenarios. A recent example of such potential could be the AlphaStar algo.

Regardless, causal inference models can be a great tool to extend the potential of currently model-blind methods, for the high adaptability and generalizability they offer. For example,

  • By observing the outcome L of many patients given drug D, one can predict the probability that a patient with characteristics Z will survive L years. Now imagine, if she were transferred to a different hospital, in a different part of the town, where the population characteristics (diet, hygiene, work habits) are different.
  • For a deep learning method, even if these new characteristics merely modify the numerical relationships among the variables recorded, one will have to re-train and learn a new prediction function all over again.

So, let us take a quick look at the causal inference engine,

While I will write this in pseudo-code form in my next edit (for if like me, things that way make more sense to you) but here is a quick description of what's happening in here-

  • Knowledge (1) = Unconscious Biases, not directly accessible to the machine/psyche
  • Assumptions (2) = Accessible World Model*
  • Causal Model (3) = Various options available - we'll use graphical models (why? discussed earlier)
  • Testable Implications (4) = Like test code we write before writing any software to see if our model works, usually handled by setUp() and tearDown() methods in Python. We will discuss this in further detail in later blogposts.
  • Queries (5) = questions formulated in causal vocabulary for eg. what is P(L|do(D))?
  • Estimand (6) = the probability formulation that answers the quantitative part of query. More on this later..
  • Data (7) = Data is only collected after we test the model and derive the estimand. RCT, remember?
  • Statistical Estimate (8) = Parametric and Non-Parametric Maximum Likelihood Methods, Propensity Scores used for smoothing the sparse data.
  • Estimate (9) = Answer like "Drug D increases the lifespan L of diabetic patients Z by 30 percent, plus or minus 20 percent"

We will go more in detail of this engine as well as simplify several steps, as we go into further chapters.

Verdict: Do we have a computational model of intelligence?

Answer: No, we're not there yet but we have managed to achieve something really grand, both in decision-making via the progress in causal inference, and in computationally faster prediction methods via deep learning.

  • Causal Inference model while extremely capable but like an extremely needy, demanding housewife make a ridiculously high number of demands like do RCT on this and that, your world model has to be fully correct, for me to give you the right answer.
  • Deep learning also suffers from this data hogging attitude. Thus now there is an interesting shift towards finding new AutoML techniques.
  • That said, both of them operate in stark contrast to the human cognition, probably due to years of human evolution may be. But we can make decisions with minimal data with minimal assumptions.

So, that is all for my review of the Intro Chapter of the Book of Why. In my upcoming blog posts, we will further dig down into causality and the next chapters of The book of Why in the hope to further understand causality and what it is capable of...

Leave a Comment

Your email address will not be published. Required fields are marked *