The impossibility of fairness in prediction models

1. Introduction

In recent years the issue of fairness (or lack thereof) in machine learning models has become a far more prominent issue. A big catalyst for this shift arose from discussions surrounding the ProPublica analysis [1, 2] and subsequent counter arguments [3] of Equivant’s (formerly Northpointe) COMPAS pre-trial risk assessment tool. Analysis showed that African-American defendants were more likely to be incorrectly labelled as higher-risk of recidivism and white defendants more likely to be incorrectly labelled as lower-risk. However the model was well-calibrated across diﬀerent populations (see e.g. [4, 5, 6] for various discussions).

The fact there existed these two seemingly contradictory viewpoints gave rise to important theoretical questions.

Which notion of fairness is correct?
If one notion is fair, why is the other not?
Can the predictive model be made to satisfy both notions?

Answering these questions has led to the so-called impossibility theorem [3, 4], which states that, in general, it is impossible for a predictive model to simultaneously satisfy multiple established deﬁnitions of fairness.

Below I attempt to use Adam Pearce’s nice interactive example [7] (which I recommend viewing prior to reading) regarding testing for medical conditions within diﬀerent age groups to give a (hopefully) simple mathematical illustration of some of the key points. I will do this following a causal data generation viewpoint put forth by Chiappa and Isaac [5].

2. An example model

The presented model consists of three variables; a patient’s age $A$ (either child $c$ or adult $a$ ), the presence of condition $C$ (either 0 or 1) and a subsequent test $T$ (also 0/1 for negative/positive). In our model adults are more likely to have the condition than children (arthritis, heart disease for example) and the probability of a test result simply depends on the absence or presence of the condition.

Using those constraints the joint probability distribution of $A$ , $C$ and $T$ can be written as

P (A, C, T) : = P (T | A, C) P (C | A) P (A) = P (T | C) P (C | A) P (A)

(1)

and represented diagrammatically by the graph in Figure 1 . Together, the graph and probability distribution form a Bayesian Network (BN).

Figure 1: Causal representation of the model.

Following the example in [7] we take the proportion of children $c$ and adults $a$ to be the same (i.e. $P (A = a) = P (A = c) = 1 ∕ 2$ ). The probability of the condition for diﬀerent age groups $P (C | A)$ we write as

	A=c	A=a
C=0	$1 - α$	$1 - β$
C=1	$α$	$β$

Table 1: Conditional probability for

P (C | A)

with $α < β$ as the prevalence is higher in adults.

For the probability of test result we have

	C=0	C=1
T=0	$q$	$1 - p$
T=1	$1 - q$	$p$

Table 2: Conditional probability for

P (T | C)

Here $p$ and $q$ are the true positive/negative rate (TPR/TNR) respectively. Often there is a single tunable parameter (as in [7]) for which a result above/below is deemed as positive/negative (e.g. the concentration of virus in a covid blood sample). In that case the TNR $q$ would decrease as the TPR $p$ is improved, however we leave that assumption for now.

3. Measuring fairness

To answer any of our questions questions ﬁrst requires a formalism. Speciﬁcally a deﬁnition of fairness that can be applied to the model. The introduction outlined two competing viewpoints that give rise to two corresponding deﬁnitions of fairness (there are also others), which are outlined below.

Interestingly, in the current context, both deﬁnitions might be considered reasonable but we will see that, unfortunately, our model can not satisfy both simultaneously. This dichotomy is the root of the impossibility theorem.

3.1 Fairness in accuracy

The ﬁrst notion of fairness we might think of is that the test result $T$ only depends on whether a person has the condition $C$ , and is therefore the same if the person is either an adult or child, i.e.

P (T | A = c, C) = P (T | A = a, C) .

Formally we say the test results are conditionally independent of age given the condition, or mathematically

P (T | A, C) = P (T | C) .

(2)

This form of fairness is sometimes called accuracy fairness or equal false positive rates and true positive rates and in our example we see immediately from Equation (1 ) this is satisﬁed. This is the form that was violated by the COMPAS tool according to the ProPublica analysis.

3.2 Fairness in calibration

A second notion might also be considered, which inverts our ﬁrst notion. Suppose we receive a particular test result $T$ , we then want to know how likely it is that we actually have the condition $C$ , which is arguably a more useful piece of information than our previous measure. This might be considered unfair then if an adult and child receive the same test result but the chances of them actually having the condition are diﬀerent. To be fair, therefore, we can argue this measure should be the same for both adults and children, i.e.

P (C | A = c, T) = P (C | A = a, T) .

(3)

Again, we can write this in the alternative mathematical form

P (C | A, T) = P (C | T),

(4)

which conveys the conditional independence of the condition $C$ on age $A$ given the test result $T$ . This alternative form is sometimes referred to as calibration fairness.

Unfortunately our example model does not satisfy this alternative deﬁnition of fairness. To see this we note that, by deﬁnition of conditional probabilities (Bayes’ Theorem),

P (C | A, T) = \frac{P (A, C, T)}{P (A, T)} = \frac{P (T | C) P (C | A)}{P (T | A)},

where

P (T | A) = \sum_{C} P (T | C) P (C | A) .

Using our values from Table 2 gives for children

P (C = 1 | A = c, T) = \frac{P (T | C = 1) α}{P (T | C = 1) α + P (T | C = 0) (1 - α)}

and adults

P (C = 1 | A = a, T) = \frac{P (T | C = 1) β}{P (T | C = 1) β + P (T | C = 0) (1 - β)} .

To satisfy calibration fairness we make them equal, as in Equation (3 ), which leads, after some algebra, to

\frac{P (T | C = 0)}{P (T | C = 1)} (\frac{1 - α}{α}) = \frac{P (T | C = 0)}{P (T | C = 1)} (\frac{1 - β}{β}) .

However this only holds in general if $α = β$ , i.e. the proportion of adults and children having the condition (the base rate) is the same (or $P (C | A) = P (C)$ ).

4. Can we ﬁx it?

Suppose we believe that calibration fairness (4 ) is more important than accuracy fairness. Obviously, we cannot alter the base rates of the condition in adults and children, however we do have some ability to alter the test results. Suppose we adjust the $p$ and $q$ in Table 2 separately for children and adults (having seperate cut-oﬀs for the amount of virus that signiﬁes a positive test result for example). This change leads to a new joint probability distribution

Q (A, C, T) = Q (T | A, C) P (C | A) P (A),

(5)

where the conditional probability $Q (T | A, C)$ would be of the form

	C=0	C=1
T=0	$q_{A}$	$1 - p_{A}$
T=1	$1 - q_{A}$	$p_{A}$

Table 3: Conditional probability for

Q (T | A, C)

for $A = c, a$ . Mathematically, this diﬀerence means that

Q (T | A = c, C) \neq Q (T | A = a, C)

and we immediately see that, unfortunately, our model no longer satisﬁes the fairness in accuracy deﬁnition (2 ).

The new probability (5 ) is represented diagrammatically in Figure 2 . The dotted lines conveys the changes brought about by the modiﬁcations in $Q (T | A, C)$ .

Figure 2: New joint probability $Q (A, C, T)$

Our goal now is to choose $q_{A}, p_{A}$ in such a way that our calibration measure is fair. Using our conditional probability relations and that $P (A) = 1 ∕ 2$ means the calibration measure can be written as

Q (C | A, T) = \frac{Q (A, C, T)}{Q (A, T)} = \frac{Q (T | A, C) P (C | A)}{\sum_{C} Q (T | A, C) P (C | A)} .

Now, for simplicity we denote the elements $α$ and $β$ in Table 2 as $P (C = 1 | A) = r_{A}$ . Then, using Table 4 ,

Q (C = 1 | A, T = 1) = \frac{p_{A} r_{A}}{p_{A} r_{A} + (1 - q_{A}) (1 - r_{A})}

(6)

and similarly

Q (C = 0 | A, T = 0) = \frac{q_{A} (1 - r_{A})}{q_{A} (1 - r_{A}) + (1 - p_{A}) r_{A}} .

(7)

To satisfy calibration fairness the conditions (6 ) and (7 ) must be the same for children and adults. Imposing that restriction results (see Appendix A for details) in the following dependencies

p_{a} = (1 - q_{a}) ψ \frac{p_{c}}{1 - q_{c}}

(8)

and

q_{a} = (\frac{q_{c}}{ψ}) \frac{1 - (q_{c} + ψ p_{c})}{1 - (q_{c} + p_{c})}

(9)

where $ψ = ϕ_{c} ∕ ϕ_{a}$ and

ϕ_{A} = \frac{r_{A}}{1 - r_{A}} .

This puts constraints on the parameters. For instance, if $ψ < 1$ then we can’t have $1 - p_{c} < q_{c} < 1 - ψ p_{c}$ , otherwise $q_{a}$ would be negative. However, if satisﬁed we can rewrite our joint probability distribution as

Q (A, C, T) = Q (C | A, T) Q (T | A) P (A) = Q (C | T) Q (T | A) P (A),

(10)

which is represented diagrammatically in Figure 3 .

Figure 3: Directed graph associated to (10 ).

One set of parameters that does satisfy the calibration fairness when $α = 0.1$ and $β = 0.2$ is given by

p_{c} = 0.9, q_{c} = 0.7

which corresponds to

p_{a} = 0.978, q_{a} = 0.175 .

Thus, by maintaining high TPR and TNR for children we are left with a situation where for adults, although we have a high TPR, the TNR is much lower, meaning we are likely to see many false positives for adults.

5. Conclusions

By following the simple example from [7] we see that diﬀerent base rates in the population (i.e. of condition prevalence) results in a situation where two seemingly sensible notions of fairness (accuracy and calibration) cannot be satisﬁed simultaneously.

Importantly, this example highlights just how important human judgement is in the interpretation of prediction models. It would be great if there was a universal deﬁnition of fairness for which each model could be benchmarked however this example shows the notion of fairness is a distinctly personal concept that can change on context and lives outside the scope of the model.

There are many more aspects concerning the diﬃculties in making prediction models, for instance Kleinberg et. al. [4] oﬀers other deﬁnitions and commentary (see also the seminar on YouTube [8]). Chiappa and Isaac also oﬀer a far more expanded view on how causality techniques techniques can tell us which fairness deﬁnitions might be valid and which not [5] (see also their DeepMind blog post [9]). For more general reading and a wider context the popular science books Weapons of Math Destruction by Cathy O’Neil [10] and The alignment problem: Machine learning and human values by Brian Christian [11] are also worth a read.

A. Calculations

Let us derive the dependencies (8 ) and (9 ). Taking the ﬁrst expression (6 ) and setting that equal for adults and children gives

\frac{p_{c} r_{c}}{(1 - q_{c}) (1 - r_{c})} = \frac{p_{a} r_{a}}{(1 - q_{a}) (1 - r_{a})} .

Similarly, doing so for the second expression (6 ) gives

\frac{(1 - p_{c}) r_{c}}{q_{c} (1 - r_{c})} = \frac{(1 - p_{a}) r_{a}}{q_{a} (1 - r_{a})} .

If we set

ϕ_{A} = \frac{r_{A}}{1 - r_{A}}

then the ﬁrst equality reduces to

\frac{p_{c}}{(1 - q_{c})} ϕ_{c} = \frac{p_{a}}{1 - q_{a}} ϕ_{a}

and the second

\frac{(1 - p_{c})}{q_{c}} ϕ_{c} = \frac{(1 - p_{a})}{q_{a}} ϕ_{a} .

Writing $ψ = ϕ_{c} ∕ ϕ_{a}$ for simplicity, the ﬁrst equality becomes

p_{a} = (1 - q_{a}) ψ \frac{p_{c}}{1 - q_{c}},

which gives us (8 ). The second equality is reduced to

1 - p_{a} = ψ \frac{(1 - p_{c}) q_{a}}{q_{c}} .

Adding the two together gives

\begin{array}{l} 1 & = ψ [\frac{p_{c} (1 - q_{a})}{1 - q_{c}} + \frac{(1 - p_{c}) q_{a}}{q_{c}}] \\ = \frac{ψ}{q_{c} (1 - q_{c})} [q_{c} p_{c} (1 - q_{a}) + (1 - q_{c}) (1 - p_{c}) q_{a}] \\ = \frac{ψ}{q_{c} (1 - q_{c})} [q_{a} (1 - p_{c} - q_{c}) + p_{c} q_{c}] \end{array}

and therefore

q_{a} = \frac{1}{(1 - p_{c} - q_{c})} [\frac{q_{c} (1 - q_{c})}{ψ} - p_{c} q_{c}] = \frac{q_{c}}{ψ} \frac{1 - q_{c} - ψ p_{c}}{1 - p_{c} - q_{c}},

which is the second dependency (9 ).

References

[1] Julia Angwin, Jeﬀ Larson, Surya Mattu, and Lauren Kirchner. Machine bias - there’s software used across the country to predict future criminals. and it’s biased against blacks. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, 2016.

[2] Jeﬀ Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm, 2016.

[3] William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpointe Inc, 7(4), 2016.

[4] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-oﬀs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

[5] Silvia Chiappa and William S Isaac. A causal Bayesian networks viewpoint on fairness. In IFIP International Summer School on Privacy and Identity Management, pages 3–20. Springer, 2018.

[6] Farhan Rahman. COMPAS Case Study: Fairness of a Machine Learning Model. https://towardsdatascience.com/compas-case-study-fairness-of-a-machine-learning-model-f0f804108751, 2020.

[7] Adam Pearce. Measuring fairness. https://pair.withgoogle.com/explorables/measuring-fairness/, 2020.

[8] John Kleinberg. Inherent trade-oﬀs in algorithmic fairness - microsoft research seminar. https://www.youtube.com/watch?v=p5yY2MyTJXA, 2018.

[9] Silvia Chiappa and William S. Isaac. DeepMind Blog Post - Causal Bayesian Networks: A ﬂexible tool to enable fairer machine learning. https://deepmind.com/blog/article/Causal_Bayesian_Networks, 2019.

[10] Cathy O’Neil. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books, 2016.

[11] Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020.