The impossibility of fairness in prediction models

1. Introduction

In recent years the issue of fairness (or lack thereof) in machine learning models has become a far more prominent issue. A big catalyst for this shift arose from discussions surrounding the ProPublica analysis [12] and subsequent counter arguments [3] of Equivant’s (formerly Northpointe) COMPAS pre-trial risk assessment tool. Analysis showed that African-American defendants were more likely to be incorrectly labelled as higher-risk of recidivism and white defendants more likely to be incorrectly labelled as lower-risk. However the model was well-calibrated across different populations (see e.g. [456] for various discussions).

The fact there existed these two seemingly contradictory viewpoints gave rise to important theoretical questions.

  1. Which notion of fairness is correct?

  2. If one notion is fair, why is the other not?

  3. Can the predictive model be made to satisfy both notions?

Answering these questions has led to the so-called impossibility theorem [34], which states that, in general, it is impossible for a predictive model to simultaneously satisfy multiple established definitions of fairness.

Below I attempt to use Adam Pearce’s nice interactive example [7] (which I recommend viewing prior to reading) regarding testing for medical conditions within different age groups to give a (hopefully) simple mathematical illustration of some of the key points. I will do this following a causal data generation viewpoint put forth by Chiappa and Isaac [5].

2. An example model

The presented model consists of three variables; a patient’s age A (either child c or adult a ), the presence of condition C (either 0 or 1) and a subsequent test T (also 0/1 for negative/positive). In our model adults are more likely to have the condition than children (arthritis, heart disease for example) and the probability of a test result simply depends on the absence or presence of the condition.

Using those constraints the joint probability distribution of A , C and T can be written as

P ( A , C , T ) : = P ( T | A , C ) P ( C | A ) P ( A ) = P ( T | C ) P ( C | A ) P ( A ) (1)

and represented diagrammatically by the graph in Figure 1 . Together, the graph and probability distribution form a Bayesian Network (BN).

PIC
Figure 1: Causal representation of the model.

Following the example in [7] we take the proportion of children c and adults a to be the same (i.e. P ( A = a ) = P ( A = c ) = 1 2 ). The probability of the condition for different age groups P ( C | A ) we write as

A=c A=a
C=0 1 α 1 β
C=1 α β
Table 1: Conditional probability for P ( C | A )

with α < β as the prevalence is higher in adults.

For the probability of test result we have

C=0 C=1
T=0 q 1 p
T=1 1 q p
Table 2: Conditional probability for P ( T | C )

Here p and q are the true positive/negative rate (TPR/TNR) respectively. Often there is a single tunable parameter (as in [7]) for which a result above/below is deemed as positive/negative (e.g. the concentration of virus in a covid blood sample). In that case the TNR q would decrease as the TPR p is improved, however we leave that assumption for now.

3. Measuring fairness

To answer any of our questions questions first requires a formalism. Specifically a definition of fairness that can be applied to the model. The introduction outlined two competing viewpoints that give rise to two corresponding definitions of fairness (there are also others), which are outlined below.

Interestingly, in the current context, both definitions might be considered reasonable but we will see that, unfortunately, our model can not satisfy both simultaneously. This dichotomy is the root of the impossibility theorem.

3.1 Fairness in accuracy

The first notion of fairness we might think of is that the test result T only depends on whether a person has the condition C , and is therefore the same if the person is either an adult or child, i.e.

P ( T | A = c , C ) = P ( T | A = a , C ) .

Formally we say the test results are conditionally independent of age given the condition, or mathematically

P ( T | A , C ) = P ( T | C ) . (2)

This form of fairness is sometimes called accuracy fairness or equal false positive rates and true positive rates and in our example we see immediately from Equation (1 ) this is satisfied. This is the form that was violated by the COMPAS tool according to the ProPublica analysis.

3.2 Fairness in calibration

A second notion might also be considered, which inverts our first notion. Suppose we receive a particular test result T , we then want to know how likely it is that we actually have the condition C , which is arguably a more useful piece of information than our previous measure. This might be considered unfair then if an adult and child receive the same test result but the chances of them actually having the condition are different. To be fair, therefore, we can argue this measure should be the same for both adults and children, i.e.

P ( C | A = c , T ) = P ( C | A = a , T ) . (3)

Again, we can write this in the alternative mathematical form

P ( C | A , T ) = P ( C | T ) , (4)

which conveys the conditional independence of the condition C on age A given the test result T . This alternative form is sometimes referred to as calibration fairness.

Unfortunately our example model does not satisfy this alternative definition of fairness. To see this we note that, by definition of conditional probabilities (Bayes’ Theorem),

P ( C | A , T ) = P ( A , C , T ) P ( A , T ) = P ( T | C ) P ( C | A ) P ( T | A ) ,

where

P ( T | A ) = C P ( T | C ) P ( C | A ) .

Using our values from Table 2 gives for children

P ( C = 1 | A = c , T ) = P ( T | C = 1 ) α P ( T | C = 1 ) α + P ( T | C = 0 ) ( 1 α )

and adults

P ( C = 1 | A = a , T ) = P ( T | C = 1 ) β P ( T | C = 1 ) β + P ( T | C = 0 ) ( 1 β ) .

To satisfy calibration fairness we make them equal, as in Equation (3 ), which leads, after some algebra, to

P ( T | C = 0 ) P ( T | C = 1 ) ( 1 α α ) = P ( T | C = 0 ) P ( T | C = 1 ) ( 1 β β ) .

However this only holds in general if α = β , i.e. the proportion of adults and children having the condition (the base rate) is the same (or P ( C | A ) = P ( C ) ).

4. Can we fix it?

Suppose we believe that calibration fairness (4 ) is more important than accuracy fairness. Obviously, we cannot alter the base rates of the condition in adults and children, however we do have some ability to alter the test results. Suppose we adjust the p and q in Table 2 separately for children and adults (having seperate cut-offs for the amount of virus that signifies a positive test result for example). This change leads to a new joint probability distribution

Q ( A , C , T ) = Q ( T | A , C ) P ( C | A ) P ( A ) , (5)

where the conditional probability Q ( T | A , C ) would be of the form

C=0 C=1
T=0 q A 1 p A
T=1 1 q A p A
Table 3: Conditional probability for Q ( T | A , C )

for A = c , a . Mathematically, this difference means that

Q ( T | A = c , C ) Q ( T | A = a , C )

and we immediately see that, unfortunately, our model no longer satisfies the fairness in accuracy definition (2 ).

The new probability (5 ) is represented diagrammatically in Figure 2 . The dotted lines conveys the changes brought about by the modifications in Q ( T | A , C ) .

PIC
Figure 2: New joint probability Q ( A , C , T )

Our goal now is to choose q A , p A in such a way that our calibration measure is fair. Using our conditional probability relations and that P ( A ) = 1 2 means the calibration measure can be written as

Q ( C | A , T ) = Q ( A , C , T ) Q ( A , T ) = Q ( T | A , C ) P ( C | A ) C Q ( T | A , C ) P ( C | A ) .

Now, for simplicity we denote the elements α and β in Table 2 as P ( C = 1 | A ) = r A . Then, using Table 4 ,

Q ( C = 1 | A , T = 1 ) = p A r A p A r A + ( 1 q A ) ( 1 r A ) (6)

and similarly

Q ( C = 0 | A , T = 0 ) = q A ( 1 r A ) q A ( 1 r A ) + ( 1 p A ) r A . (7)

To satisfy calibration fairness the conditions (6 ) and (7 ) must be the same for children and adults. Imposing that restriction results (see Appendix A for details) in the following dependencies

p a = ( 1 q a ) ψ p c 1 q c (8)

and

q a = ( q c ψ ) 1 ( q c + ψ p c ) 1 ( q c + p c ) (9)

where ψ = ϕ c ϕ a and

ϕ A = r A 1 r A .

This puts constraints on the parameters. For instance, if ψ < 1 then we can’t have 1 p c < q c < 1 ψ p c , otherwise q a would be negative. However, if satisfied we can rewrite our joint probability distribution as

Q ( A , C , T ) = Q ( C | A , T ) Q ( T | A ) P ( A ) = Q ( C | T ) Q ( T | A ) P ( A ) , (10)

which is represented diagrammatically in Figure 3 .

PIC
Figure 3: Directed graph associated to (10 ).

One set of parameters that does satisfy the calibration fairness when α = 0 . 1 and β = 0 . 2 is given by

p c = 0 . 9 , q c = 0 . 7

which corresponds to

p a = 0 . 9 7 8 , q a = 0 . 1 7 5 .

Thus, by maintaining high TPR and TNR for children we are left with a situation where for adults, although we have a high TPR, the TNR is much lower, meaning we are likely to see many false positives for adults.

5. Conclusions

By following the simple example from [7] we see that different base rates in the population (i.e. of condition prevalence) results in a situation where two seemingly sensible notions of fairness (accuracy and calibration) cannot be satisfied simultaneously.

Importantly, this example highlights just how important human judgement is in the interpretation of prediction models. It would be great if there was a universal definition of fairness for which each model could be benchmarked however this example shows the notion of fairness is a distinctly personal concept that can change on context and lives outside the scope of the model.

There are many more aspects concerning the difficulties in making prediction models, for instance Kleinberg et. al. [4] offers other definitions and commentary (see also the seminar on YouTube [8]). Chiappa and Isaac also offer a far more expanded view on how causality techniques techniques can tell us which fairness definitions might be valid and which not [5] (see also their DeepMind blog post [9]). For more general reading and a wider context the popular science books Weapons of Math Destruction by Cathy O’Neil [10] and The alignment problem: Machine learning and human values by Brian Christian [11] are also worth a read.

A. Calculations

Let us derive the dependencies (8 ) and (9 ). Taking the first expression (6 ) and setting that equal for adults and children gives

p c r c ( 1 q c ) ( 1 r c ) = p a r a ( 1 q a ) ( 1 r a ) .

Similarly, doing so for the second expression (6 ) gives

( 1 p c ) r c q c ( 1 r c ) = ( 1 p a ) r a q a ( 1 r a ) .

If we set

ϕ A = r A 1 r A

then the first equality reduces to

p c ( 1 q c ) ϕ c = p a 1 q a ϕ a

and the second

( 1 p c ) q c ϕ c = ( 1 p a ) q a ϕ a .

Writing ψ = ϕ c ϕ a for simplicity, the first equality becomes

p a = ( 1 q a ) ψ p c 1 q c ,

which gives us (8 ). The second equality is reduced to

1 p a = ψ ( 1 p c ) q a q c .

Adding the two together gives

1 = ψ [ p c ( 1 q a ) 1 q c + ( 1 p c ) q a q c ] = ψ q c ( 1 q c ) [ q c p c ( 1 q a ) + ( 1 q c ) ( 1 p c ) q a ] = ψ q c ( 1 q c ) [ q a ( 1 p c q c ) + p c q c ]

and therefore

q a = 1 ( 1 p c q c ) [ q c ( 1 q c ) ψ p c q c ] = q c ψ 1 q c ψ p c 1 p c q c ,

which is the second dependency (9 ).

References

[1]   Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias - there’s software used across the country to predict future criminals. and it’s biased against blacks. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, 2016.

[2]   Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. How we analyzed the compas recidivism algorithm. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm, 2016.

[3]   William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. Northpointe Inc, 7(4), 2016.

[4]   Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

[5]   Silvia Chiappa and William S Isaac. A causal Bayesian networks viewpoint on fairness. In IFIP International Summer School on Privacy and Identity Management, pages 3–20. Springer, 2018.

[6]   Farhan Rahman. COMPAS Case Study: Fairness of a Machine Learning Model. https://towardsdatascience.com/compas-case-study-fairness-of-a-machine-learning-model-f0f804108751, 2020.

[7]   Adam Pearce. Measuring fairness. https://pair.withgoogle.com/explorables/measuring-fairness/, 2020.

[8]   John Kleinberg. Inherent trade-offs in algorithmic fairness - microsoft research seminar. https://www.youtube.com/watch?v=p5yY2MyTJXA, 2018.

[9]   Silvia Chiappa and William S. Isaac. DeepMind Blog Post - Causal Bayesian Networks: A flexible tool to enable fairer machine learning. https://deepmind.com/blog/article/Causal_Bayesian_Networks, 2019.

[10]   Cathy O’Neil. Weapons of math destruction: How big data increases inequality and threatens democracy. Broadway Books, 2016.

[11]   Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020.