Cookie Consent by Free Privacy Policy Generator 📌 How Simpson’s Paradox Can Mislead Statistics

🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security



📚 How Simpson’s Paradox Can Mislead Statistics


💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

And why ML interpretations are not always reliable

Photo by Jason Leung on Unsplash

If there is something I really hate hearing is the classical authority argument “Statistics shows that *insert a fact here*”.

With the democratization of statistical tools and machine learning, it is easier than ever to crunch some numbers into insights. Still, everyone should pay attention about not fall into some very counterintuitive traps that can lead to bias studies made a bit too fast or without the view of an expert of the subject.

In this article, we will take a deep dive into one classic statistical trap, namely Simpson’s paradox that can lead to misinterpretation of a feature’s interpretation in a model due to the presence of hidden correlated factors.

I designed this post for people with basic knowledge in statistics and machine learning, and particularly for data scientists and analysts that are not aware of the paradox. Nevertheless, the cases exposed in part II might be of interest to anyone willing to discover concrete cases in which data can be manipulated to say something and its opposite.

What is this paradox about ?

You are running an analysis on a dataset. You have a bunch of features X1, … Xn that you are using in order to predict a target variable “y”.

The Simpson’s paradox can appear when two conditions are met:

  • You are missing at least one variable (let’s call Xs) that explains a part of the variance of your target “y”.
  • There is a direct link of causality between this hidden variable Xs and one of your features Xb.

In that case, omitting Xs might directly lead to a misinterpretation of the impact of Xb on the target variable y.

Illustration of the paradox

The case of the linear regression

To get an intuition behind the paradox, let’s study the simple case of a linear regression with a target “y” connected to two variables “X1” and “X2”. We also add an error term, independent and identically distributed:

In normal conditions, when the variables X1 and X2 are not correlated, omitting X2 and regressing y on X1 would simply give:

with:

Let’s verify this with a bunch of simulated data, taking a = 1 and b = 1:

a = 1
b = 1
#Let's draw some random points
X1 = np.random.normal(0, 1, 5000)
X2 = np.random.normal(0, 1, 5000)
eps = np.random.normal(0,1, 5000)

#Create our target value
y = a*X1 + b*X2 + eps

#Get the coefficients of the linear regression of y on X1
lr = LinearRegression().fit(X1.reshape(-1,1),y)
theta_1,theta_0 = lr.coef_[0], lr.intercept_

We confirm that when there is no correlation between X2 and X1:

  • The slope of the linear regression of y on X1 is equal to “a” (2nd figure)
  • The error of the regression epsilon1 is equal to the term bX2 + epsilon (3rd figure).

When X1 and X2 are not independent

We are now going to assume that there is a direct correlation between X1 and X2:

In this case, we cannot re-inject directly X2 into the error term, as the error would not be independent anymore (due to the correlation between X1 and X2)

Due to the term in X1, the error is not independent anymore

To retrieve an error independent, we need to rearrange our equation on X1:

And this time, unlike before, the coefficient will not be anymore equal to “a”, but includes a term accounting for the correlation between X1 and X2 (a + bc).

And this is where lies the Simpson Paradox: depending on the values of b (explained variance of X2 on y) and c (explained variance of X2 on X1) we can now have a correlation between y and X1 that can be much higher, much lower or even… opposite!

Note also that the paradox vanishes if b = 0 or c = 0, confirming the two conditions exposed earlier in this simplified case.

To visualize this effect, let’s modify our simulation:

a = 1
b = -1
c = 2
X1 = np.random.normal(0, 1, 5000)

#Define X2 as correlated to X1
eps_2 = np.random.normal(0,1,5000)
X2 = c * X1 + eps_2

#Define the target as before
eps = np.random.normal(0,1, 5000)
y = a*X1 + b*X2 + eps

lr = LinearRegression().fit(X1.reshape(-1,1),y)
theta_1,theta_0 = lr.coef_[0], lr.intercept_

With b = -1 and c = 2, we should find a negative coefficient when regressing y on X1 instead of a positive one…

… And this is exactly what we find out (2nd figure). Note that we retrieve as well the linear regression error being epsilon + b epsilon_2 (3rd figure), as expected by our little calculation.

And this is it, we have exposed Simpson’s paradox for our linear system. By omitting the impact of the variable X2, correlated both to X1 and y, we exposed ourselves to a wrong interpretation of the coefficient of the regression of y on X1.

Note that according to our simple example, the paradox can occur for a wide range of values of b and c. For example, b = 2 and c = -1 would create also a Simpson paradox in which X1 and X2 are negatively correlated while X2 has a strong influence on y.

This is of course a very simple example, but it is useful to develop the intuition around this particular phenomenon.

Some famous Simpson’s Paradox

This looks like a fun theoretical problem, but the paradox can be found in many ways in real situations. Let’s dig into 2 real-life case studies to see where statistics could actually fool someone not totally aware of those subtleties, and see how it connects to what we highlighted in the previous section.

I took the examples from the excellent post (in french) of David Louapre, which inspired me in going further in the paradox analysis.

Smoking and life expectancy

In the first part of “Ignoring a covariate: An example of Simpson’s paradox”[1], Appleton, David R., Joyce M. French, and Mark PJ Vanderpump study the survival rate (y) of smoking and nonsmoking women (X0).

In order to visualize easily the paradox in such a case, we are going to simulate a similar dataset:

#Proba to use for non-smokers
proba_not_smoke = {
"18-25":0.52,
"26-35":0.55,
"36-44":0.51,
"45-55":0.46,
"56-67":0.53,
"68-75":0.8,
"76+":0.85
}

#Proba to use for people dying naturally
proba_die_naturally = {
"18-25":0.01,
"26-35":0.01,
"36-44":0.05,
"45-55":0.13,
"56-67":0.3,
"68-75":0.6,
"76+":0.95
}

#Proba to use for people dying smoking
proba_die_smoking = {
"18-25":0.001,
"26-35":0.004,
"36-44":0.008,
"45-55":0.01,
"56-67":0.01,
"68-75":0.01,
"76+":0.01
}

def get_group(age):
'''Simple function to transform age into a category'''
if age<=25:
return "18-25"
if age<=35:
return "26-35"
if age<=44:
return "36-44"
if age<=55:
return "45-55"
if age<=67:
return "56-67"
if age<=75:
return "68-75"
else:
return "76+"


POPULATION = 5000

#Create a random population
population_ages = np.random.randint(18,90, POPULATION)
dataset = []
for age in population_ages:
group = get_group(age)

#draw the probas
p_not_smoke = proba_not_smoke[group]
p_die_naturally = proba_die_naturally[group]
p_die_smoking = proba_die_smoking[group]

#calculate the condition of the person based on the proba drawns
smoker = np.random.random()>p_not_smoke
died_naturally = np.random.random()<p_die_naturally
died_smoking = np.random.random()<p_die_naturally
died = died_naturally | ((died_smoking) & (smoker))
dataset.append({"smoker":smoker*1,"age_group":group,"lived_after_20_years":1-died})

df = pd.DataFrame(dataset)
Samples of the dataset (synthetic)

The figure below shows the life expectancy in our simulated dataset:

Life expectancy when splitting the dataset by our feature of interest (smoking), on the simulated dataset

As we can see, a preliminary result when making a simple analysis in the global population seems to show that smokers tend to have a higher life expectancy.

This very strange result comes from the fact that a confounding variable, the age of the population studied (our variable X2 following the example in part I) is omitted in the chart.

Let’s now see how the results look like when adding the age variable in the analysis.

Life expectancy when splitting the dataset by our feature of interest (smoking) and by age, on the simulated dataset

This time we can see that for every group, the life expectancy in 20 years’ time is better for the non-smokers in each age group.

When we look at the data from a different angle, we can spot the correlation between the age (X2) and the Smoker/Non-Smoker category (X1).

Repartitions of smoker in the population by age group, in our simulated dataset

And we fall back on our set of condition exposed in the first part:

  • Our variable age “X2" is very negatively correlated to our target (y) “20 year’s time life expectancy” (giving b << 0), as elders are unlikely to live 20 years more compared to younger people.
  • Also the age (X2) is negatively correlated to being a smoker (X1) (c < 0)
  • This result, if omitting the age X2, of having the impression that smoking improve life expectancy ( a + bc ) > 0 while it actually reduce it (a < 0 )

Gender and school admissions

Another very famous example of Simpson’s Paradox comes has been highlighted in « Sex bias in graduate admissions: Data from Berkeley. »[2]

The original datasets expose the admission of people for Berkeley in 1973 by department and gender and we are also going to prepare a similar synthetic variation for illustrations.

#Probabilities of admission in the major
admissions_proba = {
"Major_1":0.74,
"Major_2":0.64,
"Major_3":0.35,
"Major_4":0.34,
"Major_5":0.24,
"Major_6":0.06
}

#Probabilities of selection a major as a man
gender_proba = {
"Major_1":0.08,
"Major_2":0.04,
"Major_3":0.65,
"Major_4":0.47,
"Major_5":0.67,
"Major_6":0.47,
}

def get_major(p):
"""This function is used to select a major randomly based on some
probabilities"""
if p<0.25:
return "Major_1"
if p<0.37:
return "Major_2"
if p<0.54:
return "Major_3"
if p<0.70:
return "Major_4"
if p<0.82:
return "Major_5"
else:
return "Major_6"


POPULATION = 12000
dataset = []
for i in range(POPULATION):
p_major = np.random.random()
major = get_major(p_major)
gender = ['M','F'][np.random.random()<gender_proba[major]]
admission = ['Accepted','Rejected'][np.random.random()>admissions_proba[major]]
dataset.append({"Major":major,"Gender":gender,"Admission":admission})
df = pd.DataFrame(dataset)
Synthetic data to illustrate the sex bias in graduate admissions paradox

When we simply look at the statistics by gender (X1), we discover a bias in the % of Admission (y) toward male gender statistically significant.

Simple split of admission rate, by gender, on synthetic data

On the other end, adding the Major (X2) in the analysis reverse in most of the department the conclusion on the gender bias, revealing the paradox.

Adding the major expose the paradox, example on synthetics data

As in the previous examples, the paradox can be explained by a correlation of the Major selected (X2) with both our target y (the Selectivity) and our original feature X1 (The Gender).

The graph below shows that in the dataset, females tend to apply more for very selective majors while males for the less selective ones.

The hidden correlations between the Major (X2) and our target and original feature

The paradox with advanced ML tools

Until now we have discussed Simpson’s Paradox in relatively simple datasets. But it is also possible to generate much more complicated example when even advance ML and feature explanation algorithms could be misleading if not including all relevant data.

In this part, I simulate a “house price dataset” made of 5 features:

  • Asset surface
  • Garage surface
  • Garden surface
  • House condition
  • Area Fanciness (optional parameter)

And we are going to predict the asset price based on those 5 features.

As a sneaky person, I designed the Area Fanciness feature a bit differently:

I built this feature so that the fanciness of the area has a non-negligible impact on our target (SalePrice), but also on the Garage Surface which will take the role of the feature victim of the paradox (the “X1” from the previous examples): we can for example imagine that in fancy areas there is less place for big garages, explaining the negative correlation.

This is completely made-up of course, so don’t take it too seriously, but it will allow me to make my point.

N = 3000
n_clusters = 10
inc = N//n_clusters
n_samples = inc*n_clusters
#Create our main features
asset_surface = np.random.normal(90, 25, n_samples)
garden_surface = np.random.normal(500, 100, n_samples)
house_condition = np.random.randint(0, 5, n_samples)/10+1

#Build the two last features so they have a negative correlation
garage_surface = np.array([])
fanciness_area = np.array([])
for i in range(n_clusters):
gs = np.random.normal(i, 1, inc)
fa = np.random.normal(-i, 1, inc)
garage_surface = np.hstack([garage_surface, gs])
fanciness_area = np.hstack([fanciness_area, fa])

#Just make it a bit more non-linear...
fanciness_area = np.exp((fanciness_area + 12)/10)

df = pd.DataFrame([asset_surface, garden_surface, house_condition, garage_surface, fanciness_area])

#We build our target so that the formula is non-linear and stay in the range of a house price...
sale_price = (asset_surface*(fanciness_area**0.5)*2 + garden_surface/5 + garage_surface/2)*house_condition*100+50000+np.random.normal(0, 20000, n_samples)

I will skip the feature engineering and cleaning and jump directly to the conclusion. I used a fine-tuned xgboost to predict y using all the features of the dataset (with and without our “Area Fanciness” feature) and check the results in term of feature interpretation using the shap library, the same way as I would do for a classical ML project.

Note: For those who don’t know the library, shap is an amazing tool to perform features explanations, particularly efficient for tree-based algorithm.

Some keys to interpret the graphs, for those not familiar with shap:

  • The features are sorted top to bottom by importance, a feature placed in the top of the list have a big impact on the target while features in the bottom of the list have a much lower impact.
  • Each point represent a single sample. Red color indicate that the feature has a high value of the given sample while blue indicate a low value.
  • The more a point is located toward the left side, the more the feature had a negative impact for that sample. The more to the right, the more the positive impact was
Shap values give a different interpretation on the Garage Surface (X1) depending on the presence of the Area Fanciness feature (X2)

In our scenario, we can observe clearly the paradox: if we forget to include the “fancy area” feature, it will bias our interpretation of the results even if the scores are not so different. It will have two big consequences:

  • Overestimating the impact of the Garage Surface: much higher in the list without Area Fanciness
  • Changing our interpretation of the Garage Surface feature: if we don’t consider Area Fanciness, the shap model tells us that high garage surface tend to have a negative action on the selling price, while the explanation is that a high garage surface boost the selling price if you include that Area Fanciness

In this case we can understand simply that there is a problem and that a big garage should theoretically increase the value of the asset.

Nevertheless, in other situations, with large amount of features harder to interpret, you might fall into the paradox without even realizing it, and provide wrong interpretations about your results.

Conclusion

In this article I tried to give you the intuition behind the famous Simpson Paradox and highlight some of the dangers that can hide behind some statistical analysis.

We saw in particular that having the best ML models and the state of the art explanation algorithms is not sufficient to avoid it if you don’t have a comprehensive view of the data you are studying.

I personally don’t have the solution to avoid this kind of paradox, but as data scientists and data analysts, we need to remember that “there is no free lunch”, request the help of field experts when possible and always questions our results.

This post is particularly important to me has I really become eager on learning more about statistics when I learnt for the first time about this paradox years ago, which lead me where I am today.

I hope you enjoyed reading it as much as I enjoyed writing it.

[1] Appleton, David R., Joyce M. French, and Mark PJ Vanderpump. « Ignoring a covariate: An example of Simpson’s paradox. » The American Statistician 50.4 (1996): 340–341.
[2] Bickel, Peter J., Eugene A. Hammel, and J. William O’Connell. « Sex bias in graduate admissions: Data from Berkeley. » Science 187.4175 (1975): 398–404.

How Simpson’s Paradox Can Mislead Statistics was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...



📌 How Simpson’s Paradox Can Mislead Statistics


📈 90.48 Punkte

📌 Understanding Simpson’s Paradox with a Machine Learning Problem Framing


📈 41.41 Punkte

📌 Unveiling Multi-Attacks in Image Classification: How One Adversarial Perturbation Can Mislead Hundreds of Images


📈 35.19 Punkte

📌 Paradox Interactive: Paradox Insider - June 2020


📈 34.13 Punkte

📌 Paradox Interactive: Das Paradox Development Studio wurde umstrukturiert und Imperator: Rome liegt auf Eis


📈 34.13 Punkte

📌 No Man's Sky's Steam Page Didn't Mislead Gamers, Rules UK Ad Watchdog


📈 30.65 Punkte

📌 Twitter warns verified users against attempts to mislead public about Conservative factcheckUK stunt


📈 30.65 Punkte

📌 No Man's Sky's Steam Page Didn't Mislead Gamers, Rules UK Ad Watchdog


📈 30.65 Punkte

📌 WP Statistics Plugin up to 12.6.5 on WordPress class-wp-statistics-pages.php cross site scripting


📈 27.76 Punkte

📌 Andrew Simpson WebCollab 2.20/2.30/2.31/2.40 tasks.php selection cross site scripting


📈 24.34 Punkte

📌 Andrew Simpson WebCollab 2.20/2.30/2.31/2.40 cross site request forgery


📈 24.34 Punkte

📌 Kyle Simpson: I’ve Forgotten More JavaScript Than Most People Ever Learn


📈 24.34 Punkte

📌 Kyle Simpson: I’ve Forgotten More JavaScript Than Most People Ever Learn


📈 24.34 Punkte

📌 W.M.R. Simpson BookReview 1.0 Classification add_review.htm submit[string] cross site scripting


📈 24.34 Punkte

📌 Webmix - Ex-Footballstar O.J. Simpson eröffnete Twitter-Kanal


📈 24.34 Punkte

📌 W.M.R. Simpson BookReview Beta 1.0 Error Message search.htm search[string] cross site scripting


📈 24.34 Punkte

📌 Cameron Simpson adzapper up to 2006-01-29 denial of service [CVE-2006-0046]


📈 24.34 Punkte

📌 iOS 14: Homer Simpson und Darth Vader als Stimmen für Siri?


📈 24.34 Punkte

📌 Lisa Simpson 'cd'ing...


📈 24.34 Punkte

📌 Mythos enthüllt: So alt ist Homer Simpson


📈 24.34 Punkte

📌 CIO Leadership Live with Mandy Simpson, Chief Digital Officer at Z Energy


📈 24.34 Punkte

📌 Simpson Manufacturing Launches Investigation After Cyberattack


📈 24.34 Punkte

📌 CISO-Prognosen für 2024 von Curtis Simpson - Infopoint Security


📈 24.34 Punkte

📌 Maggie Simpson in der Star Wars-Galaxis: Die Simpsons mit Disney+-Kurzfilm Das Erwachen der Macht aus ihrem Nickerchen


📈 24.34 Punkte

📌 Homer Simpson in den Büschen: Adidas legt Meme-Sneakers auf


📈 24.34 Punkte

📌 Adidas macht legendäres Meme von Homer Simpson zu Sneakern


📈 24.34 Punkte

📌 Adidas macht legendäres Meme von Homer Simpson zu Sneakern


📈 24.34 Punkte

📌 The Cyber Threat Intelligence Paradox – Why too much data can be detrimental and what to do about it


📈 21.61 Punkte

📌 10 CRM Statistics that can help your Business in 2020


📈 18.42 Punkte

📌 The Transhumanist Paradox (33c3)


📈 17.07 Punkte

📌 The Transhumanist Paradox (33c3) - deutsche Übersetzung


📈 17.07 Punkte

📌 The Transhumanist Paradox (33c3)


📈 17.07 Punkte

📌 The Transhumanist Paradox (33c3) - deutsche Übersetzung


📈 17.07 Punkte

📌 Steam-Sale: Spiele von Paradox deutlich günstiger


📈 17.07 Punkte

📌 Paradox Interactive auf Erfolgskurs, verschenkt Download-Inhalt


📈 17.07 Punkte











matomo