# HumanCompatible · Bias Detection
A toolbox for measuring bias in data & models
**Maximum Subgroup Discrepancy (MSD)** -- bias metric with linear sample complexity
_...with a MILP formulation that also tells you which subgroup is most affected._
\[[Paper Link](https://dl.acm.org/doi/10.1145/3711896.3736857)\]
**ℓ∞** -- fast pass/fail bias test for a chosen (sub)group
_...it compares a (sub)group vs the general dataset trend against a tolerance ∆._
\[[Paper Link](https://arxiv.org/abs/2502.02623)\]
---
## Quick install & demo
```bash
python -m pip install humancompatible-detect
```
```python
from humancompatible.detect import detect_and_score
rule, msd_val = detect_and_score(
csv_path="./data/01_data.csv",
target_col="Target",
protected_list=["Race", "Age"],
method="MSD",
)
```
The function returns
- **`msd_val`** -- the maximum gap (in percentage-points) between any subgroup and its complement
- **`rule`** -- the raw subgroup encoding as a list of `(feature_index, Bin)` pairs.
To get a human-readable description, do the following:
```python
pretty = " AND ".join(str(cond) for _, cond in rule)
print(f"MSD = {msd_val:.3f}")
print("Subgroup:", pretty)
```
## Contents
```{toctree}
:maxdepth: 1
self
api/humancompatible.detect
api/humancompatible.detect.methods.msd
api/humancompatible.detect.methods.l_inf
Tutorial
Examples
```
## Featured examples
If you want to jump straight into notebooks:
- **Simple example notebook**:
- **Realistic example on Folktables**:
- **Exploring the API**:
- and more in the [examples folder](https://github.com/humancompatible/detect/tree/main/examples)
---
## MSD as a distance?
Bias detection can be understood as measuring some distance between two distributions (positive X negative samples, some training dataset X general population data...).
However, most distances have exponential sample complexity, whereas MSD requires a linear number of samples (w.r.t. the dimension) to achieve the same error.
| Classical metric |
Needs to look at |
Sample cost |
Drawback |
| Wasserstein, TV, MMD, ... |
full d-dimensional joint |
Ω(2d) |
exponential sample cost, no group explanation |
| MSD (ours) |
only protected attrs |
O(d) |
returns exact subgroup & gap |
MSD maximises the absolute difference in probability over all protected-attribute combinations (subgroups), yet is solvable in practice through an exact Mixed-Integer optimization that scans the doubly-exponential space effectively.
## Subsampled ℓ∞ norm
A different approach is that of the subsampled distances on measure spaces. In this setting, after choosing a group to be tested for bias, the data is transformed into a multidimensional histogram that is compared bin by bin to a reference histogram obtained from the whole dataset under study. For this comparison, a threshold ∆ is specified in advance. Subsampling is of capital importance here, since the number of comparisons can be exceedingly high. Crucially, the following guarantee for the subsample holds:
s=O(nlog n / ε · log(nlog n / ε) + log(1/δ)/ε)
where s is the number of samples taken, n is the number of subgroups considered, ε is the fraction of comparisons over the threshold ∆, and δ is the probability of missing out a biased subgroup.
---
## Citation
If you use MSD, please cite:
```bibtex
@inproceedings{MSD,
author = {N\v{e}me\v{c}ek, Ji\v{r}\'{\i} and Kozdoba, Mark and Kryvoviaz, Illia and Pevn\'{y}, Tom\'{a}\v{s} and Mare\v{c}ek, Jakub},
title = {Bias Detection via Maximum Subgroup Discrepancy},
year = {2025},
isbn = {9798400714542},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3711896.3736857},
doi = {10.1145/3711896.3736857},
booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
pages = {2174–2185},
numpages = {12},
location = {Toronto ON, Canada},
series = {KDD '25}
}
```
If you used the ℓ∞ method, please cite:
```bibtex
@misc{matilla2025samplecomplexitybiasdetection,
title={Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances},
author={M. Matilla, Germán and Mareček, Jakub},
year={2025},
eprint={2502.02623},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.02623v1},
}
```
Looking for the installation matrix, solver details or developer setup?
Head to the [**README -> Installation**](https://github.com/humancompatible/detect?tab=readme-ov-file#installation-details) section.