HumanCompatible · Bias Detection
A toolbox for measuring bias in data & models
Maximum Subgroup Discrepancy (MSD) – bias metric with linear sample complexity …with a MILP formulation that also tells you which subgroup is most affected. [Paper Link]
ℓ∞ – fast pass/fail bias test for a chosen (sub)group …it compares a (sub)group vs the general dataset trend against a tolerance ∆. [Paper Link]
Quick install & demo
python -m pip install humancompatible-detect
from humancompatible.detect import detect_and_score
rule, msd_val = detect_and_score(
csv_path="./data/01_data.csv",
target_col="Target",
protected_list=["Race", "Age"],
method="MSD",
)
The function returns
msd_val– the maximum gap (in percentage-points) between any subgroup and its complementrule– the raw subgroup encoding as a list of(feature_index, Bin)pairs.
To get a human-readable description, do the following:
pretty = " AND ".join(str(cond) for _, cond in rule)
print(f"MSD = {msd_val:.3f}")
print("Subgroup:", pretty)
Contents
Featured examples
If you want to jump straight into notebooks:
Simple example notebook: https://github.com/humancompatible/detect/blob/main/examples/01_basic_usage.ipynb
Realistic example on Folktables: https://github.com/humancompatible/detect/blob/main/examples/02_folktables_within-state.ipynb
Exploring the API: https://github.com/humancompatible/detect/blob/main/examples/04_exploring_functionality.ipynb
and more in the examples folder
MSD as a distance?
Bias detection can be understood as measuring some distance between two distributions (positive X negative samples, some training dataset X general population data…).
However, most distances have exponential sample complexity, whereas MSD requires a linear number of samples (w.r.t. the dimension) to achieve the same error.
| Classical metric | Needs to look at | Sample cost | Drawback |
|---|---|---|---|
| Wasserstein, TV, MMD, ... | full d-dimensional joint | Ω(2d) | exponential sample cost, no group explanation |
| MSD (ours) | only protected attrs | O(d) | returns exact subgroup & gap |
MSD maximises the absolute difference in probability over all protected-attribute combinations (subgroups), yet is solvable in practice through an exact Mixed-Integer optimization that scans the doubly-exponential space effectively.
Subsampled ℓ∞ norm
A different approach is that of the subsampled distances on measure spaces. In this setting, after choosing a group to be tested for bias, the data is transformed into a multidimensional histogram that is compared bin by bin to a reference histogram obtained from the whole dataset under study. For this comparison, a threshold ∆ is specified in advance. Subsampling is of capital importance here, since the number of comparisons can be exceedingly high. Crucially, the following guarantee for the subsample holds:
s=O(nlog n / ε · log(nlog n / ε) + log(1/δ)/ε)
where s is the number of samples taken, n is the number of subgroups considered, ε is the fraction of comparisons over the threshold ∆, and δ is the probability of missing out a biased subgroup.
Citation
If you use MSD, please cite:
@inproceedings{MSD,
author = {N\v{e}me\v{c}ek, Ji\v{r}\'{\i} and Kozdoba, Mark and Kryvoviaz, Illia and Pevn\'{y}, Tom\'{a}\v{s} and Mare\v{c}ek, Jakub},
title = {Bias Detection via Maximum Subgroup Discrepancy},
year = {2025},
isbn = {9798400714542},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3711896.3736857},
doi = {10.1145/3711896.3736857},
booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
pages = {2174–2185},
numpages = {12},
location = {Toronto ON, Canada},
series = {KDD '25}
}
If you used the ℓ∞ method, please cite:
@misc{matilla2025samplecomplexitybiasdetection,
title={Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances},
author={M. Matilla, Germán and Mareček, Jakub},
year={2025},
eprint={2502.02623},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.02623v1},
}
Looking for the installation matrix, solver details or developer setup? Head to the README -> Installation section.