# HumanCompatible · Bias Detection A toolbox for measuring bias in data & models
**Maximum Subgroup Discrepancy (MSD)** -- bias metric with linear sample complexity _...with a MILP formulation that also tells you which subgroup is most affected._ \[[Paper Link](https://dl.acm.org/doi/10.1145/3711896.3736857)\] **ℓ∞** -- fast pass/fail bias test for a chosen (sub)group _...it compares a (sub)group vs the general dataset trend against a tolerance ∆._ \[[Paper Link](https://arxiv.org/abs/2502.02623)\]
--- ## Quick install & demo ```bash python -m pip install humancompatible-detect ``` ```python from humancompatible.detect import detect_and_score rule, msd_val = detect_and_score( csv_path="./data/01_data.csv", target_col="Target", protected_list=["Race", "Age"], method="MSD", ) ``` The function returns - **`msd_val`** -- the maximum gap (in percentage-points) between any subgroup and its complement - **`rule`** -- the raw subgroup encoding as a list of `(feature_index, Bin)` pairs. To get a human-readable description, do the following: ```python pretty = " AND ".join(str(cond) for _, cond in rule) print(f"MSD = {msd_val:.3f}") print("Subgroup:", pretty) ``` ## Contents ```{toctree} :maxdepth: 1 self api/humancompatible.detect api/humancompatible.detect.methods.msd api/humancompatible.detect.methods.l_inf Tutorial Examples ``` ## Featured examples If you want to jump straight into notebooks: - **Simple example notebook**: - **Realistic example on Folktables**: - **Exploring the API**: - and more in the [examples folder](https://github.com/humancompatible/detect/tree/main/examples) --- ## MSD as a distance? Bias detection can be understood as measuring some distance between two distributions (positive X negative samples, some training dataset X general population data...). However, most distances have exponential sample complexity, whereas MSD requires a linear number of samples (w.r.t. the dimension) to achieve the same error.
Classical metric Needs to look at Sample cost Drawback
Wasserstein, TV, MMD, ... full d-dimensional joint Ω(2d) exponential sample cost, no group explanation
MSD (ours) only protected attrs O(d) returns exact subgroup & gap
MSD maximises the absolute difference in probability over all protected-attribute combinations (subgroups), yet is solvable in practice through an exact Mixed-Integer optimization that scans the doubly-exponential space effectively. ## Subsampled ℓ∞ norm A different approach is that of the subsampled distances on measure spaces. In this setting, after choosing a group to be tested for bias, the data is transformed into a multidimensional histogram that is compared bin by bin to a reference histogram obtained from the whole dataset under study. For this comparison, a threshold ∆ is specified in advance. Subsampling is of capital importance here, since the number of comparisons can be exceedingly high. Crucially, the following guarantee for the subsample holds: s=O(nlog n / ε · log(nlog n / ε) + log(1/δ)/ε) where s is the number of samples taken, n is the number of subgroups considered, ε is the fraction of comparisons over the threshold ∆, and δ is the probability of missing out a biased subgroup. --- ## Citation If you use MSD, please cite: ```bibtex @inproceedings{MSD, author = {N\v{e}me\v{c}ek, Ji\v{r}\'{\i} and Kozdoba, Mark and Kryvoviaz, Illia and Pevn\'{y}, Tom\'{a}\v{s} and Mare\v{c}ek, Jakub}, title = {Bias Detection via Maximum Subgroup Discrepancy}, year = {2025}, isbn = {9798400714542}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3711896.3736857}, doi = {10.1145/3711896.3736857}, booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2}, pages = {2174–2185}, numpages = {12}, location = {Toronto ON, Canada}, series = {KDD '25} } ``` If you used the ℓ∞ method, please cite: ```bibtex @misc{matilla2025samplecomplexitybiasdetection, title={Sample Complexity of Bias Detection with Subsampled Point-to-Subspace Distances}, author={M. Matilla, Germán and Mareček, Jakub}, year={2025}, eprint={2502.02623}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.02623v1}, } ``` Looking for the installation matrix, solver details or developer setup? Head to the [**README -> Installation**](https://github.com/humancompatible/detect?tab=readme-ov-file#installation-details) section.