Helper Utilities

humancompatible.detect.helpers.prepare.prepare_dataset(input_data: DataFrame, target_data: DataFrame, n_max: int, protected_attrs: List[str], continuous_feats: List[str], feature_processing: Dict[str, Callable[[Any], int]], verbose: int = 1) → Tuple[Binarizer, DataFrame, Series][source]

Prepares a dataset by cleaning, preprocessing, sampling, and structuring it for fairness analysis.

This function performs several steps to get the data ready for further processing, especially focusing on handling missing values, applying feature transformations, managing feature types (continuous vs. categorical), sampling, and identifying protected attributes.

Parameters:

input_data (pd.DataFrame) – The input features DataFrame.
target_data (pd.DataFrame) – Single-column target vector; same row count as input_data.
n_max (int) – The maximum number of samples to retain. If the dataset size exceeds this, it will be randomly downsampled.
protected_attrs (List[str]) – A list of column names that are considered protected attributes for fairness analysis.
continuous_feats (List[str]) – A list of column names identified as continuous features.
feature_processing (Dict[str, Callable[[Any], int]]) – Mapping from column name to a callable that converts each raw value to an integer.
verbose (int, default 1) – Verbosity level. 0 = silent, 1 = logger output only, 2 = all detailed logs (including solver output).

Returns:

A tuple containing:

binarizer_protected (Binarizer): The protected-attributes binarizer.
input_data[protected_cols] (pd.DataFrame): The part of the data with protected attributes.
target_data (pd.Series): The corresponding target features.

Return type:

Tuple[Binarizer, pd.DataFrame, pd.Series]

Notes

Rows with any NaN values in input_data will be removed.
Features with only one unique value after NaN removal will be dropped.
The target_data is assumed to contain only one column and will be converted to a pandas Series for the output.
Requires DataHandler and Binarizer classes to be defined elsewhere for dhandler_protected and binarizer_protected to work correctly.

One-shot helper: find the most biased subgroup and return its score. Works with three input modes:

DataFrame mode: pass X, y
CSV mode: pass csv_path, target_col
Two-sample mode: pass X1, X2

It first calls most_biased_subgroup() (or similar, depending on the mode) to obtain the rule, then evaluates that rule through evaluate_biased_subgroup() (depending on the mode).

Parameters:

X (pd.DataFrame | None) – Feature matrix.
y (pd.DataFrame | None) – Single-column target aligned with X.
csv_path (Path | str | None) – Path to a CSV file.
target_col (str | None) – Name of the target column in the CSV.
X1 (pd.DataFrame | None) – First dataset (Two-sample mode).
X2 (pd.DataFrame | None) – Second dataset (Two-sample mode).
protected_list (list[str] | None, default None) – Names of columns regarded as protected attributes. When None, every column in X is treated as protected.
continuous_list (list[str] | None, default None) – Columns that should be treated as continuous when building bins.
fp_map (dict[str, Callable[[Any], int]] | None, default None) – Optional per-feature recoding map to apply before binarisation.
seed (int | None, default None) – Seed for the random generator controlling subsampling and solver randomness.
n_samples (int, default 1_000_000) – Upper bound on the number of rows kept after random subsampling.
method (str, default "MSD") – Subgroup-search routine to invoke. “MSD” or “l_inf” is supported at present.
verbose (int, default 1) – Verbosity level. 0 = silent, 1 = logger output only, 2 = all detailed logs (including solver output).
method_kwargs (dict[str, Any] | None, default None) – Extra keyword arguments forwarded to the chosen method (for MSD these include time_limit, n_min, solver, etc.).

Returns:

A pair containing (rule, value):

rule - list of (feature_index, Bin) pairs [for method == “l_inf” the rule is an empty list].
value - MSD [return float] or l_inf [return bool], depending on method.

Return type:

tuple[list[tuple[int, Bin]], float | bool]

Raises:

ValueError – If modes are mixed, required arguments for a mode are missing, or method/method_kwargs are invalid.

humancompatible.detect.helpers.utils.evaluate_subgroup_discrepancy(subgroup: ndarray[bool], y: ndarray[bool], verbose: int = 1) → float[source]

Absolute subgroup discrepancy abs(delta) between positive and negative outcomes.

Simply returns the magnitude of signed_subgroup_discrepancy(subgroup, y).

Parameters:

subgroup (np.ndarray[bool]) – Boolean mask indicating subgroup membership; shape must equal that of y.
y (np.ndarray[bool]) – Boolean outcome labels (True = positive).
verbose (int, default 1) – Verbosity level. 0 = silent, 1 = logger output only, 2 = all detailed logs (including solver output).

Returns:

abs(delta) - the absolute difference in subgroup prevalence between: positives and negatives (fractional units).

Return type:

float

Raises:

AssertionError – If subgroup and y have different shapes.
ValueError – If y contains only positives or only negatives.

humancompatible.detect.helpers.utils.report_subgroup_bias(label: str, msd: float, rule: list[tuple[int, Any]], feature_names: dict[str, str], value_map: dict[str, dict[Any, str]]) → None[source]

Print a little report of MSD and its human-readable rule.

Parameters:

label – a name for this sample (e.g. “State FL” or “FL vs NH”).
msd – the numeric MSD value.
rule – the list of (col_idx, binop) pairs that define the subgroup.
feature_names – mapping from column-code -> human feature name (eg. from feature_folktables()).
value_map – mapping from column-code -> {value_code -> human label} (eg. from feature_folktables()).

humancompatible.detect.helpers.utils.signed_subgroup_discrepancy(subgroup: ndarray[bool], y: ndarray[bool], verbose: int = 1) → float[source]

Signed difference in subgroup representation between positive and negative outcomes.

This metric returns:: delta = mean(subgroup | y = 1) - mean(subgroup | y = 0)

A positive delta means the subgroup is over-represented among positives; a negative delta means it is under-represented.

Parameters:

subgroup (np.ndarray[bool]) – Boolean mask of subgroup membership; shape must match y.
y (np.ndarray[bool]) – Boolean outcome labels (True = positive, False = negative).
verbose (int, default 1) – Verbosity level. 0 = silent, 1 = logger output only, 2 = all detailed logs (including solver output).

Returns:

Signed difference proportion_in_positives - proportion_in_negatives.

Return type:

float

Raises:

AssertionError – If subgroup and y have different shapes.
ValueError – If y contains only positives or only negatives.

Examples

Equal representation => Δ = 0

>>> subgroup = np.array([True, False, True, False])
>>> y = np.array([True, False, False, True])
>>> signed_subgroup_discrepancy(subgroup, y)
0.0

Over-representation => positive Δ

>>> subgroup = np.array([True, True, False, False, True])
>>> y = np.array([True, True,  True,  False, False])
>>> round(signed_subgroup_discrepancy(subgroup, y), 3)
0.167  # subgroup is ~16.7 pp more common among positives

Under-representation => negative Δ

>>> subgroup = np.array([False, False, True, False])
>>> y = np.array([True, True, False, False])
>>> round(signed_subgroup_discrepancy(subgroup, y), 2)
-0.50  # subgroup is 50 pp less common among positives

humancompatible.detect.helpers.utils.signed_subgroup_prevalence_diff(subgroup_a: ndarray[bool], subgroup_b: ndarray[bool]) → float[source]

Signed difference in subgroup prevalence between two datasets.

Computes:: delta = mean(subgroup_b) - mean(subgroup_a)

A positive delta means the subgroup is more common in dataset B than in dataset A; a negative delta means the opposite.

Parameters:

subgroup_a (np.ndarray[bool]) – Boolean mask for dataset A.
subgroup_b (np.ndarray[bool]) – Boolean mask for dataset B. The two arrays not necessarily need to be the same length, but each must be one-dimensional and boolean.

Returns:

Signed prevalence difference delta.

Return type:

float