MSD (Maximum Subgroup Discrepancy)

Link to the Paper

humancompatible.detect.methods.msd.msd.evaluate_MSD(X: DataFrame, y: Series | ndarray, rule: List[Tuple[int, Bin]], signed: bool = False, verbose: int = 1, **kwargs) → float[source]

Compute the MSD value (delta or abs(delta)) for already calculated rules.

Parameters:

X (pd.DataFrame) – DataFrame with the protected columns referenced by rule.
y (pd.Series | np.ndarray) – Binary target vector aligned with X.
rule (list[tuple[int, Bin]]) – Conjunctive rule describing the subgroup. Each element is a pair (feature_index, Bin). Produced by most_biased_subgroup.
signed (bool, default False) – If True, return the signed subgroup discrepancy; otherwise, return the absolute value.
verbose (int, default 1) – Verbosity level. 0 = silent, 1 = logger output only, 2 = all detailed logs (including solver output).

Returns:

The subgroup discrepancy value.

Return type:

float

humancompatible.detect.methods.msd.msd.get_conjuncts_MSD(X_bin: ndarray[bool], y_bin: ndarray[bool], time_limit: int = 600, n_min: int = 0, solver: str = 'appsi_highs', check_optimality: bool = True, verbose: int = 1, **kwargs) → List[int][source]

Run the One-Rule MILP and return the indices of literals that form the Maximum-Subgroup-Discrepancy (MSD) rule.

Parameters:

X_bin (np.ndarray[bool]) – Binary feature matrix (shape n_samples * n_features).
y_bin (np.ndarray[bool]) – Binary target vector (length n_samples).
time_limit (int, default 600) – Wall-clock limit for the solver, in seconds.
n_min (int, default 0) – Minimum support the subgroup must have.
solver (str, default "appsi_highs") – Name of the MIP solver recognised by Pyomo (e.g. “gurobi”, “cplex”, “glpk”, “xpress”, “appsi_highs”).
check_optimality (bool, default True) – If True, returns the optimal solution if found, or raises a ValueError. Otherwise, returns the best-known solution.
verbose (int, default 1) – Verbosity level. 0 = silent, 1 = logger output only, 2 = all detailed logs (including solver output).

Returns:

A list of feature-column indices whose conjunction: defines the subgroup with maximal discrepancy.

Return type:

List[int]

Raises:

ValueError – Propagated from OneRule.find_rule when the solver stops with an unexpected termination condition.
ValueError – If check_optimality is True and no optimal solution is found. Tip: Set check_optimality to False to return the best-known solution, or try to increase the time limit.

humancompatible.detect.methods.msd.metrics_msd.subgroup_gap(rule: Sequence[Tuple[int, Any]], X: DataFrame, y: Series | ndarray, *, signed: bool = True) → float[source]

Compute the subgroup discrepancy delta or abs(delta) for a given rule.

Parameters:

rule – Rule returned by detect_bias - list of (col_idx, Bin).
X – DataFrame containing the protected columns referenced in rule.
y – Binary outcome vector aligned with X (1 = positive outcome).
signed – If True returns signed delta, else returns abs(delta).

Returns:

Subgroup discrepancy (signed or absolute).

Return type:

float

Raises:

KeyError – If X is missing a column required by the rule.
ValueError – If y contains only positives or only negatives.

humancompatible.detect.methods.msd.mapping_msd.subgroup_map_from_conjuncts_binarized(conjuncts: List[int], X: ndarray[bool]) → ndarray[bool][source]

Generates a boolean subgroup mapping based on the conjunction (AND) of specified features.

This function creates a boolean array where each element is True only if the corresponding row in X has True values across all columns specified in conjuncts. Essentially, it identifies individuals who meet all criteria defined by the conjuncts.

Parameters:

conjuncts (List[int]) – A list of integer indices (column indices) from the input array X. Each index represents a feature that must be True for an individual to be included in the subgroup.
X (np.ndarray[np.bool_]) – A 2D NumPy array of boolean values, where rows represent individuals and columns represent features.

Returns:

A 1D boolean NumPy array (mapping) of the same: length as the number of rows in X. An element mapping[i] is True if X[i, conj] is True for all conj in conjuncts, and False otherwise.

Return type:

np.ndarray[np.bool_]

Raises:

IndexError – If any index in conjuncts is out of bounds for the columns of X.

Examples

>>> import numpy as np
>>> X_data = np.array([
...     [True,  True,  False, True],   # Row 0
...     [True,  False, True,  True],   # Row 1
...     [False, True,  True,  False],  # Row 2
...     [True,  True,  True,  True]    # Row 3
... ])
>>>
>>> # Subgroup where feature at index 0 AND feature at index 1 are True
>>> conjuncts_1 = [0, 1]
>>> subgroup_map_from_conjuncts_binarized(conjuncts_1, X_data)
array([ True, False, False,  True])
>>> # Explanation: Only Row 0 and Row 3 have both X[:,0] and X[:,1] as True.

>>> # Subgroup where feature at index 2 is True
>>> conjuncts_2 = [2]
>>> subgroup_map_from_conjuncts_binarized(conjuncts_2, X_data)
array([False,  True,  True,  True])

>>> # Subgroup where feature at index 0 AND feature at index 2 are True
>>> conjuncts_3 = [0, 2]
>>> subgroup_map_from_conjuncts_binarized(conjuncts_3, X_data)
array([False,  True, False,  True])

>>> # Test with an empty list of conjuncts (should return all True)
>>> subgroup_map_from_conjuncts_binarized([], X_data)
array([ True,  True,  True,  True])

>>> # Test with an invalid conjunct index (will raise IndexError)
>>> try:
...     subgroup_map_from_conjuncts_binarized([0, 99], X_data)
... except IndexError as e:
...     print(e)
index 99 is out of bounds for axis 1 with size 4

humancompatible.detect.methods.msd.mapping_msd.subgroup_map_from_conjuncts_dataframe(rule: List[Tuple[int, Any]], X: DataFrame) → ndarray[bool][source]

Build a boolean mask for an MSD rule over a pandas DataFrame.

Each (index, Bin) in rule comes from detect_bias or detect_bias_two_samples. We ignore the positional index and use the Bin’s .feature.name, so this is robust to column re-ordering.

Parameters:

rule (List[Tuple[int, Any]]) – The rule identifying the subgroup, as returned by detect_bias(…).
X (pd.DataFrame) – The original (protected-only) DataFrame passed to detect_bias. Must contain all columns named in the rule’s Bins.

Returns:

A 1-D boolean array where True marks rows: belonging to the subgroup.

Return type:

np.ndarray[np.bool_]

Raises:

KeyError – If X is missing a column required by the rule.

class humancompatible.detect.methods.msd.one_rule.OneRule[source]

Bases: object

Implementation of a MIO formulation for finding an optimal conjunction.

This class implements a Mixed-Integer Optimization (MIO) formulation to discover an optimal conjunction (a logical AND of features) that maximizes the absolute difference in target outcomes between the subgroup defined by this conjunction and its complement. The formulation is inspired by the 1Rule method from the paper “Learning Optimal and Fair Classifiers” by Malioutov and Varshney (http://proceedings.mlr.press/v28/malioutov13.pdf).

__init__() → None[source]: Initializes the OneRule solver.

find_rule(X: ndarray[bool], y: ndarray[bool], n_min: int = 0, time_limit: int = 300, solver_name: str = 'appsi_highs', verbose: int = 1) → Tuple[List[int] | None, bool][source]

Finds a single conjunction (rule) that maximizes the absolute difference in target outcomes between the subgroup it defines and its complement.

This method prepares the data (by creating unique rows and assigning weights), builds the MIO model using _make_abs_model, and then solves it using whichever solver you specify in solver_name.

Parameters:

X (np.ndarray[bool]) – Input data matrix of boolean features, shape (n_instances, n_features).
y (np.ndarray[bool]) – Target labels (binary), shape (n_instances,).
n_min (int, default 0) – Minimum subgroup support (number of rows) required for a valid subgroup.
time_limit (int, default 300) – Time budget for the solver (in seconds). Note that only some solvers support this option.
solver_name (str, default "appsi_highs") –
Method for solving the MIO formulation. Can be chosen among:
- ”appsi_highs”
- ”gurobi”
- ”cplex”
- ”glpk”
- ”xpress”
- Other solvers, see Pyomo documentation
(Note that only the 5 solvers above support the graceful time_limit)
verbose (int, default 1) – Verbosity level. 0 = silent, 1 = logger output only, 2 = all detailed logs (including solver output).

Returns:

A tuple of a list of integer indices representing: the features (literals) that form the optimal conjunction. These indices correspond to the columns in the input X that define the subgroup. If the solver fails to find any feasible solution within the time budget, None is returned instead. The boolean flag is True if the returned solution is globally optimal.

Return type:

Tuple[List[int] | None, bool]

Raises:

AssertionError – If y’s shape is not (X.shape[0],) or if X or y are not of boolean dtype.
ValueError – If the solver terminates with condition other than timeout, optimality or infeasibility.
Exception – Any exceptions raised by Pyomo or solver during model creation or solving.

Notes

The input X and y are first processed to get unique rows and assign weights based on their original counts and class proportions. This helps in handling duplicate rows efficiently.
Requires a compatible MIP solver; e.g. Gurobi, HiGHS solver to be installed and configured for Pyomo.
The rule returned contains indices of the original features (columns of X) that define the conjunction.