humancompatible.detect.data_handler.DataHandler module

class humancompatible.detect.data_handler.DataHandler.DataHandler(features: list[Feature], target: Feature | None = None, causal_inc: list[tuple[Feature, Feature]] | None = None, greater_than: list[tuple[Feature, Feature]] | None = None)[source]

Bases: object

Performs all data processing from a pandas DataFrame/numpy array to a normalized and encoded input Expected use is to initialize this with training data and then use it to encode all data. Supports mixed encoding, where only some values are categorical Normalizes contiguous data to [0, 1] range Produces either one-hot encoded data or direct data with mapped categorical data to negative integers

allowed_changes(pre_vals, post_vals)[source]

property causal_inc: list[tuple[Feature, Feature]]

decode(X: ndarray[float64], denormalize: bool = True, encoded_one_hot: bool = True, as_dataframe: bool = True) → ndarray[float64][source]

Decode input features.

Parameters:

Xarray-like

Input data matrix. Shape: (num_samples, num_enc_features): where num_enc_features can be higher than num_features, because of one-hot encoding

denormalizebool, optional

Whether to invert the normalization of the features (default is True).

encoded_one_hotbool, optional

Whether the input matrix is one-hot encoded (default is True).

as_dataframebool, optional

Whether to return a pandas DataFrame or numpy array (default is True - DataFrame).

Returns:

decoded_Xnumpy array: Decoded features in the original format. Shape: (num_samples, num_features)

decode_y(y: ndarray[float64], denormalize: bool = True, as_series: bool = True) → ndarray[float64][source]

Decode target feature.

Parameters:

yarray-like

Target feature data. Shape: (num_samples,) for general case: or (num_samples, num_categorical_values) in case of one-hot encoding

denormalizebool, optional

Whether to invert the normalization of the feature (default is True).

as_seriesbool, optional

Whether to return a pandas Series or numpy array (default is True - Series).

Returns:

decoded_ynumpy array: Decoded target feature data. Shape: (num_samples,)

encode(X: ndarray | DataFrame, normalize: bool = True, one_hot: bool = True) → ndarray[float64][source]

Encode input features.

Parameters:

Xarray-like: Input features (data matrix or DataFrame). Shape: (num_samples, num_features)
normalizebool, optional: Whether to normalize the features (default is True).
one_hotbool, optional: Whether to perform one-hot encoding for categorical values (default is True).

Returns:

encoded_Xnumpy array: Encoded input features. Shape: (num_samples, one_hot_features) when one hot encoding is performed, (num_samples, num_features) otherwise

encode_all(X_all: ndarray, normalize: bool, one_hot: bool)[source]

encode_y(y: ndarray | Series, normalize: bool = True, one_hot: bool = True) → ndarray[float64][source]

Encode target feature.

Parameters:

yarray-like: Target feature (data matrix or DataFrame of labels or regression targets). Shape: (num_samples,)
normalizebool, optional: Whether to normalize the features (default is True).
one_hotbool, optional: Whether to perform one-hot encoding for categorical values (default is True).

Returns:

encoded_ynumpy array: Encoded target feature. Shape: (num_samples, num_values) for one hot encoding or (num_samples,) otherwise

encoding_width(one_hot: bool) → int[source]

property feature_names: list[str]: List of feature names

property features: list[Feature]: List of input features

classmethod from_data(X: ndarray | DataFrame, y: ndarray | Series | None = None, categ_map: dict[int | str, list[int | str]] = {}, ordered: list[int | str] = [], bounds_map: dict[int | str, tuple[int, int]] = {}, discrete: list[int | str] = [], immutable: list[int | str] = [], monotonicity: dict[int | str, Monotonicity] = {}, causal_inc: list[tuple[int | str, int | str]] = [], greater_than: list[tuple[int | str, int | str]] = [], regression: bool = False, feature_names: list[str] | None = None, target_name: str | None = None) → DataHandler[source]

Construct a DataHandler instance.

Parameters:

Xarray-like (2 dimensional)
Input features. Shape: (num_samples, num_features)

yarray-like (1 dimensional)
Target feature (e.g., labels or regression targets). Shape: (num_samples,)

categdictionary
Dictionary with indices (or column names for DataFrame) of categorical features as keys and a list of unique categorical values as values.

If the list is empty, each unique value of the feature is considered categorical If the list is non-empty, but does not cover all values, the feature is considered mixed

regressionbool
True if the task is regression, False if y is categorical and task is classification.

feature_namesoptional list of strings
List of feature names, if None it is recovered from column names if X is a DataFrame

target_nameoptional string
Name of the target feature, if None it is recovered from X if X is a pandas Series

property greater_than: list[tuple[Feature, Feature]]

property n_features: int: Number of features in the input space

property target_feature: Feature: Target feature