humancompatible.detect.data_handler.DataHandler module
- class humancompatible.detect.data_handler.DataHandler.DataHandler(features: list[Feature], target: Feature | None = None, causal_inc: list[tuple[Feature, Feature]] | None = None, greater_than: list[tuple[Feature, Feature]] | None = None)[source]
Bases:
objectPerforms all data processing from a pandas DataFrame/numpy array to a normalized and encoded input Expected use is to initialize this with training data and then use it to encode all data. Supports mixed encoding, where only some values are categorical Normalizes contiguous data to [0, 1] range Produces either one-hot encoded data or direct data with mapped categorical data to negative integers
- decode(X: ndarray[float64], denormalize: bool = True, encoded_one_hot: bool = True, as_dataframe: bool = True) ndarray[float64][source]
Decode input features.
Parameters:
- Xarray-like
- Input data matrix. Shape: (num_samples, num_enc_features)
where num_enc_features can be higher than num_features, because of one-hot encoding
- denormalizebool, optional
Whether to invert the normalization of the features (default is True).
- encoded_one_hotbool, optional
Whether the input matrix is one-hot encoded (default is True).
- as_dataframebool, optional
Whether to return a pandas DataFrame or numpy array (default is True - DataFrame).
Returns:
- decoded_Xnumpy array
Decoded features in the original format. Shape: (num_samples, num_features)
- decode_y(y: ndarray[float64], denormalize: bool = True, as_series: bool = True) ndarray[float64][source]
Decode target feature.
Parameters:
- yarray-like
- Target feature data. Shape: (num_samples,) for general case
or (num_samples, num_categorical_values) in case of one-hot encoding
- denormalizebool, optional
Whether to invert the normalization of the feature (default is True).
- as_seriesbool, optional
Whether to return a pandas Series or numpy array (default is True - Series).
Returns:
- decoded_ynumpy array
Decoded target feature data. Shape: (num_samples,)
- encode(X: ndarray | DataFrame, normalize: bool = True, one_hot: bool = True) ndarray[float64][source]
Encode input features.
Parameters:
- Xarray-like
Input features (data matrix or DataFrame). Shape: (num_samples, num_features)
- normalizebool, optional
Whether to normalize the features (default is True).
- one_hotbool, optional
Whether to perform one-hot encoding for categorical values (default is True).
Returns:
- encoded_Xnumpy array
Encoded input features. Shape: (num_samples, one_hot_features) when one hot encoding is performed, (num_samples, num_features) otherwise
- encode_y(y: ndarray | Series, normalize: bool = True, one_hot: bool = True) ndarray[float64][source]
Encode target feature.
Parameters:
- yarray-like
Target feature (data matrix or DataFrame of labels or regression targets). Shape: (num_samples,)
- normalizebool, optional
Whether to normalize the features (default is True).
- one_hotbool, optional
Whether to perform one-hot encoding for categorical values (default is True).
Returns:
- encoded_ynumpy array
Encoded target feature. Shape: (num_samples, num_values) for one hot encoding or (num_samples,) otherwise
- property feature_names: list[str]
List of feature names
- classmethod from_data(X: ndarray | DataFrame, y: ndarray | Series | None = None, categ_map: dict[int | str, list[int | str]] = {}, ordered: list[int | str] = [], bounds_map: dict[int | str, tuple[int, int]] = {}, discrete: list[int | str] = [], immutable: list[int | str] = [], monotonicity: dict[int | str, Monotonicity] = {}, causal_inc: list[tuple[int | str, int | str]] = [], greater_than: list[tuple[int | str, int | str]] = [], regression: bool = False, feature_names: list[str] | None = None, target_name: str | None = None) DataHandler[source]
Construct a DataHandler instance.
Parameters:
- Xarray-like (2 dimensional)
Input features. Shape: (num_samples, num_features)
- yarray-like (1 dimensional)
Target feature (e.g., labels or regression targets). Shape: (num_samples,)
- categdictionary
Dictionary with indices (or column names for DataFrame) of categorical features as keys and a list of unique categorical values as values.
If the list is empty, each unique value of the feature is considered categorical If the list is non-empty, but does not cover all values, the feature is considered mixed
- regressionbool
True if the task is regression, False if y is categorical and task is classification.
- feature_namesoptional list of strings
List of feature names, if None it is recovered from column names if X is a DataFrame
- target_nameoptional string
Name of the target feature, if None it is recovered from X if X is a pandas Series
- property n_features: int
Number of features in the input space