pcntoolkit.dataio.norm_data

This module provides functionalities for normalizing and converting different types of data into a NormData object.

The NormData object is an xarray.Dataset that contains the data, covariates, batch effects, and response variables, and it is used by all the models in the toolkit.

Classes

NormData

A class for handling normative modeling data, extending xarray.Dataset.

Module Contents

class NormData(name: str, data_vars: xarray.core.types.DataVars, coords: Mapping[Any, Any], attrs: Mapping[Any, Any] | None = None)

Bases: xarray.Dataset

A class for handling normative modeling data, extending xarray.Dataset.

This class provides functionality for loading data for normative modeling. It supports various data formats.

Parameters:

name (str) – The name of the dataset
data_vars (DataVars) – Data variables for the dataset
coords (Mapping[Any, Any]) – Coordinates for the dataset
attrs (Mapping[Any, Any] | None, optional) – Additional attributes for the dataset, by default None

X

Covariate data

Type:: xr.DataArray

y

Response variable data

Type:: xr.DataArray

batch_effects

Batch effect data

Type:: xr.DataArray

Z

Z-score data

Type:: xr.DataArray

centiles

Centile data

Type:: xr.DataArray

Examples

>>> data = NormData.from_dataframe("my_data", df, covariates, batch_effects, response_vars)
>>> train_data, test_data = data.train_test_split([0.8, 0.2])

Initialize a NormData object.

Parameters:

name (str) – The name of the dataset.
data_vars (DataVars) – Data variables for the dataset.
coords (Mapping[Any, Any]) – Coordinates for the dataset.
attrs (Mapping[Any, Any] | None, optional) – Additional attributes for the dataset, by default None.

batch_effects_split(batch_effects: Dict[str, List[str]], names: Tuple[str, str] | None) → Tuple[NormData, NormData]

Split the data into two datasets, one with the specified batch effects and one without.

This is useful when you want to split a dataset into two smaller ones.

Parameters:

batch_effects (Dict[str, List[str]]) – A dictionary mapping batch effect dimensions to lists of values to split on.
names (Optional[Tuple[str, str]]) – The names for the two splits.

Returns:

A tuple containing the two split NormData instances.

Return type:

Tuple[NormData, NormData]

check_compatibility(other: NormData) → bool

Check if the data is compatible with another dataset.

Parameters:: other (NormData) – Another NormData instance to compare with.
Returns:: True if compatible, False otherwise
Return type:: bool

chunk(n_chunks: int) → Generator[NormData]

Split the data into n_chunks with roughly equal number of response variables

Parameters:: n_chunks (int) – The number of chunks to split the data into.
Returns:: A generator of NormData instances.
Return type:: Generator[NormData]

concatenate_string_arrays(*arrays: Any) → numpy.ndarray

Concatenate arrays of strings.

Parameters:: arrays (List[np.ndarray]) – A list of numpy arrays containing strings.
Returns:: A single concatenated numpy array of strings.
Return type:: np.ndarray

create_statistics_group() → None

Initializes a DataArray for statistics with NaN values.

This method creates a DataArray with dimensions ‘response_vars’ and ‘statistics’, where ‘response_vars’ corresponds to the response variables in the dataset, and ‘statistics’ includes statistics such as Rho, RMSE, SMSE, EXPV, MLL, and ShapiroW. The DataArray is filled with NaN values initially.

classmethod from_bids(bids_folder, config_params) → NormData

Abstractmethod:

Load a normative dataset from a BIDS dataset.

Parameters:

bids_folder (str) – Path to the BIDS folder.
config_params (dict) – Configuration parameters for loading the dataset.

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_dataframe(name: str, dataframe: pandas.DataFrame, covariates: List[str] | None = None, batch_effects: List[str] | None = None, response_vars: List[str | LiteralString] | None = None, subject_ids: str | None = None, remove_Nan: bool = False, remove_outliers: bool = False, z_threshold: float = 3.0, attrs: Mapping[str, Any] | None = None) → NormData

Load a normative dataset from a pandas DataFrame.

Parameters:

name (str) – The name you want to give to the dataset. Will be used to name saved results.
dataframe (pd.DataFrame) – The pandas DataFrame to load.
covariates (List[str]) – The list of column names to be used as covariates in the dataset.
batch_effects (List[str]) – The list of column names to be used as batch effects in the dataset.
response_vars (List[str]) – The list of column names to be used as response variables in the dataset.
subject_ids (str) – The name of the column containing the subject IDs
attrs (Mapping[str, Any] | None, optional) – Additional attributes for the dataset, by default None.
remove_Nan (bool) – Whether or not to remove NAN values from the dataframe before creating of the class object. By default False

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_fsl(fsl_folder, config_params) → NormData

Abstractmethod:

Load a normative dataset from a FSL file.

Parameters:

fsl_folder (str) – Path to the FSL folder.
config_params (dict) – Configuration parameters for loading the dataset.

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_ndarrays(name: str, X: numpy.ndarray, Y: numpy.ndarray, batch_effects: numpy.ndarray | None = None, subject_ids: numpy.ndarray | None = None, attrs: Mapping[str, Any] | None = None, remove_outliers: bool = False, z_threshold: float = 3.0, remove_Nan: bool = False) → NormData: Create a NormData object from numpy arrays via DataFrame conversion.

classmethod from_netcdf(name: str, netcdf_path: str) → NormData

Load a normative dataset from a netcdf file.

Parameters:

name (str) – The name of the dataset.
netcdf_path (str) – The path to the netcdf file.

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_paths(name: str, covariates_path: str, responses_path: str, batch_effects_path: str, **kwargs) → NormData: Load a normative dataset from a dictionary of paths.

classmethod from_xarray(name: str, xarray_dataset: xarray.Dataset) → NormData

Load a normative dataset from an xarray dataset.

Parameters:

name (str) – The name of the dataset.
xarray_dataset (xr.Dataset) – The xarray dataset to load.

Returns:

An instance of NormData.

Return type:

NormData

get_single_batch_effect() → Dict[str, List[str]]

Get a single batch effect for each dimension.

Returns:: A dictionary mapping each batch effect dimension to a list containing a single value.
Return type:: Dict[str, List[str]]

get_statistics_df() → pandas.DataFrame: Get the statistics as a pandas DataFrame.

has_registered_metadata() → bool

Check if the batch effect and covariate metadata have been registered and are non-empty.

Returns:: True if all required metadata attributes exist and are not empty, False otherwise.
Return type:: bool

kfold_split(k: int) → Generator[Tuple[numpy.typing.ArrayLike[int], numpy.typing.ArrayLike[int]], Any, Any]

Perform k-fold splitting of the data.

Parameters:: k (int) – The number of folds.
Returns:: A generator yielding training and testing indices for each fold.
Return type:: Generator[Tuple[ArrayLike[int], ArrayLike[int]], Any, Any]

load_centiles(save_dir) → None

load_logp(save_dir) → None

load_results(save_dir: str) → None

Loads the results (zscores, centiles, logp, statistics) back into the data

Args:: save_dir (str): Where the results are saved. I.e.: {save_dir}/Z_fit_test.csv

load_statistics(save_dir) → None

load_zscores(save_dir) → None

make_compatible(other: NormData): Ensures datasets are compatible by merging the batch effects maps

merge(other: NormData, name: str | None = None) → NormData

Merge two NormData objects.

Drops all columns that are not present in both datasets.

register_batch_effects() → None: Create a mapping of batch effects to unique values.

classmethod remove_nan(dataframe: pandas.DataFrame) → pandas.DataFrame: Remove NaN values from the dataframe.

classmethod remove_outliers(dataframe: pandas.DataFrame, continuous_vars: List[str], z_threshold: float = 3.0) → pandas.DataFrame: Remove outliers from the dataframe.

save_centiles(save_dir: str) → None

save_logp(save_dir: str) → None

save_results(save_dir: str) → None

Saves the results (zscores, centiles, logp, statistics) to disk

Args:: save_dir (str): Where the results are saved. I.e.: {save_dir}/Z_fit_test.csv

save_statistics(save_dir: str) → None

save_zscores(save_dir: str) → None

scale_backward(inscalers: Dict[str, Any], outscalers: Dict[str, Any]) → None

Scale the data backward using provided scalers.

Parameters:

inscalers (Dict[str, Any]) – Scalers for the covariate data.
outscalers (Dict[str, Any]) – Scalers for the response variable data.

scale_forward(inscalers: Dict[str, Any], outscalers: Dict[str, Any]) → None

Scale the data forward in-place using provided scalers.

Parameters:

inscalers (Dict[str, Any]) – Scalers for the covariate data.
outscalers (Dict[str, Any]) – Scalers for the response variable data.

select_batch_effects(name: str, batch_effects: Dict[str, List[str]], invert: bool = False) → NormData

Select observations matching (or not matching) batch effects.

Parameters:

name (str) – Name to assign to the returned NormData instance.
batch_effects (Dict[str, List[str]]) – A dictionary mapping batch effect dimensions to lists of values to select batch effects from.
invert (bool, optional) – If True, return observations that do not match any of the specified batch effect values. Default is False.

Returns:

A NormData instance containing observations matching (or not matching) the specified batch effects.

Return type:

NormData

to_dataframe(dim_order: Sequence[Hashable] | None = None) → pandas.DataFrame

Convert the NormData instance to a pandas DataFrame.

Parameters:: dim_order (Sequence[Hashable] | None, optional) – The order of dimensions for the DataFrame, by default None.
Returns:: A DataFrame representation of the NormData instance.
Return type:: pd.DataFrame

to_netcdf(netcdf_path: str) → None

Save the NormData object to a netcdf file.

Parameters:: netcdf_path (str) – The path to the netcdf file.
Return type:: None

train_test_split(splits: Tuple[float, Ellipsis] | List[float] | float = 0.8, split_names: Tuple[str, Ellipsis] | None = None, random_state: int = 42) → Tuple[NormData, Ellipsis]

Split the data into training and testing datasets.

Parameters:

splits (Tuple[float, ] | List[float] | float) – A tuple (train_size, test_size), specifying the proportion of data for each split. Or a float specifying the proportion of data for the train set.
split_names (Tuple[str, ] | None, optional) – Names for the splits, by default None.
random_state (int , optional) – Random state for splits, by default 42.

Returns:

A tuple containing the training and testing NormData instances.

Return type:

Tuple[NormData, ]

__slots__ = ('unique_batch_effects', 'batch_effect_counts', 'batch_effect_covariate_ranges',...

property name: str

Get the name of the dataset.

Returns:: The name of the dataset.
Return type:: str

property response_var_list: xarray.DataArray