pcntoolkit.dataio.norm_data

This module provides functionalities for normalizing and converting different types of data into a NormData object.

The NormData object is an xarray.Dataset that contains the data, covariates, batch effects, and response variables, and it is used by all the models in the toolkit.

Classes

NormData

A class for handling normative modeling data, extending xarray.Dataset.

Module Contents

class NormData(name: str, data_vars: xarray.core.types.DataVars, coords: Mapping[Any, Any], attrs: Mapping[Any, Any] | None = None)

Bases: xarray.Dataset

A class for handling normative modeling data, extending xarray.Dataset.

This class provides functionality for loading data for normative modeling. It supports various data formats.

Parameters:
  • name (str) – The name of the dataset

  • data_vars (DataVars) – Data variables for the dataset

  • coords (Mapping[Any, Any]) – Coordinates for the dataset

  • attrs (Mapping[Any, Any] | None, optional) – Additional attributes for the dataset, by default None

X

Covariate data

Type:

xr.DataArray

y

Response variable data

Type:

xr.DataArray

batch_effects

Batch effect data

Type:

xr.DataArray

Z

Z-score data

Type:

xr.DataArray

centiles

Centile data

Type:

xr.DataArray

Examples

>>> data = NormData.from_dataframe("my_data", df, covariates, batch_effects, response_vars)
>>> train_data, test_data = data.train_test_split([0.8, 0.2])

Initialize a NormData object.

Parameters:
  • name (str) – The name of the dataset.

  • data_vars (DataVars) – Data variables for the dataset.

  • coords (Mapping[Any, Any]) – Coordinates for the dataset.

  • attrs (Mapping[Any, Any] | None, optional) – Additional attributes for the dataset, by default None.

batch_effects_split(batch_effects: Dict[str, List[str]], names: Tuple[str, str] | None) Tuple[NormData, NormData]

Split the data into two datasets, one with the specified batch effects and one without.

This is useful when you want to split a dataset into two smaller ones.

Parameters:
  • batch_effects (Dict[str, List[str]]) – A dictionary mapping batch effect dimensions to lists of values to split on.

  • names (Optional[Tuple[str, str]]) – The names for the two splits.

Returns:

A tuple containing the two split NormData instances.

Return type:

Tuple[NormData, NormData]

check_compatibility(other: NormData) bool

Check if the data is compatible with another dataset.

Parameters:

other (NormData) – Another NormData instance to compare with.

Returns:

True if compatible, False otherwise

Return type:

bool

chunk(n_chunks: int) Generator[NormData]

Split the data into n_chunks with roughly equal number of response variables

Parameters:

n_chunks (int) – The number of chunks to split the data into.

Returns:

A generator of NormData instances.

Return type:

Generator[NormData]

concatenate_string_arrays(*arrays: Any) numpy.ndarray

Concatenate arrays of strings.

Parameters:

arrays (List[np.ndarray]) – A list of numpy arrays containing strings.

Returns:

A single concatenated numpy array of strings.

Return type:

np.ndarray

create_statistics_group() None

Initializes a DataArray for statistics with NaN values.

This method creates a DataArray with dimensions ‘response_vars’ and ‘statistics’, where ‘response_vars’ corresponds to the response variables in the dataset, and ‘statistics’ includes statistics such as Rho, RMSE, SMSE, EXPV, MLL, and ShapiroW. The DataArray is filled with NaN values initially.

classmethod from_bids(bids_folder, config_params) NormData
Abstractmethod:

Load a normative dataset from a BIDS dataset.

Parameters:
  • bids_folder (str) – Path to the BIDS folder.

  • config_params (dict) – Configuration parameters for loading the dataset.

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_dataframe(name: str, dataframe: pandas.DataFrame, covariates: List[str] | None = None, batch_effects: List[str] | None = None, response_vars: List[str | LiteralString] | None = None, subject_ids: str | None = None, remove_Nan: bool = False, remove_outliers: bool = False, z_threshold: float = 3.0, attrs: Mapping[str, Any] | None = None) NormData

Load a normative dataset from a pandas DataFrame.

Parameters:
  • name (str) – The name you want to give to the dataset. Will be used to name saved results.

  • dataframe (pd.DataFrame) – The pandas DataFrame to load.

  • covariates (List[str]) – The list of column names to be used as covariates in the dataset.

  • batch_effects (List[str]) – The list of column names to be used as batch effects in the dataset.

  • response_vars (List[str]) – The list of column names to be used as response variables in the dataset.

  • subject_ids (str) – The name of the column containing the subject IDs

  • attrs (Mapping[str, Any] | None, optional) – Additional attributes for the dataset, by default None.

  • remove_Nan (bool) – Whether or not to remove NAN values from the dataframe before creating of the class object. By default False

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_fsl(fsl_folder, config_params) NormData
Abstractmethod:

Load a normative dataset from a FSL file.

Parameters:
  • fsl_folder (str) – Path to the FSL folder.

  • config_params (dict) – Configuration parameters for loading the dataset.

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_ndarrays(name: str, X: numpy.ndarray, Y: numpy.ndarray, batch_effects: numpy.ndarray | None = None, subject_ids: numpy.ndarray | None = None, attrs: Mapping[str, Any] | None = None, remove_outliers: bool = False, z_threshold: float = 3.0, remove_Nan: bool = False) NormData

Create a NormData object from numpy arrays via DataFrame conversion.

classmethod from_netcdf(name: str, netcdf_path: str) NormData

Load a normative dataset from a netcdf file.

Parameters:
  • name (str) – The name of the dataset.

  • netcdf_path (str) – The path to the netcdf file.

Returns:

An instance of NormData.

Return type:

NormData

classmethod from_paths(name: str, covariates_path: str, responses_path: str, batch_effects_path: str, **kwargs) NormData

Load a normative dataset from a dictionary of paths.

classmethod from_xarray(name: str, xarray_dataset: xarray.Dataset) NormData

Load a normative dataset from an xarray dataset.

Parameters:
  • name (str) – The name of the dataset.

  • xarray_dataset (xr.Dataset) – The xarray dataset to load.

Returns:

An instance of NormData.

Return type:

NormData

get_single_batch_effect() Dict[str, List[str]]

Get a single batch effect for each dimension.

Returns:

A dictionary mapping each batch effect dimension to a list containing a single value.

Return type:

Dict[str, List[str]]

get_statistics_df() pandas.DataFrame

Get the statistics as a pandas DataFrame.

has_registered_metadata() bool

Check if the batch effect and covariate metadata have been registered and are non-empty.

Returns:

True if all required metadata attributes exist and are not empty, False otherwise.

Return type:

bool

kfold_split(k: int) Generator[Tuple[numpy.typing.ArrayLike[int], numpy.typing.ArrayLike[int]], Any, Any]

Perform k-fold splitting of the data.

Parameters:

k (int) – The number of folds.

Returns:

A generator yielding training and testing indices for each fold.

Return type:

Generator[Tuple[ArrayLike[int], ArrayLike[int]], Any, Any]

load_centiles(save_dir) None
load_logp(save_dir) None
load_results(save_dir: str) None

Loads the results (zscores, centiles, logp, statistics) back into the data

Args:

save_dir (str): Where the results are saved. I.e.: {save_dir}/Z_fit_test.csv

load_statistics(save_dir) None
load_zscores(save_dir) None
make_compatible(other: NormData)

Ensures datasets are compatible by merging the batch effects maps

merge(other: NormData, name: str | None = None) NormData

Merge two NormData objects.

Drops all columns that are not present in both datasets.

register_batch_effects() None

Create a mapping of batch effects to unique values.

classmethod remove_nan(dataframe: pandas.DataFrame) pandas.DataFrame

Remove NaN values from the dataframe.

classmethod remove_outliers(dataframe: pandas.DataFrame, continuous_vars: List[str], z_threshold: float = 3.0) pandas.DataFrame

Remove outliers from the dataframe.

save_centiles(save_dir: str) None
save_logp(save_dir: str) None
save_results(save_dir: str) None

Saves the results (zscores, centiles, logp, statistics) to disk

Args:

save_dir (str): Where the results are saved. I.e.: {save_dir}/Z_fit_test.csv

save_statistics(save_dir: str) None
save_zscores(save_dir: str) None
scale_backward(inscalers: Dict[str, Any], outscalers: Dict[str, Any]) None

Scale the data backward using provided scalers.

Parameters:
  • inscalers (Dict[str, Any]) – Scalers for the covariate data.

  • outscalers (Dict[str, Any]) – Scalers for the response variable data.

scale_forward(inscalers: Dict[str, Any], outscalers: Dict[str, Any]) None

Scale the data forward in-place using provided scalers.

Parameters:
  • inscalers (Dict[str, Any]) – Scalers for the covariate data.

  • outscalers (Dict[str, Any]) – Scalers for the response variable data.

select_batch_effects(name: str, batch_effects: Dict[str, List[str]], invert: bool = False) NormData

Select observations matching (or not matching) batch effects.

Parameters:
  • name (str) – Name to assign to the returned NormData instance.

  • batch_effects (Dict[str, List[str]]) – A dictionary mapping batch effect dimensions to lists of values to select batch effects from.

  • invert (bool, optional) – If True, return observations that do not match any of the specified batch effect values. Default is False.

Returns:

A NormData instance containing observations matching (or not matching) the specified batch effects.

Return type:

NormData

to_dataframe(dim_order: Sequence[Hashable] | None = None) pandas.DataFrame

Convert the NormData instance to a pandas DataFrame.

Parameters:

dim_order (Sequence[Hashable] | None, optional) – The order of dimensions for the DataFrame, by default None.

Returns:

A DataFrame representation of the NormData instance.

Return type:

pd.DataFrame

to_netcdf(netcdf_path: str) None

Save the NormData object to a netcdf file.

Parameters:

netcdf_path (str) – The path to the netcdf file.

Return type:

None

train_test_split(splits: Tuple[float, Ellipsis] | List[float] | float = 0.8, split_names: Tuple[str, Ellipsis] | None = None, random_state: int = 42) Tuple[NormData, Ellipsis]

Split the data into training and testing datasets.

Parameters:
  • splits (Tuple[float, ] | List[float] | float) – A tuple (train_size, test_size), specifying the proportion of data for each split. Or a float specifying the proportion of data for the train set.

  • split_names (Tuple[str, ] | None, optional) – Names for the splits, by default None.

  • random_state (int , optional) – Random state for splits, by default 42.

Returns:

A tuple containing the training and testing NormData instances.

Return type:

Tuple[NormData, ]

__slots__ = ('unique_batch_effects', 'batch_effect_counts', 'batch_effect_covariate_ranges',...
property name: str

Get the name of the dataset.

Returns:

The name of the dataset.

Return type:

str

property response_var_list: xarray.DataArray