pcntoolkit.dataio.norm_data#
This module provides functionalities for normalizing and converting different types of data into a NormData object.
The NormData object is an xarray.Dataset that contains the data, covariates, batch effects, and response variables, and it is used by all the models in the toolkit.
Classes#
A class for handling normative modeling data, extending xarray.Dataset. |
Module Contents#
- class NormData(name: str, data_vars: xarray.core.types.DataVars, coords: Mapping[Any, Any], attrs: Mapping[Any, Any] | None = None)#
Bases:
xarray.DatasetA class for handling normative modeling data, extending xarray.Dataset.
This class provides functionality for loading data for normative modeling. It supports various data formats.
- Parameters:
name (
str) – The name of the datasetdata_vars (
DataVars) – Data variables for the datasetcoords (
Mapping[Any,Any]) – Coordinates for the datasetattrs (
Mapping[Any,Any] | None, optional) – Additional attributes for the dataset, by default None
- X#
Covariate data
- Type:
xr.DataArray
- y#
Response variable data
- Type:
xr.DataArray
- batch_effects#
Batch effect data
- Type:
xr.DataArray
- Z#
Z-score data
- Type:
xr.DataArray
- centiles#
Centile data
- Type:
xr.DataArray
Examples
>>> data = NormData.from_dataframe("my_data", df, covariates, batch_effects, response_vars) >>> train_data, test_data = data.train_test_split([0.8, 0.2])
Initialize a NormData object.
- Parameters:
name (
str) – The name of the dataset.data_vars (
DataVars) – Data variables for the dataset.coords (
Mapping[Any,Any]) – Coordinates for the dataset.attrs (
Mapping[Any,Any] | None, optional) – Additional attributes for the dataset, by default None.
- batch_effects_split(batch_effects: Dict[str, List[str]], names: Tuple[str, str] | None) Tuple[NormData, NormData]#
Split the data into two datasets, one with the specified batch effects and one without.
This is useful when you want to split a dataset into two smaller ones.
- Parameters:
batch_effects (
Dict[str,List[str]]) – A dictionary mapping batch effect dimensions to lists of values to split on.names (
Optional[Tuple[str,str]]) – The names for the two splits.
- Returns:
A tuple containing the two split NormData instances.
- Return type:
Tuple[NormData,NormData]
- chunk(n_chunks: int) Generator[NormData]#
Split the data into n_chunks with roughly equal number of response variables
- Parameters:
n_chunks (
int) – The number of chunks to split the data into.- Returns:
A generator of NormData instances.
- Return type:
Generator[NormData]
- concatenate_string_arrays(*arrays: Any) numpy.ndarray#
Concatenate arrays of strings.
- Parameters:
arrays (
List[np.ndarray]) – A list of numpy arrays containing strings.- Returns:
A single concatenated numpy array of strings.
- Return type:
np.ndarray
- create_statistics_group() None#
Initializes a DataArray for statistics with NaN values.
This method creates a DataArray with dimensions ‘response_vars’ and ‘statistics’, where ‘response_vars’ corresponds to the response variables in the dataset, and ‘statistics’ includes statistics such as Rho, RMSE, SMSE, EXPV, MLL, and ShapiroW. The DataArray is filled with NaN values initially.
- classmethod from_bids(bids_folder, config_params) NormData#
- Abstractmethod:
Load a normative dataset from a BIDS dataset.
- classmethod from_dataframe(name: str, dataframe: pandas.DataFrame, covariates: List[str] | None = None, batch_effects: List[str] | None = None, response_vars: List[str | LiteralString] | None = None, subject_ids: str | None = None, remove_Nan: bool = False, remove_outliers: bool = False, z_threshold: float = 3.0, attrs: Mapping[str, Any] | None = None) NormData#
Load a normative dataset from a pandas DataFrame.
- Parameters:
name (
str) – The name you want to give to the dataset. Will be used to name saved results.dataframe (
pd.DataFrame) – The pandas DataFrame to load.covariates (
List[str]) – The list of column names to be used as covariates in the dataset.batch_effects (
List[str]) – The list of column names to be used as batch effects in the dataset.response_vars (
List[str]) – The list of column names to be used as response variables in the dataset.subject_ids (
str) – The name of the column containing the subject IDsattrs (
Mapping[str,Any] | None, optional) – Additional attributes for the dataset, by default None.remove_Nan (
bool) – Whether or not to remove NAN values from the dataframe before creating of the class object. By default False
- Returns:
An instance of NormData.
- Return type:
- classmethod from_fsl(fsl_folder, config_params) NormData#
- Abstractmethod:
Load a normative dataset from a FSL file.
- classmethod from_ndarrays(name: str, X: numpy.ndarray, Y: numpy.ndarray, batch_effects: numpy.ndarray | None = None, subject_ids: numpy.ndarray | None = None, attrs: Mapping[str, Any] | None = None, remove_outliers: bool = False, z_threshold: float = 3.0, remove_Nan: bool = False) NormData#
Create a NormData object from numpy arrays via DataFrame conversion.
- classmethod from_netcdf(name: str, netcdf_path: str) NormData#
Load a normative dataset from a netcdf file.
- classmethod from_paths(name: str, covariates_path: str, responses_path: str, batch_effects_path: str, **kwargs) NormData#
Load a normative dataset from a dictionary of paths.
- classmethod from_xarray(name: str, xarray_dataset: xarray.Dataset) NormData#
Load a normative dataset from an xarray dataset.
- get_single_batch_effect() Dict[str, List[str]]#
Get a single batch effect for each dimension.
- Returns:
A dictionary mapping each batch effect dimension to a list containing a single value.
- Return type:
Dict[str,List[str]]
- get_statistics_df() pandas.DataFrame#
Get the statistics as a pandas DataFrame.
- has_registered_metadata() bool#
Check if the batch effect and covariate metadata have been registered and are non-empty.
- Returns:
True if all required metadata attributes exist and are not empty, False otherwise.
- Return type:
- kfold_split(k: int) Generator[Tuple[numpy.typing.ArrayLike[int], numpy.typing.ArrayLike[int]], Any, Any]#
Perform k-fold splitting of the data.
- Parameters:
k (
int) – The number of folds.- Returns:
A generator yielding training and testing indices for each fold.
- Return type:
Generator[Tuple[ArrayLike[int],ArrayLike[int]],Any,Any]
- load_results(save_dir: str) None#
Loads the results (zscores, centiles, logp, statistics) back into the data
- Args:
save_dir (str): Where the results are saved. I.e.: {save_dir}/Z_fit_test.csv
- merge(other: NormData, name: str | None = None) NormData#
Merge two NormData objects.
Drops all columns that are not present in both datasets.
- classmethod remove_nan(dataframe: pandas.DataFrame) pandas.DataFrame#
Remove NaN values from the dataframe.
- classmethod remove_outliers(dataframe: pandas.DataFrame, continuous_vars: List[str], z_threshold: float = 3.0) pandas.DataFrame#
Remove outliers from the dataframe.
- save_results(save_dir: str) None#
Saves the results (zscores, centiles, logp, statistics) to disk
- Args:
save_dir (str): Where the results are saved. I.e.: {save_dir}/Z_fit_test.csv
- scale_backward(inscalers: Dict[str, Any], outscalers: Dict[str, Any]) None#
Scale the data backward using provided scalers.
- Parameters:
inscalers (
Dict[str,Any]) – Scalers for the covariate data.outscalers (
Dict[str,Any]) – Scalers for the response variable data.
- scale_forward(inscalers: Dict[str, Any], outscalers: Dict[str, Any]) None#
Scale the data forward in-place using provided scalers.
- Parameters:
inscalers (
Dict[str,Any]) – Scalers for the covariate data.outscalers (
Dict[str,Any]) – Scalers for the response variable data.
- select_batch_effects(name: str, batch_effects: Dict[str, List[str]], invert: bool = False) NormData#
Select observations matching (or not matching) batch effects.
- Parameters:
name (
str) – Name to assign to the returnedNormDatainstance.batch_effects (
Dict[str,List[str]]) – A dictionary mapping batch effect dimensions to lists of values to select batch effects from.invert (
bool, optional) – IfTrue, return observations that do not match any of the specified batch effect values. Default isFalse.
- Returns:
A NormData instance containing observations matching (or not matching) the specified batch effects.
- Return type:
- to_dataframe(dim_order: Sequence[Hashable] | None = None) pandas.DataFrame#
Convert the NormData instance to a pandas DataFrame.
- Parameters:
dim_order (
Sequence[Hashable] | None, optional) – The order of dimensions for the DataFrame, by default None.- Returns:
A DataFrame representation of the NormData instance.
- Return type:
pd.DataFrame
- train_test_split(splits: Tuple[float, Ellipsis] | List[float] | float = 0.8, split_names: Tuple[str, Ellipsis] | None = None, random_state: int = 42) Tuple[NormData, Ellipsis]#
Split the data into training and testing datasets.
- Parameters:
splits (
Tuple[float,] | List[float] | float) – A tuple (train_size, test_size), specifying the proportion of data for each split. Or a float specifying the proportion of data for the train set.split_names (
Tuple[str,] | None, optional) – Names for the splits, by default None.random_state (int , optional) – Random state for splits, by default 42.
- Returns:
A tuple containing the training and testing NormData instances.
- Return type:
Tuple[NormData,]
- __slots__ = ('unique_batch_effects', 'batch_effect_counts', 'batch_effect_covariate_ranges',...#
- property response_var_list: xarray.DataArray#