The NormData class ================== A key component of the PCNtoolkit is the NormData object. It is a container for the data that will be used to fit the normative model. The NormData object keeps track of the all the dimensions of your data, the features and response variables, batch effects, preprocessing steps, and more. .. code:: ipython3 import copy import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from pcntoolkit import NormData Creating a NormData object -------------------------- There are currently two easy ways to create a NormData object. 1. Load from a pandas dataframe 2. Load from numpy arrays Here are examples of both. .. code:: ipython3 # Creating a NormData object from a pandas dataframe # Download an example dataset: data = pd.read_csv( "https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/refs/heads/main/data/fcon1000.csv" ) # specify the column names to use covariates = ["age"] batch_effects = ["sex", "site"] response_vars = ["WM-hypointensities", "Left-Lateral-Ventricle", "Brain-Stem"] # create a NormData object norm_data = NormData.from_dataframe( name="fcon1000", dataframe=data, covariates=covariates, batch_effects=batch_effects, response_vars=response_vars, remove_outliers=True, z_threshold=10, ) norm_data.coords .. parsed-literal:: Process: 75157 - 2025-11-20 13:13:40 - Removed 2 outliers for WM-hypointensities Process: 75157 - 2025-11-20 13:13:40 - Removed 2 outliers Process: 75157 - 2025-11-20 13:13:40 - Dataset "fcon1000" created. - 1076 observations - 1076 unique subjects - 1 covariates - 3 response variables - 2 batch effects: sex (2) site (23) .. parsed-literal:: Coordinates: * observations (observations) int64 9kB 0 1 2 3 ... 1072 1073 1074 1075 * response_vars (response_vars)
X Y batch_effects subject_ids
age Brain-Stem Left-Lateral-Ventricle WM-hypointensities sex site subject_ids
0 25.63 20663.2 4049.4 1686.7 1 AnnArbor_a 0
1 18.34 19954.0 9312.6 1371.1 1 AnnArbor_a 1
2 29.20 21645.2 8972.6 1414.8 1 AnnArbor_a 2
3 31.39 20790.6 6798.6 1830.6 1 AnnArbor_a 3
4 13.58 17692.6 6112.5 1642.4 1 AnnArbor_a 4
Inspecting the NormData ----------------------- So let’s go over the attributes of the NormData object. Because it is a subclass of xarray.Dataset, it has all the attributes of a xarray.Dataset, but it has some additional attributes that are specific to normative modelling. The data variables ~~~~~~~~~~~~~~~~~~ The data variables of the NormData object are: - ``X``: The covariates - ``Y``: The response variables - ``batch_effects``: The batch effects - ``subjects``: The subject ids And all these data variables are xarray.DataArrays, with corresponding dimensions, stored in the ``data_vars`` attribute of the NormData object. .. code:: ipython3 norm_data.data_vars .. parsed-literal:: Data variables: subject_ids (observations) int64 9kB 0 1 2 3 4 ... 1072 1073 1074 1075 Y (observations, response_vars) float64 26kB 1.687e+03 ... 1... X (observations, covariates) float64 9kB 25.63 18.34 ... 23.0 batch_effects (observations, batch_effect_dims)
<xarray.NormData> Size: 181kB
    Dimensions:            (observations: 1076, covariates: 1, batch_effect_dims: 2)
    Coordinates:
      * observations       (observations) int64 9kB 0 1 2 3 ... 1072 1073 1074 1075
        response_vars      <U22 88B 'WM-hypointensities'
      * covariates         (covariates) <U3 12B 'age'
      * batch_effect_dims  (batch_effect_dims) <U4 32B 'sex' 'site'
    Data variables:
        subject_ids        (observations) int64 9kB 0 1 2 3 ... 1072 1073 1074 1075
        Y                  (observations) float64 9kB 1.687e+03 1.371e+03 ... 509.1
        X                  (observations, covariates) float64 9kB 25.63 ... 23.0
        batch_effects      (observations, batch_effect_dims) <U17 146kB '1' ... '...
    Attributes:
        real_ids:                       False
        is_scaled:                      False
        name:                           fcon1000
        unique_batch_effects:           {np.str_('sex'): [np.str_('0'), np.str_('...
        batch_effect_counts:            defaultdict(<function NormData.register_b...
        covariate_ranges:               {np.str_('age'): {'mean': np.float64(28.1...
        batch_effect_covariate_ranges:  {np.str_('sex'): {np.str_('0'): {np.str_(...
Selecting a number of observations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ But we can also filter out a slice of the data. For example, to select the first 10 observations, you can use the ``observations`` coordinate: .. code:: python norm_data.sel(observations=slice(0, 9)) This will return a new NormData object with only the first 10 observations. .. code:: ipython3 norm_data.sel(observations=slice(0, 9)) .. raw:: html
<xarray.NormData> Size: 2kB
    Dimensions:            (observations: 10, response_vars: 3, covariates: 1,
                            batch_effect_dims: 2)
    Coordinates:
      * observations       (observations) int64 80B 0 1 2 3 4 5 6 7 8 9
      * response_vars      (response_vars) <U22 264B 'WM-hypointensities' ... 'Br...
      * covariates         (covariates) <U3 12B 'age'
      * batch_effect_dims  (batch_effect_dims) <U4 32B 'sex' 'site'
    Data variables:
        subject_ids        (observations) int64 80B 0 1 2 3 4 5 6 7 8 9
        Y                  (observations, response_vars) float64 240B 1.687e+03 ....
        X                  (observations, covariates) float64 80B 25.63 ... 19.88
        batch_effects      (observations, batch_effect_dims) <U17 1kB '1' ... 'An...
    Attributes:
        real_ids:                       False
        is_scaled:                      False
        name:                           fcon1000
        unique_batch_effects:           {np.str_('sex'): [np.str_('0'), np.str_('...
        batch_effect_counts:            defaultdict(<function NormData.register_b...
        covariate_ranges:               {np.str_('age'): {'mean': np.float64(28.1...
        batch_effect_covariate_ranges:  {np.str_('sex'): {np.str_('0'): {np.str_(...
NormData with predictions ------------------------- After fitting a model and predicting on NormData, the NormData object will have new attributes holding the predictions. Specifically, the NormData object will be extended with new data variables: - ``Z``: The predicted Z scores for each response variable - ``centiles``: The predicted centiles - ``logp``: The predicted log-p-values for each response variable - ``Yhat``: The predicted mean of the response variable - ``Y_harmonized``: The harmonized response variables - ``statistics``: An array of statistics for each response variable And the following new coordinates: - ``centile``: The specific centile values - ``statistic``: The name of the computed statistics .. code:: ipython3 from pcntoolkit import BLR, NormativeModel # We create a very simple BLR model because it is fast to fit model = NormativeModel(BLR()) model.fit(norm_data) # Fitting on the data also makes predictions for that data .. parsed-literal:: Process: 75157 - 2025-11-20 13:18:58 - Fitting models on 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Fitting model for WM-hypointensities. Process: 75157 - 2025-11-20 13:18:58 - Fitting model for Left-Lateral-Ventricle. Process: 75157 - 2025-11-20 13:18:58 - Fitting model for Brain-Stem. Process: 75157 - 2025-11-20 13:18:58 - Making predictions on 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Computing z-scores for 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Computing z-scores for WM-hypointensities. Process: 75157 - 2025-11-20 13:18:58 - Computing z-scores for Left-Lateral-Ventricle. Process: 75157 - 2025-11-20 13:18:58 - Computing z-scores for Brain-Stem. Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for WM-hypointensities. Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for Left-Lateral-Ventricle. Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for Brain-Stem. Process: 75157 - 2025-11-20 13:18:58 - Computing log-probabilities for 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Computing log-probabilities for WM-hypointensities. Process: 75157 - 2025-11-20 13:18:58 - Computing log-probabilities for Left-Lateral-Ventricle. Process: 75157 - 2025-11-20 13:18:58 - Computing log-probabilities for Brain-Stem. Process: 75157 - 2025-11-20 13:18:58 - Computing yhat for 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Computing yhat for WM-hypointensities. Process: 75157 - 2025-11-20 13:18:58 - Computing yhat for Left-Lateral-Ventricle. Process: 75157 - 2025-11-20 13:18:58 - Computing yhat for Brain-Stem. .. parsed-literal:: /opt/anaconda3/envs/ptk/lib/python3.12/site-packages/pcntoolkit/dataio/norm_data.py:1094: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning. subject_ids = subject_ids.stack(level="centile") .. parsed-literal:: Process: 75157 - 2025-11-20 13:18:58 - Dataset "centile" created. - 150 observations - 150 unique subjects - 1 covariates - 3 response variables - 2 batch effects: sex (1) site (1) Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for WM-hypointensities. Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for Left-Lateral-Ventricle. Process: 75157 - 2025-11-20 13:18:58 - Computing centiles for Brain-Stem. Process: 75157 - 2025-11-20 13:18:58 - Harmonizing data on 3 response variables. Process: 75157 - 2025-11-20 13:18:58 - Harmonizing data for WM-hypointensities. Process: 75157 - 2025-11-20 13:18:58 - Harmonizing data for Left-Lateral-Ventricle. Process: 75157 - 2025-11-20 13:18:58 - Harmonizing data for Brain-Stem. Process: 75157 - 2025-11-20 13:18:59 - Saving model to: /Users/stijndeboer/.pcntoolkit/saves. .. code:: ipython3 norm_data.data_vars .. parsed-literal:: Data variables: subject_ids (observations) int64 9kB 0 1 2 3 4 ... 1072 1073 1074 1075 Y (observations, response_vars) float64 26kB 1.687e+03 ... 1... X (observations, covariates) float64 9kB 25.63 18.34 ... 23.0 batch_effects (observations, batch_effect_dims)
<xarray.NormData> Size: 5kB
    Dimensions:            (observations: 10, response_vars: 3, covariates: 1,
                            batch_effect_dims: 2, centile: 5, statistic: 11)
    Coordinates:
      * observations       (observations) int64 80B 0 1 2 3 4 5 6 7 8 9
      * response_vars      (response_vars) <U22 264B 'WM-hypointensities' ... 'Br...
      * covariates         (covariates) <U3 12B 'age'
      * batch_effect_dims  (batch_effect_dims) <U4 32B 'sex' 'site'
      * centile            (centile) float64 40B 0.05 0.25 0.5 0.75 0.95
      * statistic          (statistic) <U8 352B 'EXPV' 'MACE' ... 'SMSE' 'ShapiroW'
    Data variables:
        subject_ids        (observations) int64 80B 0 1 2 3 4 5 6 7 8 9
        Y                  (observations, response_vars) float64 240B 1.687e+03 ....
        X                  (observations, covariates) float64 80B 25.63 ... 19.88
        batch_effects      (observations, batch_effect_dims) <U17 1kB '1' ... 'An...
        Z                  (observations, response_vars) float64 240B 0.7423 ... ...
        centiles           (centile, observations, response_vars) float64 1kB 152...
        logp               (observations, response_vars) float64 240B -1.158 ... ...
        Yhat               (observations, response_vars) float64 240B 1.21e+03 .....
        statistics         (response_vars, statistic) float64 264B 0.1172 ... 0.9965
        Y_harmonized       (observations, response_vars) float64 240B 1.687e+03 ....
    Attributes:
        real_ids:                       False
        is_scaled:                      False
        name:                           fcon1000
        unique_batch_effects:           {np.str_('sex'): [np.str_('0'), np.str_('...
        batch_effect_counts:            defaultdict(<function NormData.register_b...
        covariate_ranges:               {np.str_('age'): {'mean': np.float64(28.1...
        batch_effect_covariate_ranges:  {np.str_('sex'): {np.str_('0'): {np.str_(...
Or, if we want to select only the WM-hypointensities, we can use: .. code:: python norm_data.sel(response_vars="WM-hypointensities") This will return a new NormData object with only the WM-hypointensities. .. code:: ipython3 norm_data.sel(response_vars="WM-hypointensities").statistics .. raw:: html
<xarray.DataArray 'statistics' (statistic: 11)> Size: 88B
    array([ 1.17178773e-01,  4.50929368e-02,  3.98458621e-01, -6.56498543e+00,
            1.35593870e+00,  1.17178773e-01,  6.26199688e+02,  4.38152477e-03,
            8.85849246e-01,  8.82821227e-01,  8.47148187e-01])
    Coordinates:
        response_vars  <U22 88B 'WM-hypointensities'
      * statistic      (statistic) <U8 352B 'EXPV' 'MACE' ... 'SMSE' 'ShapiroW'
Now we can use the to_dataframe method to cast that selection back to a pandas dataframe. .. code:: ipython3 new_df = norm_data.sel(response_vars="WM-hypointensities").to_dataframe() new_df.head() .. raw:: html
X Y Y_harmonized Z batch_effects subject_ids centiles
age WM-hypointensities WM-hypointensities WM-hypointensities sex site subject_ids (WM-hypointensities, 0.05) (WM-hypointensities, 0.25) (WM-hypointensities, 0.5) (WM-hypointensities, 0.75) (WM-hypointensities, 0.95)
0 25.63 1686.7 1686.7 0.742311 1 AnnArbor_a 0 152.348837 776.050569 1209.579100 1643.107630 2266.809363
1 18.34 1371.1 1371.1 0.445220 1 AnnArbor_a 1 27.387682 651.236372 1084.867051 1518.497730 2142.346419
2 29.20 1414.8 1414.8 0.224270 1 AnnArbor_a 2 213.436546 837.129588 1270.652078 1704.174569 2327.867611
3 31.39 1830.6 1830.6 0.812878 1 AnnArbor_a 3 250.875616 874.583918 1308.117015 1741.650112 2365.358414
4 13.58 1642.4 1642.4 0.993572 1 AnnArbor_a 4 -54.364209 569.674000 1003.436412 1437.198824 2061.237034
This should give you a pretty good overview of how to work with NormData. Most of the functionality is built on top of xarray, so if you want to learn more about xarray, you can check out the `xarray documentation `__. However, the Xarray.DataSet class does not officially support being extended, so the API does not work completely as expected. If you have any suggestions for improvements, please let us know! Pre-processing and split datasets --------------------------------- Sometimes we have a dataset that is pre-split into train and test, and we want to use that exact data split to fit the model. We can then load the data into two NormData objects, but we have to make sure that the two datasets are compatible. This will ensure that the fitted model is applicable to both of them. .. code:: ipython3 # Download an example dataset: data = pd.read_csv( "https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/refs/heads/main/data/fcon1000.csv" ) # Create an arbitrary split as a placeholder for a predefined split () train, test = train_test_split(data, test_size=100) .. code:: ipython3 # specify the column names to use covariates = ["age"] batch_effects = ["sex", "site"] response_vars = ["WM-hypointensities", "Left-Lateral-Ventricle", "Brain-Stem"] # create NormData objects norm_train = NormData.from_dataframe( name="train", dataframe=train, covariates=covariates, batch_effects=batch_effects, response_vars=response_vars ) norm_test = NormData.from_dataframe( name="test", dataframe=test, covariates=covariates, batch_effects=batch_effects, response_vars=response_vars ) .. parsed-literal:: Process: 75157 - 2025-11-20 13:19:02 - Dataset "train" created. - 978 observations - 978 unique subjects - 1 covariates - 3 response variables - 2 batch effects: sex (2) site (23) Process: 75157 - 2025-11-20 13:19:02 - Dataset "test" created. - 100 observations - 100 unique subjects - 1 covariates - 3 response variables - 2 batch effects: sex (2) site (18) .. code:: ipython3 # Should print false, because the train and test split do not contain the same sites print(norm_train.check_compatibility(norm_test)) .. parsed-literal:: True .. code:: ipython3 norm_train.check_compatibility(norm_test) .. parsed-literal:: True .. code:: ipython3 norm_train.make_compatible(norm_test) .. code:: ipython3 norm_train.check_compatibility(norm_test) .. parsed-literal:: True