Getting started with normative modelling#

Welcome to this tutorial notebook that will show you the very basics of normative modeling. It’s like the “Hello World” of normative modeling.

Let’s jump right in.

Imports#

import warnings
import pandas as pd
import matplotlib.pyplot as plt
from pcntoolkit import (
    BLR,
    NormativeModel,
    NormData,
    load_fcon1000,
    plot_centiles,
    plot_qq,
)
import pcntoolkit.util.output
import seaborn as sns

sns.set_style("darkgrid")
warnings.simplefilter(action="ignore", category=FutureWarning)
pd.options.mode.chained_assignment = None  # default='warn'
pcntoolkit.util.output.Output.set_show_messages(False)

Load data#

First we download a small example dataset from github.

# Download an example dataset
norm_data: NormData = load_fcon1000()
# Select only these three features to model for this example
norm_data = norm_data.sel({"response_vars": ["WM-hypointensities", "Left-Lateral-Ventricle", "Brain-Stem"]})
# Train-test split
train, test = norm_data.train_test_split()

# Inspect the data
df = train.to_dataframe()
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

sns.countplot(data=df, y=("batch_effects", "site"), hue=("batch_effects", "sex"), ax=ax[0], orient="h")
ax[0].legend(title="Sex")
ax[0].set_title("Count of sites")
ax[0].set_xlabel("Site")
ax[0].set_ylabel("Count")

scatter_feature = "Left-Lateral-Ventricle"

sns.scatterplot(
    data=df,
    x=("X", "age"),
    y=("Y", scatter_feature),
    hue=("batch_effects", "site"),
    style=("batch_effects", "sex"),
    ax=ax[1],
)
ax[1].legend([], [])
ax[1].set_title(f"Scatter plot of age vs {scatter_feature}")
ax[1].set_xlabel("Age")
ax[1].set_ylabel(scatter_feature)

plt.show()

Creating a Normative model#

save_dir = "/Users/stijndeboer/Projects/PCN/PCNtoolkit/examples/saves"
model = NormativeModel(BLR(), inscaler="standardize", outscaler="standardize")

model.has_batch_effect

False

Fit the model#

With all that configured, we can fit the model.

The fit_predict function will fit the model, evaluate it, save the results and plots, and return the test data with all the predictions added.

After that, it will compute Z-scores and centiles for the test set.

All results can be found in the save directory.

model.fit_predict(train, test)

<xarray.NormData> Size: 87kB
Dimensions:            (observations: 216, response_vars: 3, covariates: 1,
                        batch_effect_dims: 2, centile: 5, statistic: 11)
Coordinates:
  * observations       (observations) int64 2kB 756 769 692 616 ... 751 470 1043
  * response_vars      (response_vars) <U22 264B 'WM-hypointensities' ... 'Br...
  * covariates         (covariates) <U3 12B 'age'
  * batch_effect_dims  (batch_effect_dims) <U4 32B 'sex' 'site'
  * centile            (centile) float64 40B 0.05 0.25 0.5 0.75 0.95
  * statistic          (statistic) <U8 352B 'EXPV' 'MACE' ... 'SMSE' 'ShapiroW'
Data variables:
    subject_ids        (observations) object 2kB 'Munchen_sub96752' ... 'Quee...
    Y                  (observations, response_vars) float64 5kB 2.721e+03 .....
    X                  (observations, covariates) float64 2kB 63.0 ... 23.0
    batch_effects      (observations, batch_effect_dims) <U17 29kB 'F' ... 'Q...
    Z                  (observations, response_vars) float64 5kB 0.8681 ... -...
    centiles           (centile, observations, response_vars) float64 26kB 75...
    logp               (observations, response_vars) float64 5kB -1.254 ... -...
    Yhat               (observations, response_vars) float64 5kB 2.041e+03 .....
    statistics         (response_vars, statistic) float64 264B 0.1501 ... 0.9891
    Y_harmonized       (observations, response_vars) float64 5kB 2.721e+03 .....
Attributes:
    real_ids:                       True
    is_scaled:                      False
    name:                           fcon1000_test
    unique_batch_effects:           {np.str_('sex'): [np.str_('F'), np.str_('...
    batch_effect_counts:            defaultdict(<function NormData.register_b...
    covariate_ranges:               {np.str_('age'): {'mean': np.float64(28.2...
    batch_effect_covariate_ranges:  {np.str_('sex'): {np.str_('F'): {np.str_(...

Plotting the centiles#

With the fitted model, and some data, we can plot some centiles. There are a lot of different configurations possible, but here is a simple example.

plot_centiles(model, scatter_data=train)

We see that the model fits the data reasonably well. We can do better, but that is a topic for another tutorial.

Showing the evaluation metrics#

We also computed evaluation metrics for the model. Those are saved in the save_dir/results/statistics.csv file, but are also added to the NormData object as a new data variable.

# We can use the `get_statistics_df` method to get a nicely formatted dataframe with the evaluation metrics.
display(train.get_statistics_df())
display(test.get_statistics_df())

statistic	EXPV	MACE	MAPE	MSLL	NLL	R2	RMSE	Rho	Rho_p	SMSE	ShapiroW
response_vars
Brain-Stem	0.000004	0.006311	0.096330	-7.800671	1.418910	0.000004	2442.168242	-0.050426	1.390658e-01	0.999996	0.996466
Left-Lateral-Ventricle	0.162276	0.053828	0.401734	-8.440148	1.330158	0.162276	3877.069187	0.269669	7.905570e-16	0.837724	0.877568
WM-hypointensities	0.132905	0.068724	0.410158	-6.778665	1.345555	0.132905	760.501165	0.019769	5.621687e-01	0.867095	0.722254

statistic	EXPV	MACE	MAPE	MSLL	NLL	R2	RMSE	Rho	Rho_p	SMSE	ShapiroW
response_vars
Brain-Stem	-0.000274	0.014259	0.103240	-7.790895	1.525974	-0.000309	2692.120004	-0.105739	0.121292	1.000309	0.989058
Left-Lateral-Ventricle	0.175440	0.046296	0.430686	-8.443777	1.404512	0.171542	4168.272079	0.220610	0.001099	0.828458	0.898344
WM-hypointensities	0.150136	0.060741	0.444148	-6.693763	1.130787	0.143994	559.964146	-0.027495	0.687809	0.856006	0.972405

QQ plots#

We also have a nice function to make QQ plots.

plot_qq(test, plot_id_line=True)

And those are the basics of Normative Modelling with the PCNtoolkit. We will go over some more advanced models in the next tutorials, but this should give you a good first impression.