Getting started with normative modelling#
Welcome to this tutorial notebook that will show you the very basics of normative modeling. It’s like the “Hello World” of normative modeling.
Let’s jump right in.
Imports#
import warnings
import pandas as pd
import matplotlib.pyplot as plt
from pcntoolkit import (
BLR,
NormativeModel,
NormData,
load_fcon1000,
plot_centiles,
plot_qq,
)
import pcntoolkit.util.output
import seaborn as sns
sns.set_style("darkgrid")
warnings.simplefilter(action="ignore", category=FutureWarning)
pd.options.mode.chained_assignment = None # default='warn'
pcntoolkit.util.output.Output.set_show_messages(False)
Load data#
First we download a small example dataset from github.
# Download an example dataset
norm_data: NormData = load_fcon1000()
# Select only these three features to model for this example
norm_data = norm_data.sel({"response_vars": ["WM-hypointensities", "Left-Lateral-Ventricle", "Brain-Stem"]})
# Train-test split
train, test = norm_data.train_test_split()
# Inspect the data
df = train.to_dataframe()
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.countplot(data=df, y=("batch_effects", "site"), hue=("batch_effects", "sex"), ax=ax[0], orient="h")
ax[0].legend(title="Sex")
ax[0].set_title("Count of sites")
ax[0].set_xlabel("Site")
ax[0].set_ylabel("Count")
scatter_feature = "Left-Lateral-Ventricle"
sns.scatterplot(
data=df,
x=("X", "age"),
y=("Y", scatter_feature),
hue=("batch_effects", "site"),
style=("batch_effects", "sex"),
ax=ax[1],
)
ax[1].legend([], [])
ax[1].set_title(f"Scatter plot of age vs {scatter_feature}")
ax[1].set_xlabel("Age")
ax[1].set_ylabel(scatter_feature)
plt.show()
Creating a Normative model#
save_dir = "/Users/stijndeboer/Projects/PCN/PCNtoolkit/examples/saves"
model = NormativeModel(BLR(), inscaler="standardize", outscaler="standardize")
model.has_batch_effect
False
Fit the model#
With all that configured, we can fit the model.
The fit_predict function will fit the model, evaluate it, save the
results and plots, and return the test data with all the predictions
added.
After that, it will compute Z-scores and centiles for the test set.
All results can be found in the save directory.
model.fit_predict(train, test)
<xarray.NormData> Size: 87kB
Dimensions: (observations: 216, response_vars: 3, covariates: 1,
batch_effect_dims: 2, centile: 5, statistic: 11)
Coordinates:
* observations (observations) int64 2kB 756 769 692 616 ... 751 470 1043
* response_vars (response_vars) <U22 264B 'WM-hypointensities' ... 'Br...
* covariates (covariates) <U3 12B 'age'
* batch_effect_dims (batch_effect_dims) <U4 32B 'sex' 'site'
* centile (centile) float64 40B 0.05 0.25 0.5 0.75 0.95
* statistic (statistic) <U8 352B 'EXPV' 'MACE' ... 'SMSE' 'ShapiroW'
Data variables:
subject_ids (observations) object 2kB 'Munchen_sub96752' ... 'Quee...
Y (observations, response_vars) float64 5kB 2.721e+03 .....
X (observations, covariates) float64 2kB 63.0 ... 23.0
batch_effects (observations, batch_effect_dims) <U17 29kB 'F' ... 'Q...
Z (observations, response_vars) float64 5kB 0.8681 ... -...
centiles (centile, observations, response_vars) float64 26kB 75...
logp (observations, response_vars) float64 5kB -1.254 ... -...
Yhat (observations, response_vars) float64 5kB 2.041e+03 .....
statistics (response_vars, statistic) float64 264B 0.1501 ... 0.9891
Y_harmonized (observations, response_vars) float64 5kB 2.721e+03 .....
Attributes:
real_ids: True
is_scaled: False
name: fcon1000_test
unique_batch_effects: {np.str_('sex'): [np.str_('F'), np.str_('...
batch_effect_counts: defaultdict(<function NormData.register_b...
covariate_ranges: {np.str_('age'): {'mean': np.float64(28.2...
batch_effect_covariate_ranges: {np.str_('sex'): {np.str_('F'): {np.str_(...Plotting the centiles#
With the fitted model, and some data, we can plot some centiles. There are a lot of different configurations possible, but here is a simple example.
plot_centiles(model, scatter_data=train)
We see that the model fits the data reasonably well. We can do better, but that is a topic for another tutorial.
Showing the evaluation metrics#
We also computed evaluation metrics for the model. Those are saved in
the save_dir/results/statistics.csv file, but are also added to the
NormData object as a new data variable.
# We can use the `get_statistics_df` method to get a nicely formatted dataframe with the evaluation metrics.
display(train.get_statistics_df())
display(test.get_statistics_df())
| statistic | EXPV | MACE | MAPE | MSLL | NLL | R2 | RMSE | Rho | Rho_p | SMSE | ShapiroW |
|---|---|---|---|---|---|---|---|---|---|---|---|
| response_vars | |||||||||||
| Brain-Stem | 0.000004 | 0.006311 | 0.096330 | -7.800671 | 1.418910 | 0.000004 | 2442.168242 | -0.050426 | 1.390658e-01 | 0.999996 | 0.996466 |
| Left-Lateral-Ventricle | 0.162276 | 0.053828 | 0.401734 | -8.440148 | 1.330158 | 0.162276 | 3877.069187 | 0.269669 | 7.905570e-16 | 0.837724 | 0.877568 |
| WM-hypointensities | 0.132905 | 0.068724 | 0.410158 | -6.778665 | 1.345555 | 0.132905 | 760.501165 | 0.019769 | 5.621687e-01 | 0.867095 | 0.722254 |
| statistic | EXPV | MACE | MAPE | MSLL | NLL | R2 | RMSE | Rho | Rho_p | SMSE | ShapiroW |
|---|---|---|---|---|---|---|---|---|---|---|---|
| response_vars | |||||||||||
| Brain-Stem | -0.000274 | 0.014259 | 0.103240 | -7.790895 | 1.525974 | -0.000309 | 2692.120004 | -0.105739 | 0.121292 | 1.000309 | 0.989058 |
| Left-Lateral-Ventricle | 0.175440 | 0.046296 | 0.430686 | -8.443777 | 1.404512 | 0.171542 | 4168.272079 | 0.220610 | 0.001099 | 0.828458 | 0.898344 |
| WM-hypointensities | 0.150136 | 0.060741 | 0.444148 | -6.693763 | 1.130787 | 0.143994 | 559.964146 | -0.027495 | 0.687809 | 0.856006 | 0.972405 |
QQ plots#
We also have a nice function to make QQ plots.
plot_qq(test, plot_id_line=True)
And those are the basics of Normative Modelling with the PCNtoolkit. We will go over some more advanced models in the next tutorials, but this should give you a good first impression.