Evaluation metrics#

After fitting a normative model and running predictions, PCNtoolkit computes a set of evaluation metrics. This notebook explains what each metric measures, how it is computed, and how to interpret the values.


Two families of metrics#

Family

Uses

Metrics

Point prediction

Only Yhat (median prediction)

MAPE, RMSE, SMSE, R², EXPV, Rho

Probabilistic

Full predicted distribution (logp, centiles, Z-scores)

MACE, MSLL, MLL, ShapiroW

Point prediction metrics only look at whether the model’s best guess (the median prediction) is close to the true value, like checking if a weather forecast said “18°C” when it was actually 20°C.

Probabilistic metrics check whether the model’s uncertainty estimates are accurate. For example, did it say “I’m 90% sure the value falls between 15°C and 25°C” and was it actually right 90% of the time? This is important for normative models as we estimate uncertainty (with centiles, z-scores).

Setup#

You can access the evaluation metrics with the code below:

# fit and predict with a normative model
model.predict(test_data)

# Access the statistics DataArray
test_data['statistics']  # shape: (n_response_vars, n_metrics)

# As a readable DataFrame
test_all.get_statistics_df()

Point prediction metrics#

These metrics all compare Y (true values) against Yhat (the model’s median prediction). They tell you how well the model tracks the central tendency of the data, but they say nothing about the quality of the uncertainty estimates.


R² — Coefficient of determination#

\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}\]

R² answers the question: how much better is my model than simply always predicting the mean?

Unlike EXPV, R² is penalized by systematic mean shifts.

  • ≤1 — higher is better

  • 0 = no better than mean

  • negative = worse than mean


EXPV — Explained variance#

\[\text{EXPV} = 1 - \frac{\text{Var}(y - \hat{y} - \overline{(y - \hat{y})})}{\text{Var}(y)}\]

Similar to R², but it measures how much of the variance in the true values is explained by the model, after removing any systematic mean offset from the residuals.

  • Range: 0 to 1 — higher is better

  • A score of 1 means the model perfectly explains the variance in the data

  • A score of 0 means the model explains no more variance than simply predicting the mean


RMSE — Root mean squared error#

\[\text{RMSE} = \sqrt{\frac{1}{n} \sum_i (y_i - \hat{y}_i)^2}\]

The average magnitude of prediction error, in the same units as the response variable. Larger errors are penalized more than small ones due to the squaring.

  • Range: 0 to ∞ — lower is better


SMSE — Standardized mean squared error#

\[\text{SMSE} = \frac{\text{MSE}}{\text{Var}(y)} = \frac{\frac{1}{n}\sum_i(y_i - \hat{y}_i)^2}{\text{Var}(y)}\]

SMSE normalizes the MSE by the variance of the target, making it scale-independent and comparable across different response variables.

  • SMSE = 1 means your model does no better than always predicting the mean

  • SMSE < 1 means improvement over the mean predictor

  • SMSE is directly related to R²: SMSE ≈ 1 − R²


MAPE — Mean absolute percentage error#

\[\text{MAPE} = \frac{1}{n} \sum_i \frac{|y_i - \hat{y}_i|}{|y_i|}\]

The average absolute percentage error between predictions and true values. Scale-independent, making it interpretable without knowing the units of the response variable.

  • Range: 0 to ∞

  • lower is better

⚠️ Note: MAPE is undefined when any true value \(y_i = 0\), and becomes unstable when values are close to zero. Use with caution for response variables that can be near zero.


Rho — Spearman rank correlation#

\[\rho = \text{Pearson}(\text{rank}(y),\ \text{rank}(\hat{y}))\]

The rank-based correlation between true values and predictions. It checks if the model correctly orders subjects (e.g., if subject A truly has more WM-hypointensities than subject B, does the model also predict a higher value for A than for B?). Unlike Pearson correlation, Spearman’s ρ is robust to outliers and does not assume a linear relationship.

  • Range: −1 to 1

  • higher is better

  • Rho_p is the associated p-value testing whether ρ is significantly different from zero

Probabilistic metrics#


MLL — Mean log loss#

\[\text{MLL} = -\frac{1}{n} \sum_i \log p(y \mid \mathcal{D}, x_*)\]

Note: In earlier PCNtoolkit releases, this metric was called NLL (Negative Log Likelihood). It is now named MLL to match the literature and avoid confusion with the different NLL used internally for BLR hyperparameter estimation.

Where: - \(y\): the test or training response variable. We typically select the test set here, to see how well the normative model fitted on training data generalises to test set. - \(\mathcal{D}\): the training dataset used to fit the model - \(x_*\): the test covariate - \(p(y_i \mid \mathcal{D}, x_*)\): the probability the model assigns to the true value given the test input

Measures how “surprised” the model is by the data y, on average.

  • Range: 0 to ∞

  • lower is better

⚠️ Important: MLL is an absolute quantity that is scale-dependent (it depends on the units and variance of the response variable). This makes it difficult to interpret in isolation. To compare models meaningfully, use MSLL instead, which normalizes MLL against a baseline.

This metric is adopted from Section 2.5 of Gaussian Processes for Machine Learning book by C. E. Rasmussen & C. K. I. Williams.


MSLL — Mean standardized log loss#

\[\text{MSLL} = \underbrace{-\frac{1}{n}\sum_i \log p(y \mid \mathcal{D}, x_*)}_{\text{MLL}_{\text{model}}} - \underbrace{\left(-\frac{1}{n}\sum_i \log \mathcal{N}\!\left(y \mid \bar{y},\, \hat{\sigma}^2\right)\right)}_{\text{MLL}_{\text{Gaussian baseline}}}\]

where the Gaussian baseline fits a single normal distribution to the training responses: - \(\bar{y} = \frac{1}{n}\sum_i y_i\) — training sample mean - \(\hat{\sigma}^2 = \frac{1}{n}\sum_i (y_i - \bar{y})^2\) — training sample variance

MSLL is a relative metric. It compares the model’s mean log loss against a Gaussian baseline. The “standardized” in the name refers to this subtraction.

Value

Meaning

MSLL < 0

Model beats the Gaussian baseline

MSLL = 0

Model is equivalent to the Gaussian baseline

MSLL > 0

Model is worse than the Gaussian baseline

This metric is adopted from Section 2.5 of Gaussian Processes for Machine Learning book by C. E. Rasmussen & C. K. I. Williams.


MACE — Mean absolute centile error#

\[\text{MACE} = \frac{1}{b} \sum_{k=1}^{b} \left( \frac{1}{m} \sum_{j=1}^{m} \left| q_j - \frac{\sum_{i=1}^{n} \mathbf{1}\{\hat{q}_{ij} \geq y_i\}}{n} \right| \right)\]

where: - \(b\) are the unique combinations of batch effects - \(m\) is the number of centiles used for calibration - \(q_j\) is the \(j\)-th target centile level (e.g. 0.05, 0.25, 0.50, 0.75, 0.95) - \(\hat{q}_{ij}\) is the predicted \(j\)-th centile value for the \(i\)-th subject - \(y_i\) is the true value for the \(i\)-th subject - \(n\) is the number of subjects in the batch group - \(\mathbf{1}\{\hat{q}_{ij} \geq y_i\}\) is an indicator function that outputs 1 or 0, depending on whether \(y_i\) lies below or above its predicted \(j\)-th centile value, respectively. So, \(\frac{\sum_i \mathbf{1}\{\hat{q}_{ij} \geq y_i\}}{n}\) is the empirical fraction of subjects below the \(j\)-th centile curve

The maths above might seem complicated. To put simply, the MACE checks, for each predicted centile level (e.g. the 10th, 25th, 50th, 75th, 95th centile curve), what fraction of subjects actually falls below it in the data. A perfectly calibrated model has exactly 10% of subjects below its 10th centile, 25% below its 25th centile, and so on. MACE averages the absolute deviation from this perfectly calibrated model across all centile levels.

Important: MACE is averaged across unique combinations of batch effects (e.g., site and sex combinations) and each combination contributes equally. This means small groups have the same influence as large groups, and hence they may add disproportionate amount of noise to MACE.

  • MACE values close to 0 indicate the predicted centile curves closely match the distribution of the data.

Connection to the QQ plot: The QQ plot is the “uncompressed” version of MACE. Each point on the QQ plot corresponds to MACE at a specific quantile level. Systematic deviations from the diagonal (e.g. an S-curve or U-curve) indicate where along the distribution calibration breaks down - information that MACE collapses into a single number.

This metric is adopted from equation 4 of this paper: > Zamanzadeh, M., Verduyn, Y., de Boer, A. et al. Normative modeling of MEG brain oscillations across the human lifespan. Commun Biol (2026). https://doi.org/10.1038/s42003-026-09825-2


ShapiroW — Shapiro–Wilk W statistic on Z-scores#

\[W = \frac{\left(\sum_i a_i Z_{(i)}\right)^2}{\sum_i (Z_i - \bar{Z})^2}\]

where \(Z_{(i)}\) are the Z-scores sorted from smallest to largest and \(a_i\) are fixed weights that reflect how much each sorted Z-score should contribute, based on what a perfect normal distribution would look like.

The Z-score for each subject is:

\[Z_i = \frac{y_i - \hat{\mu}_i}{\hat{\sigma}_i}\]

where \(\hat{\mu}_i\) is the model’s predicted mean and \(\hat{\sigma}_i\) is its predicted uncertainty for subject \(i\).

For a normative model, if the model is perfectly calibrated, the Z-scores should follow a standard normal distribution, regardless of whether the original data was Gaussian or not. A W close to 1 means the model successfully normalized the non-Gaussian original data into approximately standard-normal Z-scores.

  • Range: 0 to 1 — closer to 1 is better

W value

Interpretation

W ≈ 1.0

Z-scores are well-normalized; model can be well-calibrated

W ≈ 0.95

Mild departure from normality, likely slight miscalibration in the tails

W ≪ 1.0

Z-scores are substantially non-normal; model may be missing distributional structure

You can read more about the Shapiro–Wilk test in this wikipedia page.


Skewness — Skewness of Z-scores#

\[\text{Skewness} = \frac{n}{(n-1)(n-2)} \sum_i \left(\frac{Z_i - \bar{Z}}{s}\right)^3\]

where \(s\) is the sample standard deviation, \(n\) is the number of observations.

Skewness measures how long the tails of the Z-score distribution are relative to a standard normal distribution. A normative model can be well-calibrated if the Z-scores follow a standard normal distribution which is expected to have skewness = 0.

  • Range: \((-\infty, +\infty)\) - closer to 0 is better

Skewness value

Interpretation

≈ 0

Z-scores are symmetric; model can be well-calibrated

> 0

the right tail of the Z-score distribution is longer than the left; the model tends to predict values that are too low

< 0

the left tail of the Z-score distribution is longer than the right; the model tends to predict values that are too high

⚠️ Because of the denominator in the formula, \(n\) must be higher or equal to 3. If not, a NaN value is returned.

A nice visual representation can be found here.


Kurtosis — Excess kurtosis of Z-scores#

\[\text{Kurtosis} = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_i \left(\frac{Z_i - \bar{Z}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}\]

where \(s\) is the sample standard deviation, \(n\) is the number of observations. The \(-\,\frac{3(n-1)^2}{(n-2)(n-3)}\) term centres the statistic so that a normal distribution yields exactly 0 (Fisher’s definition of excess kurtosis).

Excess kurtosis measures how fat the tails of the Z-score distribution are relative to a standard normal distribution. A normative model can be well-calibrated if the Z-scores follow a standard normal distribution which is expected to have excess kurtosis = 0.

  • Range: \([-2, +\infty)\) - closer to 0 is better

Kurtosis value

Interpretation

≈ 0

Tails match a normal distribution; model can be well-calibrated

> 0

Fatter tails; more outliers than a normal distribution

< 0

Lighter tails; less outliers than a normal distribution

⚠️ Because of the denominator in the formula, \(n\) must be higher or equal to 4. If not, a NaN value is returned.

A nice visual representation can be found here and in this wikipedia figure you can see seven distributions each with a different kurtosis value.

Summary table#

Metric

Family

Input

Better when

Range

Point

Y, Yhat

Higher

≤ 1

EXPV

Point

Y, Yhat

Higher

0-1

Rho

Point

Y, Yhat

Higher

−1 to 1

RMSE

Point

Y, Yhat

Lower

≥ 0

SMSE

Point

Y, Yhat

Lower

≥ 0

MAPE

Point

Y, Yhat

Lower

≥ 0

MLL

Probabilistic

logp

Lower

≥ 0

MSLL

Probabilistic

logp, baseline_logp

Lower (negative)

unbounded

MACE

Probabilistic

centiles, Y

Lower

0-1

ShapiroW

Probabilistic

Z-scores

Higher

0–1

Skewness

Probabilistic

Z-scores

Closer to 0

unbounded

Kurtosis

Probabilistic

Z-scores

Closer to 0

−2 to ∞