`Predictive Clinical Neuroscience Toolkit `__ ====================================================================================== Hierarchical Bayesian Regression Normative Modelling and Transfer onto unseen site. =================================================================================== This notebook will go through basic data preparation (training and testing set, `see Saige’s tutorial `__ on Normative Modelling for more detail), the actual training of the models, and will finally describe how to transfer the trained models onto unseen sites. Created by `Saige Rutherford `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ adapted/edited by Andre Marquand and Pierre Berthet ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Step 0: Install necessary libraries & grab data files ----------------------------------------------------- .. code:: ipython3 !pip install pcntoolkit For this tutorial we will use data from the `Functional Connectom Project FCON1000 `__ to create a multi-site dataset. The dataset contains some cortical measures (eg thickness), processed by Freesurfer 6.0, and some covariates (eg age, site, gender). First we import the required package, and create a working directory. .. code:: ipython3 import os import pandas as pd import pcntoolkit as ptk import numpy as np import pickle from matplotlib import pyplot as plt .. code:: ipython3 processing_dir = "HBR_demo" # replace with desired working directory if not os.path.isdir(processing_dir): os.makedirs(processing_dir) os.chdir(processing_dir) processing_dir = os.getcwd() Overview ^^^^^^^^ Here we get the FCON dataset, remove the ICBM site for later transfer, assign some site id to the different scanner sites and print an overview of the left hemisphere mean raw cortical thickness as a function of age, color coded by the various sites: .. code:: ipython3 fcon = pd.read_csv( "https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/main/data/fcon1000.csv" ) # extract the ICBM site for transfer icbm = fcon.loc[fcon["site"] == "ICBM"] icbm["sitenum"] = 0 # remove from the training set (also Pittsburgh because it only has 3 samples) fcon = fcon.loc[fcon["site"] != "ICBM"] fcon = fcon.loc[fcon["site"] != "Pittsburgh"] sites = fcon["site"].unique() fcon["sitenum"] = 0 f, ax = plt.subplots(figsize=(12, 12)) for i, s in enumerate(sites): idx = fcon["site"] == s fcon["sitenum"].loc[idx] = i print("site", s, sum(idx)) ax.scatter(fcon["age"].loc[idx], fcon["lh_MeanThickness_thickness"].loc[idx]) ax.legend(sites) ax.set_ylabel("LH mean cortical thickness [mm]") ax.set_xlabel("age") Step 1: Prepare training and testing sets ----------------------------------------- Then we randomly split half of the samples (participants) to be either in the training or in the testing samples. We do this for the remaing FCON dataset and for the ICBM data. The transfer function will also require a training and a test sample. The numbers of samples per sites used for training and for testing are then displayed. .. code:: ipython3 tr = np.random.uniform(size=fcon.shape[0]) > 0.5 te = ~tr fcon_tr = fcon.loc[tr] fcon_te = fcon.loc[te] tr = np.random.uniform(size=icbm.shape[0]) > 0.5 te = ~tr icbm_tr = icbm.loc[tr] icbm_te = icbm.loc[te] print("sample size check") for i, s in enumerate(sites): idx = fcon_tr["site"] == s idxte = fcon_te["site"] == s print(i, s, sum(idx), sum(idxte)) fcon_tr.to_csv(processing_dir + "/fcon1000_tr.csv") fcon_te.to_csv(processing_dir + "/fcon1000_te.csv") icbm_tr.to_csv(processing_dir + "/fcon1000_icbm_tr.csv") icbm_te.to_csv(processing_dir + "/fcon1000_icbm_te.csv") Otherwise you can just load these pre defined subsets: .. code:: ipython3 # Optional # fcon_tr = pd.read_csv('https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/main/data/fcon1000_tr.csv') # fcon_te = pd.read_csv('https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/main/data/fcon1000_te.csv') # icbm_tr = pd.read_csv('https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/main/data/fcon1000_icbm_tr.csv') # icbm_te = pd.read_csv('https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/main/data/fcon1000_icbm_te.csv') Step 2: Configure HBR inputs: covariates, measures and batch effects -------------------------------------------------------------------- We will here only use the mean cortical thickness for the Right and Left hemisphere: two idps. .. code:: ipython3 idps = ["rh_MeanThickness_thickness", "lh_MeanThickness_thickness"] As input to the model, we need covariates (used to describe predictable source of variability (fixed effects), here ‘age’), measures (here cortical thickness on two idps), and batch effects (random source of variability, here ‘scanner site’ and ‘sex’). ``X`` corresponds to the covariate(s) ``Y`` to the measure(s) ``batch_effects`` to the random effects We need these values both for the training (``_train``) and for the testing set (``_test``). .. code:: ipython3 X_train = (fcon_tr["age"] / 100).to_numpy(dtype=float) Y_train = fcon_tr[idps].to_numpy(dtype=float) # configure batch effects for site and sex # batch_effects_train = fcon_tr[['sitenum','sex']].to_numpy(dtype=int) # or only site batch_effects_train = fcon_tr[["sitenum"]].to_numpy(dtype=int) with open("X_train.pkl", "wb") as file: pickle.dump(pd.DataFrame(X_train), file) with open("Y_train.pkl", "wb") as file: pickle.dump(pd.DataFrame(Y_train), file) with open("trbefile.pkl", "wb") as file: pickle.dump(pd.DataFrame(batch_effects_train), file) X_test = (fcon_te["age"] / 100).to_numpy(dtype=float) Y_test = fcon_te[idps].to_numpy(dtype=float) # batch_effects_test = fcon_te[['sitenum','sex']].to_numpy(dtype=int) batch_effects_test = fcon_te[["sitenum"]].to_numpy(dtype=int) with open("X_test.pkl", "wb") as file: pickle.dump(pd.DataFrame(X_test), file) with open("Y_test.pkl", "wb") as file: pickle.dump(pd.DataFrame(Y_test), file) with open("tsbefile.pkl", "wb") as file: pickle.dump(pd.DataFrame(batch_effects_test), file) # a simple function to quickly load pickle files def ldpkl(filename: str): with open(filename, "rb") as f: return pickle.load(f) .. code:: ipython3 batch_effects_test Step 3: Files and Folders grooming ---------------------------------- .. code:: ipython3 respfile = os.path.join( processing_dir, "Y_train.pkl" ) # measurements (eg cortical thickness) of the training samples (columns: the various features/ROIs, rows: observations or subjects) covfile = os.path.join( processing_dir, "X_train.pkl" ) # covariates (eg age) the training samples (columns: covariates, rows: observations or subjects) testrespfile_path = os.path.join( processing_dir, "Y_test.pkl" ) # measurements for the testing samples testcovfile_path = os.path.join( processing_dir, "X_test.pkl" ) # covariate file for the testing samples trbefile = os.path.join( processing_dir, "trbefile.pkl" ) # training batch effects file (eg scanner_id, gender) (columns: the various batch effects, rows: observations or subjects) tsbefile = os.path.join(processing_dir, "tsbefile.pkl") # testing batch effects file output_path = os.path.join( processing_dir, "Models/" ) # output path, where the models will be written log_dir = os.path.join(processing_dir, "log/") # if not os.path.isdir(output_path): os.mkdir(output_path) if not os.path.isdir(log_dir): os.mkdir(log_dir) outputsuffix = "_estimate" # a string to name the output files, of use only to you, so adapt it for your needs. Step 4: Estimating the models ----------------------------- Now we have everything ready to estimate the normative models. The ``estimate`` function only needs the training and testing sets, each divided in three datasets: covariates, measures and batch effects. We obviously specify ``alg=hbr`` to use the hierarchical bayesian regression method, well suited for the multi sites datasets. The remaining arguments are basic data management: where the models, logs, and output files will be written and how they will be named. .. code:: ipython3 ptk.normative.estimate( covfile=covfile, respfile=respfile, tsbefile=tsbefile, trbefile=trbefile, inscaler="standardize", outscaler="standardize", linear_mu="True", random_intercept_mu="True", centered_intercept_mu="True", alg="hbr", log_path=log_dir, binary=True, output_path=output_path, testcov=testcovfile_path, testresp=testrespfile_path, outputsuffix=outputsuffix, savemodel=True, nuts_sampler="nutpie", ) Here some analyses can be done, there are also some error metrics that could be of interest. This is covered in step 6 and in `Saige’s tutorial `__ on Normative Modelling. Step 5: Transfering the models to unseen sites ---------------------------------------------- Similarly to what was done before for the FCON data, we also need to prepare the ICBM specific data, in order to run the transfer function: training and testing set of covariates, measures and batch effects: .. code:: ipython3 X_adapt = (icbm_tr["age"] / 100).to_numpy(dtype=float) Y_adapt = icbm_tr[idps].to_numpy(dtype=float) # batch_effects_adapt = icbm_tr[['sitenum','sex']].to_numpy(dtype=int) batch_effects_adapt = icbm_tr[["sitenum"]].to_numpy(dtype=int) with open("X_adaptation.pkl", "wb") as file: pickle.dump(pd.DataFrame(X_adapt), file) with open("Y_adaptation.pkl", "wb") as file: pickle.dump(pd.DataFrame(Y_adapt), file) with open("adbefile.pkl", "wb") as file: pickle.dump(pd.DataFrame(batch_effects_adapt), file) # Test data (new dataset) X_test_txfr = (icbm_te["age"] / 100).to_numpy(dtype=float) Y_test_txfr = icbm_te[idps].to_numpy(dtype=float) # batch_effects_test_txfr = icbm_te[['sitenum','sex']].to_numpy(dtype=int) batch_effects_test_txfr = icbm_te[["sitenum"]].to_numpy(dtype=int) with open("X_test_txfr.pkl", "wb") as file: pickle.dump(pd.DataFrame(X_test_txfr), file) with open("Y_test_txfr.pkl", "wb") as file: pickle.dump(pd.DataFrame(Y_test_txfr), file) with open("txbefile.pkl", "wb") as file: pickle.dump(pd.DataFrame(batch_effects_test_txfr), file) .. code:: ipython3 respfile = os.path.join(processing_dir, "Y_adaptation.pkl") covfile = os.path.join(processing_dir, "X_adaptation.pkl") testrespfile_path = os.path.join(processing_dir, "Y_test_txfr.pkl") testcovfile_path = os.path.join(processing_dir, "X_test_txfr.pkl") trbefile = os.path.join(processing_dir, "adbefile.pkl") tsbefile = os.path.join(processing_dir, "txbefile.pkl") log_dir = os.path.join(processing_dir, "log_transfer/") output_path = os.path.join(processing_dir, "Transfer/") model_path = os.path.join( processing_dir, "Models/" ) # path to the previously trained models outputsuffix = ( "_transfer" # suffix added to the output files from the transfer function ) Here, the difference is that the transfer function needs a model path, which points to the models we just trained, and new site data (training and testing). That is basically the only difference. .. code:: ipython3 yhat, s2, z_scores = ptk.normative.transfer( covfile=covfile, respfile=respfile, tsbefile=tsbefile, trbefile=trbefile, inscaler="standardize", outscaler="standardize", linear_mu="True", random_intercept_mu="True", centered_intercept_mu="True", model_path=model_path, alg="hbr", log_path=log_dir, binary=True, output_path=output_path, testcov=testcovfile_path, testresp=testrespfile_path, outputsuffix=outputsuffix, savemodel=True, nuts_sampler="nutpie", ) .. code:: ipython3 output_path .. code:: ipython3 EV = pd.read_pickle("EXPV_estimate.pkl") print(EV) And that is it, you now have models that benefited from prior knowledge about different scanner sites to learn on unseen sites. Step 6: Interpreting model performance -------------------------------------- Output evaluation metrics definitions: \* yhat - predictive mean \* ys2 - predictive variance \* nm - normative model \* Z - deviance scores \* Rho - Pearson correlation between true and predicted responses \* pRho - parametric p-value for this correlation \* RMSE - root mean squared error between true/predicted responses \* SMSE - standardised mean squared error \* EV - explained variance \* MSLL - mean standardized log loss \* See page 23 in http://www.gaussianprocess.org/gpml/chapters/RW2.pdf