Using lifespan models to make predictions on new data
-----------------------------------------------------

This notebook shows how to apply the coefficients from pre-estimated
normative models to new data. This can be done in two different ways:
(i) using a new set of data derived from the same sites used to estimate
the model and (ii) on a completely different set of sites. In the latter
case, we also need to estimate the site effect, which requires some
calibration/adaptation data. As an illustrative example, we use a
dataset derived from several `OpenNeuro
datasets <https://openneuro.org/>`__ and adapt the learned model to make
predictions on these data.

First, if necessary, we install PCNtoolkit (note: this tutorial requires
at least version 0.27)

.. code:: ipython3

    !pip install pcntoolkit
    
.. code:: ipython3

    ! git clone https://github.com/predictive-clinical-neuroscience/braincharts.git

.. code:: ipython3

    # we need to be in the scripts folder when we import the libraries in the code block below,
    # because there is a function called nm_utils that is in the scripts folder that we need to import
    import os
    wdir = 'braincharts'
    
    os.chdir(wdir) #this path is setup for running on Google Colab. Change it to match your local path if running locally
    root_dir = os.getcwd()

Now we import the required libraries

.. code:: ipython3

    import numpy as np
    import pandas as pd
    import pickle
    from matplotlib import pyplot as plt
    import seaborn as sns
    
    from pcntoolkit.normative import estimate, predict, evaluate
    from pcntoolkit.util.utils import compute_MSLL, create_design_matrix
    os.chdir(os.path.join(root_dir, 'scripts'))
    from nm_utils import remove_bad_subjects, load_2d
    os.chdir(root_dir)

We need to unzip the models.

.. code:: ipython3

    os.chdir(os.path.join(root_dir, 'models'))

.. code:: ipython3

    # we will use the biggest sample as our training set (approx. N=57000 subjects from 82 sites)
    # for more info on the other pretrained models available in this repository,
    # please refer to the accompanying paper https://elifesciences.org/articles/72904
    ! unzip lifespan_57K_82sites.zip

Next, we configure some basic variables, like where we want the analysis
to be done and which model we want to use.

**Note:** We maintain a list of site ids for each dataset, which
describe the site names in the training and test data (``site_ids_tr``
and ``site_ids_te``), plus also the adaptation data . The training site
ids are provided as a text file in the distribution and the test ids are
extracted automatically from the pandas dataframe (see below). If you
use additional data from the sites (e.g. later waves from ABCD), it may
be necessary to adjust the site names to match the names in the training
set. See the accompanying paper for more details

.. code:: ipython3

    # which model do we wish to use?
    model_name = 'lifespan_57K_82sites'
    site_names = 'site_ids_ct_82sites.txt'
    
    
    # where the data files live
    data_dir = os.path.join(root_dir,'docs')
    
    # where the models live
    out_dir = os.path.join(root_dir, 'models', model_name)
    
    # load a set of site ids from this model. This must match the training data
    with open(os.path.join(root_dir,'docs', site_names)) as f:
        site_ids_tr = f.read().splitlines()

Load test data
~~~~~~~~~~~~~~

**Note:** For the purposes of this tutorial, we make predictions for a
multi-site transfer dataset, derived from
`OpenNeuro <https://openneuro.org/>`__.

.. code:: ipython3

    test_data = os.path.join(data_dir, 'OpenNeuroTransfer_ct_te.csv')
    
    df_te = pd.read_csv(test_data)
    
    # extract a list of unique site ids from the test set
    site_ids_te =  sorted(set(df_te['site'].to_list()))

(Optional) Load adaptation data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the data you wish to make predictions for is not derived from the
same scanning sites as those in the trainig set, it is necessary to
learn the site effect so that we can account for it in the predictions.
In order to do this in an unbiased way, we use a separate dataset, which
we refer to as ‘adaptation’ data. This must contain data for all the
same sites as in the test dataset and we assume these are coded in the
same way, based on a the ‘sitenum’ column in the dataframe.

.. code:: ipython3

    adaptation_data = os.path.join(data_dir, 'OpenNeuroTransfer_ct_ad.csv')
    
    df_ad = pd.read_csv(adaptation_data)
    
    # extract a list of unique site ids from the test set
    site_ids_ad =  sorted(set(df_ad['site'].to_list()))
    
    if not all(elem in site_ids_ad for elem in site_ids_te):
        print('Warning: some of the testing sites are not in the adaptation data')

Configure which models to fit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now, we configure which imaging derived phenotypes (IDPs) we would like
to process. This is just a list of column names in the dataframe we have
loaded above.

We could load the whole set (i.e. all phenotypes for which we have
models for …

.. code:: ipython3

    # load the list of idps for left and right hemispheres, plus subcortical regions
    with open(os.path.join(data_dir, 'phenotypes_ct_lh.txt')) as f:
        idp_ids_lh = f.read().splitlines()
    with open(os.path.join(data_dir, 'phenotypes_ct_rh.txt')) as f:
        idp_ids_rh = f.read().splitlines()
    with open(os.path.join(data_dir, 'phenotypes_sc.txt')) as f:
        idp_ids_sc = f.read().splitlines()
    
    # we choose here to process all idps
    idp_ids = idp_ids_lh + idp_ids_rh #+ idp_ids_sc

… or alternatively, we could just specify a list

.. code:: ipython3

    idp_ids = [ 'Left-Thalamus-Proper', 'Left-Lateral-Ventricle', 'rh_MeanThickness_thickness']

Configure covariates
~~~~~~~~~~~~~~~~~~~~

Now, we configure some parameters to fit the model. First, we choose
which columns of the pandas dataframe contain the covariates (age and
sex). The site parameters are configured automatically later on by the
``configure_design_matrix()`` function, when we loop through the IDPs in
the list

The supplied coefficients are derived from a ‘warped’ Bayesian linear
regression model, which uses a nonlinear warping function to model
non-Gaussianity (``sinarcsinh``) plus a non-linear basis expansion (a
cubic b-spline basis set with 5 knot points, which is the default value
in the PCNtoolkit package). Since we are sticking with the default
value, we do not need to specify any parameters for this, but we do need
to specify the limits. We choose to pad the input by a few years either
side of the input range. We will also set a couple of options that
control the estimation of the model

For further details about the likelihood warping approach, see the
accompanying paper and `Fraza et al
2021 <https://www.biorxiv.org/content/10.1101/2021.04.05.438429v1>`__.

.. code:: ipython3

    # which data columns do we wish to use as covariates?
    cols_cov = ['age','sex']
    
    # limits for cubic B-spline basis
    xmin = -5
    xmax = 110
    
    # Absolute Z treshold above which a sample is considered to be an outlier (without fitting any model)
    outlier_thresh = 7

Make predictions
~~~~~~~~~~~~~~~~

This will make predictions for each IDP separately. This is done by
extracting a column from the dataframe (i.e. specifying the IDP as the
response variable) and saving it as a numpy array. Then, we configure
the covariates, which is a numpy data array having the number of rows
equal to the number of datapoints in the test set. The columns are
specified as follows:

- A global intercept (column of ones)
- The covariate columns (here age and sex, coded as 0=female/1=male)
- Dummy coded columns for the sites in the training set (one column per
  site)
- Columns for the basis expansion (seven columns for the default
  parameterisation)

Once these are saved as numpy arrays in ascii format (as here) or
(alternatively) in pickle format, these are passed as inputs to the
``predict()`` method in the PCNtoolkit normative modelling framework.
These are written in the same format to the location specified by
``idp_dir``. At the end of this step, we have a set of predictions and
Z-statistics for the test dataset that we can take forward to further
analysis.

Note that when we need to make predictions on new data, the procedure is
more involved, since we need to prepare, process and store covariates,
response variables and site ids for the adaptation data.

.. code:: ipython3

    for idp_num, idp in enumerate(idp_ids):
        print('Running IDP', idp_num, idp, ':')
        idp_dir = os.path.join(out_dir, idp)
        os.chdir(idp_dir)
    
        # extract and save the response variables for the test set
        y_te = df_te[idp].to_numpy()
    
        # save the variables
        resp_file_te = os.path.join(idp_dir, 'resp_te.txt')
        np.savetxt(resp_file_te, y_te)
    
        # configure and save the design matrix
        cov_file_te = os.path.join(idp_dir, 'cov_bspline_te.txt')
        X_te = create_design_matrix(df_te[cols_cov],
                                    site_ids = df_te['site'],
                                    all_sites = site_ids_tr,
                                    basis = 'bspline',
                                    xmin = xmin,
                                    xmax = xmax)
        np.savetxt(cov_file_te, X_te)
    
        # check whether all sites in the test set are represented in the training set
        if all(elem in site_ids_tr for elem in site_ids_te):
            print('All sites are present in the training data')
    
            # just make predictions
            yhat_te, s2_te, Z = predict(cov_file_te,
                                        alg='blr',
                                        respfile=resp_file_te,
                                        model_path=os.path.join(idp_dir,'Models'))
        else:
            print('Some sites missing from the training data. Adapting model')
    
            # save the covariates for the adaptation data
            X_ad = create_design_matrix(df_ad[cols_cov],
                                        site_ids = df_ad['site'],
                                        all_sites = site_ids_tr,
                                        basis = 'bspline',
                                        xmin = xmin,
                                        xmax = xmax)
            cov_file_ad = os.path.join(idp_dir, 'cov_bspline_ad.txt')
            np.savetxt(cov_file_ad, X_ad)
    
            # save the responses for the adaptation data
            resp_file_ad = os.path.join(idp_dir, 'resp_ad.txt')
            y_ad = df_ad[idp].to_numpy()
            np.savetxt(resp_file_ad, y_ad)
    
            # save the site ids for the adaptation data
            sitenum_file_ad = os.path.join(idp_dir, 'sitenum_ad.txt')
            site_num_ad = df_ad['sitenum'].to_numpy(dtype=int)
            np.savetxt(sitenum_file_ad, site_num_ad)
    
            # save the site ids for the test data
            sitenum_file_te = os.path.join(idp_dir, 'sitenum_te.txt')
            site_num_te = df_te['sitenum'].to_numpy(dtype=int)
            np.savetxt(sitenum_file_te, site_num_te)
    
            yhat_te, s2_te, Z = predict(cov_file_te,
                                        alg = 'blr',
                                        respfile = resp_file_te,
                                        model_path = os.path.join(idp_dir,'Models'),
                                        adaptrespfile = resp_file_ad,
                                        adaptcovfile = cov_file_ad,
                                        adaptvargroupfile = sitenum_file_ad,
                                        testvargroupfile = sitenum_file_te)


.. parsed-literal::

    Running IDP 0 Left-Thalamus-Proper :
    Some sites missing from the training data. Adapting model
    Loading data ...
    Prediction by model  1 of 1
    Evaluating the model ...
    Evaluations Writing outputs ...
    Writing outputs ...
    Running IDP 1 Left-Lateral-Ventricle :
    Some sites missing from the training data. Adapting model
    Loading data ...
    Prediction by model  1 of 1
    Evaluating the model ...
    Evaluations Writing outputs ...
    Writing outputs ...
    Running IDP 2 rh_MeanThickness_thickness :
    Some sites missing from the training data. Adapting model
    Loading data ...
    Prediction by model  1 of 1
    Evaluating the model ...
    Evaluations Writing outputs ...
    Writing outputs ...


Preparing dummy data for plotting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now, we plot the centiles of variation estimated by the normative model.

We do this by making use of a set of dummy covariates that span the
whole range of the input space (for age) for a fixed value of the other
covariates (e.g. sex) so that we can make predictions for these dummy
data points, then plot them. We configure these dummy predictions using
the same procedure as we used for the real data. We can use the same
dummy data for all the IDPs we wish to plot

.. code:: ipython3

    # which sex do we want to plot?
    sex = 1 # 1 = male 0 = female
    if sex == 1:
        clr = 'blue';
    else:
        clr = 'red'
    
    # create dummy data for visualisation
    print('configuring dummy data ...')
    xx = np.arange(xmin, xmax, 0.5)
    X0_dummy = np.zeros((len(xx), 2))
    X0_dummy[:,0] = xx
    X0_dummy[:,1] = sex
    
    # create the design matrix
    X_dummy = create_design_matrix(X0_dummy, xmin=xmin, xmax=xmax, site_ids=None, all_sites=site_ids_tr)
    
    # save the dummy covariates
    cov_file_dummy = os.path.join(out_dir,'cov_bspline_dummy_mean.txt')
    np.savetxt(cov_file_dummy, X_dummy)


.. parsed-literal::

    configuring dummy data ...


Plotting the normative models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now we loop through the IDPs, plotting each one separately. The outputs
of this step are a set of quantitative regression metrics for each IDP
and a set of centile curves which we plot the test data against.

This part of the code is relatively complex because we need to keep
track of many quantities for the plotting. We also need to remember
whether the data need to be warped or not. By default in PCNtoolkit,
predictions in the form of ``yhat, s2`` are always in the warped
(Gaussian) space. If we want predictions in the input (non-Gaussian)
space, then we need to warp them with the inverse of the estimated
warping function. This can be done using the function
``nm.blr.warp.warp_predictions()``.

**Note:** it is necessary to update the intercept for each of the sites.
For purposes of visualisation, here we do this by adjusting the median
of the data to match the dummy predictions, but note that all the
quantitative metrics are estimated using the predictions that are
adjusted properly using a learned offset (or adjusted using a hold-out
adaptation set, as above). Note also that for the calibration data we
require at least two data points of the same sex in each site to be able
to estimate the variance. Of course, in a real example, you would want
many more than just two since we need to get a reliable estimate of the
variance for each site.

.. code:: ipython3

    sns.set(style='whitegrid')
    
    for idp_num, idp in enumerate(idp_ids):
        print('Running IDP', idp_num, idp, ':')
        idp_dir = os.path.join(out_dir, idp)
        os.chdir(idp_dir)
    
        # load the true data points
        yhat_te = load_2d(os.path.join(idp_dir, 'yhat_predict.txt'))
        s2_te = load_2d(os.path.join(idp_dir, 'ys2_predict.txt'))
        y_te = load_2d(os.path.join(idp_dir, 'resp_te.txt'))
    
        # set up the covariates for the dummy data
        print('Making predictions with dummy covariates (for visualisation)')
        yhat, s2 = predict(cov_file_dummy,
                           alg = 'blr',
                           respfile = None,
                           model_path = os.path.join(idp_dir,'Models'),
                           outputsuffix = '_dummy')
    
        # load the normative model
        with open(os.path.join(idp_dir,'Models', 'NM_0_0_estimate.pkl'), 'rb') as handle:
            nm = pickle.load(handle)
    
        # get the warp and warp parameters
        W = nm.blr.warp
        warp_param = nm.blr.hyp[1:nm.blr.warp.get_n_params()+1]
    
        # first, we warp predictions for the true data and compute evaluation metrics
        med_te = W.warp_predictions(np.squeeze(yhat_te), np.squeeze(s2_te), warp_param)[0]
        med_te = med_te[:, np.newaxis]
        print('metrics:', evaluate(y_te, med_te))
    
        # then, we warp dummy predictions to create the plots
        med, pr_int = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2), warp_param)
    
        # extract the different variance components to visualise
        beta, junk1, junk2 = nm.blr._parse_hyps(nm.blr.hyp, X_dummy)
        s2n = 1/beta # variation (aleatoric uncertainty)
        s2s = s2-s2n # modelling uncertainty (epistemic uncertainty)
    
        # plot the data points
        y_te_rescaled_all = np.zeros_like(y_te)
        for sid, site in enumerate(site_ids_te):
            # plot the true test data points
            if all(elem in site_ids_tr for elem in site_ids_te):
                # all data in the test set are present in the training set
    
                # first, we select the data points belonging to this particular site
                idx = np.where(np.bitwise_and(X_te[:,2] == sex, X_te[:,sid+len(cols_cov)+1] !=0))[0]
                if len(idx) == 0:
                    print('No data for site', sid, site, 'skipping...')
                    continue
    
                # then directly adjust the data
                idx_dummy = np.bitwise_and(X_dummy[:,1] > X_te[idx,1].min(), X_dummy[:,1] < X_te[idx,1].max())
                y_te_rescaled = y_te[idx] - np.median(y_te[idx]) + np.median(med[idx_dummy])
            else:
                # we need to adjust the data based on the adaptation dataset
    
                # first, select the data point belonging to this particular site
                idx = np.where(np.bitwise_and(X_te[:,2] == sex, (df_te['site'] == site).to_numpy()))[0]
    
                # load the adaptation data
                y_ad = load_2d(os.path.join(idp_dir, 'resp_ad.txt'))
                X_ad = load_2d(os.path.join(idp_dir, 'cov_bspline_ad.txt'))
                idx_a = np.where(np.bitwise_and(X_ad[:,2] == sex, (df_ad['site'] == site).to_numpy()))[0]
                if len(idx) < 2 or len(idx_a) < 2:
                    print('Insufficent data for site', sid, site, 'skipping...')
                    continue
    
                # adjust and rescale the data
                y_te_rescaled, s2_rescaled = nm.blr.predict_and_adjust(nm.blr.hyp,
                                                                       X_ad[idx_a,:],
                                                                       np.squeeze(y_ad[idx_a]),
                                                                       Xs=None,
                                                                       ys=np.squeeze(y_te[idx]))
            # plot the (adjusted) data points
            plt.scatter(X_te[idx,1], y_te_rescaled, s=4, color=clr, alpha = 0.1)
    
        # plot the median of the dummy data
        plt.plot(xx, med, clr)
    
        # fill the gaps in between the centiles
        junk, pr_int25 = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2), warp_param, percentiles=[0.25,0.75])
        junk, pr_int95 = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2), warp_param, percentiles=[0.05,0.95])
        junk, pr_int99 = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2), warp_param, percentiles=[0.01,0.99])
        plt.fill_between(xx, pr_int25[:,0], pr_int25[:,1], alpha = 0.1,color=clr)
        plt.fill_between(xx, pr_int95[:,0], pr_int95[:,1], alpha = 0.1,color=clr)
        plt.fill_between(xx, pr_int99[:,0], pr_int99[:,1], alpha = 0.1,color=clr)
    
        # make the width of each centile proportional to the epistemic uncertainty
        junk, pr_int25l = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2-0.5*s2s), warp_param, percentiles=[0.25,0.75])
        junk, pr_int95l = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2-0.5*s2s), warp_param, percentiles=[0.05,0.95])
        junk, pr_int99l = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2-0.5*s2s), warp_param, percentiles=[0.01,0.99])
        junk, pr_int25u = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2+0.5*s2s), warp_param, percentiles=[0.25,0.75])
        junk, pr_int95u = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2+0.5*s2s), warp_param, percentiles=[0.05,0.95])
        junk, pr_int99u = W.warp_predictions(np.squeeze(yhat), np.squeeze(s2+0.5*s2s), warp_param, percentiles=[0.01,0.99])
        plt.fill_between(xx, pr_int25l[:,0], pr_int25u[:,0], alpha = 0.3,color=clr)
        plt.fill_between(xx, pr_int95l[:,0], pr_int95u[:,0], alpha = 0.3,color=clr)
        plt.fill_between(xx, pr_int99l[:,0], pr_int99u[:,0], alpha = 0.3,color=clr)
        plt.fill_between(xx, pr_int25l[:,1], pr_int25u[:,1], alpha = 0.3,color=clr)
        plt.fill_between(xx, pr_int95l[:,1], pr_int95u[:,1], alpha = 0.3,color=clr)
        plt.fill_between(xx, pr_int99l[:,1], pr_int99u[:,1], alpha = 0.3,color=clr)
    
        # plot actual centile lines
        plt.plot(xx, pr_int25[:,0],color=clr, linewidth=0.5)
        plt.plot(xx, pr_int25[:,1],color=clr, linewidth=0.5)
        plt.plot(xx, pr_int95[:,0],color=clr, linewidth=0.5)
        plt.plot(xx, pr_int95[:,1],color=clr, linewidth=0.5)
        plt.plot(xx, pr_int99[:,0],color=clr, linewidth=0.5)
        plt.plot(xx, pr_int99[:,1],color=clr, linewidth=0.5)
    
        plt.xlabel('Age')
        plt.ylabel(idp)
        plt.title(idp)
        plt.xlim((0,90))
        plt.savefig(os.path.join(idp_dir, 'centiles_' + str(sex)),  bbox_inches='tight')
        plt.show()
    
    os.chdir(out_dir)


.. parsed-literal::

    Running IDP 0 Left-Thalamus-Proper :
    Making predictions with dummy covariates (for visualisation)
    Loading data ...
    Prediction by model  1 of 1
    Writing outputs ...
    metrics: {'RMSE': array([0.55690777]), 'Rho': array([0.]), 'pRho': array([1.]), 'SMSE': array([0.]), 'EXPV': array([0.])}


.. image:: apply_normative_models_files/apply_normative_models_ct_27_1.png


.. parsed-literal::

    Running IDP 1 Left-Lateral-Ventricle :
    Making predictions with dummy covariates (for visualisation)
    Loading data ...
    Prediction by model  1 of 1
    Writing outputs ...
    metrics: {'RMSE': array([4205.49266088]), 'Rho': array([0.45898577]), 'pRho': array([5.62632393e-25]), 'SMSE': array([0.81397727]), 'EXPV': array([0.19814613])}


.. image:: apply_normative_models_files/apply_normative_models_ct_27_3.png


.. parsed-literal::

    Running IDP 2 rh_MeanThickness_thickness :
    Making predictions with dummy covariates (for visualisation)
    Loading data ...
    Prediction by model  1 of 1
    Writing outputs ...
    metrics: {'RMSE': array([0.08652435]), 'Rho': array([0.77666469]), 'pRho': array([2.97430261e-103]), 'SMSE': array([0.40227749]), 'EXPV': array([0.59789079])}


.. image:: apply_normative_models_files/apply_normative_models_ct_27_5.png


.. code:: ipython3

    # explore an example output folder of a single model (one ROI)
    # think about what each of these output files represents.
    # Hint: look at the variable names and comments in the code block above
    ! ls rh_MeanThickness_thickness/


.. parsed-literal::

    centiles_1.png	    Models	      Rho_predict.txt	SMSE_predict.txt  ys2_predict.txt
    cov_bspline_ad.txt  pRho_predict.txt  RMSE_predict.txt	yhat_dummy.pkl	  Z_predict.txt
    cov_bspline_te.txt  resp_ad.txt       sitenum_ad.txt	yhat_predict.txt
    EXPV_predict.txt    resp_te.txt       sitenum_te.txt	ys2_dummy.pkl


.. code:: ipython3

    # check that the number of deviation scores matches the number of subjects in the test set
    # there should be one deviation score per subject (one line per subject), so we can
    # verify by counting the line numbers in the Z_predict.txt file
    ! cat rh_MeanThickness_thickness/Z_predict.txt | wc


.. parsed-literal::

        436     436   11115


The deviation scores are output as a text file in separate folders. We
want to summarize the deviation scores across all models estimates so we
can organize them into a single file, and merge the deviation scores
into the original data file.

.. code:: ipython3

    ! mkdir deviation_scores

.. code:: ipython3

    ! for i in *; do if [[ -e ${i}/Z_predict.txt ]]; then cp ${i}/Z_predict.txt deviation_scores/${i}_Z_predict.txt; fi; done

.. code:: ipython3

    z_dir = os.path.join(root_dir, 'models', model_name, 'deviation_scores')
    
    filelist = [name for name in os.listdir(z_dir)]

.. code:: ipython3

    os.chdir(z_dir)
    Z_df = pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in filelist], axis=1)

.. code:: ipython3

    df_te.reset_index(inplace=True)

.. code:: ipython3

    Z_df['sub_id'] = df_te['sub_id']

.. code:: ipython3

    df_te_Z = pd.merge(df_te, Z_df, on='sub_id', how='inner')

.. code:: ipython3

    df_te_Z.to_csv('OpenNeuroTransfer_deviation_scores.csv', index=False)