DeepTICA: Alanine dipeptide from multithermal simulation

Reference paper: Bonati, Piccini and Parrinello, PNAS (2020) [arXiv].

Prerequisite: DeepTICA tutorial.

Open in Colab

Setup

[1]:
# Colab setup
import os

if os.getenv("COLAB_RELEASE_TAG"):
    import subprocess
    subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
    cmd = subprocess.run('bash colab_setup.sh EXAMPLE', shell=True, stdout=subprocess.PIPE)
    print('Done!')

# IMPORT PACKAGES
import torch
import lightning
import numpy as np
import matplotlib.pyplot as plt

# Set seed for reproducibility
torch.manual_seed(42)
[1]:
<torch._C.Generator at 0x7fe41d2d90f0>

Alanine dipeptide: OPES Multithermal -> DeepTICA

In this example we illustrate a case of applying DeepTDA to a biased simulation, taken from the DeepTICA manuscript. In particular, we will use the alanine dipeptide example in which we collect the simulations from the OPES multithermal simulation (which thus do not require to define a trial CV).

Load data

First we load the COLVAR file(s) containing the weights of the OPES simulation and the input features, in this case they are the 45 interatomic distances between heavy atoms.

We discard the initial part of the simulation to put ourselves in the quasi-static regime, which is very quickly reached by OPES (it can be monitored by looking at the DELTAFS file for the multicanonical simulation).

The data is downloaded from the Github repository associated with the manuscript.

[2]:
import pandas as pd
from mlcolvar.utils.io import load_dataframe
from mlcolvar.utils.timelagged import create_timelagged_dataset
from mlcolvar.data import DictModule

# load colvar files containing weights of the OPES simulations as well as the driver with the input features
opes = load_dataframe('https://raw.githubusercontent.com/luigibonati/deep-learning-slow-modes/main/data/ala2-multi/1_opes_multi/COLVAR',
                      start=15000,stop=50000)
feat = load_dataframe('https://raw.githubusercontent.com/luigibonati/deep-learning-slow-modes/main/data/ala2-multi/2_training_cvs/COLVAR_DRIVER',
                          start=15000,stop=50000)

# concatenate them
df = pd.concat([opes, feat[[i for i in feat.columns if (i not in opes.columns)]]], axis=1)

df
[2]:
time phi psi ene opes.bias walker d1 d2 d3 d4 ... d36 d37 d38 d39 d40 d41 d42 d43 d44 d45
0 15000.0 -1.24937 0.540396 -53.9859 1.725970 0 0.1522 0.233868 0.244634 0.382245 ... 0.258908 0.277790 0.379264 0.498986 0.1229 0.1335 0.246437 0.225420 0.283532 0.1449
1 15001.0 -1.50515 1.438650 43.3469 -30.144900 0 0.1522 0.235165 0.241863 0.375188 ... 0.255153 0.295903 0.367831 0.502540 0.1229 0.1335 0.252509 0.223947 0.291499 0.1449
2 15002.0 -1.29419 1.730900 -18.3581 -0.562392 0 0.1522 0.244679 0.231301 0.370506 ... 0.252451 0.285218 0.363133 0.486784 0.1229 0.1335 0.241549 0.226609 0.276819 0.1449
3 15003.0 -2.53166 2.657820 -23.8477 0.715870 0 0.1522 0.234523 0.246725 0.382103 ... 0.250699 0.317424 0.312897 0.441096 0.1229 0.1335 0.246988 0.230302 0.292824 0.1449
4 15004.0 -2.15710 2.395360 11.5125 -14.231600 0 0.1522 0.234793 0.251158 0.383453 ... 0.262807 0.326766 0.339187 0.479339 0.1229 0.1335 0.251390 0.224343 0.292820 0.1449
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34995 49995.0 -1.19288 0.532025 -28.4388 1.276180 0 0.1522 0.232503 0.245793 0.382845 ... 0.258933 0.285152 0.382687 0.493388 0.1229 0.1335 0.242527 0.216974 0.264171 0.1449
34996 49996.0 -1.31044 -0.023952 -33.1943 1.544550 0 0.1522 0.237438 0.245095 0.377144 ... 0.246469 0.276428 0.354929 0.473255 0.1229 0.1335 0.249179 0.221805 0.281781 0.1449
34997 49997.0 -1.66419 2.736990 -32.5060 1.518380 0 0.1522 0.240949 0.244720 0.385074 ... 0.253170 0.327638 0.315540 0.453209 0.1229 0.1335 0.240503 0.223382 0.272337 0.1449
34998 49998.0 -1.71317 2.573260 19.0074 -17.968700 0 0.1522 0.240229 0.247424 0.383137 ... 0.245395 0.310855 0.318514 0.438564 0.1229 0.1335 0.243461 0.226024 0.279091 0.1449
34999 49999.0 -1.26937 2.127590 7.2590 -12.103300 0 0.1522 0.241292 0.235914 0.373783 ... 0.254010 0.289084 0.355414 0.456283 0.1229 0.1335 0.247204 0.220562 0.278850 0.1449

35000 rows × 51 columns

[3]:
# Select input features
X = df.filter(regex='d').values
n_input = X.shape[1]

print(X.shape)
(35000, 45)

Create time-lagged dataset

Compute weights for time rescaling

Here we extract the time \(t\), the energy \(E\) (needed for the multicanonical reweight [1]) and the bias \(V\) from the COLVAR file. We then calculate the weights as:

NB: note that if simulation temperature \(\beta_0\) is equal to the reweighting one \(\beta\) the multicanonical reweight coincides with the standard umbrella sampling-like case.

Once we have computed the weights, we rescale the time at step \(k\) by using the instantaneus acceleration:

\[dt'_k = w_k\ dt\]

and then compute the cumulative rescaled time:

\[t'_k = \sum_{i=0} ^k dt'_i\]

[1] Invernizzi, Piaggi, and Parrinello. “Unified approach to enhanced sampling.” Physical Review X 10.4 (2020): 041034.


Parameter

Type

Description

multicanonical

bool

flag to determine if using a standard reweight (false) or a multicanonical one (true)

temp

float

reweighting temperature

temp0

float

simulation temperature (only needed if multicanonical == True)

[4]:
#------------- PARAMETERS -------------
multicanonical    = True
temp              = 300.
temp0             = 300.
#--------------------------------------

# Calculate inverse temperature
kb=0.008314
beta=1./(kb*temp)
beta0=1./(kb*temp0)

# Extract cvs from df

t = df['time'].values # save time
ene = df['ene'].values.astype(np.float64) # store energy as long double
bias = df.filter(regex='.bias').values.sum(axis=1) # Load *.bias columns and sum them

# Compute log-weights for time reweighting
logweights = beta*bias

if multicanonical:
    ene -= np.mean(ene) #first shift energy by its mean value
    logweights += (beta0-beta)*ene

In order to train the Deep-TICA CVs we will need to compute the time-lagged covariance matrices in the rescaled time \(t'\). The standard way is to look for configurations which are distant a lag-time \(\tau\) in the time series. However, in the rescaled time the time-series is exponentially unevenly spaced. Hence, a naive search will lead to severe numerical issue. To address this, we use the algorithm proposed in [2]. In a nutshell, this method assume that the observable \(O(t'_k)\) have the same value from scaled time \(t'_k\) to \(t'_{k+1}\). This leads to weighting each pair of configurations based both on the rescaled time around \(t'_k\) and \(t'_k+\tau\) (see supp. information of [2] for details). All of this is done under the hood in the function find_time_lagged_configurations. To generate the training set we use the function create_time_lagged_dataset which searches for the pairs of configurations and the corresponding weight.

[2] Yang and Parrinello. “Refining collective coordinates and improving free energy representation in variational enhanced sampling.” Journal of chemical theory and computation 14.6 (2018): 2889-2894.


Parameter

Type

Description

lag_time

float

lag_time for the calculation of the covariance matrices [in rescaled time]

n_train

int

number of training configurations

n_valid

int

number of validation configurations

[5]:
#------------- PARAMETERS -------------
lag_time = 1.
#--------------------------------------

# create dataset
dataset = create_timelagged_dataset(X,t,lag_time=lag_time,logweights=logweights,progress_bar=True)

# create datamodule (split train valid)
datamodule = DictModule(dataset,lengths=[0.8,0.2],random_split=False,shuffle=False)

datamodule
100%|██████████| 34997/34997 [00:03<00:00, 11180.12it/s]
/home/lbonati@iit.local/work/code/mlcolvar120/mlcolvar/utils/timelagged.py:186: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:275.)
  x_t = torch.stack(x_t) if type(x) == torch.Tensor else torch.Tensor(x_t)
/home/lbonati@iit.local/work/code/mlcolvar120/mlcolvar/data/datamodule.py:133: UserWarning: A torch.generator was provided but it is not used with random_split=False
  warnings.warn(
[5]:
DictModule(dataset -> DictDataset( "data": [66163, 45], "data_lag": [66163, 45], "weights": [66163], "weights_lag": [66163] ),
                     train_loader -> DictLoader(length=0.8, batch_size=0, shuffle=False),
                     valid_loader -> DictLoader(length=0.2, batch_size=0, shuffle=False))

Define model

[6]:
from mlcolvar.cvs import DeepTICA

n_components = 2
nn_layers = [n_input, 30, 30, n_components]
options= {'nn': {'activation': 'tanh'}}

model = DeepTICA(nn_layers, options=options)
model
[6]:
DeepTICA(
  (loss_fn): ReduceEigenvaluesLoss()
  (norm_in): Normalization(in_features=45, out_features=45, mode=mean_std)
  (nn): FeedForward(
    (nn): Sequential(
      (0): Linear(in_features=45, out_features=30, bias=True)
      (1): Tanh()
      (2): Linear(in_features=30, out_features=30, bias=True)
      (3): Tanh()
      (4): Linear(in_features=30, out_features=2, bias=True)
    )
  )
  (tica): TICA(in_features=2, out_features=2)
)

Define Trainer & Fit

[7]:
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
from mlcolvar.utils.trainer import MetricsCallback

# define callbacks
metrics = MetricsCallback()
early_stopping = EarlyStopping(monitor="valid_loss", min_delta=1e-3, patience=10)

# define trainer
trainer = lightning.Trainer(callbacks=[metrics, early_stopping],
                     max_epochs=None, logger=None, enable_checkpointing=False)

# fit
trainer.fit( model, datamodule )
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/lbonati@iit.local/software/mambaforge/envs/mlcolvar111/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/home/lbonati@iit.local/software/mambaforge/envs/mlcolvar111/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type                  | Params | In sizes | Out sizes
-------------------------------------------------------------------------
0 | loss_fn | ReduceEigenvaluesLoss | 0      | ?        | ?
1 | norm_in | Normalization         | 0      | [1, 45]  | [1, 45]
2 | nn      | FeedForward           | 2.4 K  | [1, 45]  | [1, 2]
3 | tica    | TICA                  | 0      | [1, 2]   | [1, 2]
-------------------------------------------------------------------------
2.4 K     Trainable params
0         Non-trainable params
2.4 K     Total params
0.009     Total estimated model params size (MB)
[Warning] Normalization: the following features have a range of values < 1e-6: tensor([[ 0],
        [ 9],
        [10],
        [24],
        [30],
        [31],
        [39],
        [40],
        [44]])

/home/lbonati@iit.local/software/mambaforge/envs/mlcolvar111/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:298: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 398: 100%|██████████| 1/1 [00:00<00:00, 30.05it/s, v_num=5]

We can monitor how the individual eigenvalues are optimized during training.

[8]:
from mlcolvar.utils.plot import plot_metrics

ax = plot_metrics(metrics.metrics,
                  keys=[x for x in  metrics.metrics.keys() if 'valid_eigval' in x],#['train_loss_epoch','valid_loss'],
                  #linestyles=['-.','-'], colors=['fessa1','fessa5'],
                  yscale='linear')
../../_images/notebooks_examples_ex_DeepTICA_20_0.png

Normalize output

For convenience, we standardize the CVs to be between -1 and 1.

[9]:
from mlcolvar.core.transform import Normalization
from mlcolvar.core.transform.utils import Statistics

#X = dataset[:]['data']
with torch.no_grad():
    model.postprocessing = None # reset
    s = model(torch.Tensor(X))

norm =  Normalization(n_components, mode='min_max', stats = Statistics(s) )
model.postprocessing = norm

Compute FES

We can plot the free energy profile along the selected CVs. First we compute it in the 2D space, from which we learn that the first TICA refers to the transition between two long-lived metastable states, while the second the faster transitions within the left basin.

[14]:
from mlcolvar.utils.fes import compute_fes

fig,ax = plt.subplots(1,1,figsize=(6,5),dpi=100)

# compute cvs
with torch.no_grad():
    s = model(torch.Tensor(X)).numpy()

w = np.exp(logweights)
fes,grid,bounds,error = compute_fes(s,
                                    weights=w,
                                    temp=300,
                                    blocks=1,
                                    bandwidth=0.01, scale_by='range',
                                    plot=True, plot_max_fes=20, ax = ax, eps=1e-10)

ax.set_xlabel('Deep-TICA 1')
ax.set_xlabel('Deep-TICA 2')
[14]:
Text(0.5, 0, 'Deep-TICA 2')
../../_images/notebooks_examples_ex_DeepTICA_26_1.png

We can also plot the free energy profiles along each component. From the 2D FES above we understand that for the second CV we need to compute the FES only for the points in which the first CV is < 0, in order to describe the barrier between the faster states.

[ ]:
from mlcolvar.utils.fes import compute_fes

fig,axs = plt.subplots(1,n_components,figsize=(6*n_components,5),dpi=100)

for i in range(n_components):
    w = np.exp(logweights)

    # restrict the second CV to the points in which the first is < 0
    mask = np.ones(len(s), dtype=bool) if i==0 else s[:, 0] < 0

    fes,grid,bounds,error = compute_fes(s[mask,i],
                                        weights=w[mask],
                                        temp=300,
                                        blocks=2,
                                        bandwidth=0.02,
                                        scale_by='range',
                                        plot=True,
                                        plot_max_fes=50,
                                        backend='sklearn', # the default=KDEpy is faster but may be unstable for latest python/numpy builds
                                        ax = axs[i])
    axs[i].set_xlabel('Deep-TICA '+str(i+1))
../../_images/notebooks_examples_ex_DeepTICA_28_0.png

Note that the huge uncertainty on the free energy barrier is due to the lack of points between the states in the multithermal simulation.

Plot CVs isolines in \(\phi-\psi\) space

To understand to what states the CVs refer to, we can visualize them in the Ramachandran plot (\(\phi\) and \(\psi\)).

[17]:
# Hexbin plot in physical space
fig,axs = plt.subplots(1,n_components,figsize=(6*n_components,5),dpi=100)

x = df['phi'].values
y = df['psi'].values

# compute cvs
with torch.no_grad():
    s = model(torch.Tensor(X)).numpy()

for i,ax in enumerate(axs):
    pp = ax.hexbin(x,y,C=s[:,i],gridsize=150,cmap='fessa')
    cbar = plt.colorbar(pp,ax=ax)
    ax.set_title('Deep-TICA '+str(i+1))
    ax.set_xlabel(r'$\phi$ [rad]')
    ax.set_ylabel(r'$\psi$ [rad]')
    cbar.set_label('Deep-TICA '+str(i+1))

plt.tight_layout()
plt.show()
../../_images/notebooks_examples_ex_DeepTICA_32_0.png

You can try out also the other examples reported in the paper, such as applying DeepTICA to a simulation biasing a bad CV (the \(\psi\) dihedral angle).