Customize training¶
Outline¶
The CVs implemented in mlcolvar.cvs are subclasses of lightning.LightningModule which can be tought as tasks rather than just plain models. Indeed, they incorporate also the optimizer as well as the loss function used in the training step. In this tutorial you will learn how to customize the different aspects of the training behaviour:
optimizer
loss function
trainer
Optimizer¶
The optimizer used is returned by the function configure_optimizers which is called by the lightning trainer. The default optimizer is Adam. To change it, or to customize the optimizer’s arguments, you can interact with the CV’s members optimizer_name and optimizer_kwargs.
For instance, this could be used to add an L2 regularization through the weight_decay argument.
[1]:
# Colab setup
import os
if os.getenv("COLAB_RELEASE_TAG"):
import subprocess
subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)
print('Done!')
[2]:
from mlcolvar.cvs import RegressionCV
# define example CV
cv = RegressionCV(layers=[10,5,5,1], options={})
# choose optimizer
cv.optimizer_name = 'Adam'
# choose arguments
cv.optimizer_kwargs = {'weigth_decay' : 1e-4 }
print(f'Optimizer: {cv.optimizer_name}')
print(f'Arguments: {cv.optimizer_kwargs}')
/home/etrizio@iit.local/Bin/miniconda3/envs/mlcvs_test/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Optimizer: Adam
Arguments: {'weigth_decay': 0.0001}
Options to the default Adam optimizer can also be passed using the options parameter of the CV model using the keyword optimizer in the dictionary. The provided options will be registered in optimizer_kwargs.
For example we can set the lr and the weight_decay
[5]:
# define optimizer options
options = {'optimizer' : {'lr' : 2e-3, 'weight_decay' : 1e-4} }
# define example CV
cv = RegressionCV(layers=[10,5,5,1], options=options)
print(f'optimizer_kwargs: {cv.optimizer_kwargs}')
optimizer_kwargs: {'lr': 0.002, 'weight_decay': 0.0001}
We can also associate to the optimizer a learning rate scheduler, which allows to modify the learning rate of the optimizer as the optimization proceeds to facilitate the training. For example, to reduce the learning rate as a function of the epochs.
To do this we can easily use the schedulers implemented in torch.optim.lr_scheduler.
This can also be passed using the options parameter of the CV model using the keyword lr_scheduler in the dictionary. The scheduler object needs to be included under the key scheduler, and the parameters of the chosen scheduler should be passed under the corresponding names.
When using a LR scheduler, the learning rate is saved into the metrics logged by MetricsCallback (as lr).
[6]:
import torch
# choose the scheduler
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR # requires gamma as parameter
# define scheduler options
options = {'lr_scheduler' : { 'scheduler' : lr_scheduler, 'gamma' : 0.9999} }
# define example CV
cv = RegressionCV(layers=[10,5,5,1], options=options)
Schedulers that require a monitored metric (e.g., ReduceLROnPlateau) can pass a lr_scheduler_config dictionary, for example: {'monitor': 'valid_loss'}. This dictionary is merged into the Lightning scheduler config alongside the scheduler object.
[ ]:
# Example: scheduler that requires a monitored metric (ReduceLROnPlateau)
from torch.optim.lr_scheduler import ReduceLROnPlateau
options = {
'lr_scheduler': {
'scheduler': ReduceLROnPlateau,
'mode': 'min',
'factor': 0.5,
'patience': 5,
},
'lr_scheduler_config': {
'monitor': 'valid_loss',
},
}
Loss function¶
The set of operations that is performed at each optimization step are encoded in the method training_step of each CV. They typically involve:
a forward pass of the model
the calculation of the loss function
a backward pass
The general workflow cannot be changed as it is specific to each CV, unless you subclass a given CV and overload the training_step method. However, there are some details that can be changed.
For example, one might want to change the loss function in a RegressionCV (or in an AutoEncoderCV) from Mean Square Error (MSE) to Mean Absolute Error (MAE). To do so, one need to define the function with the same signature of the one used in the CV and then set it into the loss_fn member:
[30]:
from torch import Tensor
# print default loss
print(f'default: {cv.loss_fn}' )
# define new function
def mae_loss(input : Tensor, target: Tensor):
return
# assign it
cv.loss_fn = mae_loss
print(f'(a) new: {cv.loss_fn}' )
# this could also be accomplished with a lambda function
cv.loss_fn = lambda x,y : (x-y).abs().sum()
print(f'(b) new: {cv.loss_fn}' )
default: <function <lambda> at 0x7f89c069c670>
(a) new: <function mae_loss at 0x7f89f2c07280>
(b) new: <function <lambda> at 0x7f89c069c670>
Another setting which can be customized is the one in which the loss function has some options which can be customized. For instance, in the case of DeepLDA/DeepTICA CVs the loss function is ReduceEigenvaluesLoss which takes as inputs the eigenvalues of the underlying statistical problem and return a scalar (e.g. the sum of eigenvalues squared). To see the variables that can be set you should look at the documentation of the loss functions used.
For example, to change the reduction mode to the sum of the eigenvalues instead of the sum of the squared ones, one can update the loss function accordingly:
[33]:
from mlcolvar.cvs import DeepTICA
# define CV
cv = DeepTICA(layers=[10, 5, 5, 2], options={})
# print default loss mode
print(f'default mode: {cv.loss_fn.mode}')
# change the mode
cv.loss_fn.mode = 'sum'
# print new loss mode
print(f'>> new mode: {cv.loss_fn.mode}')
default kwargs: {'mode': 'sum2', 'n_eig': 0}
>> new kwargs: {'mode': 'sum', 'n_eig': 0}
Trainer¶
Since we are using the pytorch lightning framework we can exploit all of the benefits of this library. For instance, we can decide to run the optimization of the model on the GPUs if available with no change to our code.
[39]:
import lightning
# choose accelerators
trainer = lightning.Trainer(accelerator='cpu') #options are: "cpu", "gpu", "tpu", "ipu", "hpu", "mps", "auto"
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
An important class of functions that can be used to customize the behaviour during the training are callbacks.
Quoting the lightning documentation: Callbacks allow you to add arbitrary self-contained programs to your training. At specific points during the flow of execution (hooks), the Callback interface allows you to design programs that encapsulate a full set of functionality. It de-couples functionality that does not need to be in the lightning module and can be shared across projects.
For instance, they can be used to perform early stopping as well as to save model checkpoints or to save metrics. Here we will just give some examples of these functionalities, while we refer the reader to lightning documentation for a more detailed overview.
Early stopping¶
Early stopping allows to stop the training when a given metric (typically the validation loss) does not decrease (increase) anymore, which is a symptom of overfitting.
[40]:
from lightning.pytorch.callbacks.early_stopping import EarlyStopping
early_stopping = EarlyStopping(monitor="valid_loss", # quantity to monitor
mode='min', # whether this should me minimized or maximized durining training
min_delta=0, # minimum value that the quantity should change
patience=10, # how many epochs to wait before stopping the training
verbose=False
)
trainer = lightning.Trainer(callbacks=[early_stopping])
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Model checkpointing¶
It is often useful to save the checkpoint of the model which perform best according to some metric. This is useful when used, for instance, with early stopping.
After training finishes, you can use best_model_path to retrieve the path to the best checkpoint file and best_model_score to retrieve its score.
[42]:
from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint
# see documentation for additional customization, e.g. location and file names ecc..
checkpoint = ModelCheckpoint(save_top_k=1, # number of models to save (top_k=1 means only the best one is stored)
monitor="valid_loss" # quantity to monitor
)
# assign callback to trainer
trainer = lightning.Trainer(callbacks=[checkpoint])
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
After the training is over remember also to export the TorchScript model which is needed by PLUMED. The following code first load the best checkpoint and then compiles it.
best_model = RegressionCV.load_from_checkpoint(checkpoint.best_model_path) best_model.to_torchscript(file_path = checkpoint.best_model_path.replace(".ckpt",".ptc"), method='trace')Loggers¶
Lightning supports numerous ways of logging metrics, from saving CSV files to TensorBoard to Weight&Biases and more (see their website for the full list).
For instance, to save the metrics in a .csv file you can use the CSVLogger:
[ ]:
from lightning.pytorch.loggers import CSVLogger
logger = CSVLogger(save_dir="experiments", # directory where to save file
name='myCV', # name of experiment
version=None # version number (if None it will be automatically assigned)
)
# assign callback to trainer
trainer = lightning.Trainer(callbacks=[checkpoint])
Or again, the following snippet can be used to save the metrics in the TensorBoard format (requires tensorboard to be installed):
Adding new callbacks: save metrics into a dictionary¶
Callbacks can also be easily implemented in order to perform custom tasks.
For instance, in mlcolvar.utils.trainer we implemented a simple MetricsCallback object which save the logged metrics into a dictionary. This allows to easily display the results in the tutorials without having to save the metrics with the loggers and load them back afterwards.
[43]:
from mlcolvar.utils.trainer import MetricsCallback
log = MetricsCallback()
# assign callback to trainer
trainer = lightning.Trainer(callbacks=[checkpoint])
# After the training is over the metrics can be accessed with the dictionary .metrics
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Disable validation loop¶
In order to disable the validation loop you need to:
tell the
DictModulenot to split the dataset, withlengths=[1.0]pass the two options below to the
lightning.trainer:
[ ]:
# from mlcolvar.data import DictModule
#datamodule = DictModule(dataset,lengths=[1.0])
trainer = lightning.Trainer(limit_val_batches=0, num_sanity_val_steps=0)