Implementing a new CV from scratch

Open in Colab

In this notebook, we will move top-bottom through the structure of the CV classes in mlcolvar. We will give an overview of how CVs classes should be implemented from scratch alongside some coding-conventions we adopted in the library which may be useful for possible external contibutors.

As an example we will implement (and comment) step by step the AutoEncoderCV.

Define the class object

In mlcolvar, CVs class objects inherit from two parent classes:

  • BaseCV class, which contains some common and default helper functions

  • lightning.LightniningModule class, which automatically gives access to the Lightining package utilities

In the class declaration preamble, we set the names of the BLOCKS that will consitute the main body of the CV itself.

The blocks are meant to correspond to classes and functions defined in mlcolvar.core . However, the names we give in BLOCKS are arbitrary, considered that, in principle, we could have more blocks of the same types in our model and we would then need to distinguish between them.

[ ]:
# Colab setup
import os

if os.getenv("COLAB_RELEASE_TAG"):
    import subprocess
    subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
    cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)
    print(cmd.stdout.decode('utf-8'))
[2]:
import torch
import lightning

from mlcolvar.cvs import BaseCV

class AutoEncoderCV(BaseCV, lightning.LightningModule):
    BLOCKS = ['norm_in','encoder','decoder']
/home/etrizio@iit.local/Bin/miniconda3/envs/mlcvs_test/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

To keep the code in the library as clear as possible, we should also add short docstring to our CV class briefly explaining how it works!

Anyways to save some space we will skip this in the following cells

[3]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    """AutoEncoding Collective Variable. It is composed by a first neural network (encoder) which projects
    the input data into a latent space (the CVs). Then a second network (decoder) takes
    the CVs and tries to reconstruct the input data based on them. It is an unsupervised learning approach,
    typically used when no labels are available.
    Furthermore, it can also be used lo learn a representation which can be used not to reconstruct the data but
    to predict, e.g. future configurations.

    For training it requires a DictDataset with the key 'data' and optionally 'weights'. If a 'target'
    key is present this will be used as reference for the output of the decoder, otherway this will be compared
    with the input 'data'.
    """

    BLOCKS = ['norm_in','encoder','decoder']

The CV class __init__ method

The __init__ method is the signature of the CV model as it initializes all that is necessary for the CV model to run, including blocks, variables, loss functions..

Declaration of the __init__ method

All the CV’s in mlcolvar have some common elements:

  • in/out features: All the CVs classes in mlcolvar should have defined the number of in_features and out_features, which are the number of inputs and outputs respectively. They must be passed to the BaseCV parent class with the command super().__init__(in_features, out_features).

  • options: The options dict provide the interface to modify the defaults of the CV’s elements, i.e. parameters of blocks, optimizer.. (see later)

  • ****kwargs**: CVs in mlcolvar also accept key-word arguments to be passed to their inner functions

Each CV class will depend on different parameters, in our example the characteristic parameters for the AutoEncoderCV are just the encoder_layer (compulsory) and the decoder_layer (optional).

To stay as user-friendly as possible, in mlcolvar, we always try to give meaningful and intelligible names to the parameters. Besides that, it is also a good practice to provide a complete docstring for the __init__ method, explaining more in detail what each parameter is actually doing in the model.

[4]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    BLOCKS = ['norm_in','encoder','decoder']

    def __init__(self,
# ================================================ LOOK HERE 0.0 ================================================
                encoder_layers : list,
                decoder_layers : list = None,
                options : dict = None,
                **kwargs):
        """
        Train a CV defined as the output layer of the encoder of an autoencoder model (latent space).
        The decoder part is used only during the training for the reconstruction loss.

        Parameters
        ----------
        encoder_layers : list
            Number of neurons per layer of the encoder
        decoder_layers : list, optional
            Number of neurons per layer of the decoder, by default None
            If not set it takes automaically the reversed architecture of the encoder
        options : dict[str,Any], optional
            Options for the building blocks of the model, by default None.
            Available blocks: ['norm_in', 'encoder','decoder'].
            Set 'block_name' = None or False to turn off that block
        """
        super().__init__(in_features=encoder_layers[0], out_features=encoder_layers[-1], **kwargs)

# ================================================ LOOK HERE 0.0 ================================================

Parse options and parameters

The different options in the options dictionary are parsed using the BaseCV.parse_options function. This command is required as it also initializes defaults whenever specific options entries are not specified and checks that the given options make sense with the CV at hand.

Options must be a dictionary of dictionaries mapping the name of a block (or the optimizer) to a dictionary of keyword arguments to pass to the block (or the optimizer) __init__ function, i.e. name_of_block -> block_init_kwargs (e.g. options = {‘encoder’: {‘activation’: ‘relu’}, ‘optimizer’ : { ‘lr’ = 1e-3} }

Here we also initialize what is needed from the input parameters. In our case for example we specify that, whenever decoder_layer is not specified, it should be the reversed encoder_layer.

[5]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    BLOCKS = ['norm_in','encoder','decoder']

    def __init__(self,
                encoder_layers : list,
                decoder_layers : list = None,
                options : dict = None,
                **kwargs):
        super().__init__(in_features=encoder_layers[0], out_features=encoder_layers[-1], **kwargs)

# ================================================ LOOK HERE 0.0 ================================================

        # ======= OPTIONS =======
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]

# ================================================ LOOK HERE 0.0 ================================================


Define the loss_fn in the model

In the mlcolvar CVs the loss function are defined as attributes of the CV class. In our case we will use the MSELoss defined in mlcolvar.core.loss.

[6]:
from mlcolvar.core.loss import MSELoss

class AutoEncoderCV(BaseCV, lightning.LightningModule):
    BLOCKS = ['norm_in','encoder','decoder']

    def __init__(self,
                encoder_layers : list,
                decoder_layers : list = None,
                options : dict = None,
                **kwargs):
        super().__init__(in_features=encoder_layers[0], out_features=encoder_layers[-1], **kwargs)

        # ======= OPTIONS =======
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]

# ================================================ LOOK HERE 0.0 ================================================

        # =======   LOSS  =======
        # Reconstruction (MSE) loss
        self.loss_fn = MSELoss()

# ================================================ LOOK HERE 0.0 ================================================

Initialize the Blocks in the CV model

In general the blocks are meant to be initialized relying on the functions and classes implemented in mlcolvar.core.

We remind that the list of the names for the blocks we want to include in our CV is defined in the class’ constant BLOCKS.

In our example we will implement a norm_in = Normalization() normalize the input, the encoder = FeedForward() NN for the encoder part of the architecture and the decoder = FeedForward() NN.

Modyifing the blocks default

We pass **options as kwargs to the blocks functions in order to be able to use the options dictionary to modify the defaults when initializing the CV model in our code. For example in the case of the encoder block we can modify the activation function of the layers to the shifted_softplus using

options={'encoder':{'activation':'shifted_softplus'}}

We may also want to have the possibility to deactivate blocks sometimes like we do here for the norm_in block, which can be skipped using

options={'norm_in': None} or options={'norm_in': False}

[7]:
from mlcolvar.core.nn import FeedForward
from mlcolvar.core.transform import Normalization

class AutoEncoderCV(BaseCV, lightning.LightningModule):
    BLOCKS = ['norm_in','encoder','decoder']

    def __init__(self,
                encoder_layers : list,
                decoder_layers : list = None,
                options : dict = None,
                **kwargs):
        super().__init__(in_features=encoder_layers[0], out_features=encoder_layers[-1], **kwargs)

        # ======= OPTIONS =======
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]

        # =======   LOSS  =======
        # Reconstruction (MSE) loss
        self.loss_fn = MSELoss()

# ================================================ LOOK HERE 0.0 ================================================

        # ======= BLOCKS =======

        # initialize norm_in
        o = 'norm_in'
        if ( options[o] is not False ) and (options[o] is not None): # this allows to deactivate it
            self.norm_in = Normalization(self.in_features,**options[o])

        # initialize encoder
        o = 'encoder'
        self.encoder = FeedForward(encoder_layers, **options[o])

        # initialize decoder
        o = 'decoder'
        self.decoder = FeedForward(decoder_layers, **options[o])

# ================================================ LOOK HERE 0.0 ================================================

Defining the forward and forward_cv function

By default in the BaseCV class has two methods that apply the CV model:

  • forward_cv sequentially executes the blocks, skipping pre and post processing.

  • forward, which is used when calling model(input) and for deploying the model, also applies pre and post processing operations, if present.

By default all the defined blocks are meant to be executed to lead to the CV, however, sometimes this may not be the case. In the case of an autoencoder, for example, this would skip the decoder block as the CVs space correspond to the latent representation of the autoencoder.

To implement this we must:

  • overload forward_cv method of the BaseCV mother class in our CV model

  • implement a function that executes both the encoder, the decoder part and revert the normalization applied on the inputs to be used during the training (encode_decode)

[8]:
def forward_cv(self, x: torch.Tensor) -> (torch.Tensor):
    if self.norm_in is not None:
        x = self.norm_in(x)
    x = self.encoder(x)
    return x

def encode_decode(self, x: torch.Tensor) -> (torch.Tensor):
    x = self.forward(x)
    x = self.decoder(x)
    if self.norm_in is not None:
        x = self.norm_in.inverse(x)
    return x

Define the training_step

All the CVs classes in mlcolvar must overload the lightning.LightningModule.training_step function.

  • As first thing, within this function we need to select the data we need look for in the dataset. This is done using the keyword-indexing of the mlcolvar.data.DictDataset and allowing for a easy-to-read code.

  • Then we apply the model and compute the loss function according to the results.

  • Finally, and optionally, we log the quantities we are interested in monitoring using the lightning framework.

The BaseCV mother class also have a validation_step and a test_step functions which are by default equal to the training_step one.

[9]:
def training_step(self, train_batch, batch_idx):
    # =================get data===================
    x = train_batch['data']
    loss_kwargs = {}
    if 'weights' in train_batch:
        loss_kwargs['weights'] = train_batch['weights']

    # =================forward====================
    x_hat = self.encode_decode(x)

    # ===================loss=====================
    # Reference output (compare with a 'target' key, if any, otherwise with input 'data')
    if 'target' in train_batch:
        x_ref = train_batch['target']
    else:
        x_ref = x
    loss = self.loss_fn(x_hat, x_ref, **loss_kwargs)

    # ====================log=====================
    name = 'train' if self.training else 'valid'
    self.log(f'{name}_loss', loss, on_epoch=True)
    return loss

Wrap up: the complete example CV class

[10]:
class AutoEncoderCV(BaseCV, lightning.LightningModule):
    """AutoEncoding Collective Variable. It is composed by a first neural network (encoder) which projects
    the input data into a latent space (the CVs). Then a second network (decoder) takes
    the CVs and tries to reconstruct the input data based on them. It is an unsupervised learning approach,
    typically used when no labels are available.
    Furthermore, it can also be used lo learn a representation which can be used not to reconstruct the data but
    to predict, e.g. future configurations.

    For training it requires a DictDataset with the key 'data' and optionally 'weights'. If a 'target'
    key is present this will be used as reference for the output of the decoder, otherway this will be compared
    with the input 'data'.
    """

    BLOCKS = ['norm_in','encoder','decoder']

    def __init__(self,
                encoder_layers : list,
                decoder_layers : list = None,
                options : dict = None,
                **kwargs):
        """
        Train a CV defined as the output layer of the encoder of an autoencoder model (latent space).
        The decoder part is used only during the training for the reconstruction loss.

        Parameters
        ----------
        encoder_layers : list
            Number of neurons per layer of the encoder
        decoder_layers : list, optional
            Number of neurons per layer of the decoder, by default None
            If not set it takes automaically the reversed architecture of the encoder
        options : dict[str,Any], optional
            Options for the building blocks of the model, by default None.
            Available blocks: ['norm_in', 'encoder','decoder'].
            Set 'block_name' = None or False to turn off that block
        """
        super().__init__(in_features=encoder_layers[0], out_features=encoder_layers[-1], **kwargs)

        # ======= OPTIONS =======
        # parse and sanitize
        options = self.parse_options(options)

        # if decoder is not given reverse the encoder
        if decoder_layers is None:
            decoder_layers = encoder_layers[::-1]

        # =======   LOSS  =======
        # Reconstruction (MSE) loss
        self.loss_fn = MSELoss()

        # ======= BLOCKS =======

        # initialize norm_in
        o = 'norm_in'
        if ( options[o] is not False ) and (options[o] is not None): # this allows to deactivate it
            self.norm_in = Normalization(self.in_features,**options[o])

        # initialize encoder
        o = 'encoder'
        self.encoder = FeedForward(encoder_layers, **options[o])

        # initialize decoder
        o = 'decoder'
        self.decoder = FeedForward(decoder_layers, **options[o])

    def forward_cv(self, x: torch.Tensor) -> (torch.Tensor):
        if self.norm_in is not None:
            x = self.norm_in(x)
        x = self.encoder(x)
        return x

    def encode_decode(self, x: torch.Tensor) -> (torch.Tensor):
        x = self.forward(x)
        x = self.decoder(x)
        if self.norm_in is not None:
            x = self.norm_in.inverse(x)
        return x

    def training_step(self, train_batch, batch_idx):
        # =================get data===================
        x = train_batch['data']
        loss_kwargs = {}
        if 'weights' in train_batch:
            loss_kwargs['weights'] = train_batch['weights']

        # =================forward====================
        x_hat = self.encode_decode(x)

        # ===================loss=====================
        # Reference output (compare with a 'target' key, if any, otherwise with input 'data')
        if 'target' in train_batch:
            x_ref = train_batch['target']
        else:
            x_ref = x
        loss = self.loss_fn(x_hat, x_ref, **loss_kwargs)

        # ====================log=====================
        name = 'train' if self.training else 'valid'
        self.log(f'{name}_loss', loss, on_epoch=True)
        return loss

Write test functions

In order to ensure smooth functioning of the mlcolvar library , all the main functions have to be accompanied by proper testing functions which should be added in the tests folder. In their final form, these are mainly meant to ensure that the code is not crashing in the possible different settings and should be kept as generic and synthetic as possible.

[16]:
def test_autoencodercv():
    from mlcolvar.data import DictDataset, DictModule
    import numpy as np

    in_features, out_features = 8,2
    layers = [in_features, 6, 4, out_features]

    # initialize via dictionary
    options = { 'norm_in'  : None,
             'encoder' : { 'activation' : 'relu' },
             'optimizer' : {'lr' : 1e-3}
           }
    model = AutoEncoderCV( encoder_layers=layers, options=options )

    # train on synthetic dataset
    X = torch.randn(100,in_features)
    dataset = DictDataset({'data': X})
    datamodule = DictModule(dataset)
    trainer = lightning.Trainer(max_epochs=1, log_every_n_steps=2,logger=None, enable_checkpointing=False, enable_model_summary=False)
    trainer.fit( model, datamodule )
    model.eval()
    X_hat = model(X)

if __name__ == "__main__":
    test_autoencodercv()
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

/home/etrizio@iit.local/Bin/miniconda3/envs/mlcvs_test/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:280: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=2). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 0: 100%|██████████| 1/1 [00:00<00:00, 71.27it/s, v_num=32]
`Trainer.fit` stopped: `max_epochs=1` reached.
Epoch 0: 100%|██████████| 1/1 [00:00<00:00, 64.56it/s, v_num=32]