{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Implementing a new CV from scratch\n", " \n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/luigibonati/mlcolvar/blob/main/docs/notebooks/tutorials/adv_newcv_scratch.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will move top-bottom through the structure of the CV classes in `mlcolvar`.\n", "We will give an overview of how CVs classes should be implemented from scratch alongside some coding-conventions we adopted in the library which may be useful for possible external contibutors. \n", "\n", "As an example we will implement (and comment) step by step the `AutoEncoderCV`. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Define the class object\n", "In `mlcolvar`, CVs class objects inherit from two parent classes:\n", "- `BaseCV` class, which contains some common and default helper functions\n", "- `lightning.LightniningModule` class, which automatically gives access to the Lightining package utilities \n", "\n", "In the class declaration preamble, we set the names of the `BLOCKS` that will consitute the main body of the CV itself. \n", "\n", "The blocks are meant to correspond to classes and functions defined in `mlcolvar.core` . However, the names we give in `BLOCKS` are arbitrary, considered that, in principle, we could have more blocks of the same types in our model and we would then need to distinguish between them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Colab setup\n", "import os\n", "\n", "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", " import subprocess\n", " subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)\n", " cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)\n", " print(cmd.stdout.decode('utf-8'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/etrizio@iit.local/Bin/miniconda3/envs/mlcvs_test/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import torch\n", "import lightning\n", "\n", "from mlcolvar.cvs import BaseCV\n", "\n", "class AutoEncoderCV(BaseCV):\n", " DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To keep the code in the library as clear as possible, we should also add short docstring to our CV class briefly explaining how it works!\n", "\n", "Anyways to save some space we will skip this in the following cells" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class AutoEncoderCV(BaseCV):\n", " \"\"\"AutoEncoding Collective Variable. It is composed by a first neural network (encoder) which projects \n", " the input data into a latent space (the CVs). Then a second network (decoder) takes \n", " the CVs and tries to reconstruct the input data based on them. It is an unsupervised learning approach, \n", " typically used when no labels are available.\n", " Furthermore, it can also be used lo learn a representation which can be used not to reconstruct the data but \n", " to predict, e.g. future configurations. \n", "\n", " For training it requires a DictDataset with the key 'data' and optionally 'weights'. If a 'target' \n", " key is present this will be used as reference for the output of the decoder, otherway this will be compared\n", " with the input 'data'.\n", " \"\"\"\n", " \n", " DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## The CV class `__init__` method\n", "The `__init__` method is the signature of the CV model as it initializes all that is necessary for the CV model to run, including blocks, variables, loss functions.." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Declaration of the `__init__` method\n", "\n", "All the CV's in `mlcolvar` have some common elements:\n", "\n", "- **in/out features**: All the CVs classes in `mlcolvar` should have defined the number of `in_features` and `out_features`, which are the number of inputs and outputs respectively. They must be passed to the `BaseCV` parent class with the command `super().__init__(in_features, out_features)`.\n", "- **options**: The `options` dict provide the interface to modify the defaults of the CV's elements, i.e. parameters of blocks, optimizer.. (see later)\n", "- ****kwargs**: CVs in `mlcolvar` also accept key-word arguments to be passed to their inner functions \n", "\n", "Each CV class will depend on different parameters, in our example the characteristic parameters for the `AutoEncoderCV` are just the `encoder_layer` (compulsory) and the `decoder_layer` (optional). \n", "\n", "To stay as user-friendly as possible, in `mlcolvar`, we always try to give meaningful and intelligible names to the parameters. Besides that, it is also a good practice to provide a complete docstring for the `__init__` method, explaining more in detail what each parameter is actually doing in the model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class AutoEncoderCV(BaseCV):\n", " DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] \n", "\n", " def __init__(self,\n", "# ================================================ LOOK HERE 0.0 ================================================ \n", " encoder_layers : list, \n", " decoder_layers : list = None, \n", " options : dict = None, \n", " **kwargs):\n", " \"\"\"\n", " Train a CV defined as the output layer of the encoder of an autoencoder model (latent space). \n", " The decoder part is used only during the training for the reconstruction loss.\n", "\n", " Parameters\n", " ----------\n", " encoder_layers : list\n", " Number of neurons per layer of the encoder\n", " decoder_layers : list, optional\n", " Number of neurons per layer of the decoder, by default None\n", " If not set it takes automaically the reversed architecture of the encoder\n", " options : dict[str,Any], optional\n", " Options for the building blocks of the model, by default None.\n", " Available blocks: ['norm_in', 'encoder','decoder'].\n", " Set 'block_name' = None or False to turn off that block\n", " \"\"\"\n", " super().__init__(model=encoder_layers, **kwargs)\n", " \n", "# ================================================ LOOK HERE 0.0 ================================================ \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Parse options and parameters\n", "The different options in the `options` dictionary are parsed using the `BaseCV.parse_options` function. This command is required as it also initializes defaults whenever specific options entries are not specified and checks that the given options make sense with the CV at hand.\n", "\n", "Options must be a dictionary of dictionaries mapping the name of a block (or the optimizer) to a dictionary of keyword arguments to pass to the block (or the optimizer) `__init__` function, i.e. name_of_block -> block_init_kwargs (e.g. options = {'encoder': {'activation': 'relu'}, 'optimizer' : { 'lr' = 1e-3} }\n", "\n", "Here we also initialize what is needed from the input parameters. In our case for example we specify that, whenever `decoder_layer` is not specified, it should be the reversed `encoder_layer`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class AutoEncoderCV(BaseCV):\n", " DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] \n", " \n", " def __init__(self,\n", " encoder_layers : list, \n", " decoder_layers : list = None, \n", " options : dict = None, \n", " **kwargs):\n", " super().__init__(model=encoder_layers, **kwargs)\n", "\n", "# ================================================ LOOK HERE 0.0 ================================================ \n", " \n", " # ======= OPTIONS ======= \n", " # parse and sanitize\n", " options = self.parse_options(options)\n", "\n", " # if decoder is not given reverse the encoder\n", " if decoder_layers is None:\n", " decoder_layers = encoder_layers[::-1]\n", " \n", "# ================================================ LOOK HERE 0.0 ================================================ \n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Define the `loss_fn` in the model\n", "In the `mlcolvar` CVs the loss function are defined as attributes of the CV class. In our case we will use the `MSELoss` defined in `mlcolvar.core.loss`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlcolvar.core.loss import MSELoss\n", "\n", "class AutoEncoderCV(BaseCV):\n", " DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] \n", " \n", " def __init__(self,\n", " encoder_layers : list, \n", " decoder_layers : list = None, \n", " options : dict = None, \n", " **kwargs):\n", " super().__init__(model=encoder_layers, **kwargs)\n", "\n", " # ======= OPTIONS ======= \n", " # parse and sanitize\n", " options = self.parse_options(options)\n", "\n", " # if decoder is not given reverse the encoder\n", " if decoder_layers is None:\n", " decoder_layers = encoder_layers[::-1]\n", " \n", "# ================================================ LOOK HERE 0.0 ================================================ \n", "\n", " # ======= LOSS =======\n", " # Reconstruction (MSE) loss\n", " self.loss_fn = MSELoss()\n", "\n", "# ================================================ LOOK HERE 0.0 ================================================ \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Initialize the Blocks in the CV model\n", "In general the blocks are meant to be initialized relying on the functions and classes implemented in `mlcolvar.core`.\n", "\n", "We remind that the list of the names for the blocks we want to include in our CV is defined in the class' constant `BLOCKS`.\n", "\n", "In our example we will implement a `norm_in` = `Normalization()` normalize the input, the `encoder` = `FeedForward()` NN for the encoder part of the architecture and the `decoder` = `FeedForward()` NN. \n", "\n", "#### Modyifing the blocks default\n", "\n", "We pass `**options` as kwargs to the blocks functions in order to be able to use the `options` dictionary to modify the defaults when initializing the CV model in our code.\n", "For example in the case of the `encoder` block we can modify the activation function of the layers to the `shifted_softplus` using \n", "\n", "`options={'encoder':{'activation':'shifted_softplus'}}`\n", "\n", "We may also want to have the possibility to deactivate blocks sometimes like we do here for the `norm_in` block, which can be skipped using \n", "\n", "`options={'norm_in': None}` or `options={'norm_in': False}`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlcolvar.core.nn import FeedForward\n", "from mlcolvar.core.transform import Normalization\n", "\n", "class AutoEncoderCV(BaseCV):\n", " DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] \n", " \n", " def __init__(self,\n", " encoder_layers : list, \n", " decoder_layers : list = None, \n", " options : dict = None, \n", " **kwargs):\n", " super().__init__(model=encoder_layers, **kwargs)\n", "\n", " # ======= OPTIONS ======= \n", " # parse and sanitize\n", " options = self.parse_options(options)\n", "\n", " # if decoder is not given reverse the encoder\n", " if decoder_layers is None:\n", " decoder_layers = encoder_layers[::-1]\n", " \n", " # ======= LOSS =======\n", " # Reconstruction (MSE) loss\n", " self.loss_fn = MSELoss()\n", "\n", "# ================================================ LOOK HERE 0.0 ================================================ \n", "\n", " # ======= BLOCKS =======\n", "\n", " # initialize norm_in\n", " o = 'norm_in'\n", " if ( options[o] is not False ) and (options[o] is not None): # this allows to deactivate it\n", " self.norm_in = Normalization(self.in_features,**options[o]) \n", "\n", " # initialize encoder\n", " o = 'encoder'\n", " self.encoder = FeedForward(encoder_layers, **options[o])\n", "\n", " # initialize decoder\n", " o = 'decoder'\n", " self.decoder = FeedForward(decoder_layers, **options[o])\n", "\n", "# ================================================ LOOK HERE 0.0 ================================================ \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Defining the `forward` and `forward_cv` function\n", "By default in the `BaseCV` class has two methods that apply the CV model:\n", "- `forward_cv` sequentially executes the blocks, skipping pre and post processing.\n", "- `forward`, which is used when calling `model(input)` and for deploying the model, also applies pre and post processing operations, if present.\n", "\n", "By default **all** the defined blocks are meant to be executed to lead to the CV, however, sometimes this may not be the case. \n", "In the case of an autoencoder, for example, this would skip the `decoder` block as the CVs space correspond to the latent representation of the autoencoder. \n", "\n", "To implement this we must:\n", "- **overload** `forward_cv` method of the `BaseCV` mother class in our CV model\n", "- **implement** a function that executes both the encoder, the decoder part and revert the normalization applied on the inputs to be used during the training (`encode_decode`)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def forward_cv(self, x: torch.Tensor) -> (torch.Tensor):\n", " if self.norm_in is not None:\n", " x = self.norm_in(x)\n", " x = self.encoder(x)\n", " return x\n", "\n", "def encode_decode(self, x: torch.Tensor) -> (torch.Tensor):\n", " x = self.forward(x)\n", " x = self.decoder(x)\n", " if self.norm_in is not None:\n", " x = self.norm_in.inverse(x)\n", " return x" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Define the `training_step`\n", "All the CV classes in `mlcolvar` must overload `BaseCV.training_step` (inherited from `lightning.LightningModule`).\n", "\n", "- As first thing, within this function we need to select the data we need look for in the dataset. This is done using the keyword-indexing of the `mlcolvar.data.DictDataset` and allowing for a easy-to-read code.\n", "- Then we apply the model and compute the loss function according to the results.\n", "- Finally, and optionally, we log the quantities we are interested in monitoring using the lightning framework.\n", "\n", "The `BaseCV` mother class also have a `validation_step` and a `test_step` functions which are by default equal to the `training_step` one." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def training_step(self, train_batch, batch_idx):\n", " # =================get data===================\n", " x = train_batch['data']\n", " loss_kwargs = {}\n", " if 'weights' in train_batch:\n", " loss_kwargs['weights'] = train_batch['weights']\n", "\n", " # =================forward====================\n", " x_hat = self.encode_decode(x)\n", "\n", " # ===================loss=====================\n", " # Reference output (compare with a 'target' key, if any, otherwise with input 'data')\n", " if 'target' in train_batch:\n", " x_ref = train_batch['target']\n", " else:\n", " x_ref = x \n", " loss = self.loss_fn(x_hat, x_ref, **loss_kwargs)\n", " \n", " # ====================log===================== \n", " name = 'train' if self.training else 'valid' \n", " self.log(f'{name}_loss', loss, on_epoch=True)\n", " return loss" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Wrap up: the complete example CV class " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class AutoEncoderCV(BaseCV):\n", " \"\"\"AutoEncoding Collective Variable. It is composed by a first neural network (encoder) which projects \n", " the input data into a latent space (the CVs). Then a second network (decoder) takes \n", " the CVs and tries to reconstruct the input data based on them. It is an unsupervised learning approach, \n", " typically used when no labels are available.\n", " Furthermore, it can also be used lo learn a representation which can be used not to reconstruct the data but \n", " to predict, e.g. future configurations. \n", "\n", " For training it requires a DictDataset with the key 'data' and optionally 'weights'. If a 'target' \n", " key is present this will be used as reference for the output of the decoder, otherway this will be compared\n", " with the input 'data'.\n", " \"\"\"\n", " \n", " DEFAULT_BLOCKS = ['norm_in','encoder','decoder'] \n", " \n", " def __init__(self,\n", " encoder_layers : list, \n", " decoder_layers : list = None, \n", " options : dict = None, \n", " **kwargs):\n", " \"\"\"\n", " Train a CV defined as the output layer of the encoder of an autoencoder model (latent space). \n", " The decoder part is used only during the training for the reconstruction loss.\n", "\n", " Parameters\n", " ----------\n", " encoder_layers : list\n", " Number of neurons per layer of the encoder\n", " decoder_layers : list, optional\n", " Number of neurons per layer of the decoder, by default None\n", " If not set it takes automaically the reversed architecture of the encoder\n", " options : dict[str,Any], optional\n", " Options for the building blocks of the model, by default None.\n", " Available blocks: ['norm_in', 'encoder','decoder'].\n", " Set 'block_name' = None or False to turn off that block\n", " \"\"\"\n", " super().__init__(model=encoder_layers, **kwargs)\n", "\n", " # ======= OPTIONS ======= \n", " # parse and sanitize\n", " options = self.parse_options(options)\n", "\n", " # if decoder is not given reverse the encoder\n", " if decoder_layers is None:\n", " decoder_layers = encoder_layers[::-1]\n", " \n", " # ======= LOSS =======\n", " # Reconstruction (MSE) loss\n", " self.loss_fn = MSELoss()\n", "\n", " # ======= BLOCKS =======\n", "\n", " # initialize norm_in\n", " o = 'norm_in'\n", " if ( options[o] is not False ) and (options[o] is not None): # this allows to deactivate it\n", " self.norm_in = Normalization(self.in_features,**options[o]) \n", "\n", " # initialize encoder\n", " o = 'encoder'\n", " self.encoder = FeedForward(encoder_layers, **options[o])\n", "\n", " # initialize decoder\n", " o = 'decoder'\n", " self.decoder = FeedForward(decoder_layers, **options[o])\n", "\n", " def forward_cv(self, x: torch.Tensor) -> (torch.Tensor):\n", " if self.norm_in is not None:\n", " x = self.norm_in(x)\n", " x = self.encoder(x)\n", " return x\n", "\n", " def encode_decode(self, x: torch.Tensor) -> (torch.Tensor):\n", " x = self.forward(x)\n", " x = self.decoder(x)\n", " if self.norm_in is not None:\n", " x = self.norm_in.inverse(x)\n", " return x\n", " \n", " def training_step(self, train_batch, batch_idx):\n", " # =================get data===================\n", " x = train_batch['data']\n", " loss_kwargs = {}\n", " if 'weights' in train_batch:\n", " loss_kwargs['weights'] = train_batch['weights']\n", "\n", " # =================forward====================\n", " x_hat = self.encode_decode(x)\n", "\n", " # ===================loss=====================\n", " # Reference output (compare with a 'target' key, if any, otherwise with input 'data')\n", " if 'target' in train_batch:\n", " x_ref = train_batch['target']\n", " else:\n", " x_ref = x \n", " loss = self.loss_fn(x_hat, x_ref, **loss_kwargs)\n", " \n", " # ====================log===================== \n", " name = 'train' if self.training else 'valid' \n", " self.log(f'{name}_loss', loss, on_epoch=True)\n", " return loss" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Write test functions\n", "In order to ensure smooth functioning of the `mlcolvar` library , all the main functions have to be accompanied by proper testing functions which should be added in the tests folder. \n", "In their final form, these are mainly meant to ensure that the code is not crashing in the possible different settings and should be kept as generic and synthetic as possible." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "GPU available: True (cuda), used: True\n", "TPU available: False, using: 0 TPU cores\n", "IPU available: False, using: 0 IPUs\n", "HPU available: False, using: 0 HPUs\n", "LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " " ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/etrizio@iit.local/Bin/miniconda3/envs/mlcvs_test/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py:280: PossibleUserWarning: The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=2). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.\n", " rank_zero_warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 0: 100%|██████████| 1/1 [00:00<00:00, 71.27it/s, v_num=32] " ] }, { "name": "stderr", "output_type": "stream", "text": [ "`Trainer.fit` stopped: `max_epochs=1` reached.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Epoch 0: 100%|██████████| 1/1 [00:00<00:00, 64.56it/s, v_num=32]\n" ] } ], "source": [ "def test_autoencodercv():\n", " from mlcolvar.data import DictDataset, DictModule\n", " import numpy as np\n", "\n", " in_features, out_features = 8,2\n", " layers = [in_features, 6, 4, out_features]\n", "\n", " # initialize via dictionary\n", " options = { 'norm_in' : None,\n", " 'encoder' : { 'activation' : 'relu' },\n", " 'optimizer' : {'lr' : 1e-3}\n", " } \n", " model = AutoEncoderCV( encoder_layers=layers, options=options )\n", "\n", " # train on synthetic dataset\n", " X = torch.randn(100,in_features) \n", " dataset = DictDataset({'data': X})\n", " datamodule = DictModule(dataset)\n", " trainer = lightning.Trainer(max_epochs=1, log_every_n_steps=2,logger=None, enable_checkpointing=False, enable_model_summary=False)\n", " trainer.fit( model, datamodule )\n", " model.eval()\n", " X_hat = model(X)\n", "\n", "if __name__ == \"__main__\":\n", " test_autoencodercv() " ] } ], "metadata": { "kernelspec": { "display_name": "graph_mlcolvar_test", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }