{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Pre/post processing\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/luigibonati/mlcolvar/blob/main/docs/notebooks/tutorials/adv_preprocessing.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial shows how to add pre- or postprocessing modules to CVs. The idea is that into these modules go any operations that should not be performed in training, but only in inference. This provides additional flexibility that can come in handy, for example, in the following cases:\n", "- apply preprocessing to the data to avoid having to do it at each step, and at the same time save it in the model so that it is performed in the prediction phase, e.g., in PLUMED\n", "- apply postprocessing after the training is finished, for example, to normalize the output CV" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Colab setup\n", "import os\n", "\n", "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", " import subprocess\n", " subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)\n", " cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)\n", " print(cmd.stdout.decode('utf-8'))\n", " \n", "import torch\n", "import mlcolvar\n", "import numpy as np" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## BaseCV class" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Note that the `BaseCV` class implements the forward method in the following way:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def forward(self, x : torch.Tensor) -> torch.Tensor:\n", " \"\"\"\n", " Evaluation of the CV\n", " - Apply preprocessing if any\n", " - Execute the forward_cv method\n", " - Apply postprocessing if any\n", " \"\"\"\n", " \n", " if self.preprocessing is not None:\n", " x = self.preprocessing(x)\n", "\n", " x = self.forward_cv(x)\n", "\n", " if self.postprocessing is not None:\n", " x = self.postprocessing(x)\n", "\n", " return x" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As explained in the tutorial on implementing CVs from scratch, \n", "- the `forward` method is supposed to be called during inference\n", "- the `forward_cv` method is called from `training_step`, and is the one which is re-implemented by the various subclasses" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Pre-processing" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Assume we have a dataset on which we want to apply a preprocessing operation. In general we can define this operation as:\n", "\n", "- (a) a module implemented in the library (such as `mlcolvar.core.transform` or `mlcolvar.core.stats` objects)\n", "- (b) a generic class that inherits from the `torch.nn.Module` class (including `torch.nn.Sequential` to concatenate more transformations) \n", "- (c) a generic function that takes as input a `torch.Tensor` and returns another `torch.Tensor`. \n", "\n", "If the dimensionality of the inputs remains unchanged following the transformation, all three cases work without any other changes. Otherwise, there must be an `in_features` member that specifies the initial input size which is used to correctly concatenate the model. \n", "This is already present in all objects in (a), it must be added for those in (b), while it cannot be used in the case of python functions (c). " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Once we have defined the preprocessing, we need to:\n", "- apply it to the data before creating the Dataset/Datamodule\n", "- save into the model. This can be done either by passing it to the `preprocessing` keyword in the costructor or saving it into the `preprocessing` member after initialization. \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Using a mlcolvar object as preprocessing" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this example we show how to use a `mlcolvar` module, and in particular Principal Component Analysis (PCA) to reduce the dimensionality of the inputs. We first define the preprocessing and compute the 2 principal components out of a 10-d dataset." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from mlcolvar.core.stats import PCA\n", "\n", "# create synthetic dataset\n", "n_input = 10\n", "X = torch.rand(100,n_input)\n", "y = X.square().sum(1)\n", "\n", "# compute PCA\n", "n_pca = 2\n", "\n", "pca = PCA(in_features=n_input, out_features=n_pca)\n", "_ = pca.compute(X)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Then we can apply it to the dataset to get the pre-processed data and create the datamodule" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DictDataset( \"data\": [100, 2], \"target\": [100] )" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlcolvar.data import DictDataset\n", "\n", "X_pre = pca(X)\n", "\n", "DictDataset(dict(data=X_pre,target=y))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "And save it into the model, here a `RegressionCV`. Note that the input of the CV needs to be equal to 2 now, since we are going to apply it to the pre-processed dataset" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/lbonati@iit.local/software/anaconda3/envs/pytorch2.0/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:197: UserWarning: Attribute 'preprocessing' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['preprocessing'])`.\n", " rank_zero_warn(\n" ] }, { "data": { "text/plain": [ "RegressionCV(\n", " (preprocessing): PCA(in_features=10, out_features=2)\n", " (loss_fn): MSELoss()\n", " (norm_in): Normalization(in_features=2, out_features=2, mode=mean_std)\n", " (nn): FeedForward(\n", " (nn): Sequential(\n", " (0): Linear(in_features=2, out_features=10, bias=True)\n", " (1): ReLU(inplace=True)\n", " (2): Linear(in_features=10, out_features=10, bias=True)\n", " (3): ReLU(inplace=True)\n", " (4): Linear(in_features=10, out_features=1, bias=True)\n", " )\n", " )\n", ")" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlcolvar.cvs import RegressionCV\n", "\n", "model = RegressionCV(model=[2,10,10,1], \n", " preprocessing = pca ) \n", "\n", "# the preprocessing can also be saved later, like in:\n", "# model.preprocessing = pca\n", "\n", "model" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For inference, we should either call `forward` on the raw original data (which is what is exported to Torchscript) or also `forward_cv` to the raw data (which is what is executed during training)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_pred = model.forward(X) #equivalent to model(X)\n", "y_pred_pre = model.forward_cv(X_pre)\n", "\n", "torch.allclose(y_pred,y_pred_pre)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Post-processing" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, one might want to do some post-processing operations, typically after the training is completed. Here we use this feature to standardize the CV output such that it lies in the range between -1 and 1. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from mlcolvar.cvs import AutoEncoderCV\n", "\n", "model = AutoEncoderCV(encoder_layers=[10,5,1])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Calculate mean and range to be subtracted and divided for with the Normalization class." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mean': tensor([-0.2367]),\n", " 'std': tensor([0.0248]),\n", " 'min': tensor([-0.3403]),\n", " 'max': tensor([-0.1554])}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlcolvar.core.transform import Statistics\n", "\n", "with torch.no_grad():\n", " y_pred = model(X)\n", " \n", "stats = Statistics(y_pred).to_dict()\n", "stats" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Define a Normalization object based on these values and `mode=min_max`. Note that, in order to standardize the outputs such that the mean is 0 and stdandard deviation is 1 you should use the `mode=mean_std` instead." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from mlcolvar.core.transform import Normalization\n", "\n", "norm = Normalization(in_features=1,\n", " stats=stats, mode='min_max')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can save it as postprocessing in the CV object, and test whether it is working when calling the forward method." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mean': tensor([0.1210]),\n", " 'std': tensor([0.2687]),\n", " 'min': tensor([-1.]),\n", " 'max': tensor([1.])}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.postprocessing = norm\n", "\n", "with torch.no_grad():\n", " y_pred_post = model(X)\n", " \n", "stats = Statistics(y_pred_post).to_dict()\n", "stats" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "That's it! Now the outputs of the CV will be rescaled such that the min and max over the training set are equal to -1 and 1.\n", "\n", "Note: it you would like to reset the pre-/post- processing modules you can just set them to `None`." ] } ], "metadata": { "kernelspec": { "display_name": "pytorch", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "1cbeac1d7079eaeba64f3210ccac5ee24400128e300a45ae35eee837885b08b3" } } }, "nbformat": 4, "nbformat_minor": 2 }