Pre/post processing¶
This tutorial shows how to add pre- or postprocessing modules to CVs. The idea is that into these modules go any operations that should not be performed in training, but only in inference. This provides additional flexibility that can come in handy, for example, in the following cases:
apply preprocessing to the data to avoid having to do it at each step, and at the same time save it in the model so that it is performed in the prediction phase, e.g., in PLUMED
apply postprocessing after the training is finished, for example, to normalize the output CV
Setup¶
[7]:
# Colab setup
import os
if os.getenv("COLAB_RELEASE_TAG"):
import subprocess
subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)
print(cmd.stdout.decode('utf-8'))
import torch
import mlcolvar
import numpy as np
BaseCV class¶
Note that the BaseCV class implements the forward method in the following way:
[ ]:
def forward(self, x : torch.Tensor) -> torch.Tensor:
"""
Evaluation of the CV
- Apply preprocessing if any
- Execute the forward_cv method
- Apply postprocessing if any
"""
if self.preprocessing is not None:
x = self.preprocessing(x)
x = self.forward_cv(x)
if self.postprocessing is not None:
x = self.postprocessing(x)
return x
As explained in the tutorial on implementing CVs from scratch,
the
forwardmethod is supposed to be called during inferencethe
forward_cvmethod is called fromtraining_step, and is the one which is re-implemented by the various subclasses
Pre-processing¶
Assume we have a dataset on which we want to apply a preprocessing operation. In general we can define this operation as:
a module implemented in the library (such as
mlcolvar.core.transformormlcolvar.core.statsobjects)
a generic class that inherits from the
torch.nn.Moduleclass (includingtorch.nn.Sequentialto concatenate more transformations)
a generic function that takes as input a
torch.Tensorand returns anothertorch.Tensor.
If the dimensionality of the inputs remains unchanged following the transformation, all three cases work without any other changes. Otherwise, there must be an in_features member that specifies the initial input size which is used to correctly concatenate the model. This is already present in all objects in (a), it must be added for those in (b), while it cannot be used in the case of python functions (c).
Once we have defined the preprocessing, we need to:
apply it to the data before creating the Dataset/Datamodule
save into the model. This can be done either by passing it to the
preprocessingkeyword in the costructor or saving it into thepreprocessingmember after initialization.
Using a mlcolvar object as preprocessing¶
In this example we show how to use a mlcolvar module, and in particular Principal Component Analysis (PCA) to reduce the dimensionality of the inputs. We first define the preprocessing and compute the 2 principal components out of a 10-d dataset.
[8]:
from mlcolvar.core.stats import PCA
# create synthetic dataset
n_input = 10
X = torch.rand(100,n_input)
y = X.square().sum(1)
# compute PCA
n_pca = 2
pca = PCA(in_features=n_input, out_features=n_pca)
_ = pca.compute(X)
Then we can apply it to the dataset to get the pre-processed data and create the datamodule
[9]:
from mlcolvar.data import DictDataset
X_pre = pca(X)
DictDataset(dict(data=X_pre,target=y))
[9]:
DictDataset( "data": [100, 2], "target": [100] )
And save it into the model, here a RegressionCV. Note that the input of the CV needs to be equal to 2 now, since we are going to apply it to the pre-processed dataset
[12]:
from mlcolvar.cvs import RegressionCV
model = RegressionCV(model=[2,10,10,1],
preprocessing = pca )
# the preprocessing can also be saved later, like in:
# model.preprocessing = pca
model
/home/lbonati@iit.local/software/anaconda3/envs/pytorch2.0/lib/python3.10/site-packages/lightning/pytorch/utilities/parsing.py:197: UserWarning: Attribute 'preprocessing' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['preprocessing'])`.
rank_zero_warn(
[12]:
RegressionCV(
(preprocessing): PCA(in_features=10, out_features=2)
(loss_fn): MSELoss()
(norm_in): Normalization(in_features=2, out_features=2, mode=mean_std)
(nn): FeedForward(
(nn): Sequential(
(0): Linear(in_features=2, out_features=10, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=10, out_features=10, bias=True)
(3): ReLU(inplace=True)
(4): Linear(in_features=10, out_features=1, bias=True)
)
)
)
For inference, we should either call forward on the raw original data (which is what is exported to Torchscript) or also forward_cv to the raw data (which is what is executed during training).
[11]:
y_pred = model.forward(X) #equivalent to model(X)
y_pred_pre = model.forward_cv(X_pre)
torch.allclose(y_pred,y_pred_pre)
[11]:
True
Post-processing¶
Similarly, one might want to do some post-processing operations, typically after the training is completed. Here we use this feature to standardize the CV output such that it lies in the range between -1 and 1.
[15]:
from mlcolvar.cvs import AutoEncoderCV
model = AutoEncoderCV(encoder_layers=[10,5,1])
Calculate mean and range to be subtracted and divided for with the Normalization class.
[23]:
from mlcolvar.core.transform import Statistics
with torch.no_grad():
y_pred = model(X)
stats = Statistics(y_pred).to_dict()
stats
[23]:
{'mean': tensor([-0.2367]),
'std': tensor([0.0248]),
'min': tensor([-0.3403]),
'max': tensor([-0.1554])}
Define a Normalization object based on these values and mode=min_max. Note that, in order to standardize the outputs such that the mean is 0 and stdandard deviation is 1 you should use the mode=mean_std instead.
[24]:
from mlcolvar.core.transform import Normalization
norm = Normalization(in_features=1,
stats=stats, mode='min_max')
Finally, we can save it as postprocessing in the CV object, and test whether it is working when calling the forward method.
[27]:
model.postprocessing = norm
with torch.no_grad():
y_pred_post = model(X)
stats = Statistics(y_pred_post).to_dict()
stats
[27]:
{'mean': tensor([0.1210]),
'std': tensor([0.2687]),
'min': tensor([-1.]),
'max': tensor([1.])}
That’s it! Now the outputs of the CV will be rescaled such that the min and max over the training set are equal to -1 and 1.
Note: it you would like to reset the pre-/post- processing modules you can just set them to None.