Creating datasets¶

Outline¶

In this tutorial you will learn about how to organize data to be used in the training process, and in particular the difference between:

datasets
dataloaders
datamodules

Furthermore, we will also look into some helper functions that can help in creating:

datasets from COLVAR files
time-lagged datasets

In a nutshell:

datasets are objects which store the input data as well as additional quantities like labels or weights that are going to be used in the training.
dataloaders wrap an iterable around datasets to allow for easy access to data (as well as collating inputs into batches).
datamodules encapsulate all the steps needed to process data, e.g. split the datasets and create dataloaders

Datasets¶

We subclassed torch.utils.data.Dataset into a DictDataset which stores the information inside a dictionary and returns a dictionary with the batched data when sliced.

The keys depend on the kind of learning task:

Unsupervised: “data” (,”weights”)
Supervised
- Regression: “data”, “target” (,”weights”)
- Classification: “data”, “labels”
Time-lagged: “data”, “data_lag” (,”weights”,”weights_lag”)

The values can be either torch.Tensors or np.arrays / lists that will be passed to the torch.Tensor() function.

[ ]:

# Colab setup
import os

if os.getenv("COLAB_RELEASE_TAG"):
    import subprocess
    subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
    cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)
    print('Done!')

[6]:

import torch
from mlcolvar.data import DictDataset

# the constructor takes a dictionary as input.
n_samples, n_features = 100, 2
dataset = DictDataset({'data': torch.rand((n_samples,n_features)),
                             'target': torch.rand((n_samples,))
                             })

dataset

[6]:

DictDataset( "data": [100, 2], "target": [100] )

If the dataset is accessed with a string it will return the value of the underlying dictionary, otherwise if it is accessed with a slice it will return a sliced dictionary:

[7]:

# access with a key
print('dataset["data"] -->', dataset["data"].shape )
# access the 0-th element
print('\ndataset[0] =', dataset[0] )
# slice the dataset
print('\ndataset[0:3] =', dataset[0:3] )

dataset["data"] --> torch.Size([100, 2])

dataset[0] = {'data': tensor([0.0238, 0.6240]), 'target': tensor(0.3217)}

dataset[0:3] = {'data': tensor([[0.0238, 0.6240],
        [0.6782, 0.4476],
        [0.8055, 0.8887]]), 'target': tensor([0.3217, 0.6375, 0.5045])}

You can also add additional keys to the dataset, e.g. if you want to give different weights to the data:

[8]:

dataset['weights'] = torch.rand(100)

dataset

[8]:

DictDataset( "data": [100, 2], "target": [100], "weights": [100] )

Dataloaders¶

The dataloaders wrap iterables around the dataset such that can be easily collated into batches and used for training/validation. We subclassed the torch.utils.data.Dataloader into a FastDictionaryDataloader which takes a DictDataset as input. You can see further details in its documentation.

Typically the dataset is split across training and validation sets and then the dataloaders are created.

[9]:

from mlcolvar.data import DictLoader

# create train/valid dataloader
train_loader = DictLoader(dataset[:80],batch_size=40)
valid_loader = DictLoader(dataset[80:],batch_size=20)

train_loader

[9]:

DictLoader(length=80, batch_size=40, shuffle=True)

Datamodule¶

The lightning.LightningDataModule object can be used to simplify and organized the tasks described above related to data processing. Here we subclassed it into a DictModule which will take care of the 1) shuffling 2) splitting the datasets 3) creating the dataloaders. Note that this is supposed to be used together a lightning.Trainer.

[10]:

from mlcolvar.data import DictModule

# (1) lenghts by fraction
datamodule = DictModule(dataset, lengths = [0.8,0.2], batch_size = 10 )
print('#1 --> ', datamodule )

# (2) lenghts as number of element
datamodule = DictModule(dataset, lengths = [75,20,5],
                                    batch_size = [25,10,5],             # different batch sizes for each dataloader
                                    shuffle = [True, False, False] )    # specify per-dataloader options

print('\n#2 --> ', datamodule )

#1 -->  DictModule(dataset -> DictDataset( "data": [100, 2], "target": [100], "weights": [100] ),
                     train_loader -> DictLoader(length=0.8, batch_size=10, shuffle=True),
                     valid_loader -> DictLoader(length=0.2, batch_size=10, shuffle=True))

#2 -->  DictModule(dataset -> DictDataset( "data": [100, 2], "target": [100], "weights": [100] ),
                     train_loader -> DictLoader(length=75, batch_size=25, shuffle=True),
                     valid_loader -> DictLoader(length=20, batch_size=10, shuffle=False),
                        test_loader =DictLoader(length=5, batch_size=5, shuffle=False))

I/O helper functions¶

Creating datasets from file¶

It is of course possible to load the data from files (e.g. with the load_dataframe function) and then creating a dataset. For convenience, we created a functioncreate_dataset_from_files` that can be used to create the dataset directly from files. This covers the following settings:

unsupervised learning: one or more files are merged together in an unlabeled dataset

[11]:

from mlcolvar.utils.io import create_dataset_from_files

filenames = [ "data/muller-brown/unbiased/high-temp/COLVAR" ]

# load data into dataset
dataset, df = create_dataset_from_files(filenames,
                                        create_labels=False,
                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes
                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)

Class 0 dataframe shape:  (5001, 11)

 - Loaded dataframe (5001, 11): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2']
 - Descriptors (5001, 2): ['p.x', 'p.y']

[34]:

df.head(5)

[34]:

	time	p.x	p.y	ene	pot.bias	pot.ene_bias
0	0.0	0.500000	0.000000	6.580981	6.580981	6.580981
1	1.0	0.285803	0.351447	11.506740	11.506740	11.506740
2	2.0	-0.004293	0.590710	11.821637	11.821637	11.821637
3	3.0	-0.530208	0.714688	16.812886	16.812886	16.812886
4	4.0	-1.015236	0.978306	8.821514	8.821514	8.821514

classification: in this case each file contains samples of a different class

[38]:

from mlcolvar.utils.io import create_dataset_from_files

filenames = [ f"data/muller-brown/unbiased/state-{i}/COLVAR" for i in range(2) ]

# load data into dataset
dataset, df = create_dataset_from_files(filenames,
                                        create_labels=True,
                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes
                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)

Class 0 dataframe shape:  (2001, 12)
Class 1 dataframe shape:  (2001, 12)

 - Loaded dataframe (4002, 12): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2', 'labels']
 - Descriptors (4002, 2): ['p.x', 'p.y']

Create time-lagged datasets¶

In case of time-lagged tasks, one has to deal not to single configurations, rather to pairs of configurations \(\{x(t),x(t+\tau)\}\) which are distant a lag-time \(\tau\) in time. The mlcolvar.utils.timelagged module contains some helper functions, in particular the function create_timelagged_dataset.

Notes:

If logweigths are given (e.g. beta*bias) the search for time-lagged configurations will be performed in rescaled time [McCarthy and Parrinello, JCP 2017].
The resulting dataset will contain the keys ‘data’, ‘data_lag’ as well as ‘weights’ and ‘weights_lag’, where the weights are all equal to ones in the unbiased case.
The actual search for time-lagged configurations is performed by the function find_time_lagged_configurations, which however is not supposed to be called directly.

[47]:

from mlcolvar.utils.timelagged import create_timelagged_dataset

X = torch.rand((100,20))
t = torch.arange(100)

# returns configurations at time t as well as time t+tau
dataset = create_timelagged_dataset(X, t,
                                    lag_time=10,
                                    logweights=None )

dataset

/Users/luigi/work/mlcolvar/mlcolvar/utils/timelagged.py:129: UserWarning: Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.
  warnings.warn('Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.')

[47]:

DictDataset( "data": [88, 20], "data_lag": [88, 20], "weights": [88], "weights_lag": [88] )