Creating datasets

Open in Colab

Outline

In this tutorial you will learn about how to organize data to be used in the training process, and in particular the difference between:

  • datasets

  • dataloaders

  • datamodules

Furthermore, we will also look into some helper functions that can help in creating:

  • datasets from COLVAR files

  • time-lagged datasets

In a nutshell:

  • datasets are objects which store the input data as well as additional quantities like labels or weights that are going to be used in the training.

  • dataloaders wrap an iterable around datasets to allow for easy access to data (as well as collating inputs into batches).

  • datamodules encapsulate all the steps needed to process data, e.g. split the datasets and create dataloaders

Datasets

We subclassed torch.utils.data.Dataset into a DictDataset which stores the information inside a dictionary and returns a dictionary with the batched data when sliced.

The keys depend on the kind of learning task:

  • Unsupervised: “data” (,”weights”)

  • Supervised

    • Regression: “data”, “target” (,”weights”)

    • Classification: “data”, “labels”

  • Time-lagged: “data”, “data_lag” (,”weights”,”weights_lag”)

The values can be either torch.Tensors or np.arrays / lists that will be passed to the torch.Tensor() function.

[ ]:
# Colab setup
import os

if os.getenv("COLAB_RELEASE_TAG"):
    import subprocess
    subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)
    cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)
    print('Done!')
[6]:
import torch
from mlcolvar.data import DictDataset

# the constructor takes a dictionary as input.
n_samples, n_features = 100, 2
dataset = DictDataset({'data': torch.rand((n_samples,n_features)),
                             'target': torch.rand((n_samples,))
                             })

dataset
[6]:
DictDataset( "data": [100, 2], "target": [100] )

If the dataset is accessed with a string it will return the value of the underlying dictionary, otherwise if it is accessed with a slice it will return a sliced dictionary:

[7]:
# access with a key
print('dataset["data"] -->', dataset["data"].shape )
# access the 0-th element
print('\ndataset[0] =', dataset[0] )
# slice the dataset
print('\ndataset[0:3] =', dataset[0:3] )
dataset["data"] --> torch.Size([100, 2])

dataset[0] = {'data': tensor([0.0238, 0.6240]), 'target': tensor(0.3217)}

dataset[0:3] = {'data': tensor([[0.0238, 0.6240],
        [0.6782, 0.4476],
        [0.8055, 0.8887]]), 'target': tensor([0.3217, 0.6375, 0.5045])}

You can also add additional keys to the dataset, e.g. if you want to give different weights to the data:

[8]:
dataset['weights'] = torch.rand(100)

dataset
[8]:
DictDataset( "data": [100, 2], "target": [100], "weights": [100] )

Dataloaders

The dataloaders wrap iterables around the dataset such that can be easily collated into batches and used for training/validation. We subclassed the torch.utils.data.Dataloader into a FastDictionaryDataloader which takes a DictDataset as input. You can see further details in its documentation.

Typically the dataset is split across training and validation sets and then the dataloaders are created.

[9]:
from mlcolvar.data import DictLoader

# create train/valid dataloader
train_loader = DictLoader(dataset[:80],batch_size=40)
valid_loader = DictLoader(dataset[80:],batch_size=20)

train_loader
[9]:
DictLoader(length=80, batch_size=40, shuffle=True)

Datamodule

The lightning.LightningDataModule object can be used to simplify and organized the tasks described above related to data processing. Here we subclassed it into a DictModule which will take care of the 1) shuffling 2) splitting the datasets 3) creating the dataloaders. Note that this is supposed to be used together a lightning.Trainer.

[10]:
from mlcolvar.data import DictModule

# (1) lenghts by fraction
datamodule = DictModule(dataset, lengths = [0.8,0.2], batch_size = 10 )
print('#1 --> ', datamodule )

# (2) lenghts as number of element
datamodule = DictModule(dataset, lengths = [75,20,5],
                                    batch_size = [25,10,5],             # different batch sizes for each dataloader
                                    shuffle = [True, False, False] )    # specify per-dataloader options

print('\n#2 --> ', datamodule )
#1 -->  DictModule(dataset -> DictDataset( "data": [100, 2], "target": [100], "weights": [100] ),
                     train_loader -> DictLoader(length=0.8, batch_size=10, shuffle=True),
                     valid_loader -> DictLoader(length=0.2, batch_size=10, shuffle=True))

#2 -->  DictModule(dataset -> DictDataset( "data": [100, 2], "target": [100], "weights": [100] ),
                     train_loader -> DictLoader(length=75, batch_size=25, shuffle=True),
                     valid_loader -> DictLoader(length=20, batch_size=10, shuffle=False),
                        test_loader =DictLoader(length=5, batch_size=5, shuffle=False))

I/O helper functions

Creating datasets from file

It is of course possible to load the data from files (e.g. with the load_dataframe function) and then creating a dataset. For convenience, we created a functioncreate_dataset_from_files` that can be used to create the dataset directly from files. This covers the following settings:

  1. unsupervised learning: one or more files are merged together in an unlabeled dataset

[11]:
from mlcolvar.io import create_dataset_from_files

filenames = [ "data/muller-brown/unbiased/high-temp/COLVAR" ]

# load data into dataset
dataset, df = create_dataset_from_files(filenames,
                                        create_labels=False,
                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes
                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)
Class 0 dataframe shape:  (5001, 11)

 - Loaded dataframe (5001, 11): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2']
 - Descriptors (5001, 2): ['p.x', 'p.y']
[34]:
df.head(5)
[34]:
time p.x p.y p.z ene pot.bias pot.ene_bias lwall.bias lwall.force2 uwall.bias uwall.force2
0 0.0 0.500000 0.000000 0.0 6.580981 6.580981 6.580981 0.0 0.0 0.0 0.0
1 1.0 0.285803 0.351447 0.0 11.506740 11.506740 11.506740 0.0 0.0 0.0 0.0
2 2.0 -0.004293 0.590710 0.0 11.821637 11.821637 11.821637 0.0 0.0 0.0 0.0
3 3.0 -0.530208 0.714688 0.0 16.812886 16.812886 16.812886 0.0 0.0 0.0 0.0
4 4.0 -1.015236 0.978306 0.0 8.821514 8.821514 8.821514 0.0 0.0 0.0 0.0
  1. classification: in this case each file contains samples of a different class

[38]:
from mlcolvar.io import create_dataset_from_files

filenames = [ f"data/muller-brown/unbiased/state-{i}/COLVAR" for i in range(2) ]

# load data into dataset
dataset, df = create_dataset_from_files(filenames,
                                        create_labels=True,
                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes
                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)
Class 0 dataframe shape:  (2001, 12)
Class 1 dataframe shape:  (2001, 12)

 - Loaded dataframe (4002, 12): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2', 'labels']
 - Descriptors (4002, 2): ['p.x', 'p.y']

Create time-lagged datasets

In case of time-lagged tasks, one has to deal not to single configurations, rather to pairs of configurations \(\{x(t),x(t+\tau)\}\) which are distant a lag-time \(\tau\) in time. The mlcolvar.utils.timelagged module contains some helper functions, in particular the function create_timelagged_dataset.

Notes:

  • If logweigths are given (e.g. beta*bias) the search for time-lagged configurations will be performed in rescaled time [McCarthy and Parrinello, JCP 2017].

  • The resulting dataset will contain the keys ‘data’, ‘data_lag’ as well as ‘weights’ and ‘weights_lag’, where the weights are all equal to ones in the unbiased case.

  • The actual search for time-lagged configurations is performed by the function find_time_lagged_configurations, which however is not supposed to be called directly.

[47]:
from mlcolvar.utils.timelagged import create_timelagged_dataset

X = torch.rand((100,20))
t = torch.arange(100)

# returns configurations at time t as well as time t+tau
dataset = create_timelagged_dataset(X, t,
                                    lag_time=10,
                                    logweights=None )

dataset
/Users/luigi/work/mlcolvar/mlcolvar/utils/timelagged.py:129: UserWarning: Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.
  warnings.warn('Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.')
[47]:
DictDataset( "data": [88, 20], "data_lag": [88, 20], "weights": [88], "weights_lag": [88] )