{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Creating datasets\n", "\n", "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/luigibonati/mlcolvar/blob/main/docs/notebooks/tutorials/intro_2_data.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Outline" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial you will learn about how to organize data to be used in the training process, and in particular the difference between:\n", "\n", "- datasets\n", "- dataloaders \n", "- datamodules\n", "\n", "Furthermore, we will also look into some helper functions that can help in\n", " creating:\n", "\n", "- datasets from COLVAR files\n", "- time-lagged datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In a nutshell:\n", "- [datasets](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) are objects which store the input data as well as additional quantities like labels or weights that are going to be used in the training. \n", "- [dataloaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) wrap an iterable around datasets to allow for easy access to data (as well as collating inputs into batches). \n", "- [datamodules](https://pytorch-lightning.readthedocs.io/en/1.8.1/data/datamodule.html) encapsulate all the steps needed to process data, e.g. split the datasets and create dataloaders" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We subclassed `torch.utils.data.Dataset` into a `DictDataset` which stores the information inside a dictionary and returns a dictionary with the batched data when sliced. \n", "\n", "The **keys** depend on the kind of learning task:\n", "- Unsupervised: \"data\" (,\"weights\")\n", "- Supervised\n", " - Regression: \"data\", \"target\" (,\"weights\")\n", " - Classification: \"data\", \"labels\"\n", "- Time-lagged: \"data\", \"data_lag\" (,\"weights\",\"weights_lag\")\n", "\n", "The **values** can be either torch.Tensors or np.arrays / lists that will be passed to the torch.Tensor() function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Colab setup\n", "import os\n", "\n", "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", " import subprocess\n", " subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)\n", " cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)\n", " print('Done!')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DictDataset( \"data\": [100, 2], \"target\": [100] )" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "from mlcolvar.data import DictDataset\n", "\n", "# the constructor takes a dictionary as input.\n", "n_samples, n_features = 100, 2\n", "dataset = DictDataset({'data': torch.rand((n_samples,n_features)),\n", " 'target': torch.rand((n_samples,))\n", " })\n", "\n", "dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "If the dataset is accessed with a string it will return the value of the underlying dictionary,\n", "otherwise if it is accessed with a slice it will return a sliced dictionary:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dataset[\"data\"] --> torch.Size([100, 2])\n", "\n", "dataset[0] = {'data': tensor([0.0238, 0.6240]), 'target': tensor(0.3217)}\n", "\n", "dataset[0:3] = {'data': tensor([[0.0238, 0.6240],\n", " [0.6782, 0.4476],\n", " [0.8055, 0.8887]]), 'target': tensor([0.3217, 0.6375, 0.5045])}\n" ] } ], "source": [ "# access with a key \n", "print('dataset[\"data\"] -->', dataset[\"data\"].shape )\n", "# access the 0-th element\n", "print('\\ndataset[0] =', dataset[0] )\n", "# slice the dataset\n", "print('\\ndataset[0:3] =', dataset[0:3] )" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can also add additional keys to the dataset, e.g. if you want to give different weights to the data:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] )" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset['weights'] = torch.rand(100)\n", "\n", "dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Dataloaders" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The dataloaders wrap iterables around the dataset such that can be easily collated into batches and used for training/validation. We subclassed the `torch.utils.data.Dataloader` into a `FastDictionaryDataloader` which takes a `DictDataset` as input. You can see further details in its documentation.\n", "\n", "Typically the dataset is split across training and validation sets and then the dataloaders are created." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DictLoader(length=80, batch_size=40, shuffle=True)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlcolvar.data import DictLoader\n", "\n", "# create train/valid dataloader\n", "train_loader = DictLoader(dataset[:80],batch_size=40)\n", "valid_loader = DictLoader(dataset[80:],batch_size=20)\n", "\n", "train_loader" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Datamodule" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The `lightning.LightningDataModule` object can be used to simplify and organized the tasks described above related to data processing. Here we subclassed it into a `DictModule` which will take care of the 1) shuffling 2) splitting the datasets 3) creating the dataloaders. Note that this is supposed to be used together a `lightning.Trainer`. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#1 --> DictModule(dataset -> DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] ),\n", "\t\t train_loader -> DictLoader(length=0.8, batch_size=10, shuffle=True),\n", "\t\t valid_loader -> DictLoader(length=0.2, batch_size=10, shuffle=True))\n", "\n", "#2 --> DictModule(dataset -> DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] ),\n", "\t\t train_loader -> DictLoader(length=75, batch_size=25, shuffle=True),\n", "\t\t valid_loader -> DictLoader(length=20, batch_size=10, shuffle=False),\n", "\t\t\ttest_loader =DictLoader(length=5, batch_size=5, shuffle=False))\n" ] } ], "source": [ "from mlcolvar.data import DictModule\n", "\n", "# (1) lenghts by fraction\n", "datamodule = DictModule(dataset, lengths = [0.8,0.2], batch_size = 10 )\n", "print('#1 --> ', datamodule ) \n", "\n", "# (2) lenghts as number of element\n", "datamodule = DictModule(dataset, lengths = [75,20,5], \n", " batch_size = [25,10,5], # different batch sizes for each dataloader\n", " shuffle = [True, False, False] ) # specify per-dataloader options\n", "\n", "print('\\n#2 --> ', datamodule ) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### I/O helper functions" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating datasets from file" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "It is of course possible to load the data from files (e.g. with the `load_dataframe` function`) and then creating a dataset. For convenience, we created a function `create_dataset_from_files` that can be used to create the dataset directly from files. This covers the following settings:" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1) **unsupervised learning**: one or more files are merged together in an unlabeled dataset" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Class 0 dataframe shape: (5001, 11)\n", "\n", " - Loaded dataframe (5001, 11): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2']\n", " - Descriptors (5001, 2): ['p.x', 'p.y']\n" ] } ], "source": [ "from mlcolvar.utils.io import create_dataset_from_files\n", "\n", "filenames = [ \"data/muller-brown/unbiased/high-temp/COLVAR\" ]\n", "\n", "# load data into dataset\n", "dataset, df = create_dataset_from_files(filenames, \n", " create_labels=False,\n", " filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes\n", " return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timep.xp.yp.zenepot.biaspot.ene_biaslwall.biaslwall.force2uwall.biasuwall.force2
00.00.5000000.0000000.06.5809816.5809816.5809810.00.00.00.0
11.00.2858030.3514470.011.50674011.50674011.5067400.00.00.00.0
22.0-0.0042930.5907100.011.82163711.82163711.8216370.00.00.00.0
33.0-0.5302080.7146880.016.81288616.81288616.8128860.00.00.00.0
44.0-1.0152360.9783060.08.8215148.8215148.8215140.00.00.00.0
\n", "
" ], "text/plain": [ " time p.x p.y ... lwall.force2 uwall.bias uwall.force2\n", "0 0.0 0.500000 0.000000 ... 0.0 0.0 0.0\n", "1 1.0 0.285803 0.351447 ... 0.0 0.0 0.0\n", "2 2.0 -0.004293 0.590710 ... 0.0 0.0 0.0\n", "3 3.0 -0.530208 0.714688 ... 0.0 0.0 0.0\n", "4 4.0 -1.015236 0.978306 ... 0.0 0.0 0.0\n", "\n", "[5 rows x 11 columns]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(5)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "2. **classification**: in this case each file contains samples of a different class" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Class 0 dataframe shape: (2001, 12)\n", "Class 1 dataframe shape: (2001, 12)\n", "\n", " - Loaded dataframe (4002, 12): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2', 'labels']\n", " - Descriptors (4002, 2): ['p.x', 'p.y']\n" ] } ], "source": [ "from mlcolvar.utils.io import create_dataset_from_files\n", "\n", "filenames = [ f\"data/muller-brown/unbiased/state-{i}/COLVAR\" for i in range(2) ]\n", "\n", "# load data into dataset\n", "dataset, df = create_dataset_from_files(filenames, \n", " create_labels=True,\n", " filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes\n", " return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Create time-lagged datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In case of time-lagged tasks, one has to deal not to single configurations, rather to pairs of configurations $\\{x(t),x(t+\\tau)\\}$ which are distant a lag-time $\\tau$ in time. The `mlcolvar.utils.timelagged` module contains some helper functions, in particular the function `create_timelagged_dataset`.\n", "\n", "Notes:\n", "- If logweigths are given (e.g. beta*bias) the search for time-lagged configurations will be performed in rescaled time [McCarthy and Parrinello, JCP 2017].\n", "- The resulting dataset will contain the keys 'data', 'data_lag' as well as 'weights' and 'weights_lag', where the weights are all equal to ones in the unbiased case.\n", "- The actual search for time-lagged configurations is performed by the function `find_time_lagged_configurations`, which however is not supposed to be called directly." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/luigi/work/mlcolvar/mlcolvar/utils/timelagged.py:129: UserWarning: Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.\n", " warnings.warn('Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.')\n" ] }, { "data": { "text/plain": [ "DictDataset( \"data\": [88, 20], \"data_lag\": [88, 20], \"weights\": [88], \"weights_lag\": [88] )" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlcolvar.utils.timelagged import create_timelagged_dataset\n", "\n", "X = torch.rand((100,20)) \n", "t = torch.arange(100)\n", "\n", "# returns configurations at time t as well as time t+tau\n", "dataset = create_timelagged_dataset(X, t, \n", " lag_time=10, \n", " logweights=None )\n", "\n", "dataset" ] } ], "metadata": { "kernelspec": { "display_name": "pytorch", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "1cbeac1d7079eaeba64f3210ccac5ee24400128e300a45ae35eee837885b08b3" } } }, "nbformat": 4, "nbformat_minor": 2 }