{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Creating datasets\n", "\n", "[](https://colab.research.google.com/github/luigibonati/mlcolvar/blob/main/docs/notebooks/tutorials/intro_2_data.ipynb)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Outline" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial you will learn about how to organize data to be used in the training process, and in particular the difference between:\n", "\n", "- datasets\n", "- dataloaders \n", "- datamodules\n", "\n", "Furthermore, we will also look into some helper functions that can help in\n", " creating:\n", "\n", "- datasets from COLVAR files\n", "- time-lagged datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In a nutshell:\n", "- [datasets](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) are objects which store the input data as well as additional quantities like labels or weights that are going to be used in the training. \n", "- [dataloaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) wrap an iterable around datasets to allow for easy access to data (as well as collating inputs into batches). \n", "- [datamodules](https://pytorch-lightning.readthedocs.io/en/1.8.1/data/datamodule.html) encapsulate all the steps needed to process data, e.g. split the datasets and create dataloaders" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Datasets" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We subclassed `torch.utils.data.Dataset` into a `DictDataset` which stores the information inside a dictionary and returns a dictionary with the batched data when sliced. \n", "\n", "The **keys** depend on the kind of learning task:\n", "- Unsupervised: \"data\" (,\"weights\")\n", "- Supervised\n", " - Regression: \"data\", \"target\" (,\"weights\")\n", " - Classification: \"data\", \"labels\"\n", "- Time-lagged: \"data\", \"data_lag\" (,\"weights\",\"weights_lag\")\n", "\n", "The **values** can be either torch.Tensors or np.arrays / lists that will be passed to the torch.Tensor() function. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Colab setup\n", "import os\n", "\n", "if os.getenv(\"COLAB_RELEASE_TAG\"):\n", " import subprocess\n", " subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)\n", " cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)\n", " print('Done!')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DictDataset( \"data\": [100, 2], \"target\": [100] )" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "from mlcolvar.data import DictDataset\n", "\n", "# the constructor takes a dictionary as input.\n", "n_samples, n_features = 100, 2\n", "dataset = DictDataset({'data': torch.rand((n_samples,n_features)),\n", " 'target': torch.rand((n_samples,))\n", " })\n", "\n", "dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "If the dataset is accessed with a string it will return the value of the underlying dictionary,\n", "otherwise if it is accessed with a slice it will return a sliced dictionary:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dataset[\"data\"] --> torch.Size([100, 2])\n", "\n", "dataset[0] = {'data': tensor([0.0238, 0.6240]), 'target': tensor(0.3217)}\n", "\n", "dataset[0:3] = {'data': tensor([[0.0238, 0.6240],\n", " [0.6782, 0.4476],\n", " [0.8055, 0.8887]]), 'target': tensor([0.3217, 0.6375, 0.5045])}\n" ] } ], "source": [ "# access with a key \n", "print('dataset[\"data\"] -->', dataset[\"data\"].shape )\n", "# access the 0-th element\n", "print('\\ndataset[0] =', dataset[0] )\n", "# slice the dataset\n", "print('\\ndataset[0:3] =', dataset[0:3] )" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can also add additional keys to the dataset, e.g. if you want to give different weights to the data:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] )" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset['weights'] = torch.rand(100)\n", "\n", "dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Dataloaders" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The dataloaders wrap iterables around the dataset such that can be easily collated into batches and used for training/validation. We subclassed the `torch.utils.data.Dataloader` into a `FastDictionaryDataloader` which takes a `DictDataset` as input. You can see further details in its documentation.\n", "\n", "Typically the dataset is split across training and validation sets and then the dataloaders are created." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DictLoader(length=80, batch_size=40, shuffle=True)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from mlcolvar.data import DictLoader\n", "\n", "# create train/valid dataloader\n", "train_loader = DictLoader(dataset[:80],batch_size=40)\n", "valid_loader = DictLoader(dataset[80:],batch_size=20)\n", "\n", "train_loader" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Datamodule" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The `lightning.LightningDataModule` object can be used to simplify and organized the tasks described above related to data processing. Here we subclassed it into a `DictModule` which will take care of the 1) shuffling 2) splitting the datasets 3) creating the dataloaders. Note that this is supposed to be used together a `lightning.Trainer`. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#1 --> DictModule(dataset -> DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] ),\n", "\t\t train_loader -> DictLoader(length=0.8, batch_size=10, shuffle=True),\n", "\t\t valid_loader -> DictLoader(length=0.2, batch_size=10, shuffle=True))\n", "\n", "#2 --> DictModule(dataset -> DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] ),\n", "\t\t train_loader -> DictLoader(length=75, batch_size=25, shuffle=True),\n", "\t\t valid_loader -> DictLoader(length=20, batch_size=10, shuffle=False),\n", "\t\t\ttest_loader =DictLoader(length=5, batch_size=5, shuffle=False))\n" ] } ], "source": [ "from mlcolvar.data import DictModule\n", "\n", "# (1) lenghts by fraction\n", "datamodule = DictModule(dataset, lengths = [0.8,0.2], batch_size = 10 )\n", "print('#1 --> ', datamodule ) \n", "\n", "# (2) lenghts as number of element\n", "datamodule = DictModule(dataset, lengths = [75,20,5], \n", " batch_size = [25,10,5], # different batch sizes for each dataloader\n", " shuffle = [True, False, False] ) # specify per-dataloader options\n", "\n", "print('\\n#2 --> ', datamodule ) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### I/O helper functions" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating datasets from file" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "It is of course possible to load the data from files (e.g. with the `load_dataframe` function`) and then creating a dataset. For convenience, we created a function `create_dataset_from_files` that can be used to create the dataset directly from files. This covers the following settings:" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1) **unsupervised learning**: one or more files are merged together in an unlabeled dataset" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Class 0 dataframe shape: (5001, 11)\n", "\n", " - Loaded dataframe (5001, 11): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2']\n", " - Descriptors (5001, 2): ['p.x', 'p.y']\n" ] } ], "source": [ "from mlcolvar.io import create_dataset_from_files\n", "\n", "filenames = [ \"data/muller-brown/unbiased/high-temp/COLVAR\" ]\n", "\n", "# load data into dataset\n", "dataset, df = create_dataset_from_files(filenames, \n", " create_labels=False,\n", " filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes\n", " return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | time | \n", "p.x | \n", "p.y | \n", "p.z | \n", "ene | \n", "pot.bias | \n", "pot.ene_bias | \n", "lwall.bias | \n", "lwall.force2 | \n", "uwall.bias | \n", "uwall.force2 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0.0 | \n", "0.500000 | \n", "0.000000 | \n", "0.0 | \n", "6.580981 | \n", "6.580981 | \n", "6.580981 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 1 | \n", "1.0 | \n", "0.285803 | \n", "0.351447 | \n", "0.0 | \n", "11.506740 | \n", "11.506740 | \n", "11.506740 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 2 | \n", "2.0 | \n", "-0.004293 | \n", "0.590710 | \n", "0.0 | \n", "11.821637 | \n", "11.821637 | \n", "11.821637 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 3 | \n", "3.0 | \n", "-0.530208 | \n", "0.714688 | \n", "0.0 | \n", "16.812886 | \n", "16.812886 | \n", "16.812886 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "4.0 | \n", "-1.015236 | \n", "0.978306 | \n", "0.0 | \n", "8.821514 | \n", "8.821514 | \n", "8.821514 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "