{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating datasets\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/luigibonati/mlcolvar/blob/main/docs/notebooks/tutorials/intro_2_data.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Outline"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial you will learn about how to organize data to be used in the training process, and in particular the difference between:\n",
    "\n",
    "- datasets\n",
    "- dataloaders \n",
    "- datamodules\n",
    "\n",
    "Furthermore, we will also look into some helper functions that can help in\n",
    " creating:\n",
    "\n",
    "- datasets from COLVAR files\n",
    "- time-lagged datasets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a nutshell:\n",
    "- [datasets](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) are objects which store the input data as well as additional quantities like labels or weights that are going to be used in the training. \n",
    "- [dataloaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) wrap an iterable around datasets to allow for easy access to data (as well as collating inputs into batches). \n",
    "- [datamodules](https://pytorch-lightning.readthedocs.io/en/1.8.1/data/datamodule.html) encapsulate all the steps needed to process data, e.g. split the datasets and create dataloaders"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Datasets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We subclassed `torch.utils.data.Dataset` into a `DictDataset` which stores the information inside a dictionary and returns a dictionary with the batched data when sliced. \n",
    "\n",
    "The **keys** depend on the kind of learning task:\n",
    "- Unsupervised: \"data\" (,\"weights\")\n",
    "- Supervised\n",
    "    - Regression: \"data\", \"target\" (,\"weights\")\n",
    "    - Classification: \"data\", \"labels\"\n",
    "- Time-lagged: \"data\", \"data_lag\" (,\"weights\",\"weights_lag\")\n",
    "\n",
    "The **values** can be either torch.Tensors or np.arrays / lists that will be passed to the torch.Tensor() function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Colab setup\n",
    "import os\n",
    "\n",
    "if os.getenv(\"COLAB_RELEASE_TAG\"):\n",
    "    import subprocess\n",
    "    subprocess.run('wget https://raw.githubusercontent.com/luigibonati/mlcolvar/main/colab_setup.sh', shell=True)\n",
    "    cmd = subprocess.run('bash colab_setup.sh TUTORIAL', shell=True, stdout=subprocess.PIPE)\n",
    "    print('Done!')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DictDataset( \"data\": [100, 2], \"target\": [100] )"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "from mlcolvar.data import DictDataset\n",
    "\n",
    "# the constructor takes a dictionary as input.\n",
    "n_samples, n_features = 100, 2\n",
    "dataset = DictDataset({'data': torch.rand((n_samples,n_features)),\n",
    "                             'target': torch.rand((n_samples,))\n",
    "                             })\n",
    "\n",
    "dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the dataset is accessed with a string it will return the value of the underlying dictionary,\n",
    "otherwise if it is accessed with a slice it will return a sliced dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dataset[\"data\"] --> torch.Size([100, 2])\n",
      "\n",
      "dataset[0] = {'data': tensor([0.0238, 0.6240]), 'target': tensor(0.3217)}\n",
      "\n",
      "dataset[0:3] = {'data': tensor([[0.0238, 0.6240],\n",
      "        [0.6782, 0.4476],\n",
      "        [0.8055, 0.8887]]), 'target': tensor([0.3217, 0.6375, 0.5045])}\n"
     ]
    }
   ],
   "source": [
    "# access with a key \n",
    "print('dataset[\"data\"] -->', dataset[\"data\"].shape )\n",
    "# access the 0-th element\n",
    "print('\\ndataset[0] =', dataset[0] )\n",
    "# slice the dataset\n",
    "print('\\ndataset[0:3] =', dataset[0:3] )"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also add additional keys to the dataset, e.g. if you want to give different weights to the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] )"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset['weights'] = torch.rand(100)\n",
    "\n",
    "dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataloaders"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataloaders wrap iterables around the dataset such that can be easily collated into batches and used for training/validation. We subclassed the `torch.utils.data.Dataloader` into a `FastDictionaryDataloader` which takes a `DictDataset` as input. You can see further details in its documentation.\n",
    "\n",
    "Typically the dataset is split across training and validation sets and then the dataloaders are created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DictLoader(length=80, batch_size=40, shuffle=True)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from mlcolvar.data import DictLoader\n",
    "\n",
    "# create train/valid dataloader\n",
    "train_loader = DictLoader(dataset[:80],batch_size=40)\n",
    "valid_loader = DictLoader(dataset[80:],batch_size=20)\n",
    "\n",
    "train_loader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Datamodule"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `lightning.LightningDataModule` object can be used to simplify and organized the tasks described above related to data processing. Here we subclassed it into a `DictModule` which will take care of the 1) shuffling 2) splitting the datasets 3) creating the dataloaders. Note that this is supposed to be used together a `lightning.Trainer`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#1 -->  DictModule(dataset -> DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] ),\n",
      "\t\t     train_loader -> DictLoader(length=0.8, batch_size=10, shuffle=True),\n",
      "\t\t     valid_loader -> DictLoader(length=0.2, batch_size=10, shuffle=True))\n",
      "\n",
      "#2 -->  DictModule(dataset -> DictDataset( \"data\": [100, 2], \"target\": [100], \"weights\": [100] ),\n",
      "\t\t     train_loader -> DictLoader(length=75, batch_size=25, shuffle=True),\n",
      "\t\t     valid_loader -> DictLoader(length=20, batch_size=10, shuffle=False),\n",
      "\t\t\ttest_loader =DictLoader(length=5, batch_size=5, shuffle=False))\n"
     ]
    }
   ],
   "source": [
    "from mlcolvar.data import DictModule\n",
    "\n",
    "# (1) lenghts by fraction\n",
    "datamodule = DictModule(dataset, lengths = [0.8,0.2], batch_size = 10 )\n",
    "print('#1 --> ', datamodule ) \n",
    "\n",
    "# (2) lenghts as number of element\n",
    "datamodule = DictModule(dataset, lengths = [75,20,5], \n",
    "                                    batch_size = [25,10,5],             # different batch sizes for each dataloader\n",
    "                                    shuffle = [True, False, False] )    # specify per-dataloader options\n",
    "\n",
    "print('\\n#2 --> ', datamodule ) "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### I/O helper functions"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Creating datasets from file"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is of course possible to load the data from files (e.g. with the `load_dataframe` function`) and then creating a dataset. For convenience, we created a function `create_dataset_from_files` that can be used to create the dataset directly from files. This covers the following settings:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1) **unsupervised learning**: one or more files are merged together in an unlabeled dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Class 0 dataframe shape:  (5001, 11)\n",
      "\n",
      " - Loaded dataframe (5001, 11): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2']\n",
      " - Descriptors (5001, 2): ['p.x', 'p.y']\n"
     ]
    }
   ],
   "source": [
    "from mlcolvar.utils.io import create_dataset_from_files\n",
    "\n",
    "filenames = [ \"data/muller-brown/unbiased/high-temp/COLVAR\" ]\n",
    "\n",
    "# load data into dataset\n",
    "dataset, df = create_dataset_from_files(filenames, \n",
    "                                        create_labels=False,\n",
    "                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes\n",
    "                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "      <th>p.x</th>\n",
       "      <th>p.y</th>\n",
       "      <th>p.z</th>\n",
       "      <th>ene</th>\n",
       "      <th>pot.bias</th>\n",
       "      <th>pot.ene_bias</th>\n",
       "      <th>lwall.bias</th>\n",
       "      <th>lwall.force2</th>\n",
       "      <th>uwall.bias</th>\n",
       "      <th>uwall.force2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.580981</td>\n",
       "      <td>6.580981</td>\n",
       "      <td>6.580981</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.285803</td>\n",
       "      <td>0.351447</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.506740</td>\n",
       "      <td>11.506740</td>\n",
       "      <td>11.506740</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "      <td>-0.004293</td>\n",
       "      <td>0.590710</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.821637</td>\n",
       "      <td>11.821637</td>\n",
       "      <td>11.821637</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "      <td>-0.530208</td>\n",
       "      <td>0.714688</td>\n",
       "      <td>0.0</td>\n",
       "      <td>16.812886</td>\n",
       "      <td>16.812886</td>\n",
       "      <td>16.812886</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "      <td>-1.015236</td>\n",
       "      <td>0.978306</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.821514</td>\n",
       "      <td>8.821514</td>\n",
       "      <td>8.821514</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   time       p.x       p.y  ...  lwall.force2  uwall.bias  uwall.force2\n",
       "0   0.0  0.500000  0.000000  ...           0.0         0.0           0.0\n",
       "1   1.0  0.285803  0.351447  ...           0.0         0.0           0.0\n",
       "2   2.0 -0.004293  0.590710  ...           0.0         0.0           0.0\n",
       "3   3.0 -0.530208  0.714688  ...           0.0         0.0           0.0\n",
       "4   4.0 -1.015236  0.978306  ...           0.0         0.0           0.0\n",
       "\n",
       "[5 rows x 11 columns]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. **classification**: in this case each file contains samples of a different class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Class 0 dataframe shape:  (2001, 12)\n",
      "Class 1 dataframe shape:  (2001, 12)\n",
      "\n",
      " - Loaded dataframe (4002, 12): ['time', 'p.x', 'p.y', 'p.z', 'ene', 'pot.bias', 'pot.ene_bias', 'lwall.bias', 'lwall.force2', 'uwall.bias', 'uwall.force2', 'labels']\n",
      " - Descriptors (4002, 2): ['p.x', 'p.y']\n"
     ]
    }
   ],
   "source": [
    "from mlcolvar.utils.io import create_dataset_from_files\n",
    "\n",
    "filenames = [ f\"data/muller-brown/unbiased/state-{i}/COLVAR\" for i in range(2) ]\n",
    "\n",
    "# load data into dataset\n",
    "dataset, df = create_dataset_from_files(filenames, \n",
    "                                        create_labels=True,\n",
    "                                        filter_args=dict(regex='p.x|p.y'), # select input descriptors using .filter method of Pandas dataframes\n",
    "                                        return_dataframe=True) # return also the dataframe of the loaded files (not only the input data)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create time-lagged datasets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In case of time-lagged tasks, one has to deal not to single configurations, rather to pairs of configurations $\\{x(t),x(t+\\tau)\\}$ which are distant a lag-time $\\tau$ in time. The `mlcolvar.utils.timelagged` module contains some helper functions, in particular the function `create_timelagged_dataset`.\n",
    "\n",
    "Notes:\n",
    "- If logweigths are given (e.g. beta*bias) the search for time-lagged configurations will be performed in rescaled time [McCarthy and Parrinello, JCP 2017].\n",
    "- The resulting dataset will contain the keys 'data', 'data_lag' as well as 'weights' and 'weights_lag', where the weights are all equal to ones in the unbiased case.\n",
    "- The actual search for time-lagged configurations is performed by the function `find_time_lagged_configurations`, which however is not supposed to be called directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/luigi/work/mlcolvar/mlcolvar/utils/timelagged.py:129: UserWarning: Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.\n",
      "  warnings.warn('Monitoring the progress for the search of time-lagged configurations with a progress_bar requires `tqdm`.')\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "DictDataset( \"data\": [88, 20], \"data_lag\": [88, 20], \"weights\": [88], \"weights_lag\": [88] )"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from mlcolvar.utils.timelagged import create_timelagged_dataset\n",
    "\n",
    "X = torch.rand((100,20)) \n",
    "t = torch.arange(100)\n",
    "\n",
    "# returns configurations at time t as well as time t+tau\n",
    "dataset = create_timelagged_dataset(X, t, \n",
    "                                    lag_time=10, \n",
    "                                    logweights=None )\n",
    "\n",
    "dataset"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pytorch",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "1cbeac1d7079eaeba64f3210ccac5ee24400128e300a45ae35eee837885b08b3"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}