mlcolvar.data.DictLoader

class mlcolvar.data.DictLoader(dataset: dict | DictDataset | Subset | Sequence, batch_size: int | Sequence[int] = 0, shuffle: bool = True)[source]

Bases: object

PyTorch DataLoader for DictDataset .

It is much faster than TensorDataset + DataLoader because DataLoader grabs individual indices of the dataset and calls cat (slow).

The class can also merge multiple :class:`~mlcolvar.data.dataset.DictDataset`s that have different keys (see example below). Different datasets can have different number of samples. In this case, it is necessary to specify the batch sizes so that the number of batches per epoch is the same for all datasets.

Notes

Adapted from https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6.

Examples

>>> x = torch.arange(1,11)

A DictLoader can be initialize from a dict, a DictDataset, or a Subset wrapping a DictDataset.

>>> # Initialize from a dictionary.
>>> d = {'data': x.unsqueeze(1), 'labels': x**2}
>>> dataloader = DictLoader(d, batch_size=1, shuffle=False)
>>> dataloader.dataset_len  # number of samples
10
>>> # Print first batch.
>>> for batch in dataloader:
...     print(batch)
...     break
{'data': tensor([[1]]), 'labels': tensor([1])}
>>> # Initialize from a DictDataset.
>>> dict_dataset = DictDataset(d)
>>> dataloader = DictLoader(dict_dataset, batch_size=2, shuffle=False)
>>> len(dataloader)  # Number of batches
5
>>> # Initialize from a PyTorch Subset object.
>>> train, _ = torch.utils.data.random_split(dict_dataset, [0.5, 0.5])
>>> dataloader = DictLoader(train, batch_size=1, shuffle=False)

It is also possible to iterate over multiple dictionary datasets having different keys for multi-task learning

>>> dataloader = DictLoader(
...     dataset=[dict_dataset, {'some_unlabeled_data': torch.arange(20)+11}],
...     batch_size=[1, 2], shuffle=False,
... )
>>> dataloader.dataset_len  # This is the number of samples in the datasets.
[10, 20]
>>>  # Print first batch.
>>> from pprint import pprint
>>> for batch in dataloader:
...     pprint(batch)
...     break
{'dataset0': {'data': tensor([[1]]), 'labels': tensor([1])},
 'dataset1': {'some_unlabeled_data': tensor([11, 12])}}
__init__(dataset: dict | DictDataset | Subset | Sequence, batch_size: int | Sequence[int] = 0, shuffle: bool = True)[source]

Initialize a DictLoader.

Parameters:
  • dataset (dict or DictDataset or Subset of DictDataset or list-like.) – The dataset or a list of datasets. If a list, the datasets can have different keys but they must all have the same number of samples.

  • batch_size (int or list-like of int, optional) – Batch size, by default 0 (==single batch). If multiple datasets are passed, this can be a list specifying the batch size for each dataset. Otherwise, if an int, this uses the same batch size for al datasets. This must be set so that the total number of batches per epoch is the same for all datasets.

  • shuffle (bool, optional) – If True, shuffle the data in-place whenever an iterator is created out of this object, by default True.

Methods

__init__(dataset[, batch_size, shuffle])

Initialize a DictLoader.

get_stats([dataset_idx])

Compute statistics ('mean','std','min','max') of the dataloader.

set_dataset_and_batch_size(dataset, batch_size)

Set a compatible pair of datasets and batch sizes.

property batch_size

Batch size or, in case of multiple datasets, a list of batch sizes.

Type:

int or List[int]

property dataset

The dictionary dataset(s).

Type:

DictDataset or list[DictDataset]

property dataset_len

Number of samples in the dataset(s).

Type:

int

get_stats(dataset_idx: int | None = None)[source]

Compute statistics ('mean','std','min','max') of the dataloader.

Parameters:

dataset_idx (int, optional) – If given and the loader has multiple datasets, only the statistics of the dataset_idx-th dataset will be returned.

Returns:

stats – A dictionary mapping the datasets’ keys (e.g., 'data', 'weights') to their statistics. If the loader has multiple datasets, stats[i] is the dictionary for the i-th dataset.

Return type:

Dict[Dict] or List[Dict[Dict]]

property keys

The keys of all the datasets in this loader.

Type:

tuple[str] or tuple[tuple[str]]

set_dataset_and_batch_size(dataset: None | dict | DictDataset | Subset | Sequence, batch_size: None | int | Sequence[int])[source]

Set a compatible pair of datasets and batch sizes.

With multiple datasets, dataset and batch_size must be compatible so that each dataset has the same number of batches per epoch so it might not be possible to set the two attributes singularly without leaving the object in an inconsistent state. Instead, this setter can be used safely.

Parameters:
  • dataset (None or dict or DictDataset or Subset of DictDataset or list-like.) – The dataset or a list of datasets. If a list, the datasets can have different keys but they must all have the same number of samples. If None, only batch_size is set.

  • batch_size (None or int or list-like of int) – Batch size, by default 0 (==single batch). If multiple datasets are passed, this can be a list specifying the batch size for each dataset. Otherwise, if an int, this uses the same batch size for al datasets. This must be set so that the total number of batches per epoch is the same for all datasets. If None, only dataset is set.

Attributes

batch_size

Batch size or, in case of multiple datasets, a list of batch sizes.

dataset

The dictionary dataset(s).

dataset_len

Number of samples in the dataset(s).

has_multiple_datasets

keys

The keys of all the datasets in this loader.