mlcolvar.data.DictLoader¶
- class mlcolvar.data.DictLoader(dataset: dict | DictDataset | Subset | Sequence, batch_size: int | Sequence[int] = 0, shuffle: bool = True)[source]¶
Bases:
objectPyTorch DataLoader for
DictDataset.It is much faster than
TensorDataset+DataLoaderbecauseDataLoadergrabs individual indices of the dataset and calls cat (slow).The class can also merge multiple :class:`~mlcolvar.data.dataset.DictDataset`s that have different keys (see example below). Different datasets can have different number of samples. In this case, it is necessary to specify the batch sizes so that the number of batches per epoch is the same for all datasets.
Notes
Adapted from https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6.
Examples
>>> x = torch.arange(1,11)
A
DictLoadercan be initialize from adict, aDictDataset, or aSubsetwrapping aDictDataset.>>> # Initialize from a dictionary. >>> d = {'data': x.unsqueeze(1), 'labels': x**2} >>> dataloader = DictLoader(d, batch_size=1, shuffle=False) >>> dataloader.dataset_len # number of samples 10 >>> # Print first batch. >>> for batch in dataloader: ... print(batch) ... break {'data': tensor([[1]]), 'labels': tensor([1])}
>>> # Initialize from a DictDataset. >>> dict_dataset = DictDataset(d) >>> dataloader = DictLoader(dict_dataset, batch_size=2, shuffle=False) >>> len(dataloader) # Number of batches 5
>>> # Initialize from a PyTorch Subset object. >>> train, _ = torch.utils.data.random_split(dict_dataset, [0.5, 0.5]) >>> dataloader = DictLoader(train, batch_size=1, shuffle=False)
It is also possible to iterate over multiple dictionary datasets having different keys for multi-task learning
>>> dataloader = DictLoader( ... dataset=[dict_dataset, {'some_unlabeled_data': torch.arange(20)+11}], ... batch_size=[1, 2], shuffle=False, ... ) >>> dataloader.dataset_len # This is the number of samples in the datasets. [10, 20] >>> # Print first batch. >>> from pprint import pprint >>> for batch in dataloader: ... pprint(batch) ... break {'dataset0': {'data': tensor([[1]]), 'labels': tensor([1])}, 'dataset1': {'some_unlabeled_data': tensor([11, 12])}}
- __init__(dataset: dict | DictDataset | Subset | Sequence, batch_size: int | Sequence[int] = 0, shuffle: bool = True)[source]¶
Initialize a
DictLoader.- Parameters:
dataset (dict or DictDataset or Subset of DictDataset or list-like.) – The dataset or a list of datasets. If a list, the datasets can have different keys but they must all have the same number of samples.
batch_size (int or list-like of int, optional) – Batch size, by default 0 (==single batch). If multiple datasets are passed, this can be a list specifying the batch size for each dataset. Otherwise, if an
int, this uses the same batch size for al datasets. This must be set so that the total number of batches per epoch is the same for all datasets.shuffle (bool, optional) – If
True, shuffle the data in-place whenever an iterator is created out of this object, by defaultTrue.
Methods
__init__(dataset[, batch_size, shuffle])Initialize a
DictLoader.get_stats([dataset_idx])Compute statistics
('mean','std','min','max')of the dataloader.set_dataset_and_batch_size(dataset, batch_size)Set a compatible pair of datasets and batch sizes.
- property batch_size¶
Batch size or, in case of multiple datasets, a list of batch sizes.
- Type:
int or List[int]
- property dataset¶
The dictionary dataset(s).
- Type:
DictDataset or list[DictDataset]
- property dataset_len¶
Number of samples in the dataset(s).
- Type:
int
- get_stats(dataset_idx: int | None = None)[source]¶
Compute statistics
('mean','std','min','max')of the dataloader.- Parameters:
dataset_idx (int, optional) – If given and the loader has multiple datasets, only the statistics of the
dataset_idx-th dataset will be returned.- Returns:
stats – A dictionary mapping the datasets’ keys (e.g.,
'data','weights') to their statistics. If the loader has multiple datasets,stats[i]is the dictionary for thei-th dataset.- Return type:
Dict[Dict] or List[Dict[Dict]]
- property keys¶
The keys of all the datasets in this loader.
- Type:
tuple[str] or tuple[tuple[str]]
- set_dataset_and_batch_size(dataset: None | dict | DictDataset | Subset | Sequence, batch_size: None | int | Sequence[int])[source]¶
Set a compatible pair of datasets and batch sizes.
With multiple datasets,
datasetandbatch_sizemust be compatible so that each dataset has the same number of batches per epoch so it might not be possible to set the two attributes singularly without leaving the object in an inconsistent state. Instead, this setter can be used safely.- Parameters:
dataset (None or dict or DictDataset or Subset of DictDataset or list-like.) – The dataset or a list of datasets. If a list, the datasets can have different keys but they must all have the same number of samples. If
None, onlybatch_sizeis set.batch_size (None or int or list-like of int) – Batch size, by default 0 (==single batch). If multiple datasets are passed, this can be a list specifying the batch size for each dataset. Otherwise, if an
int, this uses the same batch size for al datasets. This must be set so that the total number of batches per epoch is the same for all datasets. IfNone, onlydatasetis set.
Attributes
Batch size or, in case of multiple datasets, a list of batch sizes.
The dictionary dataset(s).
Number of samples in the dataset(s).
has_multiple_datasetsThe keys of all the datasets in this loader.