Dataloaders for PyTorch#
Overview#
Efficient data loading is crucial for high-performance machine learning. PyTorch enhances data loading speed through parallelized data loading, batch retrieval of indices, and streaming to progressively download datasets.
PyTorch offers a powerful data loading utility class (torch.utils.data.DataLoader
). The key argument for this DataLoader is the “Dataset”, which specifies the source of data. PyTorch supports two primary types of Datasets:
Map-style datasets (
Dataset
) is a map from indices/keys to data samples. So, ifdataset[idx]
is accessed, that readsidx-th
from a directory on disk.Iterable-style datasets (
IterableDataset
) represents an iterable over data samples. This is very suitable where random reads are expensive or even improbable, and where the batch size depends on the fetched data. So, ifiter(dataset)
is called, returns a stream of data from a database, or remote server, or even logs generated in real time.
In the Cerebras Model Zoo, dataloaders extend these base types to implement additional functionalities. For instance, the BertCSVDynamicMaskDataProcessor
(code) extends IterableDataset
and BertClassifierDataProcessor
(code) extends Dataset
.
Properties of PyTorch Dataloader#
For comprehensive details on the properties of PyTorch dataloaders, refer to this page.
Cerebras Model Zoo Dataloaders#
The Cerebras Model Zoo includes several example dataloaders that extend IterableDataset
and add functionalities like input encoding and tokenization. Notable examples are:
BertCSVDataProcessor - Reads
CSV
files containing the input text tokens andMLM
andNSP
featuresGptHDF5MapDataProcessor - A
HDF5
map style dataset processor to read fromHDF5
format for GPT pre-trainingT5DynamicDataProcessor - Reads text files containing the input text tokens, adds extra ids for language modelling task on the fly
Creating a custom dataloader with PyTorch#
To create your own dataloader keep in mind these tips:
Ensure coherence between the dataloader output and the neural network model input: If you are using a model from the Cerebras Model Zoo, refer to the README file of the model to understand the required data format. For example, if using GPT-2, ensure your input function produces the features dictionary.
Utilize Cerebras-supported file types: Create your dataset by extending one of the native dataset types. The Cerebras ecosystem supports files of types
HDF5
,CSV
, andTXT
. Other file types are not tested and may not be supported.
Conclusion#
Effective use of PyTorch dataloaders can dramatically improve the efficiency of your data loading processes. By leveraging the capabilities provided by PyTorch and the Cerebras Model Zoo, you can customize your data handling to meet the specific needs of your machine learning models. This ensures a streamlined, efficient workflow, enabling you to focus on model development and performance.
What’s next?#
To learn more about creating a custom dataloader, refer to our step-by-step tutorial on Creating custom dataloaders.