cerebras.modelzoo.data_preparation.raw_dataset_processor.RawDatasetProcessor.RawDatasetProcessor#
- class cerebras.modelzoo.data_preparation.raw_dataset_processor.RawDatasetProcessor.RawDatasetProcessor(*args, **kwargs)[source]#
Bases:
torch.utils.data.IterableDataset
Methods
Collates a list of dictionaries into a batch
Classmethod to create the dataloader object.
Returns the next item in the iteration.
- get_next_item()[source]#
Returns the next item in the iteration.
This function iterates over the data stream from the reader, tokenizes the data, and yields dictionaries containing features as keys and NumPy arrays as values.
- Returns
An iterator yielding dictionaries with string keys and NumPy array values.
- Return type
Iterator[Dict[str, np.ndarray]]
- collate_fn(batch)[source]#
Collates a list of dictionaries into a batch
- Parameters
batch (List[Dict[str, np.ndarray]]) – A list of dictionaries, where each dictionary contains string keys and NumPy array values.
- Returns
The collated batch.
- Return type
Any