cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessor#
- class cerebras.modelzoo.data.nlp.bert.BertCSVDataProcessor.BertCSVDataProcessor(*args, **kwargs)[source]#
Bases:
torch.utils.data.IterableDataset
Reads csv files containing the input text tokens, and MLM features. :param <dict> params: dict containing input parameters for creating dataset. Expects the following fields:
“data_dir” (string): path to the data files to use.
“batch_size” (int): Batch size.
“shuffle” (bool): Flag to enable data shuffling.
“shuffle_seed” (int): Shuffle seed.
“shuffle_buffer” (int): Shuffle buffer size.
“dynamic_mlm_scale” (bool): Flag to dynamically scale the loss.
“num_workers” (int): How many subprocesses to use for data loading.
- “drop_last” (bool): If True and the dataset size is not divisible
by the batch size, the last incomplete batch will be dropped.
“prefetch_factor” (int): Number of samples loaded in advance by each worker.
- “persistent_workers” (bool): If True, the data loader will not shutdown
the worker processes after a dataset has been consumed once.
“mixed_precision” (bool): Casts input mask to fp16 if set to True. Otherwise, the generated mask is float32.
Methods
Classmethod to create the dataloader object.
Iterating over the data to construct input features.
Generator to read the data in chunks of size of data_buffer.
- load_buffer()[source]#
Generator to read the data in chunks of size of data_buffer.
- Returns
Yields the data stored in the data_buffer.
- get_single_item()[source]#
Iterating over the data to construct input features.
- Returns
A tuple with training features: * np.array[int.32] input_ids: Numpy array with input token indices.
Shape: (max_sequence_length).
- np.array[int.32] labels: Numpy array with labels.
Shape: (max_sequence_length).
- np.array[int.32] attention_mask
Shape: (max_sequence_length).
- np.array[int.32] token_type_ids: Numpy array with segment indices.
Shape: (max_sequence_length).
- np.array[int.32] next_sentence_label: Numpy array with labels for NSP task.
Shape: (1).
- np.array[int.32] masked_lm_mask: Numpy array with a mask of
predicted tokens. Shape: (max_predictions) 0 indicates the non masked token, and 1 indicates the masked token.