cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessor#
- class cerebras.modelzoo.data.nlp.bert.BertCSVDynamicMaskDataProcessor.BertCSVDynamicMaskDataProcessor(*args, **kwargs)[source]#
Bases:
torch.utils.data.IterableDataset
Reads csv files containing the input text tokens, adds MLM features on the fly. :param <dict> params: dict containing input parameters for creating dataset. Expects the following fields:
“data_dir” (string): path to the data files to use.
“batch_size” (int): Batch size.
“shuffle” (bool): Flag to enable data shuffling.
“shuffle_seed” (int): Shuffle seed.
“shuffle_buffer” (int): Shuffle buffer size.
“mask_whole_word” (bool): Flag to whether mask the entire word.
“do_lower” (bool): Flag to lower case the texts.
“dynamic_mlm_scale” (bool): Flag to dynamically scale the loss.
“num_workers” (int): How many subprocesses to use for data loading.
- “drop_last” (bool): If True and the dataset size is not divisible
by the batch size, the last incomplete batch will be dropped.
“prefetch_factor” (int): Number of samples loaded in advance by each worker.
- “persistent_workers” (bool): If True, the data loader will not shutdown
the worker processes after a dataset has been consumed once.
“oov_token” (string): Out of vocabulary token.
“mask_token” (string): Mask token.
“document_separator_token” (string): Seperator token.
“exclude_from_masking” list(string): tokens that should be excluded from being masked.
“max_sequence_length” (int): Maximum length of the sequence to generate.
“max_predictions_per_seq” (int): Maximum number of masked tokens per sequence.
“masked_lm_prob” (float): Ratio of the masked tokens over the sequence length.
“gather_mlm_labels” (bool): Flag to gather mlm labels.
“mixed_precision” (bool): Casts input mask to fp16 if set to True. Otherwise, the generated mask is float32.
Methods
Classmethod to create the dataloader object.
Iterating over the data to construct input features.
Generator to read the data in chunks of size of data_buffer.
- load_buffer()[source]#
Generator to read the data in chunks of size of data_buffer.
- Returns
Yields the data stored in the data_buffer.
- get_single_item()[source]#
Iterating over the data to construct input features.
- Returns
A tuple with training features: * np.array[int.32] input_ids: Numpy array with input token indices.
Shape: (max_sequence_length).
- np.array[int.32] labels: Numpy array with labels.
Shape: (max_sequence_length).
- np.array[int.32] attention_mask
Shape: (max_sequence_length).
- np.array[int.32] token_type_ids: Numpy array with segment indices.
Shape: (max_sequence_length).
- np.array[int.32] next_sentence_label: Numpy array with labels for NSP task.
Shape: (1).
- np.array[int.32] masked_lm_mask: Numpy array with a mask of
predicted tokens. Shape: (max_predictions) 0 indicates the non masked token, and 1 indicates the masked token.