cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor.data_generator#
- cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor.data_generator(metadata_files, vocab_file, do_lower, split_num, max_seq_length, short_seq_prob, mask_whole_word, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, min_short_seq_length=None, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, inverted_mask=False, seed=None, spacy_model='en_core_web_sm', input_files_prefix='', sop_labels=False)[source]#
Generator function used to create input dataset for MLM + NSP dataset.
1. Generate raw examples by concatenating two parts ‘tokens-a’ and ‘tokens-b’ as follows: [CLS] <tokens-a> [SEP] <tokens-b> [SEP] where :
tokens-a: list of tokens taken from the current document and of random length (less than msl).
tokens-b: list of tokens chosen based on the randomly set “next_sentence_labels” and of length msl-len(<tokens-a>)- 3 (to account for [CLS] and [SEP] tokens)
- If “next_sentence_labels” is 1, (set to 1 with 0.5 probability),
tokens-b are list of tokens from sentences chosen randomly from different document
- else,
tokens-b are list of tokens taken from the same document and is a continuation of tokens-a in the document
The number of raw tokens depends on “short_sequence_prob” as well 2. Mask the raw examples based on “max_predictions_per_seq” 3. Pad the masked example to “max_sequence_length” if less that msl
- Parameters
metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.
vocab_file (str) – Vocabulary file, to build tokenization from
do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not
split_num (int) – Number of input files to read at a given time for processing.
max_seq_length (int) – Maximum length of the sequence to generate
short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.
mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.
max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence
masked_lm_prob (float) – Proportion of tokens to be masked
dupe_factor (int) – Number of times to duplicate the dataset with different static masks
min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]
output_type_shapes (dict) – Dictionary indicating the shapes of different outputs
multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>
multiple_docs_separator (str) – String which separates
multiple documents in a single text file. :param single_sentence_per_line: True,when the document is already
split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document
- Parameters
inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.
seed (int) – Random seed.
spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.
input_file_prefix (str) – Prefix to be added to paths of the input files.
sop_labels (bool) – If true, negative examples of the dataset will be two consecutive sentences in reversed order. Otherwise, uses regular (NSP) labels (where negative examples are from different documents).
- Returns
yields training examples (feature, label)
where label refers to the next_sentence_prediction label