cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor.data_generator#

cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor.data_generator(metadata_files, vocab_file, do_lower, split_num, max_seq_length, short_seq_prob, mask_whole_word, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, min_short_seq_length=None, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, inverted_mask=False, seed=None, spacy_model='en_core_web_sm', input_files_prefix='', sop_labels=False)[source]#

Generator function used to create input dataset for MLM + NSP dataset.

1. Generate raw examples by concatenating two parts ‘tokens-a’ and ‘tokens-b’ as follows: [CLS] <tokens-a> [SEP] <tokens-b> [SEP] where :

tokens-a: list of tokens taken from the current document and of random length (less than msl).

tokens-b: list of tokens chosen based on the randomly set “next_sentence_labels” and of length msl-len(<tokens-a>)- 3 (to account for [CLS] and [SEP] tokens)

If “next_sentence_labels” is 1, (set to 1 with 0.5 probability),: tokens-b are list of tokens from sentences chosen randomly from different document
else,: tokens-b are list of tokens taken from the same document and is a continuation of tokens-a in the document

The number of raw tokens depends on “short_sequence_prob” as well 2. Mask the raw examples based on “max_predictions_per_seq” 3. Pad the masked example to “max_sequence_length” if less that msl

Parameters

metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.
vocab_file (str) – Vocabulary file, to build tokenization from
do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not
split_num (int) – Number of input files to read at a given time for processing.
max_seq_length (int) – Maximum length of the sequence to generate
short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.
mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.
max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence
masked_lm_prob (float) – Proportion of tokens to be masked
dupe_factor (int) – Number of times to duplicate the dataset with different static masks
min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]
output_type_shapes (dict) – Dictionary indicating the shapes of different outputs
multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>
multiple_docs_separator (str) – String which separates

multiple documents in a single text file. :param single_sentence_per_line: True,when the document is already

split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document

Parameters

inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.
seed (int) – Random seed.
spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.
input_file_prefix (str) – Prefix to be added to paths of the input files.
sop_labels (bool) – If true, negative examples of the dataset will be two consecutive sentences in reversed order. Otherwise, uses regular (NSP) labels (where negative examples are from different documents).

Returns

yields training examples (feature, label)

where label refers to the next_sentence_prediction label

cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor

cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor.SentencePairInstance