cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor.data_generator#

cerebras.modelzoo.data_preparation.nlp.bert.sentence_pair_processor.data_generator(metadata_files, vocab_file, do_lower, split_num, max_seq_length, short_seq_prob, mask_whole_word, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, min_short_seq_length=None, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, inverted_mask=False, seed=None, spacy_model='en_core_web_sm', input_files_prefix='', sop_labels=False)[source]#

Generator function used to create input dataset for MLM + NSP dataset.

1. Generate raw examples by concatenating two parts ‘tokens-a’ and ‘tokens-b’ as follows: [CLS] <tokens-a> [SEP] <tokens-b> [SEP] where :

tokens-a: list of tokens taken from the current document and of random length (less than msl).

tokens-b: list of tokens chosen based on the randomly set “next_sentence_labels” and of length msl-len(<tokens-a>)- 3 (to account for [CLS] and [SEP] tokens)

If “next_sentence_labels” is 1, (set to 1 with 0.5 probability),

tokens-b are list of tokens from sentences chosen randomly from different document

else,

tokens-b are list of tokens taken from the same document and is a continuation of tokens-a in the document

The number of raw tokens depends on “short_sequence_prob” as well 2. Mask the raw examples based on “max_predictions_per_seq” 3. Pad the masked example to “max_sequence_length” if less that msl

Parameters
  • metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.

  • vocab_file (str) – Vocabulary file, to build tokenization from

  • do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not

  • split_num (int) – Number of input files to read at a given time for processing.

  • max_seq_length (int) – Maximum length of the sequence to generate

  • short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.

  • mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.

  • max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence

  • masked_lm_prob (float) – Proportion of tokens to be masked

  • dupe_factor (int) – Number of times to duplicate the dataset with different static masks

  • min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]

  • output_type_shapes (dict) – Dictionary indicating the shapes of different outputs

  • multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>

  • multiple_docs_separator (str) – String which separates

multiple documents in a single text file. :param single_sentence_per_line: True,when the document is already

split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document

Parameters
  • inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.

  • seed (int) – Random seed.

  • spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.

  • input_file_prefix (str) – Prefix to be added to paths of the input files.

  • sop_labels (bool) – If true, negative examples of the dataset will be two consecutive sentences in reversed order. Otherwise, uses regular (NSP) labels (where negative examples are from different documents).

Returns

yields training examples (feature, label)

where label refers to the next_sentence_prediction label