cerebras.modelzoo.data_preparation.nlp.bert.mlm_only_processor.data_generator#
- cerebras.modelzoo.data_preparation.nlp.bert.mlm_only_processor.data_generator(metadata_files, vocab_file, do_lower, disable_masking, mask_whole_word, max_seq_length, max_predictions_per_seq, masked_lm_prob, dupe_factor, output_type_shapes, multiple_docs_in_single_file=False, multiple_docs_separator='\n', single_sentence_per_line=False, buffer_size=1000000.0, min_short_seq_length=None, overlap_size=None, short_seq_prob=0, spacy_model='en_core_web_sm', inverted_mask=False, allow_cross_document_examples=True, document_separator_token='[SEP]', seed=None, input_files_prefix='')[source]#
Generator function used to create input dataset for MLM only dataset.
1. Generate raw examples with tokens based on “overlap_size”, “max_sequence_length”, “allow_cross_document_examples” and “document_separator_token” and using a sliding window approach. The exact steps are detailed in “_create_examples_from_document” function 2. Mask the raw examples based on “max_predictions_per_seq” 3. Pad the masked example to “max_sequence_length” if less that msl
- Parameters
metadata_files (str or list[str]) – A string or strings list each pointing to a metadata file. A metadata file contains file paths for flat text cleaned documents. It has one file path per line.
vocab_file (str) – Vocabulary file, to build tokenization from
do_lower (bool) – Boolean value indicating if words should be converted to lowercase or not
disable_masking (bool) – whether masking should be disabled
mask_whole_word (bool) – If True, all subtokens corresponding to a word will be masked.
max_seq_length (int) – Maximum length of the sequence to generate
max_predictions_per_seq (int) – Maximum number of Masked tokens in a sequence
masked_lm_prob (float) – Proportion of tokens to be masked
dupe_factor (int) – Number of times to duplicate the dataset with different static masks
output_type_shapes (dict) – Dictionary indicating the shapes of different outputs
multiple_docs_in_single_file (bool) – True, when a single text file contains multiple documents separated by <multiple_docs_separator>
multiple_docs_separator (str) – String which separates multiple documents in a single text file.
single_sentence_per_line – True,when the document is already split into sentences with one sentence in each line and there is no requirement for further sentence segmentation of a document
buffer_size (int) – Number of tokens to be processed at a time
min_short_seq_length (int) – When short_seq_prob > 0, this number indicates the least number of tokens that each example should have i.e the num_tokens (excluding pad) would be in the range [min_short_seq_length, MSL]
overlap_size (int) – Number of tokens that overlap with previous example when processing buffer with a sliding window approach. If None, defaults to overlap to max_seq_len/4.
short_seq_prob (int) – Probability of a short sequence. Defaults to 0. Sometimes we want to use shorter sequences to minimize the mismatch between pre-training and fine-tuning.
spacy_model – spaCy model to load, i.e. shortcut link, package name or path. Used to segment text into sentences.
inverted_mask (bool) – If set to False, has 0’s on padded positions and 1’s elsewhere. Otherwise, “inverts” the mask, so that 1’s are on padded positions and 0’s elsewhere.
allow_cross_document_examples (bool) – If True, the sequences can contain tokens from the next document.
document_separator_token (str) – String to separate tokens from one document and the next when sequences span documents
seed (int) – Random seed.
input_file_prefix (str) – Prefix to be added to paths of the input files.
- Returns
yields training examples (feature, [])