cerebras.modelzoo.data.nlp.bert.bert_utils.create_masked_lm_predictions#
- cerebras.modelzoo.data.nlp.bert.bert_utils.create_masked_lm_predictions(tokens, max_sequence_length, mask_token_id, max_predictions_per_seq, input_pad_id, attn_mask_pad_id, labels_pad_id, tokenize, vocab_size, masked_lm_prob, rng, exclude_from_masking, mask_whole_word, replacement_pool=None)[source]#
Creates the predictions for the masked LM objective.
- Parameters
tokens (list) – Tokens to process.
max_sequence_length (int) – Maximum sequence length.
mask_token_id (int) – Id of the masked token.
max_predictions_per_seq (int) – Maximum number of masked LM predictions per sequence
input_pad_id (int) – Input sequence padding id.
attn_mask_pad_id (int) – Attention mask padding id.
labels_pad_id (int) – Labels padding id.
tokenize (callable) – Method to tokenize the input sequence.
vocab_size (str) – Size of the vocabulary file.
masked_lm_prob (float) – Masked LM probability.
rng (random.Random) – Object with shuffle function.
exclude_from_masking (list) – List of tokens to exclude from masking.
mask_whole_word (bool) – Whether to mask the whole words or not.
replacement_pool (list) – List of ids which should be included when replacing tokens with random words from vocab. Default is None and means that we can take any token from the vocab.
- Returns
tuple which includes: * np.array[int.32] input_ids: Numpy array with input token indices.
Shape: (max_sequence_length).
- np.array[int.32] labels: Numpy array with labels.
Shape: (max_sequence_length).
- np.array[int.32] attention_mask
Shape: (max_sequence_length).
- np.array[int.32] masked_lm_mask: Numpy array with a mask of
predicted tokens. Shape: (max_predictions) 0 indicates the non masked token, and 1 indicates the masked token.