cerebras.modelzoo.data_preparation.nlp.t5.utils.create_transformer_input_features#
- cerebras.modelzoo.data_preparation.nlp.t5.utils.create_transformer_input_features(src_tokens, tgt_tokens, src_max_sequence_length, tgt_max_sequence_length, input_pad_id, attn_mask_pad_id, labels_pad_id, tokenize, sos_token='<s>', eos_token='</s>')[source]#
Creates features for Transformer model input.
- Parameters
src_tokens (list) – Input tokens to process.
tgt_tokens (list) – Target tokens to process.
src_max_sequence_length (int) – Maximum sequence length of the encoder input.
tgt_max_sequence_length (int) – Maximum sequence length of the decoder input.
input_pad_id (int) – Input sequence padding id.
attn_mask_pad_id (int) – Attention mask padding id.
labels_pad_id (int) – Labels padding id.
tokenize (callable) – Method to tokenize the input sequence.
sos_token (str) – the index of the SOS token in the vocabulary.
eos_token (str) – the index of the EOS token in the vocabulary.
- Returns
A dict with includes: * np.array[int.32] input_ids: Numpy array with encoder input token indices.
Shape: (src_max_sequence_length).
- np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.
Shape: (tgt_max_sequence_length).
- np.array[int.32] attention_mask: Numpy array with attention mask for encoder.
Shape: (src_max_sequence_length).
- np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.
Shape: (tgt_max_sequence_length). 1 indicates the non masked token, and 0 indicates the masked token.