cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.fim#
- cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.fim(sample_array, sample_idx, tokenizer, fim_rate, spm_rate, suffix_tok_id, prefix_tok_id, middle_tok_id, fim_pad_tok_id, eos_tok_id, opt_bos_tok_id)[source]#
Takes in an array of input_ids, mask, and labels, and performs the FIM operation to re-arrange into PSM and SPM format with some probability
- Parameters
sample_array (np.array) – Stack of input_ids, mask, and labels after tokenization. Labels are off-by-one of input_ids
training (as in standard auto-regressive) –
i (int) – Index of sample from dataset, used for logging.
tokenizer (Tokenizer) – Tokenizer object
fim_rate (float) – Determines what percentage of contexts are FIM’ed
spm_rate (float) – Determines what percentage of FIM’ed contexts are in SPM format. 1 - spm_rate determines PSM
suffix_tok_id (int) – Id for special token denoting suffix section in a FIM’ed context
prefix_tok_id (int) – Id for special token denoting prefix section in a FIM’ed context
middle_tok_id (int) – Id for special token denoting middle section in a FIM’ed context
fim_pad_tok_id (int) – Id for padding
eos_tok_id (int) – Id for the end-of-seqence
opt_bos_tok_id (list) – Optionally a list containing the bos token id, otherwise will be empty list. Empty list will be a no-op in the concatenation. Bos-token will only exist if model’s tokenizer adds bos-token by default.
- Returns
Stack of input_ids, mask, and labels after FIM transformation. Mask and labels have been adjusted to still filter padding tokens and represent the following token, respectively.
- Return type
fim_outputs (np.array)