cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.format_fim#
- cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.format_fim(segment_fim_format_pairs, max_seq_len, suffix_tok_id, prefix_tok_id, middle_tok_id, eos_tok_id, opt_bos_tok_id)[source]#
Takes in list of prefix/middle/suffix token lists, along with respective FIM (or AR) formats. Performs the correct transformation according to the format, adding the special tokens and shuffling the sections, before concatenating everything together.
- Parameters
segments_fim_format_pairs (List[Tuple[List[List[int]], str]]) – This list of tuples is used
formats (to store the prefix/middle/suffix token-id lists and the corresponding FIM) –
formatting. (be used downstream in the FIM) –
max_seq_len (int) – Max sequence length that each sequence is expected to match
suffix_tok_id (int) – Id for suffix token
prefix_tok_id (int) – Id for suffix token
middle_tok_id (int) – Id for suffix token
eos_tok_id (int) – Id for suffix token
opt_bos_tok_id (list) – Optionally a list containing the bos token id, otherwise will be empty list. Empty list will be a no-op in the concatenation. Bos-token will only exist if model’s tokenizer adds bos-token by default. Both have to be lists so that np concat works
- Returns
- Array of token ids in the FIMed order
along with special tokens
- mask (np.array): Array of 1’s and 0’s corresponding to true
tokens and padding respectively
- label (np.array): Token i of label corresponds to token i+1 in
sample array. Same elements except that label ends in eos (end-of-sequence) token
- Return type
sample (np.array)