cerebras.modelzoo.data_preparation.data_preprocessing.utils.truncate_or_pad_helper#
- cerebras.modelzoo.data_preparation.data_preprocessing.utils.truncate_or_pad_helper(segments_fim_format_pairs, diff, fim_pad_tok_id, sample_idx)[source]#
Since we perform FIM at character-level, we potentially split characters in the middle of a word. This can lead to non-standard token sequences, and after re-tokenizing we might need to truncate or pad to get back to the original context length. This function ensures that our outputs are back at their original length.
- Parameters
segments_fim_format_pairs (List[Tuple[List[List[int]], str]]) – This list of tuples is used
formats (to store the prefix/middle/suffix token-id lists and the corresponding FIM) –
formatting. (be used downstream in the FIM) –
diff (int) – The number of tokens to add or remove. Positive means truncate, negative means pad
fim_pad_tok_id (int) – Id of padding token
- Returs:
(List[Tuple[List[List[int]], str]]): The element of the tuples will now be lists that are truncated or padded such that the concatenation of all these tokens, along with the special tokens, will be equal to the original sequence length.