cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.tokenize_stop_words#

cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.tokenize_stop_words(stop_words, tokenizer)[source]#

Helper to construct a list of stop token sequences from the given list of stop words using the specified tokenizer.

For stop words that tokenize to a single token, we iterate the tokenizer’s vocab and add all the token_ids that detokenize to the stop word. This is done to handle the case where different token ids map to the same stop word, since RT uses stop tokens, not words to stop inferring.

For stop words that tokenize to multiple token sequence, we add the sequence directly.

Parameters

stop_words (str) – The input string.
tokenizer (PreTrainedTokenizerBase) – Tokenizer class from huggingface transformers library.

Returns

Sorted (by first token id) list of stop token sequences.

Return type

List[List[int]]

cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.get_token_ids

cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.EvalHarnessDataset