cerebras.modelzoo.data_preparation.data_preprocessing.dpo_token_generator.DPOTokenGenerator#
- class cerebras.modelzoo.data_preparation.data_preprocessing.dpo_token_generator.DPOTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#
Bases:
object
Initialize the DPTokenGenerator class with dataset parameters, tokenizer, and token IDs.
- Parameters
params (Dict[str, Any]) – A dictionary containing parameters for dataset processing and configurations. It should include ‘dataset’ and ‘processing’ keys among others.
tokenizer – An instance of a tokenizer, likely from the Hugging Face transformers library.
eos_id (int) – The token ID used to signify the end of a sequence.
pad_id (int) – The token ID used for padding sequences to a uniform length.
The function initializes the DPTokenGenerator with various settings for text processing, including flags for text normalization, detokenization options, data types for input IDs and masks, special token configurations, and sequence length constraints.
Methods
Tokenizes the prompt and its response using a specific strategy to handle tokenizers where encoding a concatenated string does not simply equal the concatenation of encoded strings.
Tokenize and encode the doc for DPO.
Get the token ID for the given token.
Tokenizes text with the tokenizer, supporting both callable tokenizers and those requiring an encode method.
- tokenize_text(text)[source]#
Tokenizes text with the tokenizer, supporting both callable tokenizers and those requiring an encode method.
- Parameters
text (str) – Text to tokenize.
- Returns
Dictionary with ‘input_ids’, ‘attention_mask’, and ‘labels’.
- Return type
Dict[str, List[int]]
- build_tokenized_answer(prompt, prompt_response)[source]#
Tokenizes the prompt and its response using a specific strategy to handle tokenizers where encoding a concatenated string does not simply equal the concatenation of encoded strings. Specifically handles cases for tokenizers like Llama’s, ensuring that enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):] holds.
- Parameters
tokenizer (PreTrainedTokenizer) – The tokenizer to use for encoding.
prompt (str) – The prompt text to be encoded.
prompt_response (str) – The prompt response text to be encoded.
- Returns
A dictionary containing tokenized IDs and attention masks for both the prompt and the combined prompt and response.
- Return type
Dict[str, List[int]]
- Raises
ValueError – If the lengths of generated token IDs do not match expectations.
- Reference:
Discussion on tokenization strategy: https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257