cerebras.modelzoo.data_preparation.data_preprocessing.dpo_token_generator.DPOTokenGenerator#

class cerebras.modelzoo.data_preparation.data_preprocessing.dpo_token_generator.DPOTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#

Bases: object

Initialize the DPTokenGenerator class with dataset parameters, tokenizer, and token IDs.

Parameters

params (Dict[str, Any]) – A dictionary containing parameters for dataset processing and configurations. It should include ‘dataset’ and ‘processing’ keys among others.
tokenizer – An instance of a tokenizer, likely from the Hugging Face transformers library.
eos_id (int) – The token ID used to signify the end of a sequence.
pad_id (int) – The token ID used for padding sequences to a uniform length.

The function initializes the DPTokenGenerator with various settings for text processing, including flags for text normalization, detokenization options, data types for input IDs and masks, special token configurations, and sequence length constraints.

Methods

`build_tokenized_answer`	Tokenizes the prompt and its response using a specific strategy to handle tokenizers where encoding a concatenated string does not simply equal the concatenation of encoded strings.
`encode`	Tokenize and encode the doc for DPO.
`get_token_id`	Get the token ID for the given token.
`tokenize_text`	Tokenizes text with the tokenizer, supporting both callable tokenizers and those requiring an encode method.

tokenize_text(text)[source]#

Tokenizes text with the tokenizer, supporting both callable tokenizers and those requiring an encode method.

Parameters: text (str) – Text to tokenize.
Returns: Dictionary with ‘input_ids’, ‘attention_mask’, and ‘labels’.
Return type: Dict[str, List[int]]

build_tokenized_answer(prompt, prompt_response)[source]#

Tokenizes the prompt and its response using a specific strategy to handle tokenizers where encoding a concatenated string does not simply equal the concatenation of encoded strings. Specifically handles cases for tokenizers like Llama’s, ensuring that enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):] holds.

Parameters

tokenizer (PreTrainedTokenizer) – The tokenizer to use for encoding.
prompt (str) – The prompt text to be encoded.
prompt_response (str) – The prompt response text to be encoded.

Returns

A dictionary containing tokenized IDs and attention masks for both the prompt and the combined prompt and response.

Return type

Dict[str, List[int]]

Raises

ValueError – If the lengths of generated token IDs do not match expectations.

Reference:: Discussion on tokenization strategy: https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257

encode(semantic_data_array)[source]#

Tokenize and encode the doc for DPO.

Parameters: doc (tuple) – Contains prompt, completion data to encode
Returns: Tuple of encoded features for DPO and dataset stats
Return type: -> Tuple[List[np.ndarray], Dict]

get_token_id(token)[source]#

Get the token ID for the given token.

Parameters: token (str) – Token for which the ID is needed.
Returns: Token ID.
Return type: int

cerebras.modelzoo.data_preparation.data_preprocessing.dpo_token_generator

cerebras.modelzoo.data_preparation.data_preprocessing.fim_token_generator