cerebras.modelzoo.data_preparation.data_preprocessing.dpo_token_generator.DPOTokenGenerator#

class cerebras.modelzoo.data_preparation.data_preprocessing.dpo_token_generator.DPOTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#

Bases: object

Initialize the DPTokenGenerator class with dataset parameters, tokenizer, and token IDs.

Parameters
  • params (Dict[str, Any]) – A dictionary containing parameters for dataset processing and configurations. It should include ‘dataset’ and ‘processing’ keys among others.

  • tokenizer – An instance of a tokenizer, likely from the Hugging Face transformers library.

  • eos_id (int) – The token ID used to signify the end of a sequence.

  • pad_id (int) – The token ID used for padding sequences to a uniform length.

The function initializes the DPTokenGenerator with various settings for text processing, including flags for text normalization, detokenization options, data types for input IDs and masks, special token configurations, and sequence length constraints.

Methods

build_tokenized_answer

Tokenizes the prompt and its response using a specific strategy to handle tokenizers where encoding a concatenated string does not simply equal the concatenation of encoded strings.

encode

Tokenize and encode the doc for DPO.

get_token_id

Get the token ID for the given token.

tokenize_text

Tokenizes text with the tokenizer, supporting both callable tokenizers and those requiring an encode method.

tokenize_text(text)[source]#

Tokenizes text with the tokenizer, supporting both callable tokenizers and those requiring an encode method.

Parameters

text (str) – Text to tokenize.

Returns

Dictionary with ‘input_ids’, ‘attention_mask’, and ‘labels’.

Return type

Dict[str, List[int]]

build_tokenized_answer(prompt, prompt_response)[source]#

Tokenizes the prompt and its response using a specific strategy to handle tokenizers where encoding a concatenated string does not simply equal the concatenation of encoded strings. Specifically handles cases for tokenizers like Llama’s, ensuring that enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):] holds.

Parameters
  • tokenizer (PreTrainedTokenizer) – The tokenizer to use for encoding.

  • prompt (str) – The prompt text to be encoded.

  • prompt_response (str) – The prompt response text to be encoded.

Returns

A dictionary containing tokenized IDs and attention masks for both the prompt and the combined prompt and response.

Return type

Dict[str, List[int]]

Raises

ValueError – If the lengths of generated token IDs do not match expectations.

Reference:

Discussion on tokenization strategy: https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257

encode(semantic_data_array)[source]#

Tokenize and encode the doc for DPO.

Parameters

doc (tuple) – Contains prompt, completion data to encode

Returns

Tuple of encoded features for DPO and dataset stats

Return type

-> Tuple[List[np.ndarray], Dict]

get_token_id(token)[source]#

Get the token ID for the given token.

Parameters

token (str) – Token for which the ID is needed.

Returns

Token ID.

Return type

int