cerebras.modelzoo.data_preparation.data_preprocessing.custom_tokenizer_example.CustomLlama3Tokenizer.CustomLlama3Tokenizer#
- class cerebras.modelzoo.data_preparation.data_preprocessing.custom_tokenizer_example.CustomLlama3Tokenizer.CustomLlama3Tokenizer(pretrained_model_name_or_path, eos_token_id=None, pad_token_id=None, **kwargs)[source]#
Bases:
object
Custom implementation of Llama3 Tokenizer, which overrides compute_offsets of the HuggingFace (which is buggy - https://github.com/huggingface/tokenizers/issues/1553).
- Parameters
pretrained_model_name_or_path (str) – The pretrained model name or path.
eos_token_id (Union[int, None], optional) – The ID of the end-of-sequence token. Defaults to None.
pad_token_id (Union[int, None], optional) – The ID of the padding token. Defaults to None.
**kwargs (Any) – Additional keyword arguments to be passed to AutoTokenizer.
- tokenizer#
The AutoTokenizer instance for the given pretrained model.
- Type
AutoTokenizer
- eos_token_id#
The ID of the end-of-sequence token.
- Type
int
- pad_token_id#
The ID of the padding token.
- Type
int
Methods
Compute offsets for the given encoded input.
- compute_offsets(encoded, return_offsets_mapping=False)[source]#
Compute offsets for the given encoded input.
- Parameters
encoded (Dict[str, Any]) – The encoded input containing ‘input_ids’ and ‘offset_mapping’.
return_offsets_mapping (bool, optional) – Whether to return the offsets mapping. Defaults to False.
- Returns
A list of tuples representing the start and end offsets for each token.
- Return type
List[Tuple[int, int]]
- __call__(text, **kwargs)[source]#
Encode the given text into tokens and optionally return the offsets mapping.
- Parameters
text (str) – The input text to tokenize.
**kwargs (Any) – Additional keyword arguments for tokenization.
- Returns
The encoded result containing ‘input_ids’, ‘attention_mask’, and optionally ‘offset_mapping’.
- Return type
Dict[str, Any]