cerebras.modelzoo.data_preparation.nlp.tokenizers.HFTokenizer.HFTokenizer#

class cerebras.modelzoo.data_preparation.nlp.tokenizers.HFTokenizer.HFTokenizer(vocab_file, special_tokens=None)[source]#

Bases: object

Designed to integrate the HF’s Tokenizer library :param vocab_file: A vocabulary file to create the tokenizer from. :type vocab_file: str :param special_tokens: A list or a string representing the special

tokens that are to be added to the tokenizer.

Methods

add_special_tokens

add_token

decode

encode

get_token

get_token_from_tokenizer_config

This api is designed to extract token information from the tokenizer config json file.

get_token_id

set_eos_pad_tokens

Attributes

eos

pad

get_token_from_tokenizer_config(json_data, token)[source]#

This api is designed to extract token information from the tokenizer config json file. We assume the token data to be in 2 formats either as a string or a dictionary.