cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.FullTokenizer#
- class cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.FullTokenizer(vocab_file, do_lower_case=True)[source]#
Bases:
object
Class for full tokenization of a piece of text Calls BaseTokenizer and WordPiece tokenizer to perform basic grammar operations and wordpiece splits :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing
Methods
Converts a list of ids to a list of tokens We shift all inputs by 1 because of the ids->token dictionary formed by keras Tokenizer starts with index 1 instead of 0.
Converts a list of tokens to a list of ids We shift all outputs by 1 because of the dictionary formed by keras Tokenizer starts with index 1 instead of 0.
Returns a list of the words in the vocab
Perform basic tokenization followed by wordpiece tokenization on a piece of text.
- convert_tokens_to_ids(text)[source]#
Converts a list of tokens to a list of ids We shift all outputs by 1 because of the dictionary formed by keras Tokenizer starts with index 1 instead of 0.
- convert_ids_to_tokens(text)[source]#
Converts a list of ids to a list of tokens We shift all inputs by 1 because of the ids->token dictionary formed by keras Tokenizer starts with index 1 instead of 0.