cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.BaseTokenizer#
- class cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.BaseTokenizer(vocab_file, do_lower_case=True)[source]#
Bases:
object
Class for base tokenization of a piece of text Handles grammar operations like removing strip accents, checking for chinese characters in text, handling splitting on punctuation and control characters. Also handles creating the tokenizer for converting tokens->id and id->tokens and storing vocabulary for the dataset :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing
Methods
Tokenizes a piece of text.