cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization#
Tokenization classes and functions
Classes
Class for base tokenization of a piece of text Handles grammar operations like removing strip accents, checking for chinese characters in text, handling splitting on punctuation and control characters. |
|
Class for full tokenization of a piece of text Calls BaseTokenizer and WordPiece tokenizer to perform basic grammar operations and wordpiece splits :param str vocab_file: File containing vocabulary, each token in new line :param bool do_lower: Specifies whether to convert to lower case for data processing |
|
Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing |