cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.WordPieceTokenizer#

class cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.WordPieceTokenizer(vocab_file, unknown_token='[UNK]', max_input_chars_per_word=200, do_lower_case=True)[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.BaseTokenizer

Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing

Methods

tokenize

Tokenize a piece of text into its word pieces This uses a greedy longest-match-first algorithm to perfom tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"].

tokenize(text)[source]#

Tokenize a piece of text into its word pieces This uses a greedy longest-match-first algorithm to perfom tokenization using the given vocabulary. For example:

input = “unaffable” output = [“un”, “##aff”, “##able”]

Does not convert to ids.