cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.WordPieceTokenizer#
- class cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.WordPieceTokenizer(vocab_file, unknown_token='[UNK]', max_input_chars_per_word=200, do_lower_case=True)[source]#
Bases:
cerebras.modelzoo.data_preparation.nlp.tokenizers.Tokenization.BaseTokenizer
Class for tokenization of a piece of text into its word pieces :param str vocab_file: File containing vocabulary, each token in new line :param str unknown_token: Token for words not in vocabulary :param int max_input_chars_per_word: Max length of word for splitting :param bool do_lower: Specifies whether to convert to lower case for data processing
Methods
Tokenize a piece of text into its word pieces This uses a greedy longest-match-first algorithm to perfom tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"].