cerebras.modelzoo.data_preparation.nlp.tokenizers.BPETokenizer#
Byte pair encoding/decoding utilities
Modified from the GPT-2 codebase: https://github.com/openai/gpt-2
Functions
Returns list of utf-8 byte and a corresponding list of unicode strings. |
|
Return set of symbol pairs in a word. |
Classes