cerebras.modelzoo.data_preparation.data_preprocessing.utils.split_text_and_tokenize#

cerebras.modelzoo.data_preparation.data_preprocessing.utils.split_text_and_tokenize(text, tokenizer, max_tok_len=2000, remove_bos_in_chunks=True)[source]#

Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences. This is done to avoid performance issues with tokenizers like LlamaTokenizer which are slow for long sequences.

Parameters

text (str) – text to be tokenized
tokenizer (Tokenizer) – tokenizer to be used
max_tok_len (int, optional) – max length of each sequence. Defaults to 2000.
remove_bos_in_chunks (bool, optional) – whether to ignore bos token id in chunks. Defaults to True.

Returns

list of token ids for the text

Return type

tok_ids (list)

cerebras.modelzoo.data_preparation.data_preprocessing.utils.set_defaults

cerebras.modelzoo.data_preparation.data_preprocessing.utils.truncate_helper