cerebras.modelzoo.data_preparation.data_preprocessing.utils.split_text_and_tokenize#
- cerebras.modelzoo.data_preparation.data_preprocessing.utils.split_text_and_tokenize(text, tokenizer, max_tok_len=2000, remove_bos_in_chunks=True)[source]#
Function to split the text into smaller sequences of length max_tok_len and then tokenize each of the smaller sequences. This is done to avoid performance issues with tokenizers like LlamaTokenizer which are slow for long sequences.
- Parameters
text (str) – text to be tokenized
tokenizer (Tokenizer) – tokenizer to be used
max_tok_len (int, optional) – max length of each sequence. Defaults to 2000.
remove_bos_in_chunks (bool, optional) – whether to ignore bos token id in chunks. Defaults to True.
- Returns
list of token ids for the text
- Return type
tok_ids (list)