cerebras.modelzoo.data_preparation.nlp.t5.utils.split_sequences#

cerebras.modelzoo.data_preparation.nlp.t5.utils.split_sequences(tokens, length)[source]#

Split a long sequence into shorter sequences of the specified length. :param list tokens: A list of token indices. :param int length: The maximum allowed length of a sample.

Returns: A list of sequences containing exactly the same samples as before split into seperate samples such that no element of the dataset has length longer than specified.

cerebras.modelzoo.data_preparation.nlp.t5.utils.shuffle

cerebras.modelzoo.data_preparation.nlp.tokenizers