cerebras.modelzoo.data_preparation.nlp.t5.utils.select_random_chunk#
- cerebras.modelzoo.data_preparation.nlp.t5.utils.select_random_chunk(tokens, max_length=65536, rng=None)[source]#
Select a random chunk of a sample. This is used to prevent bias towards very long passages in the corpus.
- Parameters
tokens (list) – A list of token indices.
max_length (int) – the maximum allowed length of a sample before splitting.
rng (np.random.Generator) – The numpy random generator to be used as the source of randomness for this function.
- Returns
A list that is a random chunk of tokens if len(tokens) > max_length or tokens otherwise.