cerebras.modelzoo.data_preparation.nlp.t5.utils.concatenate_documents#
- cerebras.modelzoo.data_preparation.nlp.t5.utils.concatenate_documents(dataset, num_to_concatenate=128, pad_id=0)[source]#
Concatenate unrelated documents together to reduce the need for padding.
- Parameters
dataset (iterable) – The input dataset.
num_to_concatenate (int) – How many documents to concatanate together.
- Params int pad_id
The vocab id reserved for padding values. Must not occur anywhere in the dataset.
- Yields
new samples made from concatenating samples in dataset.