cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor#

class cerebras.modelzoo.data.nlp.t5.T5DynamicDataProcessor.T5DynamicDataProcessor(*args, **kwargs)[source]#

Bases: torch.utils.data.IterableDataset

Reads text files containing the input text tokens, adds extra ids for language modeling task on the fly. :param str src_vocab_file: Path to file containing tokens of vocabulary, one token per line. :param str src_data_dir: Path to directory containing the output of preprocess.sh, with all the files of tokenized data. :param int batch_size: Number of sequences per batch. Note that it is different between systems. :param bool shuffle, optional: If true the data will be shuffled before passing into the model. Recommended for training. Can be set to False for debugging. :param int shuffle_seed, optional: Sets random seed for the order of data shuffling. Allows for reproducibility while still shuffling data. :param int shuffle_buffer: Size of buffer used to store data before shuffling :param int extra_ids, optional: Number of sentinel tokens for T5 objective :param int src_max_sequence_length, optional: Largest possible sequence length for the input. If longer it will be truncated. All other sequences padded to this length. :param int tgt_max_sequence_length, optional: Largest possible sequence length for the labels. If longer it will be truncated. All other sequences padded to this length. :param int num_workers, optional: Number of processes that move data to the accelerator system, so that the system doesn’t process data faster than it receives it. :param bool drop_last, optional: If the last batch is not the full size, i.e. the dataset could not divide evenly into the batch-size, do not use the last batch. :param int prefetch_factor, optional: Number of batch loaded in advance by each worker. :param bool persistent_workers, optional: If set, workers will not be shutdown after going through the dataset once. :param bool do_lower, optional: If set, will lowercase all tokens in vocabulary. T5’s vocabulary is cased so this is not recommended. :param list buckets, optional: A list of boundaries for sequence lengths to bucket together in order to speed up VTS/VSL. :param bool dynamic_loss_weight, optional: If set, will divide the loss for a token by the length of the sequence that the token comes from. :param bool pack_sequences, optional: If set, will concatenate sequences so that computation is performed on real data rather than padding :param int num_documents_to_concatenate, optional: Specifies how many documents to pack together :param str oov_token, optional: Token for out-of-vocabulary words/sub-words :param str sos_token, optional: Token for start-of-sequence :param str eos_token, optional: Token for end-of-sequence :param str pad_token, optional: Token for padding :param int labels_pad_id, optional: Can set specific padding for labels :param int input_pad_id, optional: Can set specific padding for inputs :param bool mixed_precision, optional: If set, will use float16 rather than float32 when possible

Methods

create_dataloader

Classmethod to create the dataloader object.

element_length_fn

Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

get_meta_data

Read data from meta files.

get_single_item

Iterating over the data to construct input features.

load_buffer

Generator to read samples of data.

get_meta_data(data_dir)[source]#

Read data from meta files. :param str data_dir: Path to the input directory. :return: Processed meta data.

load_buffer()[source]#

Generator to read samples of data.

Returns

Yields data samples, one at a time.

get_single_item()[source]#

Iterating over the data to construct input features.

Returns

A dict with training features: * np.array[int.32] input_ids: Numpy array with encoder input token indices.

Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_input_ids: Numpy array with decoder input token indices.

    Shape: (tgt_max_sequence_length).

  • np.array[int.32] attention_mask: Numpy array with attention mask for encoder.

    Shape: (src_max_sequence_length).

  • np.array[int.32] decoder_attention_mask: Numpy array with attention mask for decoder.

    Shape: (tgt_max_sequence_length).

  • np.array[int.32] labels: Numpy array with labels for teacher forcing mode.

    Shape: (tgt_max_sequence_length).

element_length_fn(features)[source]#

Takes a single sample and returns the sequence length of that sample to be used for VTS bucketing.

create_dataloader()[source]#

Classmethod to create the dataloader object.