cerebras.modelzoo.data_preparation.nlp.t5.utils#
Functions
Concatenate unrelated documents together to reduce the need for padding. |
|
Formats a raw sequence into a corrupted sequence and corresponding denoising targets. :param list tokens: A list of uncorrupted token indices. :param int vocab_size: The size of the vocabulary. :param int sos_token: The index of the SOS token in the vocabulary. :param int eos_token: The index of the EOS token in the vocabulary. :param np.random.Generator rng: The numpy random generator to be used as the source of randomness for this function. :returns: a tuple (feature_dict, label) of denoising source and target numpy arrays. |
|
Creates features for Transformer model input. |
|
Map a function over an iterator and flatten the result. |
|
T5 span corruption takes a sequence raw_sequence and corrupts spans to generate sequences masked_input and target. This function computes the maximum possible length of raw_sequence such that masked_input has length no greater than max_sequence_length. It outputs this length along with the maximum length of targets for this length of raw_sequences. :param int max_sequence_length: The maximum length of the encoder inputs after masking. :param float corruption_prob: The fraction of tokens that are corrupted for the denoising objective. :param int mean_span_len: The average length of a corrupted span. :returns: An integer such that if a sequence is clipped to this length before masking then it will have length at most max_sequence_length after masking; an integer that is the maximum possible length of a decoder sequence. |
|
Replace each run of consecutive noise tokens with a different sentinel. The idea here is to be able to align the dropped spans in the inputs with the markers in the targets. We want to generate training examples like "We hold <X> to be <Y> that" -> "<X> these truths <Y> self evident <Z>" Sentinels assigned in decreasing order within the sequence starting at vocab_size - 1. That is, we appropriate the last tokens in the vocabulary for additional use as sentinels. :param list tokens: A list of uncorrupted token indices. :param np.array noise_mask: A 1d boolean tensor with mask to apply noise. :param int vocab_size: Size of the vocabulary with tokens. :return: np.array with sentinels of the same type and shape as tokens. |
|
Provides padding for T5 input features. |
|
Postprocessing of the CSV file. |
|
Noise mask consisting of random spans of noise tokens. The number of noise tokens and the number of noise spans and non-noise spans are determined deterministically as follows: num_noise_tokens = round(length * noise_density) num_nonnoise_spans = num_noise_spans = round( num_noise_tokens / mean_noise_span_length) Spans alternate between non-noise and noise, beginning with non-noise. Subject to the above restrictions, all masks are equally likely. :param int length: Length of the incoming token sequence. :param float noise_density: A float - approximate density of output mask. :param float mean_noise_span_length: A number used in the noise mask calculation. :param np.random.Generator rng: The numpy random generator to be used as the source of randomness for this function. :return: A boolean np.array with shape [length]. |
|
Select a random chunk of a sample. |
|
Perform a buffered shuffle on an iterator. |
|
Split a long sequence into shorter sequences of the specified length. |