cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessor#

class cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessor(params, samples_file_list, dataset_size)[source]#

Bases: object

Methods

`create_dataloader`	Classmethod to create the dataloader object.
`from_request_type`
`gen_data_samples`	Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

static gen_data_samples(requests, batch_size, max_sequence_length, tokenizer, eos_token_id, samples_saver, request_type, inf_start_token=None, max_gen_tokens=None)[source]#

Preprocess raw text requests as fetched from EEH script into data samples consumable by GPT2 model and dump these to numpy file.

Parameters

requests (List) – List of EEH’s Instance dataclass objects holding raw text data
batch_size (int) – The batch size
max_sequence_length (int) – The maximum length of each sample
tokenizer (transformers.PreTrainedTokenizerBase) – The tokenizer used to tokenize raw text data
eos_token_id (int) – int representing the end-of-sentence token
samples_saver (cerebras.modelzoo.common.utils.input.utils.SamplesSaver) – SamplesSaver object to manage the saving of data samples to file.
request_type (cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.RequestType) – The type of request for which the data sample is to be created
inf_start_token (Optional[int]) – (generative tasks-only) int representing the start token for generative inference
max_gen_tokens (Optional[int]) – (generative tasks-only) The max number of tokens to generate

Returns

(List[str], int, tuple) tuple of - list of file paths where the samples are dumped; - int representing the size of the dataset (total no. of samples; - tuple of request metadata needed for EEH postprocessing.

Return type

Tuple[List[str], int, List[Tuple[int, int]]]

create_dataloader()[source]#: Classmethod to create the dataloader object.

cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.EvalHarnessDataset

cerebras.modelzoo.data.nlp.gpt.InferenceDataProcessor.InferenceDataProcessorBCEH