cerebras.modelzoo.data_preparation.data_preprocessing.pretraining_token_generator#
PretrainingTokenGenerator Module
This module provides the PretrainingTokenGenerator class which is designed to process text data and create features suitable for language modeling tasks.
- Usage:
tokenizer = PretrainingTokenGenerator(dataset_params, max_sequence_length, tokenizer) tokenized_features = tokenizer.encode(“Sample text for processing.”)
Functions
Given a list of token_ids, generate input sequence and labels. |
Classes
Initialize the PretrainingTokenGenerator class. |