cerebras.modelzoo.data_preparation.data_preprocessing.vsl_pretraining_token_generator#
This module provides the VSLPretrainingTokenGenerator class, extending PretrainingTokenGenerator for advanced processing of tokenized text data tailored for variable-length sequence language modeling (VSLLM). Includes methods for processing chunks of tokenized text, optimizing representation of tokenized data by merging shorter sequences within a specified maximum sequence length, and tokenizing text for auto-regressive language modeling.
Classes
Processes tokenized text data, specifically for VSLLM. |