Token Generators#
Overview#
Setting up token generators is a crucial step in the preprocessing pipeline for machine learning tasks on Cerebras Systems. Token generators convert raw data into tokenized formats suitable for machine learning models, ensuring efficient and effective data processing. This guide covers the configuration of pre-built and custom token generators, along with examples and use cases.
Pre-Built Token Generators#
Cerebras Model Zoo provides a comprehensive suite of pre-built token generators tailored to support
various stages and tasks in the development of LLMs. The initialization of these token generators
is dependent on the mode
parameter that is specified in the config file (refer to Modes and Dataset Parameters).
Supported token generators#
For Pretraining mode#
PretrainingTokenGenerator
: General-purpose pretraining on large text corpora. Whentraining_objective
is set tomlm
, it does MLM task processing.MultiModalPretrainingTokenGenerator
: For multimodal data, integrating text, and images. Initialized whenis_multimodal
is set toTrue
in the config file.FIMTokenGenerator
: Designed for fill-in-the-middle tasks. Initialized whentraining_objective
is set tofim
in the config file.VSLPretrainingTokenGenerator
: For visual and language pretraining. Initialized whenuse_vsl
is set toTrue
in the config file.
For Fine-tuning mode#
FinetuningTokenGenerator
: General-purpose fine-tuning.MultiModalFinetuningTokenGenerator
: Fine-tuning with multimodal data. Initialized whenis_multimodal
is set toTrue
in the config file.VSLFinetuningTokenGenerator
: Fine-tuning for visual and language tasks. Initialized whenuse_vsl
is set toTrue
in the config file.
Other Supported Token Generators#
DPOTokenGenerator
: Focused on direct preference optimization (DPO) during token generation. Initialized whenmode
is set todpo
.NLGTokenGenerator
: Optimized for natural language generation tasks. Initialized whenmode
is set tonlg
.
Flags Supported by Pre-built Token Generators#
This section lists all the flags that are supported by the various token generators, along with their default values.
Common Parameters#
This section lists down parameters that are common and can be used across all the token generators.
Flag |
Default Value |
Description |
---|---|---|
use_ftfy |
False |
Fix text with ftfy. |
ftfy_normalizer |
NFC |
Choose what kind of unicode normalization is applied. Usually, we apply NFC normalization, so that letters followed by combining characters become single combined characters. Using None applies no normalization while fixing text. |
wikitext_detokenize |
False |
Use wikitext detokenizer to fix text. |
min_sequence_len |
10 |
Minimum token length to skip the sample. |
input_ids_dtype |
int32 |
dtype of processed input_ids. |
input_mask_dtype |
int32 |
dtype of processed input loss masks. |
max_seq_length |
2048 |
Maximum sequence length. |
Pretraining Parameters#
This section lists down parameters that can be used for PretrainingTokenGenerator
.
Flag |
Default Value |
Description |
---|---|---|
pack_sequences |
True |
Concatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token. |
inverted_mask |
False |
If False, 0 represents masked positions. If True 1 represents masked positions. |
seed |
0 |
Random seed used for generating short sequences |
short_seq_prob |
0.0 |
Probability of creating sequences which are shorter than the maximum sequence length. |
split_text_to_tokenize |
False |
Whether to split the text into smaller chunks before tokenization. This is helpful for very long documents with tokenizers such as Llama tokenizer which performs quadratically in the text length. |
chunk_len_to_split |
2000 |
Length of the text chunks to split the text into before tokenization for slower tokenizers. Could be optionally used with the above flag split_text_to_tokenize. Without the previous flag, this argument will be ignored. |
remove_bos_in_chunks |
False |
Whether to remove the BOS token from the beginning of the chunks. Set this to True when using split_test_to_tokenize and chunk_len_to_split to avoid having multiple BOS tokens in the middle of the text. Not applicable to all tokenizers. |
Finetuning Parameters#
This section lists down parameters that can be used for FinetuningTokenGenerator
.
Flag |
Default Value |
Description |
---|---|---|
chat_template |
None |
Custom chat template which overrides the tokenizer’s in built chat template. Useful when preprocessing chat format data. |
semantic_drop_mask |
{} |
Dictionary which indicates which semantic region to drop from input data before tokenization |
semantic_loss_weight |
{} |
Dictionary which indicates the loss mask of the different semantic regions post tokenization |
end_of_turn_token |
None |
Required when the tokenizer inserts an end-of-turn token in the chat template after each turn, which is different from the EOS token. |
sep_token |
tokenizer’s |
Separator token to be used for separating prompt and completion for instruction style datasets |
FIM Token Generator Parameters#
This section lists down parameters that can be used for FIMTokenGenerator
.
Note
FIMTokenGenerator
also uses config paramaters that are used by PretrainingTokenGenerator
, in addition to the ones specified below.
Flag |
Default Value |
Description |
---|---|---|
fim_rate |
0.90 |
Float specifying percentage of data to apply FIM transformation, instead of leaving as auto-regressive. |
spm_rate |
0.50 |
Float specifying percentage of FIM transformation to convert to prefix-suffix-middle (PSM) vs suffix-prefix-middle (SPM) formats. |
fim_suffix_tok |
None |
Special token denoting suffix section in a FIM’ed context. |
fim_prefix_tok |
None |
Special token denoting prefix section in a FIM’ed context. |
fim_middle_tok |
None |
Special token denoting middle section in a FIM’ed context. |
MLM Token Generator Parameters#
This section lists down parameters that can be used for PretrainingTokenGenerator
when training_objective
is set to mlm
.
Note
In this case, it also uses all the config paramaters that are used by PretrainingTokenGenerator
, in addition to the ones specified below.
Flag |
Default Value |
Description |
---|---|---|
mlm_fraction |
0.15 |
Fraction of tokens to be masked in MLM tasks. |
mlm_with_gather |
False |
MLM processing mode. When set to True the length of the returned labels is equal to mlm_fraction * msl, else it is equal to msl |
ignore_index |
-100 |
Required when mlm_with_gather is set to False. Presence of ignore_index value at a position in the labels indicates that this position will not be used for loss calculation. |
excluded_tokens |
[‘<cls>’, ‘<pad>’, ‘<eos>’, ‘<unk>’, ‘<null_1>’, ‘<mask>’] |
Tokens to be excluded when masking. Provided only through YAML config. |
VSL Finetuning Token Generator Parameters#
This section lists down parameters that can be used for VSLFineTuningTokenGenerator
.
Note
VSLFineTuningTokenGenerator
also uses the config paramaters that are used by FineTuningTokenGenerator
, in addition to the ones specified below.
Flag |
Default Value |
Description |
---|---|---|
use_vsl |
True |
Generate examples with multiple sequences packed together |
position_ids_dtype |
int32 |
dtype of token position ids. |
Note
Increasing the read chunk size will increase the packing factor of VSL. So, the user needs to figure out the tradeoff between higher packing and processing time depending on the dataset’s packing factor.
VSL Pretraining Token Generator Parameters#
This section lists down parameters that can be used for VSLPretrainingTokenGenerator
. use_vsl
needs to be set to True in the train_input
or eval_input
section of the model config.
Note
VSLPretrainingTokenGenerator
also uses the config paramaters that are used by PretrainingTokenGenerator
, in addition to the ones specified below.
Flag |
Default Value |
Description |
---|---|---|
use_vsl |
True |
Generate examples with multiple sequences packed together |
fold_long_doc |
True |
Fold documents larger than max_seq_length into multiple sequences, instead of dropping them. |
DPO Token Generator Parameters#
This section lists down parameters that can be used for DPOTokenGenerator
.
Flag |
Default Value |
Description |
---|---|---|
max_prompt_length |
512 |
If the sequence exceeds the |
response_delimiter |
|
This is used to set the separator between |
Multimodal Pretraining Token Generator Parameters#
This section lists down parameters that can be used for MultiModalPretrainingTokenGenerator
.
Flag |
Default Value |
Description |
---|---|---|
image_dir |
None |
Absolute path of image directory. Used along with the relative path under the image_key field in read_hook_kwargs to check that images exist, and throw out examples with no image. |
max_num_img |
1 |
Maximum number of images allowed in one preprocessed sequence. Sequences with more than max_num_img images will be discarded |
num_patches |
None |
Number of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images. |
semantic_attention_mask |
{} |
Dictionary which indicates the attention mask of the different semantic regions. |
semantic_drop_mask |
{} |
Dictionary which indicates which semantic region to drop from input data before tokenization |
semantic_loss_weight |
{} |
Dictionary which indicates the loss mask of the different semantic regions post tokenization |
Multimodal Finetuning Token Generator Parameters#
This section lists down parameters that can be used for MultiModalFinetuningTokenGenerator
.
Note
MultiModalFinetuningTokenGenerator
also uses config paramaters that are used by FineTuningTokenGenerator
, in addition to the ones specifed below.
Flag |
Default Value |
Description |
---|---|---|
image_dir |
None |
Absolute path of image directory. Used along with the relative path under the image_key field in read_hook_kwargs to check that images exist, and throw out examples with no image. |
max_num_img |
1 |
Maximum number of images allowed in one preprocessed sequence. Sequences with more than max_num_img images will be discarded |
num_patches |
None |
Number of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images. |
semantic_attention_mask |
{} |
Dictionary which indicates the attention mask of the different semantic regions. |
Custom Token Generators#
In addition to pre-built token generators, the Model Zoo allows users to implement custom token generators. This enables arbitrary transformations of the input data before tokenization.
Procedure#
To use custom token generators, ensure the configuration file is properly set up. Follow these steps:
1. Ensure that the mode
param is set to custom
, in order to be able to specify your own token generator.
2. Specify the path to the custom token generator class in the config file, in the token_generator
param, within the setup
section. This would look like:
mode: "custom"
token_generator: "<path/to/custom-generator-class>"
Note
The token_generator
path should be specified with the class name being separated with a
colon : from the module name, for the custom token generator be instantiated correctly.
Class Implementation Guidelines#
The custom token generator must adhere to the following guidelines:
1. The constructor’s signature must be as follows:
def __init__(
self, params: Dict[str, Any], tokenizer: Any, eos_id: int, pad_id: int
):
"""
Args:
params (Dict[str, Any]): Parameters for the dataset and processing.
tokenizer (Any): Tokenizer to use for tokenization.
eos_id (int): End-of-sequence token ID.
pad_id (int): Padding token ID.
"""
2. The custom token generator must implement an encode
method, which tokenizes and encodes the data according to the user definition. For more examples on how the encode
method looks like, refer to the code of pre-built token generators that are present in Model Zoo.
3. The signature of the encode
method is given below, where it takes in a semantic_data_array
:
def encode(
self, semantic_data_array: List[Dict[str, Any]]
) -> Tuple[Dict[str, Any], Dict[str, int]]:
Conclusion#
Configuring token generators is an important step in the preprocessing pipeline for machine learning tasks on Cerebras Systems. By leveraging the comprehensive suite of pre-built token generators provided by Cerebras ModelZoo, you can efficiently handle various stages and tasks in the development of large language models. Additionally, the flexibility to implement custom token generators allows for tailored transformations of input data, meeting specific project requirements.
The introduction of on-the-fly data processing further enhances the preprocessing workflow by reducing storage needs and increasing adaptability during training and evaluation. The examples provided for pretraining and fine-tuning configurations illustrate how to set up these processes seamlessly.
Finally, the TokenFlow utility offers an invaluable tool for visualizing and debugging preprocessed data, ensuring data integrity and facilitating error detection. By following the guidelines and leveraging the tools outlined in this guide, you can optimize your preprocessing pipeline, leading to more efficient training and improved performance of your machine learning models on Cerebras Systems