On-the-Fly Data Processing#
A new preprocessing
section in the train_input
and eval_input
sections of the YAML configuration file enables on-the-fly data preprocessing during training and/or evaluation.
This reduces the turnaround time when running experiments on relatively smaller dataset. It also reduces storage requirements. The
parameters for preprocessing
are the same as those defined for data preprocessing. The same data
processing algorithms and techniques used for offline data preprocessing are applied here. For
multibox runs, sharding is based on the number of input files inside the input directory. The number
of input files should be greater than or equal to the number of systems multiplied by the number of
workers per system. Below are examples for pretraining and fine-tuning configurations.
To enable this:
Specify the data_processor as RawDatasetProcessor
in the train_input
section and import from cerebras.modelzoo.data_preparation.raw_dataset_processor.RawDatasetProcessor
.
Supported Modes#
Pretraining
Fine-tuning
Note
Shuffling is not yet supported.
Pretraining Example#
The following configuration is for on-the-fly data preprocessing for pretraining:
train_input:
preprocessing:
data_processor: "RawDatasetProcessor"
processing:
custom_tokenizer: gpt2tokenizer
tokenizer_params:
encoder_file: /path/to/gpt2-encoder.json
vocab_file: /path/to/gpt2-vocab.bpe
sep_token: None
batch_size: 4
max_seq_length: 256
seed: 0
read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:text_read_hook"
read_hook_kwargs:
data_keys:
text_key: "text"
dataset:
data_keys:
jsonl_key: "text"
setup:
data:
source: /path/to/text_data
type: local
mode: pretraining
processes: 1
Fine-Tuning Example#
The following configuration is for on-the-fly data preprocessing for fine-tuning:
train_input:
preprocessing:
processing:
data_processor: "RawDatasetProcessor"
custom_tokenizer: gpt2tokenizer
tokenizer_params:
encoder_file: /path/to/gpt2-encoder.json
vocab_file: /path/to/gpt2-vocab.bpe
sep_token: None
batch_size: 4
max_seq_length: 2048
seed: 0
read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:prompt_completion_text_read_hook"
read_hook_kwargs:
data_keys:
prompt_key: "prompt"
completion_key: "completion"
dataset:
prompt_key: "prompt"
completion_key: "completion"
setup:
data:
source: /path/to/sum_data
type: local
mode: finetuning
processes: 1
Configuration#
preprocessing:
processing: See Setting Up the Environment
dataset: See Modes and Dataset Parameters
setup: See Handling Input Data