Quickstart Guide For Data Preprocessing#
Overview#
This guide provides a comprehensive overview of data preprocessing for the Cerebras platform. It covers the steps to set up the environment, preprocess text-only and multimodal datasets for pre-training and fine-tuning, and visualize the preprocessed data. By following this guide, you can ensure your data is prepared efficiently and effectively for machine learning tasks.
Initial setup#
Preprocess a text-Only dataset for pre-training#
Setup the raw data#
Set up the directory for raw data which will be needed for preprocessing. In case of HuggingFace dataset, refer to <insert-refernece-to-HuggingFace-section>
Input files format#
The preprocessing pipeline supports input data in various formats; namely - .jsonl
, .json.gz
, .jsonl.zst
, .jsonl.zst.tar
, .parquet
, .txt
and .fasta
. For more details on how to structure these files, refer to the section on read hooks in the _read-hooks-section.
Note
To optimally process the files, it is recommended that all files other than .txt
contain enough text in a single file. Recommended size for each file is in the order of GBs.
If processiing smaller files with .txt
format, please input a metadata
file containing a list of paths for these files to better leverage multiprocessing.
Prepare data config file#
To set up a config file, you will have to specify three sections, namely - setup
, processing
and dataset
.
Example of data setup section:
setup:
data:
source: "<path/to/dir>"
type: "local"
split: "test"
cache_dir: "path/to/cache_dir"
...other parameters accepted by HuggingFace ``load_dataset`` API...
mode: "pretraining"
output_dir: "./output/dir/here/"
processes: 1
Note
Set type
to huggingface
for HuggingFace data, and set the source
to the name of the HuggingFace dataset.
Example of processing section:
processing:
huggingface_tokenizer: "bert-base-uncased"
tokenizer_params:
param1: value1
param2: value2
read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks.text_read_hook"
read_hook_kwargs:
data_keys:
text_key: "text"
resume_from_checkpoint: False
max_seq_length: 2048
read_chunk_size: 1024
write_chunk_size: 1024
shuffle: False
shuffle_seed: 0
Example of dataset section:
dataset:
ftfy_normalizer: NFC
use_ftfy: False
use_vsl: False
wikitext_detokenize: False
pack_sequences: True
Note
Set training_objective
to mlm
for MLM tasks, and fim
for FIM tasks.
For a comprehensive list of configurations used in Model Zoo, refer to
Procedure#
Applying read hook for pre-training#
This extracts and processes plain text data for reading tasks. It requires a key to extract text from input data.
read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks.text_read_hook"
read_hook_kwargs:
data_keys:
text_key: "text"
[
{
"content": [
{"text": "Extracted text data"}
]
}
]
Preprocess a Multimodal local dataset for fine-tuning#
Setup the raw data#
Set up the directory for raw data which will be needed for preprocessing. Also, set up the image directory needed.
Example of data setup section:
setup:
data:
source: "/path/to/local/dataset"
type: "local"
mode: "finetuning"
output_dir: "./output/dir/here/"
processes: 1
Example of processing section:
Note
The tokenizer has a custom implementation because of a bug that is present on HuggingFace’s Llama3 tokenizer’s offset_mapping
calculation. (see: https://github.com/huggingface/tokenizers/issues/1553). Once that is fixed, this tokenizer need not be used.
Also, the current preprocessing pipeline relies on the tokenizer’s offset_mapping
to process semantic regions. Please use this implementation for Llama3.
processing:
custom_tokenizer: cerebras.modelzoo.data_preparation.data_preprocessing.custom_tokenizer_example.CustomLlama3Tokenizer:CustomLlama3Tokenizer
tokenizer_params:
pretrained_model_name_or_path: "meta-llama/Meta-Llama-3-8B-Instruct"
token: "<insert_auth_token>"
read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks.finetuning_llava_hook_prompt_completion"
read_hook_kwargs:
data_keys:
multi_turn_key: "conversation"
image_key: "<name-of-image-column-in-raw-dataset>"
image_token: "<image>"
phase: 1
resume_from_checkpoint: False
max_seq_length: 2048
read_chunk_size: 1024
write_chunk_size: 1024
shuffle: False
shuffle_seed: 0
Example of dataset section:
dataset:
is_multimodal: True
ftfy_normalizer: NFC
use_ftfy: False
use_vsl: False
wikitext_detokenize: False
image_dir: "<path/to/image/dir>"
Note
use_vsl
needs to be set to True in the train_input
or eval_input
section of the model config.
use_vsl
is not supported for multimodal tasks.
Fine-tuning LLaVA hook prompt completion#
This transforms conversation data for fine-tuning LLaVA, alternating between prompt and completion roles. It requires keys for conversation data and image paths. The hook implementation can be found here.
read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks.finetuning_llava_hook_prompt_completion"
read_hook_kwargs:
data_keys:
multi_turn_key: "conversation"
image_key: "image_path"
image_token: "<image>"
phase: 1
[
{
"type": "system",
"content": [{"text": "A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's "
"questions."}]
},
{
"type": "prompt",
"content": [
{"image": "path/to/image.jpg"},
{"text": "User's text before and after image"}
],
"semantic_drop_mask": [False, True]
},
{
"type": "completion",
"content": [{"text": "Assistant's response"}],
"semantic_drop_mask": [False]
}
]
For a detailed explanation of the configuration parameters, refer to: the Data preprocessing guide.
You can also refer to some config templates here.
Running the preprocessing pipeline#
Once the configuration file is prepared, run the below command:
python preprocess_data.py --config /path/to/configuration/file
Visualization#
This tool visualizes preprocessed data efficiently and in an organized fashion, allowing for easy debugging and error-catching of the output data.
python launch_tokenflow.py --output_dir <directory/of/file(s)>
Arguments#
output_dir
: Contains the file(s) that are to be viewed in the GUI. [Required]data_params
: Location of the data_params.json file for the preprocessed dataset. [Optional]port
: In case the user wants to specify a different port for the flask server. [Optional, default=5000]
Output#
There are 4 sections in the visualization output. input_strings
and label_strings
are converted
tokens from input_ids
and labels
respectively. The tokens in the string sections are highlighted
in green when the loss weight is greater than zero for that specific token. Similarly, the tokens are
highlighted in red when their attention mask is set to zero. For multimodal datasets, hovering over the
image pad tokens also displays the corresponding image in the popup window.
Conclusion#
By following this guide, you can efficiently set up and run data preprocessing pipelines on the Cerebras platform. This process ensures that your datasets are clean and optimized for further machine learning tasks, improving both performance and accuracy.