Fine-tune your first model#
Overview#
Follow this guide to fine-tune your first model on a Cerebras system.
This tutorial teaches you about Cerebras essentials like data preprocessing and training scripts, config files, and checkpoint conversion tools. To understand these concepts, you’ll fine-tune Meta’s Llama 3 8B on a small dataset consisting of documents and their summaries.
In this quickstart guide, you will:
Setup your environment
Pre-process a small dataset
Port a trained model from Hugging Face
Fine-tune and evaluate a model
Test your model on downstream tasks
Port your model to Hugging Face
Note
In this tutorial, you will train your model for a short while on a small dataset. A high quality model requires a longer training run, as well as a much larger dataset.
Prerequisites#
To begin this guide, you must have:
Cerebras system access. If you don’t have acces, contact Cerebras Support.
Completed setup and installation.
Step 1: Setup#
Set environment variables#
Start by saving common paths in environment variables for easy access, including:
The parent directory above Model Zoo
The location of data preprocessing scripts
The location of training scripts (in this case, Llama 3)
The location of scripts for converting checkpoints to and from Hugging Face
export MODELZOO_PARENT=$(pwd)
export MODELZOO_DATA=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing
export MODELZOO_MODEL=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/models/nlp/llama
export MODELZOO_TOOLS=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/tools
export MODELZOO_COMMON=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/common
Create model directory#
Create a dedicated folder for assets (data/model configs) and generated files (processed data files, checkpoints, logs, etc.):
mkdir finetuning_tutorial
Copy configs#
Then, copy sample configs into your folder. You will use these to control Model Zoo scripts for efficient data preprocessing, training, and evaluation of large models.
cp modelzoo/src/cerebras/modelzoo/tutorials/finetuning/* finetuning_tutorial
Inspect data config (optional)#
Take a look at one of your data configs:
cat finetuning_tutorial/train_data_config.yaml
Here is what you should see in your terminal:
############################################
## Fine-Tuning Tutorial Train Data Config ##
############################################
setup:
data:
type: "huggingface"
source: "autoevaluate/xsum-sample"
split: "train"
mode: "finetuning"
output_dir: "finetuning_tutorial/train_data"
processes: 1
processing:
custom_tokenizer: cerebras.modelzoo.data_preparation.data_preprocessing.custom_tokenizer_example.CustomLlama3Tokenizer:CustomLlama3Tokenizer
tokenizer_params:
pretrained_model_name_or_path: "baseten/Meta-Llama-3-tokenizer"
write_in_batch: True
read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:prompt_completion_text_read_hook"
read_hook_kwargs:
data_keys:
prompt_key: "document"
completion_key: "summary"
dataset:
use_ftfy: True
This config file will process the ”train” split of the autoevaluate/xsum-sample dataset from Hugging Face using the baseten/Meta-Llama-3-tokenizer.
An example of “train” looks as follows:
{
"document": "In Wales, councils are responsible for funding..",
"summary": "As Chancellor George Osborne announced...",
"id": "35821725"
}
If you are interested, you can read more about the various parameters and pre-built utilities for preprocessing common data formats. You can also follow end-to-end tutorials for various use cases.
Inspect model config (optional)#
Take a look at your model config:
cat finetuning_tutorial/model_config.yaml
Here is what you should see in your terminal:
#######################################
## Fine-Tuning Tutorial Model Config ##
#######################################
trainer:
init:
model_dir: finetuning_tutorial/model
backend:
backend_type: CSX
cluster_config:
num_csx: 1
callbacks:
- ComputeNorm: {}
checkpoint:
steps: 18
logging:
log_steps: 1
loop:
eval_steps: 5
max_steps: 18
model:
attention_dropout_rate: 0.0
attention_module: multiquery_attention
attention_type: scaled_dot_product
dropout_rate: 0.0
embedding_dropout_rate: 0.0
embedding_layer_norm: false
extra_attention_params:
num_kv_groups: 8
filter_size: 14336
fp16_type: cbfloat16
hidden_size: 4096
initializer_range: 0.02
layer_norm_epsilon: 1.0e-05
loss_scaling: num_tokens
loss_weight: 1.0
max_position_embeddings: 8192
mixed_precision: true
nonlinearity: swiglu
norm_type: rmsnorm
num_heads: 32
num_hidden_layers: 32
pos_scaling_factor: 1.0
position_embedding_type: rotary
rope_theta: 500000.0
rotary_dim: 128
share_embedding_weights: false
use_bias_in_output: false
use_ffn_bias: false
use_ffn_bias_in_attention: false
use_projection_bias_in_attention: false
vocab_size: 128256
optimizer:
AdamW:
betas:
- 0.9
- 0.95
correct_bias: true
weight_decay: 0.01
precision:
enabled: true
fp16_type: cbfloat16
log_loss_scale: true
loss_scaling_factor: dynamic
max_gradient_norm: 1.0
schedulers:
- CosineDecayLR:
end_learning_rate: 1.0e-05
initial_learning_rate: 5.0e-05
total_iters: 18
seed: 1
fit:
train_dataloader:
batch_size: 8
data_dir: train_data
data_processor: GptHDF5MapDataProcessor
num_workers: 8
persistent_workers: true
prefetch_factor: 10
shuffle: true
shuffle_seed: 1337
val_dataloader: &id001
batch_size: 1
data_dir: valid_data
data_processor: GptHDF5MapDataProcessor
num_workers: 8
shuffle: false
ckpt_path: finetuning_tutorial/from_hf/pytorch_model_to_cs-2.3.mdl
validate:
val_dataloader: *id001
validate_all:
val_dataloaders: *id001
These parameters specify the full architecture of the Llama 3 8B model and help define a Trainer object for training, validation, and logging semantics.
If you are interested, learn more about model configs here, or dive into how to set up flexible training and evaluation. You can also follow end-to-end tutorials for various use cases.
Inspect evaluation config (optional)#
Take a look at your evaluation config:
cat finetuning_tutorial/eeh_config.yaml
Here is what you should see in your terminal:
#############################################################
## Fine-Tuning Tutorial Eleuther Evaluation Harness Config ##
#############################################################
trainer:
init:
backend:
backend_type: CSX
cluster_config:
num_csx: 1
model:
model_name: llama
attention_dropout_rate: 0.0
attention_module: multiquery_attention
attention_type: scaled_dot_product
dropout_rate: 0.0
embedding_dropout_rate: 0.0
embedding_layer_norm: false
extra_attention_params:
num_kv_groups: 8
filter_size: 14336
fp16_type: cbfloat16
hidden_size: 4096
initializer_range: 0.02
layer_norm_epsilon: 1.0e-05
loss_scaling: num_tokens
loss_weight: 1.0
max_position_embeddings: 8192
mixed_precision: true
nonlinearity: swiglu
norm_type: rmsnorm
num_heads: 32
num_hidden_layers: 32
pos_scaling_factor: 1.0
position_embedding_type: rotary
rope_theta: 500000.0
rotary_dim: 128
share_embedding_weights: false
use_bias_in_output: false
use_ffn_bias: false
use_ffn_bias_in_attention: false
use_projection_bias_in_attention: false
vocab_size: 128256
callbacks:
- EleutherEvalHarness:
eeh_args:
tasks: winogrande
num_fewshot: 0
keep_data_dir: false
batch_size: 4
shuffle: false
max_sequence_length: 8192
num_workers: 1
data_dir: finetuning_tutorial/eeh
eos_id: 128001
pretrained_model_name_or_path: baseten/Meta-Llama-3-tokenizer
flags:
csx.performance.micro_batch_size: null
This file lets you evaluate your model via the multiple choice (non-generative) eval harness task winogrande
on a single CSX system.
If you are interested, you can learn more about validating models using the Eleuther or BigCode Evaluation Harness in our documentation.
Step 2: Preprocess data#
Preprocess training and validation data#
Use your data configs to preprocess your “train” and “validation” datasets:
python $MODELZOO_DATA/preprocess_data.py \
--config finetuning_tutorial/train_data_config.yaml
python $MODELZOO_DATA/preprocess_data.py \
--config finetuning_tutorial/valid_data_config.yaml
You should then see your preprocessed data in finetuning_tutorial/train_data/
and finetuning_tutorial/valid_data/
(see the output_dir
parameter in your data configs).
Inspect preprocessed data (optional)#
Once you’ve preprocessed your data, you can visualize the outcome:
python $MODELZOO_DATA/tokenflow/launch_tokenflow.py \
--output_dir finetuning_data/train_data
In your terminal, you will see a url like http://172.31.48.239:5000
. Copy and paste this into your browser to launch TokenFlow, a tool for interactively visualizing whether loss and attention masks were applied correctly:
Step 3: Port model from Hugging Face#
Download checkpoint and configs#
Create a dedicated folder for the checkpoint and configuration files you’ll be downloading from Hugging Face.
mkdir finetuning_tutorial/from_hf
You can either fine-tune a model from a local pre-trained checkpoint or (as in this tutorial) from Hugging Face.
First, download checkpoint and configuration files from Hugging Face using the commands below. For the purposes of this tutorial, we’ll be using McGill’s Llama 3-8B-Web, a finetuned Meta-Llama-3-8B-Instruct model.
wget -P finetuning_tutorial/from_hf https://huggingface.co/McGill-NLP/Llama-3-8B-Web/resolve/main/pytorch_model.bin
wget -P finetuning_tutorial/from_hf https://huggingface.co/McGill-NLP/Llama-3-8B-Web/resolve/main/config.json
This will save two files in the finetuning_tutorial/from_hf
directory:
config.json: The model’s configuration file.
pytorch_model.bin: The model’s weights.
Convert checkpoint and configs#
You can now convert the files to a format compatible with Model Zoo:
python $MODELZOO_TOOLS/convert_checkpoint.py \
convert \
--model llama \
--src-fmt hf \
--tgt-fmt cs-current \
--config finetuning_tutorial/from_hf/config.json \
finetuning_tutorial/from_hf/pytorch_model.bin
Your finetuning_tutorial/from_hf
folder should now contain:
pytorch_model_to_cs-2.3.mdl
: The converted model checkpoint.config_to_cs-2.3.yaml
: The converted configuration file.
While you will not need to do this in this quickstart since it has already been done for you, as a final step, you will usually point ckpt_path
in your finetuning_tutorial/model_config.yaml
to the location of this converted checkpoint.
Step 4: Train and evaluate model#
Modify configs#
Set train_dataloader.data_dir
and val_dataloader.data_dir
in your model config to the absolute paths of your preprocessed data:
sed -i "s|data_dir: train_data|data_dir: ${MODELZOO_PARENT}/finetuning_tutorial/train_data|" \
finetuning_tutorial/model_config.yaml
sed -i "s|data_dir: valid_data|data_dir: ${MODELZOO_PARENT}/finetuning_tutorial/valid_data|" \
finetuning_tutorial/model_config.yaml
Submit training job#
Train your model by passing your updated model configs, the location of important directories, and python packages to a run script. Click here for more information.
python $MODELZOO_MODEL/run.py CSX \
--mode train_and_eval \
--params finetuning_tutorial/model_config.yaml \
--mount_dirs $MODELZOO_PARENT $MODELZOO_PARENT/modelzoo \
--python_paths $MODELZOO_PARENT/modelzoo/src \
You should then see something like this in your terminal:
Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO: Finished sending initial weights
INFO: | Train Device=CSX, Step=50, Loss=8.31250, Rate=69.37 samples/sec, GlobalRate=69.37 samples/sec
INFO: | Train Device=CSX, Step=100, Loss=7.25000, Rate=68.41 samples/sec, GlobalRate=68.56 samples/sec
...
Once training is complete, you will find several artifacts in the finetuning_tutorial/model
folder (see the model_dir
parameter in your model config). These include:
Checkpoints
TensorBoard event files
Run logs
A copy of the model config
Inspect training logs (optional)#
Monitor your training during the run or visualize TensorBoard event files afterwards:
tensorboard --bind_all --logdir="finetuning_tutorial/model"
Run evaluation tasks#
After training, you can test your model on downstream tasks:
python $MODELZOO_COMMON/run_eleuther_eval_harness.py CSX \
--params finetuning_tutorial/eeh_config.yaml \
--checkpoint_path finetuning_tutorial/model/checkpoint_0.mdl \
--mount_dirs $MODELZOO_PARENT/finetuning_tutorial/eeh \
--python_paths $MODELZOO_PARENT/modelzoo/src \
Your output logs should look something like:
...
2024-06-24 11:06:49,596 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=632, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-06-24 11:06:49,625 INFO: | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=633, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-06-24 11:07:04,333 INFO: <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7ff89ae3ceb0> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-24 11:07:04,696 INFO:
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande| 1|none | 0|acc |0.7284|± |0.0141|
Step 5: Port model to Hugging Face#
Convert checkpoint and configs#
Once you train (and evaluate) your model, you can port it to Hugging Face to generate outputs:
python $MODELZOO_TOOLS/convert_checkpoint.py \
convert \
--model llama \
--src-fmt cs-auto \
--tgt-fmt hf \
--config finetuning_tutorial/model_config.yaml \
--output-dir finetuning_tutorial/to_hf \
finetuning_tutorial/model/checkpoint_0.mdl
This will create both Hugging Face config files and a converted checkpoint under finetuning_tutorial/to_hf
.
Validate checkpoint and configs (optional)#
You can now generate outputs using Hugging Face:
pip install 'transformers[torch]'
python
Python 3.8.16 (default, Mar 18 2024, 18:27:40)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("baseten/Meta-Llama-3-tokenizer")
>>> config = AutoConfig.from_pretrained("finetuning_tutorial/to_hf/model_config_to_hf.json")
>>> model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="finetuning_tutorial/to_hf/checkpoint_0_to_hf.bin", config = config)
>>> text = "Generative AI is "
>>> pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
>>> generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2, eos_token_id=pipeline.tokenizer.eos_token_id, pad_token_id=pipeline.tokenizer.eos_token_id)[0]
>>> print(generated_text['generated_text'])
>>> exit()
Note
As a reminder, in this quickstart, you did not train your model for very long. A high quality model requires a longer training run, as well as a much larger dataset.
Conclusion#
Congratulations! In this tutorial, you followed an end-to-end workflow to fine-tune a model on a Cerebras system and learn about essential tools and scripts.
As part of this, your learned how to:
Setup your environment
Pre-process a small dataset
Port a trained model from Hugging Face
Fine-tune and evaluate a model
Test your model on downstream tasks
Port your model to Hugging Face
What’s next?#
Learn more about data preprocessing.
Learn more about the Cerebras Model Zoo and the different models we support.