Trainer Overview#
Using the Cerebras Model Zoo Trainer gives you the ability to efficiently train models of any size without needing to worry about manually distributing the model through techniques such as model parallelism or tensor parallelism.
It is an auxiliary feature of the Model Zoo and is thus not required to use any of the reference models that come pre-packaged inside the ModelZoo. However, it is recommended to use the Trainer as it is highly optimized for training and validating models on the Cerebras Wafer-Scale Cluster.
On this page, you will learn about the
Trainer
class. By the end you should have
a cursory understanding on how to use the
Trainer
class.
Prerequisites#
Please ensure that you have installed the Cerebras Model Zoo package by going through the installation guide.
Optionally, you can also read through the
basic Cerebras PyTorch guide to first gain an
understanding of the underlying API that underpins the
Trainer
class.
Basic Usage#
The Trainer
class can be imported and used as follows:
import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer
# Any torch.nn.Module
model: torch.nn.Module = torch.nn.Linear(10, 10)
# Any Cerebras compliant optimizer
optimizer: cstorch.optim.Optimizer = cstorch.optim.SGD(
model.parameters(), lr=0.01, momentum=0.9
)
trainer = Trainer(
device="CSX", # The device to run on
model_dir="./model_dir", # The directory at which to store artifacts
model=model,
optimizer=optimizer,
)
# Train the model over a single epoch of the train dataloader
# and then run validation over a single epoch of the val dataloader
trainer.fit(train_dataloader, val_dataloader)
As can be seen in the above example, at a minimum the
Trainer
class takes in the following:
device
: The device to run training/validation on.model_dir
: The directory at which to store model related artifacts (e.g. model checkpoints).model
: Thetorch.nn.Module
instance that we are training/validating.optimizer
: Optionally, acerebras.pytorch.optim.Optimizer
instance can be passed in to optimize the model weights during the training phase.
At a minimum, the call to fit
takes in the following:
train_dataloader
: Thecerebras.pytorch.utils.data.DataLoader
instance to use during training.val_dataloader
: Optionally, acerebras.pytorch.utils.data.DataLoader
instance can be passed in to run validation during and/or at the end of training.
The default behaviour of this minimally configured run is to train the model over a single epoch of the train dataloader and then run validation over a single epoch of the val dataloader.
There you have it! With this small sample of code, you can begin training your very first model using the Cerebras Model Zoo Trainer!
You can pause here to go try it out for yourself, or continue reading to learn
more about how to more finely configure the Trainer
to fit your needs.
Configuring the Training loop#
As mentioned above, if both a train_dataloader
and val_dataloader
are provided to the fit
call,
the default behaviour is to run a single epoch of training followed by
a single epoch of validation.
This behaviour can be configured by passing in a
TrainingLoop
instance to the
Trainer as follows:
from cerebras.modelzoo.trainer.callbacks import TrainingLoop
trainer = Trainer(
...,
loop=TrainingLoop(
num_steps=1000,
eval_steps=100,
eval_frequency=100,
),
)
trainer.fit(train_dataloader, val_dataloader)
In this above example,
num_steps
represents the total number of batches to train for. Ifnum_steps
exceeds the number of available batches in the train dataloader, the dataloader is automatically repeated to be able to run training fornum_steps
.eval_steps
represents the number of steps to run validation for every time we run validation. Similar to training, ifeval_steps
exceeds the number of available batches in the val dataloader, the dataloader is automatically repeated. Although, typically validation is never run for more than a single epoch. So, it is advised to seteval_steps
to be less than the length of the validation dataloader. Otherwise, the validation metrics may be incorrect.eval_frequency
represents how often validation is run during training. In the above example, validation is run every 100 steps of training. That is to say, throughout the 1000 steps of training, validation is run 10 times. Regardless of the value ofeval_frequency
, ifeval_frequency
is greater than zero, we always run validation at the end of training.
Checkpointing#
The Trainer
can be
further configured to save checkpoints at regular intervals by passing in a
Checkpoint
instance as follows:
from cerebras.modelzoo.trainer.callbacks import Checkpoint
trainer = Trainer(
...,
model_dir="./model_dir",
checkpoint=Checkpoint(steps=100),
)
trainer.fit(train_dataloader, val_dataloader)
In the above example, a checkpoint is saved every 100 steps of training. A checkpoint
is also saved at the end of training regardless of whether or not num_steps
is a multiple of the checkpoint steps.
The checkpoints are saved in the model_dir
directory that was passed to
the Trainer
.
model_dir/
├── checkpoint_100.mdl
├── checkpoint_200.mdl
├── checkpoint_300.mdl
└── ...
Note
This checkpoint is meant for resuming training from the same point in the future. As such, it will contain the model weights, optimizer state, and any other state that is necessary to resume training. Please see Selective Checkpoint State Saving for examples of how to configure what state is saved into the checkpoint.
A saved checkpoint can be loaded again in the future by specifying the ckpt_path
argument to the call to fit
. For example,
trainer = Trainer(...)
trainer.fit(
train_dataloader,
val_dataloader,
ckpt_path="/path/to/checkpoint",
)
The above code will load the checkpoint at that path before starting training.
Note
If a ckpt_path
is not provided, but a checkpoint is found inside the
model_dir
, then Trainer
will
automatically load the latest checkpoint found in the model_dir
.
To learn more about how to configure checkpointing behaviour using the
Trainer
,
see Checkpointing.
Conclusion#
That concludes this overview of the basic functionality that the
Trainer
offers. By this point, you should
have a cursory understanding of how to construct and configure a
Trainer
and perform some training using it.
What’s next?#
To learn about how to specify a schedule for learning rates, please see Optimizer and Scheduler.
To learn about how you can configure a Trainer
instance using a YAML configuration file, you can check out:
Trainer YAML Overview
To learn more about how you can use the Trainer
in some core workflows, you can check out:
To learn more about how you can extend the capabilities of the
Trainer
class, you can check out:
To learn more about what the Trainer
class
outputs during the run, you can check out: