Launch your job#
Running jobs on the Cerebras Wafer-Scale cluster is straightforward and similar to running them on a single device. Here’s a comprehensive guide to get you started.
Prerequisite#
Make sure you have set up your installation.
Activate Cerebras virtual environment#
Before starting any jobs on the Cerebras Wafer-Scale Cluster, make sure to activate your virtual environment.
On the user node, activate the environment by issuing the following command:
source venv_cerebras_pt/bin/activate
Note that now you should be in the (venv_cerebras_pt)
environment.
Prepare your datasets#
Each model in the Cerebras Model Zoo comes with scripts to help you prepare your datasets. For general guidance, check the Data processing and dataloaders section. Additionally, you can find dataset examples in the README file of each model.
For instance, the FC-MNIST model includes a
prepare_data.py
script that will download sample data.For language models, leverage the data processing in the Cerebras Model Zoo. The Training and Fine-Tuning a Large Language Model(LLM) tutorial provides an example.
Note
After preparing your data, update the data path in the configuration file to the absolute path where your data is stored. Inside the configs/
folder, you will find YAML files corresponding to different model sizes. Locate the YAML file for your desired model size and modify the data path accordingly to ensure the training or evaluation process can find your data:
train_input:
data_dir: "/absolute/path/to/training/dataset"
...
eval_input:
data_dir: "/absolute/path/to/evaluation/dataset/"
Launch your job#
Each model in the Cerebras Model Zoo contains a script called run.py
. This script is designed to handle the compilation, training, and evaluation of your models on the Cerebras Wafer-Scale cluster.
To launch your job, you’ll need to specify the following flags:
Flag |
Mandatory |
Description |
---|---|---|
|
Yes |
Specifies that the target device for execution is a Cerebras Cluster. |
|
Yes |
Path to a YAML file containing model/run configuration options. |
|
Yes |
Whether to run train, eval, train_and_eval, or eval_all. |
|
Yes |
List of paths to be mounted to the Appliance containers. It should include parent paths for Cerebras Model Zoo and
other locations needed by the dataloader, including datasets and code.
(Default: Pulled from path defined by env variable |
|
Yes |
List of paths to be exported to |
|
No |
Compile the model including matching to Cerebras kernels and mapping to hardware. It does not execute on system.
Upon success, compile artifacts are stored inside the Cerebras cluster, under the directory specified in
|
|
No |
Validate model can be matched to Cerebras kernels. This is a lightweight compilation. It does not map to the hardware
nor execute on system. Mutually exclusive with compile_only.
(Default: |
|
No |
Path to store model checkpoints, TensorBoard events files, etc.
(Default: |
|
No |
Path to store the compile artifacts inside Cerebras cluster.
(Default: |
|
No |
Number of CS-X systems to use in training.
(Default: |
For a more comprehensive list, issue the following command:
python run.py -h
Validate your job (optional)#
If you want to verify that your model implementation is compatible with the Cerebras software platform, you can use the --validate_only
flag. This flag enables you to quickly check compatibility without the need to execute a full model run. It’s especially useful when you’re developing or adjusting your models and want to ensure they will work with the platform.
For instance, you might run a command like this:
python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used} \
--validate_only
Compile your job (optional)#
To generate the executable files for your model on the Cerebras cluster, you can use the --compile_only
flag. This step takes more time compared to validation (typically 15 minutes to an hour) as it prepares the model’s computation graph for optimal execution.
An example command might look like this:
python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--model_dir model_dir \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used} \
--compile_only
Note
You can speed up your training or evaluation runs by reusing pre-compiled artifacts obtained through the --validate_only
and --compile_only
flags. To achieve this, ensure that you use the same --compile_dir
path during both the compilation and execution phases.
Keep in mind that training and evaluation modes require distinct fabric programming on the CS-X system, resulting in different compiled artifacts depending on the mode. For instance, when running:
--mode train --compile_only
--mode eval --compile_only
The artifacts will differ. Make sure you specify the appropriate --compile_dir
based on whether you’re training or evaluating.
Execute your job#
To execute your job on the Cerebras Wafer-Scale cluster, follow these steps:
1. Specify the Target Device: Use “CSX” as the first positional argument to target the Cerebras cluster.
./run.py CSX [other flags]
2. Provide Cluster Information:
--python_paths
: Specify the Python paths needed to execute your job correctly. This should include all necessary scripts and packages.--mount_dirs
: Indicate which directories should be mounted to access required files, datasets, or model weights.
Information about the Cerebras cluster where the job will be executed using the flags --python_paths
and --mount_dirs
.
Note
You can specify the
python_paths
andmount_dirs
arguments either in the:
run.py
script: Provide them as command-line arguments while executing the script.Runconfig section of
params.yaml
: Define these parameters within the YAML configuration file.
When running a model from the Cerebras Model Zoo, ensure that the paths specified include the parent directory where the Model Zoo is located. For instance, if your directory structure is /path/to/parent/modelzoo
, the arguments should be /path/to/parent/modelzoo/src
.
3. When executing your job, you need to specify two key pieces of information:
Execution Mode: Choose one of the following modes based on your requirements:
train
: For training the model.eval
: For evaluating the model on a specific dataset.eval_all
: For evaluating across multiple datasets.train_and_eval
: For both training and evaluating.
Configuration File Path: Provide the path to the relevant configuration file that contains the necessary settings.
Ensure that these options are included in your command or script for proper execution.
python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--model_dir model_dir \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used}
Here is an example of a typical output log for a training job:
Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO: Finished sending initial weights
INFO: | Train Device=CSX, Step=50, Loss=8.31250, Rate=69.37 samples/sec, GlobalRate=69.37 samples/sec
INFO: | Train Device=CSX, Step=100, Loss=7.25000, Rate=68.41 samples/sec, GlobalRate=68.56 samples/sec
INFO: | Train Device=CSX, Step=150, Loss=6.53125, Rate=68.31 samples/sec, GlobalRate=68.46 samples/sec
INFO: | Train Device=CSX, Step=200, Loss=6.53125, Rate=68.54 samples/sec, GlobalRate=68.51 samples/sec
INFO: | Train Device=CSX, Step=250, Loss=6.12500, Rate=68.84 samples/sec, GlobalRate=68.62 samples/sec
INFO: | Train Device=CSX, Step=300, Loss=5.53125, Rate=68.74 samples/sec, GlobalRate=68.63 samples/sec
INFO: | Train Device=CSX, Step=350, Loss=4.81250, Rate=68.01 samples/sec, GlobalRate=68.47 samples/sec
INFO: | Train Device=CSX, Step=400, Loss=5.37500, Rate=68.44 samples/sec, GlobalRate=68.50 samples/sec
INFO: | Train Device=CSX, Step=450, Loss=6.43750, Rate=68.43 samples/sec, GlobalRate=68.49 samples/sec
INFO: | Train Device=CSX, Step=500, Loss=5.09375, Rate=66.71 samples/sec, GlobalRate=68.19 samples/sec
INFO: Training completed successfully!
INFO: Processed 60500 sample(s) in 887.2672743797302 seconds.
Note
Cerebras only supports using a single CS-X when running in eval mode.
To scale to multiple CS-X systems, simply add the
--num_csx
flag specifying the number of CS-X systems. The global batch size divided by the number of CS-Xs will be the effective batch size per device.Once you have submitted your job to execute in the Cerebras Wafer-Scale cluster, you can track the progress or kill your job using the csctl tool. You can also monitor the performance using a Grafana dashboard.
Explore output files and artifacts#
The contents of the model directory, specified by the --model_dir
flag, contain all results and artifacts from the latest run. These include:
Checkpoints
Checkpoints are saved in the
<model_dir>
directory.Tensorboard event files
Tensorboard event files are stored in the
<model_dir>
directory. Events files can be visualized using Tensorboard. Here’s an example of how to launch Tensorboard:$ tensorboard --logdir <model_dir> --bind_all TensorBoard 2.2.2 at http://<url-to-user-node>:6006/ (Press CTRL+C to quit)
YAML files
YAML files containing configuration parameters used in the run are stored in the
<model_dir>/train
or<model_dir>/eval
directory depending on the execution mode.Run logs
Stdout from the run is located under
<model_dir>/cerebras_logs/latest/run.log
. If there are multiple runs, look under the corresponding<model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log
.
Cancel your job#
For any reason if you wish to cancel your job, issue the following command:
csctl cancel job <jobid>
What’s next?#
Try out our LLM workflow by following the step-by-step instructional tutorial on Training and fine-tuning a Large Language Model (LLM).