The csrun_wse
Script#
This section describes how to use the csrun_wse
script for training, eval and prediction.
Note
This applies only to pipelinedmodels for both the Slurm/Singularity workflow and the Kubernetes workflow.
Note
Slurm wrapper scripts (csrun_wse
and csrun_cpu
) may be customized for your particular environment by your sysadmins and may look different than what is shown below. Check whether your Sysadmin’s local documentation is available and whether there are any special instructions for your CS-2.
> csrun_wse --help
Usage: csrun_wse [--help] [--mount-dirs] [--total-nodes] [--tasks-per-node] [--cpus-per-task] [--single-task-nodes] [--use-sbatch] command_for_cs_execution
...
...
...
Description#
Runs the given <command_for_cs_execution>
command on the CS system.
The following applies:
The specific type of the execution task, i.e., training or prediction or evaluation is specified in the
<command_for_cs_execution>
command.The input pipelinedis run on a Cerebras server cluster with multiple workers, co-ordinated by Slurm.
Unless the optional arguments for Slurm configuration are specified, this script uses the following default values:
Default values
total-nodes: $DEF_NODES tasks-per-node: $DEF_TASKS cpus-per-task: $DEF_CPUS
Tip
We recommend that you first configure the csrun_cpu
script by setting the Slurm variables before running this csrun_wse
script. See The csrun_cpu Script.
Arguments#
command_for_cs_execution
: A Python command to initiate a task (train, eval, predict) that will execute on the CS system.
--mount-dirs
: (Optional) String of comma-seperated paths to mount, in addition to the standard paths listed in csrun_cpu. Default is an empty string, i.e., only paths listed in csrun_cpu are mounted.
--total-nodes
: (Optional) Number of nodes to execute with. Default is as listed above.
--tasks-per-node
: (Optional) Number of tasks per node to execute with. Default is as listed above.
--cpus-per-task
: (Optional) Number of CPUs per task to execute with. Default is as listed above. Applies only to the Slurm workflow.
--single-task-nodes
: (Optional) Number of nodes, among the total nodes, that will only run a single task. Default is 0 indicating that all nodes will have multiple tasks running on them.
--use-sbatch
: (Optional) Adding this flag will submit a batch script to slurm to execute <command_to_execute>. sbatch will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated. Applies only to the Slurm workflow.
Examples#
csrun_wse --total-nodes=3 \
--tasks-per-node=5 \
--cpus-per-task=16 \
python run.py --mode=train \
--cs_ip=0.0.0.0
The above csrun_wse
command executes the Python command:
python run.py --mode=train --cs_ip=0.0.0.0
which initiates model training on the CS system at the given cs_ip
address. As specified in the command line options for Slurm, 3 nodes with 5 workers each, and 16 CPUs assigned per worker, are used for this training task.
csrun_wse --mount-dirs="/data/ml,/lab/ml" \
--use-sbatch \
python run.py --mode=eval \
--eval_steps=1000 \
--cs_ip=0.0.0.0
The above csrun_wse
command mounts “/data/ml/” and “/lab/ml” in addition to the default mount directories, and launches a batch script with the Python command:
python run.py --mode=eval --eval_steps=1000 --cs_ip=0.0.0.0"
which initiates model evaluation on the CS system at the given cs_ip
address. The default Slurm settings are used.
Checkpoints and logs#
The checkpoints, logs and event files will be stored in the model_dir
. If the directory exists, then the weights are loaded from the checkpoint file. Same as the model_dir
passed to the tf.estimator
.
When training with the CerebrasEstimator
, by default a checkpoint is always taken at the beginning and end of training.
Tip
If you wish to take checkpoints more frequently, use save_checkpoints_steps
in the CSRunConfig
. Refer to Setting the runtime configuration section.
Loss: The loss is stored for TensorBoard based on save_summary_steps
. You can set the default value in the CSRunConfig
.
Similarly, TensorFlow logging is output based on log_step_count_steps
. You can set the default value in the CSRunConfig
.
Debug in single-task mode#
When you run a command such as:
> csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \
python run.py --mode train --params configs/your-params-file.yaml \
--model_dir your-model-dir --cs_ip 10.255.253.0
it will run a total of 12 tasks on 2 nodes, with 6 tasks per node. However, the chief tasks and the worker tasks are not synchronized, hence debugging becomes difficult.
For example, a worker task might be streaming the data into the CS system, and the chief task might be receiving the data from the CS system. To debug in this scenario requires stopping and starting both the worker task and chief tasks at the same time to examine the CS system output with its corresponding CS system input. Because these tasks run asynchronously, such starting and stopping them synchronously becomes difficult.
The single-task mode helps debugging by performing both the chief and workers on a single node as a single task. This means that you only need to start and stop this single task to examine both the data into and out of the CS system. The single-task mode can be used with the training, evaluation or the prediction jobs.
Important
The single-task mode is intended for debugging purposes only.
Using single-task mode#
To set the single-task mode, set the following two command line arguments to 1, as in: --nodes=1
and --tasks-per-node=1
. See below.
> csrun_wse --nodes=1 --tasks-per-node=1 --cpus-per-task=12 \
python run.py --mode train --params configs/your-params-file.yaml \
--model_dir your-model-dir --cs_ip 10.255.253.0
With the above settings, the training job is run on the CS system in a single-task debug mode.
Evaluate#
The following is an example command showing how to execute the run.py
to submit an evaluation job to a CS system cluster. This example uses the Slurm variables that are passed as command line argument values.
> csrun_wse --nodes=2 --tasks-per-node=6 --cpus-per-task=12 \
python run.py --mode eval --params configs/your-params-file.yaml \
--model_dir your-model-dir --cs_ip 10.255.253.0
Two evaluation modes are supported:
eval
eval_all
Predict#
The following is an example command showing how to execute the run.py
to perform prediction with your neural network, using a CS system cluster. This example uses the Slurm variables from the csrun_cpu
script, provided by system administrator. See The csrun_cpu Script.
csrun_wse python run.py --mode predict --params configs/your-params-file.yaml --cs_ip 10.255.253.0
Prediction results#
The results of the inference run are saved as follows:
If CS system is used, then in a file named,
predictions_cs_{est_config.task_id}.npz
in yourmodel_dir
directory.If CS system is not used and instead CPU or GPU is used, then the inference is run using the TensorFlow Estimator and the prediction results are stored in a file named,
predictions_tf_{est_config.task_id}.npz
in yourmodel_dir
directory.
Sbatch mode#
The default behavior of csrun_cpu
uses srun
. With srun
, slurm will allocate resources and csrun_cpu
will exit once the slurm job is finished. By using the flag --use-sbatch
, csrun_cpu
submits to slurm a batch script to execute the command <command_to_execute>
using sbatch
. sbatch
will immediately exit after submitting the script. The script will stay on the slurm queue of pending jobs until resources are allocated.
The command use will be stored as the file CS_<date>.log
and the standard output and standard error will be stored as CS_<date>_<slurm_job_id>.out
.
To properly schedule training jobs in the CS system using crun_wse
, one should define the enviromnet variables GRES_NODE
or GRES_RESOURCE
inside csrun_cpu
. GRES_RESOURCE
corresponds to the generic resource identifying the CS system in the slurm configuration. GRES_NODE
corresponds to the dedicaded CPU nodeto manage the CS system.