Limitations of the CerebrasEstimator#
Though CerebrasEstimator
inherits from the TensorFlow Estimator, the CerebrasEstimator
does not yet support the full breadth of features provided by TensorFlow Estimator. There are a few differences and limitations in CerebrasEstimator
. These are described below.
Important
All the feature limitations listed below, such as lack of support for user hooks, apply only when training on the CS system. The CerebrasEstimator
supports all TensorFlow Estimator features when training on GPU or CPU.
Model function limitations#
- Hooks
The
CerebrasEstimator
currently does not support user-defined hooks, which allow a way to ‘hook into’ certain points of theCerebrasEstimator
execution. If such user-defined hooks are present, thenCerebrasEstimator
will error by complaining about accessing a tensor that is not initialized.- Eval_metrics
The
CerebrasEstimator
does not currently support the TensorFlow Metrics API only when running on CS system. This means that during a training run on CS system, you cannot runeval_metrics
operations such as accuracy.If you would like to use
eval_metrics
for debugging, theCerebrasEstimator
supports the usage ofeval_metrics
on CPU or GPU.If the parameter
eval_metric_ops
is set in theEstimatorSpec
returned by the model function, then running Estimator will produce the following error:[Unsupported TensorFlow] Detected unsupported eval_metric_ops in CS system training run
.
Input function differences#
Dataset repeating#
Instead of requiring you to specify the number of epochs you would like to train for, the Estimator requires that you:
Explicitly set the number of steps you want to train for in the Estimator
train
function, andUse the default parameter of the
repeat
function (count=None
) provided by the Dataset API to ensure that theinput_fn
will keep providing samples to the CS system until the number of training steps set in thetrain
function is complete. See below code example:dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)
Multiple input workers#
To utilize the full computational capabilities of the CS system, multiple input workers are used to send the training data, simultaneously from each input worker, to the CS system.
Note
This means that each worker node must shuffle its data differently.
In the simplest setup, the same input data is replicated across every input worker. Because the dataset is large, the CerebrasEstimator
can approximate distributed training with dataset shards by ensuring that each worker shuffles its data differently.
In other words, make sure that you are not providing a deterministic random seed.
Return Dataset#
The CerebrasEstimator
requires that your input function returns a Dataset (tf.data.Dataset
). Each element of the Dataset must be structured to consist of features and labels. See <link to features and labels discussion>
Note
Features must be a tensor. Labels can be a tensor or None.
If the input function does not return a dataset, then CerebrasEstimator
will
error out with the following error:
[Unsupported TensorFlow] Input function must return a tf.data.Dataset
.
Single dictionary input#
The input function in CerebrasEstimator
only takes a single dictionary parameter, params
, as input. This will be passed in through the Estimator constructor.
Input function limitations#
- Drop remainder to enforce fixed batch size
The
CerebrasEstimator
requires that your input function outputs batches of a fixedbatch_size
across all steps. To enforce this, you must set thedrop_remainder
parameter provided by the Dataset API toTrue
when batching the Dataset. See TensorFlow documentation for batch.dataset = dataset.shuffle(1000).repeat().batch(batch_size, drop_remainder=True)
If you do not provide a fixed batch_size
, the CerebrasEstimator
will error out with the following error:
[Unsupported TensorFlow] Inconsistent batch sizes detected. To ensure a fixed batch size across all steps, set `drop_remainder=True` when batching your Dataset in the input function.
Config differences#
- Lower bound on save_checkpoint_steps
Because the CS system trains faster than alternative systems, saving checkpoints too frequently can have a significant overall performance impact.
- TF Env Config
This environment variable (see section 1 under Configuration) must be specified, while training on the CS system. A default for this is already provided in our example scripts. Ensure that its called during training.
- Parameters not supported
save_checkpoint_secs
train_distribute
device_fn
protocol
eval_distribute
experimental_distribute
Experimental_max_worker_delay_secs
Compilation differences#
Like most high performance compute devices, the CS system requires application compilation before execution. In a typical training run, this is handled automatically by CerebrasEstimator
.
However, because this process can take many minutes, thereby increasing with the
complexity of your model, Cerebras makes available a standalone
CerebrasEstimator.compile()
function. This function allows you to quickly validate your model code and perform full batched precompiles without connecting
to the CS system. However, note that when you compile your model on one CS system, you cannot run this compiled model on another CS system.