Optimizer and Scheduler#

On this page, you will learn about how to add and configure the Trainer with a Optimizer and with one or more Scheduler classes. By the end you should have a cursory understanding on how to use the Optimizer class and Scheduler class in conjunction with the Trainer class.

Prerequisites#

You must have installed the Cerebras Model Zoo (click here if you haven’t).
You must be familiar with the Trainer.
Please ensure you have an understanding of the CSTorch optimizer and scheduler classes by reading the cerebras.pytorch.optim docs.

Basic Usage#

An Optimizer implements an optimization algorithm to control how model parameters are updated. Various hyperparameters such as lr, momentum, and weight_decay can be passed to the Optimizer to give further control. A Scheduler is used in conjunction with an Optimizer to adjust the value of these hyperparameters over the course of a run. Currently, schedulers for lr and weight_decay are supported.

The Trainer takes in an optimizer argument. An optimizer is used to optimize model weights during training and is required for any run that does any training. optimizer can be passed as an Optimizer class. For details on all available optimizers, see the CSTorch optimizer class.

The Trainer also accepts a schedulers argument. Schedulers are used to adjust hyperparameters during training. Typically this adjustment is a decay following some algorithm. The CSTorch API supports schedulers that adjust either learning rate or weight decay. For a full list of available schedulers see CSTorch scheduler class.

In the example below, you create an SGD optimizer with a single SequentialLR Scheduler that is a LinearLR Scheduler for the first 500 steps, then a CosineDecayLR Scheduler for the next 500 steps.

trainer:
  init:
    optimizer:
      # Corresponds to cstorch.optim.SGD
      SGD:
        lr: 0.01
        momentum: 0.9
    schedulers:
      - SequentialLR:
          schedulers:
            - LinearLR:
                initial_learning_rate: 0.01
                end_learning_rate: 0.001
                total_iters: 500
            - CosineDecayLR:
                initial_learning_rate: 0.001
                end_learning_rate: 0.0001
                total_iters: 500
    ...
  ...

import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer

trainer = Trainer(
    ...,
    optimizer=lambda model: cstorch.optim.SGD(
        model.parameters(),
        lr=0.01,
        momentum=0.9,
    ),
    schedulers=[
        lambda optimizer: cstorch.optim.lr_scheduler.SequentialLR(
            optimizer,
            schedulers=[
                cstorch.optim.lr_scheduler.LinearLR(
                    optimizer,
                    initial_learning_rate=0.01,
                    end_learning_rate=0.001,
                    total_iters=500,
                ),
                cstorch.optim.lr_scheduler.CosineDecayLR(
                    optimizer,
                    initial_learning_rate=0.001,
                    end_learning_rate=0.0001,
                    total_iters=500,
                ),
            ]
        ),
        ...
    ],
    ...,
)
...

Note

Note how in python, optimizer is passed as a callable, assumed to be a function that takes in a torch.nn.Module and returns a Optimizer. It can also be passed as an Optimizer provided the model is already defined.

Similarly schedulers is passed as a list of callables, where each element is assumed to be a function that takes in a Optimizer and returns a Scheduler. It can also be passed as an Scheduler provided the Optimizer is already defined.

Using callables allows us to pass in objects without having to predefine inputs to that object.

Using Tags to Selectively Update#

You can use ModelZoo to filter what parameters a scheduler will update. This is done on the optimizer-side by tagging param_groups based on glob-like patterns and on the scheduler-side by specifying which tagged groups to update.

Generating tags in the Optimizer#

The optimizer contains an attribute param_groups which is a list of dictionaries containing all parameters. For more information see the PyTorch documentation.

Modelzoo has the ability to tag optimizer param_groups based on glob-like pattern matching of parameter names. These tagged param_groups can then be used to selectively adjust specific parameters.

Parameters are partitioned and tagged via YAML. For example:

trainer:
  init:
    optimizer:
      Adam:
        lr: 0.005
        params:
          - params: "*bias"
            tag: "bias_params"

import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer
from cerebras.modelzoo.common.optim_utils import partition_params_group_with_tags

filter_params = {
    "params": [{"params": "*bias", "tag": "bias_params"}],
}
trainer = Trainer(
    ...,
    optimizer=lambda model: cstorch.optim.SGD(
        partition_params_group_with_tags(model.parameters(), filter_params),
        lr=0.01,
        momentum=0.9,
    ),
    ...,
)
...

This specification would group all parameters whose names end in "bias" into one group with the tag "bias_params". All remaining parameters would be in another group with no tags.

For cases where multiple filters are specified and target overlapping subsets, param_groups will be partitioned into all unique combinations of tags.

For example, if you had parameters named:

fc1.weight
fc1.bias
fc2.weight
fc2.bias

Given these filters:

params:
  - params: "*bias"
    tag: "bias_params"
  - params: "fc1*"
    tag: "fc1_params"

You will end up with params groups partitioned like this:

[
    {"tags": set("bias_params"), "params": ("fc2.bias", ...)},
    {"tags": set("fc1_params"), "params": ("fc1.weight", ...)},
    {"tags": set("bias_params", "fc1_params"), "params": ("fc1.bias", ...)},
    {"params": ("fc2.weight", ...)},
]

Currently, the main application for parameter tagging is for selectively applying schedulers to specific parameters.

Note

By default, ModelZoo may perform other partitioning operations on param_groups. This may affect the length of param_groups however the placement of "tags" will still be correctly preserved. See configure_param_groups for more details.

Specifying tags in the Scheduler#

Using the param_group_tags argument, individual schedulers can be configured to only target specific optimizer param_groups. For example:

trainer:
  init:
    schedulers:
      - LinearLR:
          initial_learning_rate: 0.01
          end_learning_rate: 0.001
          total_iters: 100
          param_group_tags: "tag1"
    ...
  ...

import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer
trainer = Trainer(
    ...,
    schedulers=[
        lambda optimizer: cstorch.optim.lr_scheduler.LinearLR(
            optimizer,
            initial_learning_rate=0.01,
            end_learning_rate=0.001,
            total_iters=100,
            param_group_tags="tag1",
        ),
        ...
    ],
    ...,
)
...

In the example above, the learning rate scheduler specified will only update optimizer param_groups that have the "tag1" tag.

These tags can be added to param_groups manually but the most common use case is in conjunction with optimizer tagging.

trainer:
  init:
    optimizer:
      Adam:
        lr: 0.005
        params:
          - params: "*bias"
            tag: "bias_params"
    schedulers:
      - CosineDecayWD:
          initial_weight_decay: 0.01
          end_weight_decay: 0.001
          total_iters: 100
          param_group_tags: "bias_params"

import cerebras.pytorch as cstorch
from cerebras.modelzoo import Trainer
from cerebras.modelzoo.common.optim_utils import partition_params_group_with_tags

filter_params = {
    "params": [{"params": "*bias", "tag": "bias_params"}],
}
trainer = Trainer(
    ...,
    optimizer=lambda model: cstorch.optim.SGD(
        partition_params_group_with_tags(model.parameters(), filter_params),
        lr=0.01,
        momentum=0.9,
    ),
    schedulers=lambda optimizer: cstorch.optim.weight_decay_scheduler.CosineDecayWD(
        optimizer,
        initial_weight_decay=0.01,
        end_weight_decay=0.001,
        total_iters=100,
        param_group_tags="bias_params",
    )
    ...,
)
...

In the example above, the CosineDecayWD scheduler would only adjust the weight decay of parameters whose names end in "bias".

Conclusion#

That concludes this overview of using the Optimizer and the Scheduler in conjunction with the Trainer. By this point, you should have a cursory understanding of how to construct and configure a Optimizer and Scheduler inside a Trainer instance.

What’s next?#

To learn more about how to configure checkpointing behaviour using the Trainer, see Model Zoo Trainer - Checkpoint.

Optimizer and Scheduler#

Prerequisites#

Basic Usage#

Using Tags to Selectively Update#

Generating tags in the Optimizer#

Specifying tags in the Scheduler#

Conclusion#

What’s next?#

Further Reading#