cerebras.modelzoo.layers#
- class cerebras.modelzoo.layers.AlibiPositionEmbeddingLayer(*args, **kwargs)[source]#
Bases:
torch.nn.Module
Alibi Position Embedding Layer, Symmetric case with bidirectional supported
alibi bias as in paper: https://arxiv.org/abs/2108.12409
- Parameters
num_heads (int) – number of attention heads.
slopes (Tensor) – slope values to use for alibi heads. Shape: [num_heads, 1]. Default to None.
alibi_trainable_slopes (bool) – whether the alibi slopes are trainable parameters.
slopes_initializer (str) – initializer for alibi slopes if it’s trainable. Defaults to
xavier_uniform
.
- Returns
Relative position bias, to be used in attention masking
- Return type
position_bias (Tensor)
- forward(seq_length, key_length, past_kv=None, constant_pos_mask=None, batch_size=None)[source]#
Return the position bias based on the alibi slopes.
- Parameters
seq_length (int) – the length of query tokens.
key_length (int) – the length of key tokens.
- Returns
Position bias tensor with shape [num_heads, query_length, key_length]
- class cerebras.modelzoo.layers.MultiheadAttention(*args, **kwargs)[source]#
Bases:
torch.nn.Module
Multi-head attention layer. Adapted from: https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention.
- Parameters
embed_dim (int) – Number of input units in each projection output
num_heads (int) – Number of attention heads.
inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to
embed_dim
.dropout (float) – Dropout rate for key-query weights. Defaults to 0.0.
batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).
add_bias_kv (bool) – If specified, adds bias to the key and value sequences at dim=0. Default: False.
add_zero_attn (bool) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False
kdim (int) – Number of input units in the key projection
vdim (int) – Number of input units in the value projection
use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.
use_ffn_bias (bool) – Whether to use bias in the output projection.
attention_initializer (str) – Projection kernel initializer. Defaults to
xavier_uniform
.attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.
bias_initializer (str) – Bias initializer. Defaults to
zeros
.attention_type (str) – The attention variant to execute. Currently accepts
dot_product
andscaled_dot_product
. Defaults toscaled_dot_product
.scale_qk_dot_by_d (bool) – If
True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).attention_logits_alpha (float) – Scales the QK^T dot product. Used to stabilize logits in muP training.
softmax_dtype_fp32 (bool) – Use an FP32 softmax implementation.
attention_kernel (str | None) –
Kernel to use. Uses
default
if None. See accepted values below.None
- Default implementation.fast_attention
- Experimental optimized implementation.device (optional) – Device to create the model parameters on, can be a cuda device or CS device.
- forward(q, k, v, attn_mask=None, key_padding_mask=None, need_weights=False, average_attn_weights=True, past_kv=None, cache_present_kv=False, past_kv_self_attn=True, position_bias=None, rotary_position_embedding_helper=None, layer_idx=None, **extra_args)[source]#
Applies the attention mechanism to queries
q
, keysk
and valuesv
.- Parameters
q (Tensor) – Queries, shape
[batch_size, seq_length, embed_dim]
.k (Tensor) – Keys, shape
[batch_size, seq_length, embed_dim]
.v (Tensor) – Values, shape
[batch_size, seq_length, embed_dim]
.attn_mask (Tensor) – Attention mask. Can be 2D of shape
[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
.key_padding_mask (Tensor) – If specified, a mask of shape (N, S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to None.
need_weights (bool) – If specified, returns attn_output_weights in addition to attn_outputs. Default: False.
average_attn_weights (bool) – If true, indicates that the returned attn_weights should be averaged across heads. Otherwise, attn_weights are provided separately per head. Note that this flag only has an effect when need_weights=True. Default: True (i.e. average weights across heads)
past_kv (tuple(tensor, tensor)) – Past keys and values. Tensors have shape
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. The 0th and 1st tensor contain the past keys and values, respectively. Defaults toNone
.cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to
False
.past_kv_self_attn (bool) – Specifies whether the past keys & values should be used for self-attention (true) or cross-attention (false). Ignored if past_kv is not provided. Default: True
position_bias (Tensor) – Tensor containing position bias to apply in attention with shape
[num_heads, query_length, key_length]
.rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.
- Returns
Attention output tensor with shape
[batch_size, seq_length, embed_dim]
.
- class cerebras.modelzoo.layers.BatchChannelNorm2D(*args, **kwargs)[source]#
Bases:
torch.nn.Module
Implements Batch Channel Normalization proposed in Micro-Batch Training with Batch-Channel Normalization and Weight Standardization <https://arxiv.org/abs/1903.10520>
- Parameters
num_groups (int) – number of groups to separate the channels into.
num_channels (int) – number of channels. C from an expected input of size (N, C, H, W).
eps (float) – a value added to the denominator for numerical stability. Default: 1e-5.
momentum (float) – The Update rate value used for the running_mean and running_var computation. Default: 0.1.
device (torch.device) – Device to place the learnable parameters.
dtype (torch.dtype) – Data type of learnable parameters.
- Shape:
input: (N, C, H, W) output: (N, C, H, W) (same shape as input)
- class cerebras.modelzoo.layers.EmbeddingLayer(*args, **kwargs)[source]#
Bases:
torch.nn.Module
Creates token and, optionally, position and segment embeddings.
- Parameters
vocab_size (int) – Size of input vocabulary.
embedding_size (int) – Dimension of the embedding space.
pad_token_id (Optional[int]) – If specified, the entries at padding_idx do not contribute to the gradient; therefore, the embedding vector at padding_idx is not updated during training.
segment_embedding_size (int) – Dimension of the embedding space for segment embeddings. Useful when factorized embeddings are used for tokens and so the size of the embedding space for segments differs from that for tokens. Defaults to the same value as embedding_size.
embeddings_initializer (Optional[str,Callable]) – Token embeddings initializer. Defaults to ‘uniform’.
max_position_embeddings (int) – Maximum sequence length to train using model.
position_embedding_type (str) – ‘learned’, ‘fixed’ or ‘rotary’. Defaults to “learned”, for ‘rotary’ embeddings, embeddings are not created at bottom but computed with key&query embeddings by RotaryPositionEmbeddingHelper
position_embedding_offset (int) – Offset for position embeddings. Default to 0.
min_timescale (Optional[int]) – The scale of the shortest sinusoid. Default to 1.0. (only need to be specified when position_embedding_type is fixed).
max_timescale (Optional[int]) – The scale of the longest sinusoid. Default to 1.0e4. (only need to be specified when position_embedding_type is fixed).
position_embeddings_initializer (Optional[str,Callable]) – Position embeddings initializer. Defaults to “uniform”.
pos_scaling_factor (Optional[int]) – Scales the position embeddings by pos_scaling_factor. Default to 1.
pos_scaling_type (Optional[str]) – Scales the position scaling type. Possible values are ‘YaRN’ and “linear”. Default to “linear”.
pos_scaling_extra_args (Optional[str]) – A dict containing args for YaRN (and future) position scaling methods.
num_segments (Optional[int]) – Number of segments for the segment embedding layer. Defaults to None, in which case the segment embedding layer is not created.
segment_embeddings_initializer (Optional[str,Callable]) – Segment embeddings initializer. Defaults to “uniform”.
(optional) (device) – Device to create the model parameters on, can be a cuda device or CS device.
- forward(input_ids, position_ids=None, segment_ids=None, past_length=0)[source]#
- Convert input_ids to token embeddings according to the embedding type.
Word embeddings (required), segment embeddings (optional) and position embeddings (optional).
- Parameters
input_ids (Tensor) – input token ids with shape
[batch_size, seq_length]
.position_ids (Tensor) – position ids with shape
[batch_size, seq_length]
.segment_ids (Tensor) – input segment ids with shape
[batch_size, seq_length]
.
- Returns
Token embedding output with shape
[batch_size, seq_length, embedding_size]
.
- class cerebras.modelzoo.layers.FeedForwardNetwork(*args, **kwargs)[source]#
Bases:
torch.nn.Module
A feed forward network that consists of a stack of fully connected layers arranged as [LinearLayer -> Activation -> Dropout] block repeated len(layers_units) times.
- Parameters
config (FeedForwardNetworkConfig) – Feed forward network config.
Initialize the FFN object instance.
- class cerebras.modelzoo.layers.GPTJDecoderLayer(*args, **kwargs)[source]#
Bases:
cerebras.modelzoo.layers.TransformerDecoderLayer.TransformerDecoderLayer
GPTJDecoderLayer is inherited from TransformerDecoderLayer, it has 2 modifications:
It uses parallel decoder architecture instead of the sequential one
It supports both gptj and gpt-neox which uses untied_layer_norm
Reference: https://www.cerebras.net/blog/how-to-harness-the-predictive-power-of-gpt-j
- Parameters
d_model (int) – the number of expected features in the input (required).
nhead (int) – the number of heads in the multihead-attention models (required).
use_untied_layer_norm (bool) – whether to use untied layer_norm. Should be False for GPTJ and True for Neox
kwargs – the rest of the arguments the same as TransformerDecoderLayer
- forward(tgt, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, attention_mask=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, self_attn_position_bias=None, cross_attn_position_bias=None, layer_idx=None, expert_hash_idx=None)[source]#
GPTJ layer with rotary position embeddings and parallel decoder architecture
- Parameters
tgt (torch.Tensor) – the sequence to the decoder layer (required).
memory (Optional[torch.Tensor]) – the sequence from the last layer of the encoder (required).
tgt_mask (Optional[torch.Tensor]) – the mask for the tgt sequence (optional).
memory_mask (Optional[torch.Tensor]) – the mask for the memory sequence (optional).
tgt_key_padding_mask (Optional[torch.Tensor]) – the mask for the tgt keys per batch (optional).
memory_key_padding_mask (Optional[torch.Tensor]) – the mask for the memory keys per batch (optional).
rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.
past_kv (Optional[Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]]) – Past keys and values for self attention and (if applicable) cross attention modules. Key/value tensors have shape
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. (optional).cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. (optional).
self_attn_position_bias (Optional[torch.Tensor]) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
expert_hash_idx (Optional[torch.Tensor]) – tensor containing mixture-of-experts expert selection indices for each token in the batch. Only used with MoE with hash-based routing enabled (optional).
- Shape:
Output tensor with shape
- class cerebras.modelzoo.layers.GroupInstanceNorm(*args, **kwargs)[source]#
Bases:
torch.nn.Module
Uses torch.nn.GroupNorm to emulate InstanceNorm by setting number of groups equal to the number of channels.
- Parameters
num_channels (int) – number of channels. C from an expected input of size (N, C, H, W).
- class cerebras.modelzoo.layers.MultiQueryAttention(*args, **kwargs)[source]#
Bases:
cerebras.modelzoo.layers.AttentionLayer.MultiheadAttention
- Implements the Multi-Query Attention Layer from
Fast Transformer Decoding: One Write-Head is All You Need <https://arxiv.org/abs/1911.02150>
- Parameters
embed_dim (int) – Number of input units in each projection output
num_heads (int) – Number of attention heads.
inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to
embed_dim
.dropout (float) – Dropout rate for key-query weights. Defaults to 0.0.
batch_first (bool) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).
add_bias_kv (bool) – If specified, adds bias to the key and value sequences at dim=0. Default: False.
add_zero_attn (bool) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False
kdim (int) – Number of output units in key projection
vdim (int) – Number of output units in projection
use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.
use_ffn_bias (bool) – Whether to use bias in the output projection.
attention_initializer (str) – Projection kernel initializer. Defaults to
xavier_uniform
.attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.
bias_initializer (str) – Bias initializer. Defaults to
zeros
.attention_type (str) – The attention variant to execute. Currently accepts
dot_product
andscaled_dot_product
. Defaults toscaled_dot_product
.softmax_dtype_fp32 (bool) – Use an FP32 softmax implementation.
attention_kernel (str | None) –
Kernel to use. Uses
default
if None. See accepted values below.None
- Default implementation.fast_attention
- Experimental optimized implementation.device (optional) – Device to create the model parameters on, can be a cuda device or CS device.
- class cerebras.modelzoo.layers.RelativePositionEmbeddingLayer(*args, **kwargs)[source]#
Bases:
torch.nn.Module
Relative Position Embedding Layer
- Parameters
num_heads (int) – number of attention heads.
relative_attention_bias (Tensor) – Tensor with relative attention weights. Shape: [num_relative_attention_buckets, num_heads]. Defaults set to None.
num_relative_attention_buckets (int) – Number of buckets used to calculate relative position bias. Default: 32
max_relative_positions (int) – The maximum relative distance used when calculating relative position buckets. See relative_position_bucket docs for more details. Default: 128
bidirectional_relative_attention (bool) – Whether attention is bidirectional.
allow_negative_buckets (bool) – If enabled, position buckets will be both positive and negative (as required by certain models like DEBERTA). Default: False.
relative_attn_bias_initializer (bool) – Relative Attention bias initializer. Defaults to
xavier_uniform
.
- Returns
Relative position bias, to be used in attention masking
- Return type
position_bias (Tensor)
- forward(seq_length, key_length, past_kv=None)[source]#
Return the position bias.
- Parameters
seq_length (int) – the length of query tokens.
key_length (int) – the length of key tokens.
- Returns
Position bias tensor with shape [num_heads, query_length, key_length]
- static relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128, allow_negative_buckets=False)[source]#
Translate relative position to a bucket number for relative attention. The relative position is defined as memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to position. If
bidirectional_relative_attention
= False, then positive relative positions are invalid. We use smaller buckets for small absolute relative positions and larger buckets for larger absolute relative positions. All relative positions >= max_distance map to the same bucket. All relative positions <= -max_distance map to the same bucket. This should allow for more graceful generalization to longer sequences than the model has been trained on. :param relative_position: Tensor with relative positions. :type relative_position: Tensor :param bidirectional: Whether attention is bidirectional :type bidirectional: bool :param num_buckets: number of buckets for relative positions :type num_buckets: int :param max_distance: Used in order to calculate relative position buckets. :type max_distance: int :param allow_negative_buckets: If enabled, position buckets will be both positiveand negative (as required by certain models like DEBERTA). Default: False.
- Returns
a Tensor with the same shape as
relative_position
, containing int32 values in the range [0, num_relative_attention_buckets).
- class cerebras.modelzoo.layers.Transformer(*args, **kwargs)[source]#
Bases:
torch.nn.Module
A transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805) model with corresponding parameters.
- Parameters
d_model (int) – the number of expected features in the encoder/decoder inputs (default=512).
nhead (int) – the number of heads in the multihead attention models (default=8).
num_encoder_layers (int) – the number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers (int) – the number of sub-decoder-layers in the decoder (default=6).
dim_feedforward (int) – the dimension of the feedforward network model (default=2048).
dropout (float) – the dropout value (default=0.1).
activation (Union[str, Callable[[torch.Tensor], torch.Tensor]]) – the activation function of encoder/decoder intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu
custom_encoder (Optional[Any]) – custom encoder (default=None).
custom_decoder (Optional[Any]) – custom decoder (default=None).
layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).
batch_first (bool) – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False
(seq, batch, feature).norm_first (bool) – if
True
, encoder and decoder layers will perform LayerNorms before other attention and feedforward operations, otherwise after. Default:False
(after).attention_type – Should be in [“scaled_dot_product”, “dot_product”].
use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.
use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.
use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer.
attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.
ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.
device (optional) – Device to create the model parameters on, can be a cuda device or CS device.
- Examples::
>>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12) >>> src = torch.rand((10, 32, 512)) >>> tgt = torch.rand((20, 32, 512)) >>> out = transformer_model(src, tgt)
Note: A full example to apply nn.Transformer module for the word language model is available in https://github.com/pytorch/examples/tree/master/word_language_model
- forward(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]#
Take in and process masked source/target sequences.
- Parameters
src (torch.Tensor) – the sequence to the encoder (required).
tgt (torch.Tensor) – the sequence to the decoder (required).
src_mask (Optional[torch.Tensor]) – the additive mask for the src sequence (optional).
tgt_mask (Optional[torch.Tensor]) – the additive mask for the tgt sequence (optional).
memory_mask (Optional[torch.Tensor]) – the additive mask for the encoder output (optional).
src_key_padding_mask (Optional[torch.Tensor]) – the ByteTensor mask for src keys per batch (optional).
tgt_key_padding_mask (Optional[torch.Tensor]) – the ByteTensor mask for tgt keys per batch (optional).
memory_key_padding_mask (Optional[torch.Tensor]) – the ByteTensor mask for memory keys per batch (optional).
- Shape:
src: \((S, E)\) for unbatched input, \((S, N, E)\) if batch_first=False or (N, S, E) if batch_first=True.
tgt: \((T, E)\) for unbatched input, \((T, N, E)\) if batch_first=False or (N, T, E) if batch_first=True.
src_mask: \((S, S)\) or \((N\cdot\text{num\_heads}, S, S)\).
tgt_mask: \((T, T)\) or \((N\cdot\text{num\_heads}, T, T)\).
memory_mask: \((T, S)\).
src_key_padding_mask: \((S)\) for unbatched input otherwise \((N, S)\).
tgt_key_padding_mask: \((T)\) for unbatched input otherwise \((N, T)\).
memory_key_padding_mask: \((S)\) for unbatched input otherwise \((N, S)\).
Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with
True
are not allowed to attend whileFalse
values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight. [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value ofTrue
will be ignored while the position with the value ofFalse
will be unchanged.output: \((T, E)\) for unbatched input, \((T, N, E)\) if batch_first=False or (N, T, E) if batch_first=True.
Note: Due to the multi-head attention architecture in the transformer model, the output sequence length of a transformer is same as the input sequence (i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the batch size, E is the feature number
Examples
>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
- class cerebras.modelzoo.layers.TransformerDecoder(*args, **kwargs)[source]#
Bases:
torch.nn.Module
TransformerDecoder is a stack of N decoder layers
- Parameters
decoder_layer – an instance of the TransformerDecoderLayer() class (required).
num_layers – the number of sub-decoder-layers in the decoder (required).
norm – the layer normalization component (optional).
- Examples::
>>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8) >>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6) >>> memory = torch.rand(10, 32, 512) >>> tgt = torch.rand(20, 32, 512) >>> out = transformer_decoder(tgt, memory)
- forward(tgt, memory=None, tgt_mask=None, sparse_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, self_attn_position_bias=None, cross_attn_position_bias=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, extract_layer_idx=None, expert_hash_idx=None, **extra_args)[source]#
Pass the inputs (and mask) through the decoder layer in turn.
- Parameters
tgt (torch.Tensor) – the sequence to the decoder (required).
memory (Optional[torch.Tensor]) – the sequence from the last layer of the encoder (optional).
tgt_mask (Optional[torch.Tensor]) – the mask for the tgt sequence (optional).
memory_mask (Optional[torch.Tensor]) – the mask for the memory sequence (optional).
tgt_key_padding_mask (Optional[torch.Tensor]) – the mask for the tgt keys per batch (optional).
memory_key_padding_mask (Optional[torch.Tensor]) – the mask for the memory keys per batch (optional).
self_attn_position_bias (Optional[torch.Tensor]) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
cross_attn_position_bias (Optional[torch.Tensor]) – similar to self_attn_position_bias, this is the tensor containing position bias to apply in cross-attention.
rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.
past_kv (Optional[List[Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]]]) – Past keys and values for each of the decoder layers (optional).
cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. (optional).
extract_layer_idx (Optional[int]) – (inclusive)layer index in range [0, self.num_layers) (zero-indexed) Applies decoder layers up to (and including) extract_layer_idx instead of all decoder layers. For ex: extract_layer_idx=3 would run fwd pass from decoder_block_0 to decoder_block_3 and return outputs from decoder_block_3. If extract_layer_idx = None and norm != None, then the output returned would be decoder_block_{self.num_layers-1} -> norm -> output (return)
expert_hash_idx (Optional[torch.Tensor]) – Optional tensor for mixture-of-experts models with hash-based routing. Tensor contains the expert ID for each token in the batch based on a hashing calculation.
- Shape:
see the docs in Transformer class.
- class cerebras.modelzoo.layers.TransformerDecoderLayer(*args, **kwargs)[source]#
Bases:
torch.nn.Module
TransformerDecoderLayer is made up of self-attn, multihead-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
- Parameters
d_model (int) – the number of expected features in the input (required).
nhead (int) – the number of heads in the multihead-attention models (required).
dim_feedforward (int) – the dimension of the feedforward network model (default=2048).
dropout (float) – the dropout value (default=0.1).
activation (Union[str, Callable[[torch.Tensor], torch.Tensor]]) – the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu
layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).
batch_first (bool) – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False
(seq, batch, feature).norm_layer (Type[torch.nn.Module]) – the normalization class that will be used before/after FF layers (default=nn.LayerNorm)
norm_first (bool) – if
True
, layer norm is done prior to self attention, multihead attention and feedforward operations, respectively. Otherwise it’s done after. Default:False
(after).attention_dropout_rate (Optional[float]) – Attention dropout rate. If None, defaults to dropout.
attention_softmax_fp32 (Optional[bool]) – Use FP32 softmax in attention block.
use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.
attention_type – Should be in [“scaled_dot_product”, “dot_product”]
scale_qk_dot_by_d (bool) – If
True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).attention_logit_alpha (float) – Scales the QK^T dot product. Used to stabilize logits in muP training.
attention_inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to d_model
add_cross_attention (bool) – If
True
, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set toFalse
)use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.
use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer
attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.
attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
attention_output_layer_initializer – attention output layer projection initializer. If not specified, the output will be initialized via
attention_initializer
ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.
ffn_output_layer_initializer – If not None, initialize the last FFN layer with this initializer. Defaults to None.
use_ff_layer1_dropout (bool) – If
True
, dropout will be enabled after the first feed forward layer. Default: TrueTrue (use_ff_layer2_dropout = If) – True
Default (dropout will be enabled after the second feed forward layer.) – True
ffn_dropout_rate (Optional[float]) – Controls dropout rate of FF’s first layer. If None, defaults to dropout.
moe_params – A dict of MoE params including num_experts, top_k and load_balancing_loss_coef
Examples
>>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8, batch_first=True) >>> memory = torch.rand(32, 10, 512) >>> tgt = torch.rand(32, 20, 512) >>> out = decoder_layer(tgt, memory)
- forward(tgt, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, self_attn_position_bias=None, cross_attn_position_bias=None, layer_idx=None, expert_hash_idx=None, **extra_args)[source]#
Pass the inputs (and mask) through the decoder layer.
- Parameters
tgt (torch.Tensor) – the sequence to the decoder layer (required).
memory (Optional[torch.Tensor]) – the sequence from the last layer of the encoder (required).
tgt_mask (Optional[torch.Tensor]) – the mask for the tgt sequence (optional).
memory_mask (Optional[torch.Tensor]) – the mask for the memory sequence (optional).
tgt_key_padding_mask (Optional[torch.Tensor]) – the mask for the tgt keys per batch (optional).
memory_key_padding_mask (Optional[torch.Tensor]) – the mask for the memory keys per batch (optional).
rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.
past_kv (Optional[Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]]]) – Past keys and values for self attention and (if applicable) cross attention modules. Key/value tensors have shape
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. (optional).cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. (optional).
self_attn_position_bias (Optional[torch.Tensor]) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
expert_hash_idx (Optional[torch.Tensor]) – tensor containing mixture-of-experts expert selection indices for each token in the batch. Only used with MoE with hash-based routing enabled (optional).
- Shape:
see the docs in Transformer class.
- class cerebras.modelzoo.layers.TransformerEncoder(*args, **kwargs)[source]#
Bases:
torch.nn.Module
TransformerEncoder is a stack of N encoder layers
- Parameters
encoder_layer – an instance of the TransformerEncoderLayer() class (required).
num_layers – the number of sub-encoder-layers in the encoder (required).
norm – the layer normalization component (optional).
enable_nested_tensor – if True, input will automatically convert to nested tensor (and convert back on output). This will improve the overall performance of TransformerEncoder when padding rate is high. Default:
False
(disabled).
- Examples::
>>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8) >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6) >>> src = torch.rand(10, 32, 512) >>> out = transformer_encoder(src)
- forward(src, mask=None, src_key_padding_mask=None, rotary_position_embedding_helper=None, self_attn_position_bias=None, extract_layer_idx=None, **extra_args)[source]#
Pass the input through the encoder layers in turn.
- Parameters
src (torch.Tensor) – the sequence to the encoder (required).
mask (Optional[torch.Tensor]) – the mask for the src sequence (optional).
src_key_padding_mask (Optional[torch.Tensor]) – the mask for the src keys per batch (optional).
rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.
self_attn_position_bias (Optional[torch.Tensor]) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
extract_layer_idx (Optional[int]) – (inclusive)layer index in range [0, self.num_layers) (zero-indexed) Applies encoder layers up to (and including) extract_layer_idx instead of all encoder layers. For ex: extract_layer_idx=3 would run fwd pass from encoder_block_0 to encoder_block_3 and return outputs from encoder_block_3. If extract_layer_idx = None and norm != None, then the output returned would be encoder_block_{self.num_layers-1} -> norm -> output (return)
- Shape:
see the docs in Transformer class.
- class cerebras.modelzoo.layers.TransformerEncoderLayer(*args, **kwargs)[source]#
Bases:
torch.nn.Module
TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
- Parameters
d_model (int) – the number of expected features in the input (required).
nhead (int) – the number of heads in the multihead attention models (required).
dim_feedforward (int) – the dimension of the feedforward network model (default=2048).
dropout (float) – the dropout value (default=0.1).
activation (Union[str, Callable[[torch.Tensor], torch.Tensor]]) – the activation function of the intermediate layer, can be a string (“relu” or “gelu”) or a unary callable. Default: gelu
layer_norm_eps (float) – the eps value in layer normalization components (default=1e-5).
batch_first (bool) – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:False
(seq, batch, feature).norm_layer (Type[torch.nn.Module]) – the normalization class that will be used before/after FF layers (default=nn.LayerNorm)
norm_first (bool) – if
True
, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it’s done after. Default:False
(after).attention_dropout_rate (Optional[float]) – Attention dropout rate. If None, defaults to dropout.
use_projection_bias_in_attention – Add bias to Q,K,V projections in the Attention layer. Defaults to False.
attention_type – Should be in [“scaled_dot_product”, “dot_product”]
scale_qk_dot_by_d (bool) – If
True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).attention_softmax_fp32 (Optional[bool]) – Use FP32 softmax in attention block.
attention_inner_dim (int) – Number of output units in attention query/key/value projection. Defaults to d_model
add_cross_attention – If
True
, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set toFalse
)use_ffn_bias_in_attention – Add bias in the concluding FFN in the Attention layer. Defaults to False.
use_ffn_bias – Add bias in all dense layers of the decoder’s ffn sublayer
attention_initializer – Attention layer initializer. Defaults to “xavier_uniform”.
attention_q_initializer – Query projection kernel initializer. If not specified, the query will be initialized via
attention_initializer
attention_output_layer_initializer – attention output layer projection initializer. If not specified, the output will be initialized via
attention_initializer
ffn_initializer – FFN layer initializer. Defaults to “xavier_uniform”.
ffn_output_layer_initializer – If not None, initialize the last FFN layer with this initializer. Defaults to None.
use_ff_layer1_dropout (bool) – If
True
, dropout will be enabled after the first feed forward layer. Default: TrueTrue (use_ff_layer2_dropout = If) – True
Default (dropout will be enabled after the second feed forward layer.) – True
ffn_dropout_rate (Optional[float]) – Controls dropout rate of FF’s first layer. If None, defaults to dropout.
layerscale_value (Optional[float]) – initial value to use for LayerScale in vision transformers. Defaults to None.
stochastic_depth_drop_prob (Optional[float]) – drop probability for stochastic depth per sample (when applied in main path of residual blocks.
stochastic_depth_mode (Optional[str]) – should be in [“batch”, “row”].
Example
When
batch_first
isTrue
: >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True) >>> src = torch.rand(32, 10, 512) >>> out = encoder_layer(src)- forward(src, src_mask=None, src_key_padding_mask=None, rotary_position_embedding_helper=None, self_attn_position_bias=None, **extra_args)[source]#
Pass the input through the encoder layer.
- Parameters
src (torch.Tensor) – the sequence to the encoder layer (required).
src_mask (Optional[torch.Tensor]) – the mask for the src sequence (optional).
src_key_padding_mask (Optional[torch.Tensor]) – the mask for the src keys per batch (optional).
rotary_position_embedding_helper (Optional[RotaryPositionEmbeddingHelper]) – A helper class to apply rotary embedding on the input tensor.
self_attn_position_bias (Optional[torch.Tensor]) – the tensor containing position bias to apply in self-attention, can be obtained from relative or alibi position embeddings.
- Shape:
see the docs in Transformer class.