cerebras.modelzoo.data_preparation.data_preprocessing.fim_token_generator.FIMTokenGenerator#

class cerebras.modelzoo.data_preparation.data_preprocessing.fim_token_generator.FIMTokenGenerator(params, tokenizer, eos_id, pad_id)[source]#

Bases: cerebras.modelzoo.data_preparation.data_preprocessing.pretraining_token_generator.PretrainingTokenGenerator

Initialize the FIMTokenGenerator class. :param params: Params from config file. :type params: Dict[str, Any] :param tokenizer: Tokenizer instance. :param eos_id: End of sequence token ID. :type eos_id: int :param pad_id: Padding token ID. :type pad_id: int

Methods

`clean_text`	Clean the provided text.
`encode`	Tokenize and encode the data for auto-regressive language modeling.
`encode_leftover_prefix`	Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.
`get_allowable_token_ids`	Generate a list of token IDs that can be masked.
`get_data_stats`	Get data statistics from the sample.
`mask_single_sequence`	Masks tokens in a single sequence according to the MLM strategy.
`parse_semantic_data_array`	Parse semantic data dictionary.
`process_chunks`	Processes chunks of tokenized text and returns processed features along with the total padding added.
`process_chunks_mlm`	Processes chunks of tokenized text and returns processed features along with the total padding added.
`tokenize_data`	Tokenize the text and create features for auto-regressive language modeling.

encode(semantic_data_array)[source]#

Tokenize and encode the data for auto-regressive language modeling.

Parameters: semantic_data_array (Union[Dict[str, Any], List[Dict[str, Any]]]) – Data to encode.
Returns: Tuple of encoded features for auto-regressive language modeling and dataset stats.
Return type: Tuple[Dict[str, Any], Dict[str, int]]

clean_text(data)#

Clean the provided text.

Parameters: data (str) – Text to clean.
Returns: Cleaned text.
Return type: str

encode_leftover_prefix(prefix)#

Processes the leftover prefix which is a list of ndarray tokens into chunks based on max sequence length.

The last chunk is handled specifically if it’s shorter than the max sequence length. If the last chunk has less than two tokens, it’s discarded.

Parameters: prefix (List[np.ndarray]) – The prefix list of token arrays to process.
Returns: A tuple containing the processed token chunks as a list of ndarrays and the dataset stats.
Return type: Tuple[Dict[str, Any], Dict[str, int]]

get_allowable_token_ids()#

Generate a list of token IDs that can be masked.

get_data_stats(sample, lvt=None)#

Get data statistics from the sample.

Parameters: sample (np.ndarray) – Tokenized sample.
Returns: Data statistics.
Return type: Dict[str, int]

mask_single_sequence(input_ids)#

Masks tokens in a single sequence according to the MLM strategy. When self.mlm_with_gather is False, the returning len(labels) == len(input_ids) When self.mlm_with_gather is True, the returning len(labels) == self.max_predictions

Parameters

input_ids (List[int]) – Original sequence of token IDs.

Returns

input_ids: Modified sequence with masked tokens.
masked_lm_positions: Positions of the masked tokens, empty if not self.mlm_with_gather.
masked_lm_mask: Binary indicators (1s) for positions that were masked, empty if not self.mlm_with_gather.
labels: Original token IDs of the masked tokens for label purposes.

Return type

Tuple[List[int], List[int], List[int], List[int]]

parse_semantic_data_array(semantic_data_array)#

Parse semantic data dictionary.

Parameters: entry (Union[Dict[str, Any], List[Dict[str, Any]]]) – Data entry.
Returns: Parsed text and raw data statistics.
Return type: Tuple[str, Dict[str, int]]

process_chunks(tokenized_text_chunks)#

Processes chunks of tokenized text and returns processed features along with the total padding added.

Parameters: tokenized_text_chunks (List[List[int]]) – A list of tokenized text chunks, where each chunk is represented as a list of integers.
Returns: A tuple containing a list of processed results and dataset stats.
Return type: Tuple[List[np.ndarray], Dict[str, int]]

process_chunks_mlm(tokenized_text_chunks)#

Processes chunks of tokenized text and returns processed features along with the total padding added.

Args: tokenized_text_chunks (List[List[int]]): A list of tokenized text chunks, where each chunk is represented as a list of integers.

Returns: Tuple[List[Any], Dict]: A tuple containing a list of processed results and dataset stats.

tokenize_data(semantic_data_array)#

Tokenize the text and create features for auto-regressive language modeling.

Parameters: semantic_data_dict (Union[Dict[str, Any], List[Dict[str, Any]]]) – Data to tokenize.
Returns: Tuple of encoded features for auto-regressive language modeling and dataset stats.
Return type: Tuple[List[np.ndarray], Dict[str, int]]

cerebras.modelzoo.data_preparation.data_preprocessing.fim_token_generator

cerebras.modelzoo.data_preparation.data_preprocessing.finetuning_token_generator