cerebras.modelzoo.data.common.input_utils.get_data_for_task#
- cerebras.modelzoo.data.common.input_utils.get_data_for_task(task_id, meta_data_values_cum_sum, num_examples_per_task, meta_data_values, meta_data_filenames)[source]#
Function to get distribute files with given number of examples such that each distributed task has access to exactly the same number of examples
- Parameters
task_id (int) – Integer id for a task.
meta_data_values_cum_sum (int) – Cumulative sum of the file sizes in lines from meta data file.
num_examples_per_task (int) – Number of the examples specified per slurm task. Equal to batch_size * num_batch_per_task.
meta_data_values (list[int]) – List of the files sizes in lines in the meta data file.
meta_data_filenames (list[str]) – List with file names in the meta data file.
- Returns
list of tuples of length 3. The tuple contains at - index 0: filepath. - index 1: number of examples to be considered for this task_id. - index 2: start index in the file from where these
examples should be considered
The list represents the files that should be considered for this task_id.