cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaC4Dataset#

class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaC4Dataset(input_dir)[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset

Methods

already_shuffled

Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.

dir_path

documents

A generator producing all documents in the dataset.

name

num_docs

num_duplicate_docs

num_short_docs

short_documents_path

Path to the file with short documents

size

size_duplicate_docs

size_short_docs

stem_dir_path

already_shuffled()#

Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.

documents(process_id, n_process, dup_sh, short_sh)#

A generator producing all documents in the dataset.

short_documents_path()#

Path to the file with short documents