cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication#

class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication(datasets, duplicates, short_docs)[source]#

Bases: cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset

Methods

already_shuffled

Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.

dir_path

Path to the directory

documents

name

num_docs

Return an estimate of the dataset number of documents.

sample_documents

short_documents_path

Path to the file with short documents

size

num_docs()[source]#

Return an estimate of the dataset number of documents. Implementations may use a faster, less accurate estimate.

already_shuffled()#

Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.

dir_path()#

Path to the directory

short_documents_path()#

Path to the file with short documents