cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication#
- class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication(datasets, duplicates, short_docs)[source]#
Bases:
cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.Dataset
Methods
Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
Path to the directory
documents
name
Return an estimate of the dataset number of documents.
sample_documents
Path to the file with short documents
size
- num_docs()[source]#
Return an estimate of the dataset number of documents. Implementations may use a faster, less accurate estimate.
- already_shuffled()#
Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.
- dir_path()#
Path to the directory
- short_documents_path()#
Path to the file with short documents