cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaGithubDataset#

class cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaGithubDataset(input_dir)[source]#

Methods

`already_shuffled`	Datasets where the source is already shuffled should override this to return True so that it isn't shuffled again.
`dir_path`
`documents`	A generator producing all documents in the dataset.
`name`
`num_docs`
`num_duplicate_docs`
`num_short_docs`
`short_documents_path`	Path to the file with short documents
`size`
`size_duplicate_docs`
`size_short_docs`
`stem_dir_path`

already_shuffled()#: Datasets where the source is already shuffled should override this to return True so that it isn’t shuffled again.

documents(process_id, n_process, dup_sh, short_sh)#: A generator producing all documents in the dataset.

cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaCommonCrawlDataset

cerebras.modelzoo.data_preparation.nlp.slimpajama.preprocessing.datasets.RedPajamaReplication