cerebras.modelzoo.data_preparation.raw_dataset_processor.utils.Reader#
- class cerebras.modelzoo.data_preparation.raw_dataset_processor.utils.Reader(file_list, keys, format_hook_fn)[source]#
Bases:
object
Initialize the Reader instance.
- Parameters
file_list (List[str]) – List of file paths to be read.
keys (Optional[Dict]) – Dictionary containing the type of key and it’s name.
Methods
Handle JSONL data and yield processed entries.
Read and process Fasta file without using BioPython.
Read and process gzipped JSON file.
Read and process JSONL file.
Read and process TAR archive containing ZST compressed JSONL files.
Read and process ZST compressed JSONL file.
Read and process Parquet file.
Read and process text file.
Stream and process data from multiple file formats.
- handle_jsonl(jsonl_reader, get_meta, autojoin_paragraphs, para_joiner)[source]#
Handle JSONL data and yield processed entries.
- Parameters
jsonl_reader (Any) – The JSONL reader object.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
Yields processed data entries.
- Return type
Iterator[Dict[str, Any]]
- read_txt(file)[source]#
Read and process text file.
- Parameters
file (str) – Path to the .txt file.
- Returns
Yields processed data lines.
- Return type
Iterator[Any]
- read_jsongz(file)[source]#
Read and process gzipped JSON file.
- Parameters
file (str) – Path to the .json.gz file.
- Returns
Yields processed data entries.
- Return type
Iterator[Any]
- read_jsonl(file, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#
Read and process JSONL file.
- Parameters
file (str) – Path to the .jsonl file.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
Yields processed data entries.
- Return type
Iterator[Any]
- read_jsonl_zst(file, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#
Read and process ZST compressed JSONL file.
- Parameters
file (str) – Path to the .jsonl.zst file.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
Yields processed data entries.
- Return type
Iterator[Any]
- read_jsonl_tar(file, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n')[source]#
Read and process TAR archive containing ZST compressed JSONL files.
- Parameters
file (str) – Path to the .jsonl.zst.tar file.
get_meta (bool) – Flag to determine if meta data should be extracted.
autojoin_paragraphs (bool) – Flag to auto join paragraphs.
para_joiner (str) – Paragraph joiner string.
- Returns
Yields processed data entries.
- Return type
Iterator[Any]
- read_parquet(file)[source]#
Read and process Parquet file.
- Parameters
file (str) – Path to the .parquet file.
- Returns
Yields processed data rows.
- Return type
Iterator[Any]