manifest¶
Manifest file system for efficient metadata management and file grouping in ETL pipelines.
Provides the ManifestFile class for creating, storing, and retrieving file metadata
collections, enabling optimized batch processing and intelligent file partitioning.
- class s3manifesto.manifest.ManifestFile(uri: str, uri_summary: str, data_file_list: ~typing.List[~s3manifesto.model.DataFile] = <factory>, size: int | None = None, n_record: int | None = None, fingerprint: str | None = None, details: ~typing.Dict[str, ~typing.Any] = <factory>)[source]¶
Core manifest file system consisting of two linked files for efficient metadata management.
Manifest File Structure:
A complete manifest consists of two files that work together:
Manifest Summary File (JSON): Contains aggregate metadata and references, Example:
{ "n_files": 50, "total_size": 600_000_000, # 600 MB "total_records": 100_000, "uri": "s3://bucket/prefix/manifest.parquet", "fingerprint": "2d0175ad9416dc5fd7138546471738ca" }
Manifest Data File (Parquet): Contains detailed per-file metadata, example:
+-------------------------------+--------------+----------+----------------------------------+ | uri | size (Bytes) | n_record | Etag | +-------------------------------+--------------+----------+----------------------------------+ | s3://bucket/prefix/file1.json | 1_000_000 | 1000 | 8a53247196e46b53699d065ba3cc8e0d | +-------------------------------+--------------+----------+----------------------------------+ | s3://bucket/prefix/file2.json | 2_000_000 | 2000 | b3f20f3c7a8877c24504634edd067fcf | +-------------------------------+--------------+----------+----------------------------------+ | s3://bucket/prefix/file3.json | 3_000_000 | 3000 | dd9b315f1d7ec573cb7305e6e238731f | +-------------------------------+--------------+----------+----------------------------------+ | ... | ... | ... | ... | +-------------------------------+--------------+----------+----------------------------------+ | ... | ... | ... | ... | +-------------------------------+--------------+----------+----------------------------------+ | ... | ... | ... | ... | +-------------------------------+--------------+----------+----------------------------------+
Write Process:
When creating a manifest, write the Manifest Summary File first, then the Manifest Data File to S3, ensuring atomicity and consistency.
Read Process:
When reading a manifest, read the Manifest Summary File first to get aggregate statistics and the URI reference, then read the Manifest Data File for detailed metadata.
Simple Usage Examples:
Creating and writing a manifest:
data_files = [ DataFile(uri="s3://bucket/file1.json", size=1000000, n_record=1000, etag="abc123"), DataFile(uri="s3://bucket/file2.json", size=2000000, n_record=2000, etag="def456"), DataFile(uri="s3://bucket/file3.json", size=3000000, n_record=3000, etag="ghi789") ] manifest = ManifestFile.new( uri="s3://bucket/manifest-data.parquet", uri_summary="s3://bucket/manifest-summary.json", data_file_list=data_files, ) manifest.write(s3_client)
Reading a manifest:
manifest = ManifestFile.read( uri_summary="s3://bucket/manifest-summary.json", s3_client=s3_client, ) print(f"Total files: {len(manifest.data_file_list)}") print(f"Total size: {manifest.size} bytes")
File Partitioning:
Manifest files are essentially collections of file metadata that can be intelligently partitioned for parallel processing. Use
partition_files_by_size()andpartition_files_by_n_record()to efficiently split files into balanced groups.You can use
ManifestFilein two ways: - As a file splitter calculator (in-memory partitioning without S3 storage) - As a persistent manifest file storage (with S3 read/write operations)See the Quick Start Guide for complete examples.
- Parameters:
uri – URI of the Manifest Data File (Parquet format)
uri_summary – URI of the Manifest Summary File (JSON format)
data_file_list – List of DataFile objects with metadata
size – Total aggregate size in bytes of all files
n_record – Total aggregate record count across all files
fingerprint – Unique hash for detecting data changes and cache invalidation
details – Additional workflow-specific metadata
- calculate()[source]¶
Calculate total size, n_record, and fingerprint of the data files in a single pass.
We use pre-calculated values stored as instance attributes rather than lazy-loaded cached properties for performance optimization. Since calculating size, n_record, and fingerprint all require iterating through the data_file_list, using separate cached properties would result in multiple for-loops (one per property access). This single calculate() method performs all computations in one pass, significantly improving efficiency for large file collections.
- classmethod new(uri: str, uri_summary: str, data_file_list: List[DataFile], size: int | None = None, n_record: int | None = None, fingerprint: str | None = None, details: Dict[str, Any] | None = None, calculate: bool = True) Self[source]¶
Create a new manifest file object. To load manifest file data from S3, use the
read()method.- Parameters:
uri – URI of the manifest data file.
uri_summary – URI of the manifest summary file.
data_file_list – List of data files.
size – Total size of the data files.
n_record – Total number of records in the data files.
calculate – If True, calculate the size and n_record using the data_file_list.
- write(s3_client: S3Client)[source]¶
Write the manifest file to S3.
- Parameters:
s3_client – boto3.client(“s3”) object.
- classmethod read(uri_summary: str, s3_client: S3Client) Self[source]¶
Read the manifest file from S3.
- Parameters:
uri_summary – URI of the manifest summary file. (NOT THE MANIFEST DATA FILE)
s3_client – boto3.client(“s3”) object.
- partition_files_by_size(target_size: int = 100000000) List[DataFileGroup][source]¶
Organize data files into balanced task groups, ensuring each group’s total file size approximates a specified target, optimizing workload distribution.
- Parameters:
target_size – Target size for each task group in bytes.
- partition_files_by_n_record(target_n_record: int = 10000000) List[DataFileGroup][source]¶
Organize data files into balanced task groups, ensuring each group’s total number of records approximates a specified target, optimizing workload distribution.
- Parameters:
target_n_record – Target number of records for each task group.