manifest#

In ETL (Extract, Transform, Load) pipelines, it’s a common practice to group numerous files into appropriately sized batches, each forming a distinct task. This approach optimizes processing efficiency and resource utilization.

However, this method reqiures an effective mechanism for storing and retrieving metadata. Ideally, we should be able to access the metadata for an entire task in a single operation, eliminating the need to read each file individually. This approach significantly reduces I/O operations and improves overall performance.

This module implements an abstraction layer to achieve this functionality. It provides a streamlined interface for grouping files, managing their associated metadata, and enabling efficient batch processing in ETL workflows.

class s3manifesto.manifest.ManifestFile(uri: str, uri_summary: str, data_file_list: ~typing.List[~s3manifesto.typehint.T_DATA_FILE] = <factory>, size: ~typing.Optional[int] = None, n_record: ~typing.Optional[int] = None, fingerprint: ~typing.Optional[str] = None, details: ~typing.Dict[str, ~typing.Any] = <factory>)[source]#

Manifest file refers to two files:

  • Manifest file: Contains the metadata of the data files. It is a parquet file

    that contains the metadata of the data files. Each row in the parquet file is a

Parameters:
  • uri – URI of the manifest file.

  • uri_summary – URI of the manifest summary file.

  • data_file_list – List of data files.

  • size – Total size of the data files.

  • n_record – Total number of records in the data files.

  • fingerprint – A unique fingerprint for the manifest file. It is calculated based on the URI and ETag of the data files.

  • details – Additional details about the manifest file.

calculate()[source]#

Calculate total size and n_record of the data files.

classmethod new(uri: str, uri_summary: str, data_file_list: List[T_DATA_FILE], size: Optional[int] = None, n_record: Optional[int] = None, fingerprint: Optional[str] = None, details: Optional[Dict[str, Any]] = None, calculate: bool = True)[source]#

Create a new manifest file object. To load manifest file data from S3, use the read() method.

Parameters:
  • uri – URI of the manifest data file.

  • uri_summary – URI of the manifest summary file.

  • data_file_list – List of data files.

  • size – Total size of the data files.

  • n_record – Total number of records in the data files.

  • calculate – If True, calculate the size and n_record using the data_file_list.

write(s3_client: S3Client)[source]#

Write the manifest file to S3.

Parameters:

s3_client – boto3.client(“s3”) object.

classmethod read(uri_summary: str, s3_client: S3Client)[source]#

Read the manifest file from S3.

Parameters:
  • uri_summary – URI of the manifest summary file. (NOT THE MANIFEST DATA FILE)

  • s3_client – boto3.client(“s3”) object.

group_files_into_tasks_by_size(target_size: int = 100000000) List[Tuple[List[T_DATA_FILE], int]][source]#

Organize data files into balanced task groups, ensuring each group’s total file size approximates a specified target, optimizing workload distribution.

Parameters:

target_size – Target size for each task group in bytes.

group_files_into_tasks_by_n_record(target_n_record: int = 10000000) List[Tuple[List[T_DATA_FILE], int]][source]#

Organize data files into balanced task groups, ensuring each group’s total number of records approximates a specified target, optimizing workload distribution.

Parameters:

target_n_record – Target number of records for each task group.