model

Data model classes.

class s3manifesto.model.Base[source]

Base class providing common functionality for all data model classes.

Enables efficient serialization and deserialization for distributed processing where task definitions need to be passed between workers and coordinators.

to_dict() dict[str, Any][source]

Convert the dataclass instance to a dictionary.

Returns:

A dictionary representation of the dataclass instance.

class s3manifesto.model.FileSpec(uri: str, value: int)[source]

Lightweight file specification containing URI and a numeric value for grouping.

Essential for divide-and-conquer algorithms that need to partition files by size or record count without loading full metadata, enabling efficient task distribution.

Parameters:
  • uri – Unique identifier for the file location

  • value – Numeric value used for grouping (size in bytes or record count)

class s3manifesto.model.GroupSpec(file_specs: List[FileSpec], value: int)[source]

Represents a balanced group of files with their collective value for optimal task sizing.

Critical for divide-and-conquer processing where work must be distributed evenly across parallel workers, ensuring consistent resource utilization and predictable execution times.

Parameters:
  • file_specs – List of FileSpec grouped together

  • value – Total combined value of all files in this group

class s3manifesto.model.DataFile(uri: str, etag: str | None = None, size: int | None = None, n_record: int | None = None)[source]

Complete metadata specification for a data file including integrity and size information.

Enables divide-and-conquer workflows to make informed decisions about task partitioning while providing data integrity verification through ETags for reliable distributed processing.

Parameters:
  • uri – Unique S3 URI or file path identifier

  • etag – AWS S3 ETag for data integrity verification

  • size – File size in bytes for resource planning

  • n_record – Number of records for workload estimation

classmethod dump_many_to_dataframe(data_files: Iterable[Self]) DataFrame[source]

Convert a list of DataFile objects to a Polars DataFrame.

Parameters:

data_files – An iterable of DataFile objects.

Returns:

A Polars DataFrame containing the data from the DataFile objects.

classmethod load_many_from_dataframe(df: DataFrame) List[Self][source]

Convert a Polars DataFrame to a list of DataFile objects.

Parameters:

df – A Polars DataFrame containing the data.

Returns:

A list of DataFile objects created from the DataFrame.

class s3manifesto.model.DataFileGroup(data_files: List[DataFile], attr_name: str, value: int)[source]

A collection of DataFile grouped together for optimal parallel processing.

Facilitates divide-and-conquer strategies by providing ready-to-execute task units where each group represents a balanced workload for distributed worker nodes.

Parameters:
  • data_files – List of DataFile objects that should be processed together

  • value – Total aggregated value (size or record count) for the entire group

class s3manifesto.model.ManifestSummary(manifest: str, size: int | None = None, n_record: int | None = None, fingerprint: str | None = None, details: dict[str, ~typing.Any] = <factory>)[source]

Compact summary metadata for a manifest file providing quick access to aggregate statistics.

Enables divide-and-conquer coordinators to make informed decisions about task distribution without loading the full manifest data, optimizing planning overhead in large-scale processing.

Parameters:
  • manifest – URI reference to the associated manifest data file

  • size – Total aggregate size in bytes of all files in the manifest

  • n_record – Total aggregate record count across all files in the manifest

  • fingerprint – Unique hash for detecting data changes and cache invalidation

  • details – Additional metadata for workflow-specific information