model¶

Data model classes.

class s3manifesto.model.Base[source]¶

Base class providing common functionality for all data model classes.

Enables efficient serialization and deserialization for distributed processing where task definitions need to be passed between workers and coordinators.

to_dict() → dict[str, Any][source]¶

Convert the dataclass instance to a dictionary.

Returns:: A dictionary representation of the dataclass instance.

class s3manifesto.model.FileSpec(uri: str, value: int)[source]¶

Lightweight file specification containing URI and a numeric value for grouping.

Essential for divide-and-conquer algorithms that need to partition files by size or record count without loading full metadata, enabling efficient task distribution.

Parameters:

uri – Unique identifier for the file location
value – Numeric value used for grouping (size in bytes or record count)

class s3manifesto.model.GroupSpec(file_specs: List[FileSpec], value: int)[source]¶

Represents a balanced group of files with their collective value for optimal task sizing.

Critical for divide-and-conquer processing where work must be distributed evenly across parallel workers, ensuring consistent resource utilization and predictable execution times.

Parameters:

file_specs – List of FileSpec grouped together
value – Total combined value of all files in this group

class s3manifesto.model.DataFile(uri: str, etag: str | None = None, size: int | None = None, n_record: int | None = None)[source]¶

Complete metadata specification for a data file including integrity and size information.

Enables divide-and-conquer workflows to make informed decisions about task partitioning while providing data integrity verification through ETags for reliable distributed processing.

Parameters:

uri – Unique S3 URI or file path identifier
etag – AWS S3 ETag for data integrity verification
size – File size in bytes for resource planning
n_record – Number of records for workload estimation

classmethod dump_many_to_dataframe(data_files: Iterable[Self]) → DataFrame[source]¶

Convert a list of DataFile objects to a Polars DataFrame.

Parameters:: data_files – An iterable of DataFile objects.
Returns:: A Polars DataFrame containing the data from the DataFile objects.

classmethod load_many_from_dataframe(df: DataFrame) → List[Self][source]¶

Convert a Polars DataFrame to a list of DataFile objects.

Parameters:: df – A Polars DataFrame containing the data.
Returns:: A list of DataFile objects created from the DataFrame.

class s3manifesto.model.DataFileGroup(data_files: List[DataFile], attr_name: str, value: int)[source]¶

A collection of DataFile grouped together for optimal parallel processing.

Facilitates divide-and-conquer strategies by providing ready-to-execute task units where each group represents a balanced workload for distributed worker nodes.

Parameters:

data_files – List of DataFile objects that should be processed together
value – Total aggregated value (size or record count) for the entire group

class s3manifesto.model.ManifestSummary(manifest: str, size: int | None = None, n_record: int | None = None, fingerprint: str | None = None, details: dict[str, ~typing.Any] = <factory>)[source]¶

Compact summary metadata for a manifest file providing quick access to aggregate statistics.

Enables divide-and-conquer coordinators to make informed decisions about task distribution without loading the full manifest data, optimizing planning overhead in large-scale processing.

Parameters:

manifest – URI reference to the associated manifest data file
size – Total aggregate size in bytes of all files in the manifest
n_record – Total aggregate record count across all files in the manifest
fingerprint – Unique hash for detecting data changes and cache invalidation
details – Additional metadata for workflow-specific information