Release and Version History

x.y.z (Backlog)

Features and Improvements

Minor Improvements

Bugfixes

Miscellaneous

1.0.0 (2025-06-23)

🎉 Major Release - Complete Architectural Rewrite

This release represents a fundamental transformation of s3manifesto from a basic file grouping utility into a comprehensive enterprise-grade manifest system for big data ETL orchestration.

💥 Breaking Changes

  • Complete API redesign with immutable dataclasses (frozen=True, slots=True)

  • Method signatures changed: partition_files_by_size() and partition_files_by_n_record() replace previous grouping methods

  • All data structures now use modern type hints with typing_extensions

  • Drop support for Python 3.8 and earlier; minimum requirement is now Python 3.9

Features and Improvements

  • 🚀 Revolutionary Performance Improvements
    • Heap-optimized Best Fit Decreasing (BFD) Algorithm: O(n log k) complexity vs O(n×k) for large datasets

    • 10x performance gains: 10K files creating 1K groups: ~0.1s vs ~10s

    • Scalability breakthrough: Handles 1000+ groups with minimal performance degradation

    • Memory optimization: Single-pass calculation for size, n_record, and fingerprint

  • 🏗️ New Architectural Features
    • Dual-file Manifest System: Manifest Summary (JSON) + Manifest Data (Parquet) for optimal storage and retrieval

    • Fingerprinting: Unique MD5 hashes for change detection and cache invalidation

    • Hierarchical Grouping: Support for million+ file scenarios with bounded memory usage

  • 🔧 Enhanced API and Integrations
    • Polars Integration: Native DataFrame operations with dump_many_to_dataframe() and load_many_from_dataframe()

    • S3-native Operations: Direct AWS S3 integration with automatic compression

    • Human-readable Properties: size_for_human across all data classes

    • Comprehensive Type Safety: 100% type coverage with modern Python type hints

  • 🎯 Divide-and-Conquer Philosophy
    • Separation of Concerns: Clean separation between metadata collection, partitioning algorithms, and execution

    • Composable Abstractions: Works as both in-memory calculator and persistent storage

Miscellaneous

  • 📚 Documentation and Developer Experience
    • Comprehensive Documentation: Philosophy guide, performance patterns, and real-world examples

    • Jupyter-style Tutorials: Progressive learning with 8 hands-on sections

    • Research Documentation: Algorithm comparisons and performance benchmarks translated to English

🔄 Migration Guide

Users upgrading from 0.x versions should:

  1. Update method calls: group_files_into_tasks_by_*partition_files_by_*

  2. Handle new return types: Methods now return DataFileGroup objects with metadata

  3. Leverage fingerprints: Use manifest.fingerprint for change detection workflows

This 1.0.0 release establishes s3manifesto as the foundational tool for big data divide-and-conquer workflows, providing the missing link between file discovery and distributed execution in modern data lake architectures.

0.4.1 (2024-08-10)

Features and Improvements

  • Add ManifestFile.details attribute. It is a dictionary that contains additional information about the manifest file.

0.3.1 (2024-08-10)

💥Breaking Changes

  • group_files, ManifestFile.group_files_into_tasks_by_size, ManifestFile.group_files_into_tasks_by_n_record now returns List[Tuple[List[T_DATA_FILE, int]]] instead of List[List[T_DATA_FILE]].

0.2.1 (2024-08-10)

💥Breaking Changes

  • Remove md5 key from Data File typed dictionary, add etag key instead.

  • Remove ManifestFile.group_files_into_tasks from public API.

Features and Improvements

  • Add the following public APIs:
    • ManifestFile.group_files_into_tasks_by_size

    • ManifestFile.group_files_into_tasks_by_n_record

  • Add fingerprint attribute to ManifestFile class. It is a unique fingerprint for the manifest file. It is calculated based on the URI and ETag of the data files.

Minor Improvements

Bugfixes

Miscellaneous

0.1.1 (2024-08-08)

  • First release