S3 Manifesto Quick Start Tutorial

This notebook demonstrates how to use s3manifesto for efficient file metadata management and intelligent partitioning in big data workflows.

Overview

s3manifesto provides:

  • Metadata Organization: Consolidate file metadata into manageable collections

  • Two-File Storage: Efficient JSON summary + Parquet data file system

  • Intelligent Partitioning: Best Fit Decreasing algorithm for optimal file grouping

  • Divide-and-Conquer Ready: Perfect for distributed processing frameworks

Let’s walk through a complete example!

1. Setup Mock AWS Environment

For this tutorial, we’ll use moto to create a mock AWS environment. In production, you would connect to your actual AWS S3 service.

[1]:
import moto

mock_aws = moto.mock_aws()
mock_aws.start()

2. Create S3 Bucket and Client

Set up our S3 client and create a bucket for storing manifest files.

[2]:
import boto3

boto_ses = boto3.Session(region_name="us-east-1")
s3_client = boto_ses.client("s3")

s3_bucket = "my-bucket"
s3_client.create_bucket(Bucket=s3_bucket)

print(f"✅ Created S3 bucket: {s3_bucket}")
✅ Created S3 bucket: my-bucket

3. Create Sample Data Files and Manifest

Let’s create a manifest containing metadata for 10 sample data files. Each file represents a typical big data scenario with varying sizes and record counts.

What is a Manifest?

A manifest is a collection of file metadata containing:

  • URI: S3 location of each data file

  • Size: File size in bytes for resource planning

  • Record Count: Number of records for workload estimation

  • ETag: Data integrity verification hash

[18]:
import s3manifesto.api as s3manifesto
import random

n_file = 10
uri = f"s3://{s3_bucket}/manifest.parquet"
uri_summary = f"s3://{s3_bucket}/manifest-summary.json"

data_file_list = list()
for ith in range(1, 1 + n_file):
    data_file_uri = f"s3://{s3_bucket}/data/{ith}.parquet"
    # 500K ~ 1.5M records each file
    n_record = random.randint(500_000, 1_500_000)
    # 5MB ~ 15MB size each file
    size = random.randint(5 * 1_000_000, 15 * 1_000_000)
    data_file = s3manifesto.DataFile(
        uri=data_file_uri,
        etag="...",
        size=size,
        n_record=n_record,
    )
    data_file_list.append(data_file)

manifest_file = s3manifesto.ManifestFile.new(
    uri=uri,  # URI for manifest data file (Parquet format)
    uri_summary=uri_summary,  # URI for manifest summary file (JSON format)
    data_file_list=data_file_list,
    calculate=True,  # Automatically calculate totals from data_file_list
)

print(f"📊 Created manifest with {manifest_file.n_data_file} files")
print(f"📏 Total size: {manifest_file.size_for_human}")
print(f"📦 Total records: {manifest_file.n_record:,}")
print(f"🔐 Fingerprint: {manifest_file.fingerprint}")
📊 Created manifest with 10 files
📏 Total size: 102.92 MB
📦 Total records: 10,036,408
🔐 Fingerprint: 14f938cb78d554e52039879fddb1f1a4

4. Write Manifest to S3

A manifest consists of two files that work together:

  1. Manifest Summary File (JSON): Contains aggregate metadata and references to the data file

  2. Manifest Data File (Parquet): Contains detailed per-file metadata in high-performance format

This two-file system allows for efficient metadata access - you can quickly check summary statistics without loading the full detailed metadata.

[21]:
manifest_file.write(s3_client=s3_client)
print(f"✅ Written manifest to S3:")
print(f"   📄 Summary: {uri_summary}")
print(f"   📋 Data: {uri}")
✅ Written manifest to S3:
   📄 Summary: s3://my-bucket/manifest-summary.json
   📋 Data: s3://my-bucket/manifest.parquet
[22]:
import json
from rich import print as rprint

res = s3_client.get_object(Bucket=s3_bucket, Key="manifest-summary.json")
content = res["Body"].read().decode("utf-8")
print("content of manifest summary file:")
rprint(json.loads(content))
content of manifest summary file:
{
    'manifest': 's3://my-bucket/manifest.parquet',
    'size': 107924092,
    'n_record': 10036408,
    'fingerprint': '14f938cb78d554e52039879fddb1f1a4',
    'details': {}
}
[24]:
import polars as pl

res = s3_client.get_object(Bucket=s3_bucket, Key="manifest.parquet")
df = pl.read_parquet(res["Body"].read())
print("content of manifest data file:")
df
content of manifest data file:
[24]:
shape: (10, 4)
urietagsizen_record
strstri64i64
"s3://my-bucket/data/1.parquet""..."9673507732709
"s3://my-bucket/data/2.parquet""..."141633791297214
"s3://my-bucket/data/3.parquet""..."10807337832962
"s3://my-bucket/data/4.parquet""..."13934496527477
"s3://my-bucket/data/5.parquet""..."13297418826390
"s3://my-bucket/data/6.parquet""..."10755103779863
"s3://my-bucket/data/7.parquet""..."51982481460156
"s3://my-bucket/data/8.parquet""..."114954021302776
"s3://my-bucket/data/9.parquet""..."12036096890199
"s3://my-bucket/data/10.parquet""..."65631061386662

5. Read Manifest from S3

When reading a manifest, you only need to provide the summary file URI. The library automatically:

  1. Reads the summary file first to get aggregate statistics

  2. Uses the reference to locate and read the detailed data file

  3. Reconstructs the complete manifest object

This design enables efficient access patterns for different use cases.

[25]:
manifest_file = s3manifesto.ManifestFile.read(
    uri_summary=uri_summary,
    s3_client=s3_client,
)

print(f"📖 Read manifest from S3:")
print(f"   📊 Files: {manifest_file.n_data_file}")
print(f"   📏 Total size: {manifest_file.size_for_human}")
print(f"   📦 Total records: {manifest_file.n_record:,}")
📖 Read manifest from S3:
   📊 Files: 10
   📏 Total size: 102.92 MB
   📦 Total records: 10,036,408

6. Intelligent File Partitioning by Size

One of the most powerful features of s3manifesto is intelligent file partitioning. This uses the Best Fit Decreasing (BFD) algorithm to group files into balanced batches perfect for parallel processing.

Why Partitioning Matters

  • Consistent Resource Usage: Each group has similar total size/records

  • Optimal Parallel Processing: Balanced workloads across workers

  • Memory Planning: Predictable memory requirements per batch

  • Divide-and-Conquer: Ready for distributed processing frameworks

Let’s partition our files by size with a target of 30MB per group:

[26]:
groups = manifest_file.partition_files_by_size(
    target_size=30_000_000,  # 30MB target per group
)

print(f"🔄 Partitioned {manifest_file.n_data_file} files into {len(groups)} groups")
print(f"🎯 Target size per group: 30MB")
print()

for ith, group in enumerate(groups, start=1):
    file_count = len(group.data_files)
    print(f"📦 Group {ith}: {file_count} files, {group.size_for_human}")

    # Show first few filenames for reference
    uris = [file.uri for file in group.data_files]
    if len(uris) <= 3:
        print(f"   Files: {uris}")
    else:
        print(f"   Files: {uris[:2]} ... and {len(uris)-2} more")
    print()
🔄 Partitioned 10 files into 4 groups
🎯 Target size per group: 30MB

📦 Group 1: 2 files, 26.80 MB
   Files: ['s3://my-bucket/data/2.parquet', 's3://my-bucket/data/4.parquet']

📦 Group 2: 2 files, 24.16 MB
   Files: ['s3://my-bucket/data/5.parquet', 's3://my-bucket/data/9.parquet']

📦 Group 3: 3 files, 27.53 MB
   Files: ['s3://my-bucket/data/8.parquet', 's3://my-bucket/data/3.parquet', 's3://my-bucket/data/10.parquet']

📦 Group 4: 3 files, 24.44 MB
   Files: ['s3://my-bucket/data/6.parquet', 's3://my-bucket/data/1.parquet', 's3://my-bucket/data/7.parquet']

7. Intelligent File Partitioning by Record Count

You can also partition files by record count, which is useful when processing time is more dependent on the number of records than file size.

This is particularly valuable for:

  • ETL pipelines where transformation time scales with record count

  • Database operations where insert/update performance depends on row count

  • Analytics workloads where computation scales with data points

[27]:
groups_by_records = manifest_file.partition_files_by_n_record(
    target_n_record=2_000_000,  # 2M records target per group
)

print(f"🔢 Partitioned by record count: {len(groups_by_records)} groups")
print(f"🎯 Target records per group: 2,000,000")
print()

for ith, group in enumerate(groups_by_records, start=1):
    file_count = len(group.data_files)
    # Note: group.value represents record count when partitioned by records
    print(f"📊 Group {ith}: {file_count} files, {group.value:,} records")

    # Calculate total size for this group
    total_size = sum(file.size for file in group.data_files)
    size_mb = total_size / 1_000_000
    print(f"   Total size: {size_mb:.1f} MB")
    print()
🔢 Partitioned by record count: 7 groups
🎯 Target records per group: 2,000,000

📊 Group 1: 2 files, 1,987,633 records
   Total size: 19.1 MB

📊 Group 2: 1 files, 1,386,662 records
   Total size: 6.6 MB

📊 Group 3: 1 files, 1,302,776 records
   Total size: 11.5 MB

📊 Group 4: 1 files, 1,297,214 records
   Total size: 14.2 MB

📊 Group 5: 2 files, 1,723,161 records
   Total size: 22.8 MB

📊 Group 6: 2 files, 1,606,253 records
   Total size: 24.1 MB

📊 Group 7: 1 files, 732,709 records
   Total size: 9.7 MB

8. Summary: Key Benefits

🎯 Efficient Metadata Management

  • Consolidate scattered file metadata into manageable collections

  • Two-file system (JSON summary + Parquet data) for optimal access patterns

  • Automatic calculation of aggregate statistics

⚖️ Intelligent Partitioning

  • Best Fit Decreasing algorithm ensures balanced groups

  • Never exceeds target size/records (except for naturally oversized files)

  • Perfect for divide-and-conquer parallel processing

🚀 Production Ready

  • Handles millions of files efficiently

  • Optimized for AWS S3 storage

  • Built for large-scale ETL and data processing workflows

Happy data processing! 🎉