S3 Manifesto Quick Start Tutorial¶

This notebook demonstrates how to use s3manifesto for efficient file metadata management and intelligent partitioning in big data workflows.

Overview¶

s3manifesto provides:

Metadata Organization: Consolidate file metadata into manageable collections
Two-File Storage: Efficient JSON summary + Parquet data file system
Intelligent Partitioning: Best Fit Decreasing algorithm for optimal file grouping
Divide-and-Conquer Ready: Perfect for distributed processing frameworks

Let’s walk through a complete example!

1. Setup Mock AWS Environment¶

For this tutorial, we’ll use moto to create a mock AWS environment. In production, you would connect to your actual AWS S3 service.

[1]:

import moto

mock_aws = moto.mock_aws()
mock_aws.start()

2. Create S3 Bucket and Client¶

Set up our S3 client and create a bucket for storing manifest files.

[2]:

import boto3

boto_ses = boto3.Session(region_name="us-east-1")
s3_client = boto_ses.client("s3")

s3_bucket = "my-bucket"
s3_client.create_bucket(Bucket=s3_bucket)

print(f"✅ Created S3 bucket: {s3_bucket}")

✅ Created S3 bucket: my-bucket

3. Create Sample Data Files and Manifest¶

Let’s create a manifest containing metadata for 10 sample data files. Each file represents a typical big data scenario with varying sizes and record counts.

What is a Manifest?¶

A manifest is a collection of file metadata containing:

URI: S3 location of each data file
Size: File size in bytes for resource planning
Record Count: Number of records for workload estimation
ETag: Data integrity verification hash

[18]:

import s3manifesto.api as s3manifesto
import random

n_file = 10
uri = f"s3://{s3_bucket}/manifest.parquet"
uri_summary = f"s3://{s3_bucket}/manifest-summary.json"

data_file_list = list()
for ith in range(1, 1 + n_file):
    data_file_uri = f"s3://{s3_bucket}/data/{ith}.parquet"
    # 500K ~ 1.5M records each file
    n_record = random.randint(500_000, 1_500_000)
    # 5MB ~ 15MB size each file
    size = random.randint(5 * 1_000_000, 15 * 1_000_000)
    data_file = s3manifesto.DataFile(
        uri=data_file_uri,
        etag="...",
        size=size,
        n_record=n_record,
    )
    data_file_list.append(data_file)

manifest_file = s3manifesto.ManifestFile.new(
    uri=uri,  # URI for manifest data file (Parquet format)
    uri_summary=uri_summary,  # URI for manifest summary file (JSON format)
    data_file_list=data_file_list,
    calculate=True,  # Automatically calculate totals from data_file_list
)

print(f"📊 Created manifest with {manifest_file.n_data_file} files")
print(f"📏 Total size: {manifest_file.size_for_human}")
print(f"📦 Total records: {manifest_file.n_record:,}")
print(f"🔐 Fingerprint: {manifest_file.fingerprint}")

📊 Created manifest with 10 files
📏 Total size: 102.92 MB
📦 Total records: 10,036,408
🔐 Fingerprint: 14f938cb78d554e52039879fddb1f1a4

4. Write Manifest to S3¶

A manifest consists of two files that work together:

Manifest Summary File (JSON): Contains aggregate metadata and references to the data file
Manifest Data File (Parquet): Contains detailed per-file metadata in high-performance format

This two-file system allows for efficient metadata access - you can quickly check summary statistics without loading the full detailed metadata.

[21]:

manifest_file.write(s3_client=s3_client)
print(f"✅ Written manifest to S3:")
print(f"   📄 Summary: {uri_summary}")
print(f"   📋 Data: {uri}")

✅ Written manifest to S3:
   📄 Summary: s3://my-bucket/manifest-summary.json
   📋 Data: s3://my-bucket/manifest.parquet

[22]:

import json
from rich import print as rprint

res = s3_client.get_object(Bucket=s3_bucket, Key="manifest-summary.json")
content = res["Body"].read().decode("utf-8")
print("content of manifest summary file:")
rprint(json.loads(content))

content of manifest summary file:

{
    'manifest': 's3://my-bucket/manifest.parquet',
    'size': 107924092,
    'n_record': 10036408,
    'fingerprint': '14f938cb78d554e52039879fddb1f1a4',
    'details': {}
}

[24]:

import polars as pl

res = s3_client.get_object(Bucket=s3_bucket, Key="manifest.parquet")
df = pl.read_parquet(res["Body"].read())
print("content of manifest data file:")
df

content of manifest data file:

[24]:

shape: (10, 4)

uri	etag	size	n_record
str	str	i64	i64
"s3://my-bucket/data/1.parquet"	"..."	9673507	732709
"s3://my-bucket/data/2.parquet"	"..."	14163379	1297214
"s3://my-bucket/data/3.parquet"	"..."	10807337	832962
"s3://my-bucket/data/4.parquet"	"..."	13934496	527477
"s3://my-bucket/data/5.parquet"	"..."	13297418	826390
"s3://my-bucket/data/6.parquet"	"..."	10755103	779863
"s3://my-bucket/data/7.parquet"	"..."	5198248	1460156
"s3://my-bucket/data/8.parquet"	"..."	11495402	1302776
"s3://my-bucket/data/9.parquet"	"..."	12036096	890199
"s3://my-bucket/data/10.parquet"	"..."	6563106	1386662

5. Read Manifest from S3¶

When reading a manifest, you only need to provide the summary file URI. The library automatically:

Reads the summary file first to get aggregate statistics
Uses the reference to locate and read the detailed data file
Reconstructs the complete manifest object

This design enables efficient access patterns for different use cases.

[25]:

manifest_file = s3manifesto.ManifestFile.read(
    uri_summary=uri_summary,
    s3_client=s3_client,
)

print(f"📖 Read manifest from S3:")
print(f"   📊 Files: {manifest_file.n_data_file}")
print(f"   📏 Total size: {manifest_file.size_for_human}")
print(f"   📦 Total records: {manifest_file.n_record:,}")

📖 Read manifest from S3:
   📊 Files: 10
   📏 Total size: 102.92 MB
   📦 Total records: 10,036,408

6. Intelligent File Partitioning by Size¶

One of the most powerful features of s3manifesto is intelligent file partitioning. This uses the Best Fit Decreasing (BFD) algorithm to group files into balanced batches perfect for parallel processing.

Why Partitioning Matters¶

Consistent Resource Usage: Each group has similar total size/records
Optimal Parallel Processing: Balanced workloads across workers
Memory Planning: Predictable memory requirements per batch
Divide-and-Conquer: Ready for distributed processing frameworks

Let’s partition our files by size with a target of 30MB per group:

[26]:

groups = manifest_file.partition_files_by_size(
    target_size=30_000_000,  # 30MB target per group
)

print(f"🔄 Partitioned {manifest_file.n_data_file} files into {len(groups)} groups")
print(f"🎯 Target size per group: 30MB")
print()

for ith, group in enumerate(groups, start=1):
    file_count = len(group.data_files)
    print(f"📦 Group {ith}: {file_count} files, {group.size_for_human}")

    # Show first few filenames for reference
    uris = [file.uri for file in group.data_files]
    if len(uris) <= 3:
        print(f"   Files: {uris}")
    else:
        print(f"   Files: {uris[:2]} ... and {len(uris)-2} more")
    print()

🔄 Partitioned 10 files into 4 groups
🎯 Target size per group: 30MB

📦 Group 1: 2 files, 26.80 MB
   Files: ['s3://my-bucket/data/2.parquet', 's3://my-bucket/data/4.parquet']

📦 Group 2: 2 files, 24.16 MB
   Files: ['s3://my-bucket/data/5.parquet', 's3://my-bucket/data/9.parquet']

📦 Group 3: 3 files, 27.53 MB
   Files: ['s3://my-bucket/data/8.parquet', 's3://my-bucket/data/3.parquet', 's3://my-bucket/data/10.parquet']

📦 Group 4: 3 files, 24.44 MB
   Files: ['s3://my-bucket/data/6.parquet', 's3://my-bucket/data/1.parquet', 's3://my-bucket/data/7.parquet']

7. Intelligent File Partitioning by Record Count¶

You can also partition files by record count, which is useful when processing time is more dependent on the number of records than file size.

This is particularly valuable for:

ETL pipelines where transformation time scales with record count
Database operations where insert/update performance depends on row count
Analytics workloads where computation scales with data points

[27]:

groups_by_records = manifest_file.partition_files_by_n_record(
    target_n_record=2_000_000,  # 2M records target per group
)

print(f"🔢 Partitioned by record count: {len(groups_by_records)} groups")
print(f"🎯 Target records per group: 2,000,000")
print()

for ith, group in enumerate(groups_by_records, start=1):
    file_count = len(group.data_files)
    # Note: group.value represents record count when partitioned by records
    print(f"📊 Group {ith}: {file_count} files, {group.value:,} records")

    # Calculate total size for this group
    total_size = sum(file.size for file in group.data_files)
    size_mb = total_size / 1_000_000
    print(f"   Total size: {size_mb:.1f} MB")
    print()

🔢 Partitioned by record count: 7 groups
🎯 Target records per group: 2,000,000

📊 Group 1: 2 files, 1,987,633 records
   Total size: 19.1 MB

📊 Group 2: 1 files, 1,386,662 records
   Total size: 6.6 MB

📊 Group 3: 1 files, 1,302,776 records
   Total size: 11.5 MB

📊 Group 4: 1 files, 1,297,214 records
   Total size: 14.2 MB

📊 Group 5: 2 files, 1,723,161 records
   Total size: 22.8 MB

📊 Group 6: 2 files, 1,606,253 records
   Total size: 24.1 MB

📊 Group 7: 1 files, 732,709 records
   Total size: 9.7 MB

8. Summary: Key Benefits¶

🎯 Efficient Metadata Management

Consolidate scattered file metadata into manageable collections
Two-file system (JSON summary + Parquet data) for optimal access patterns
Automatic calculation of aggregate statistics

⚖️ Intelligent Partitioning

Best Fit Decreasing algorithm ensures balanced groups
Never exceeds target size/records (except for naturally oversized files)
Perfect for divide-and-conquer parallel processing

🚀 Production Ready

Handles millions of files efficiently
Optimized for AWS S3 storage
Built for large-scale ETL and data processing workflows

Happy data processing! 🎉