S3 Manifesto Quick Start Tutorial¶
This notebook demonstrates how to use s3manifesto for efficient file metadata management and intelligent partitioning in big data workflows.
Overview¶
s3manifesto provides:
Metadata Organization: Consolidate file metadata into manageable collections
Two-File Storage: Efficient JSON summary + Parquet data file system
Intelligent Partitioning: Best Fit Decreasing algorithm for optimal file grouping
Divide-and-Conquer Ready: Perfect for distributed processing frameworks
Let’s walk through a complete example!
1. Setup Mock AWS Environment¶
For this tutorial, we’ll use moto to create a mock AWS environment. In production, you would connect to your actual AWS S3 service.
[1]:
import moto
mock_aws = moto.mock_aws()
mock_aws.start()
2. Create S3 Bucket and Client¶
Set up our S3 client and create a bucket for storing manifest files.
[2]:
import boto3
boto_ses = boto3.Session(region_name="us-east-1")
s3_client = boto_ses.client("s3")
s3_bucket = "my-bucket"
s3_client.create_bucket(Bucket=s3_bucket)
print(f"✅ Created S3 bucket: {s3_bucket}")
✅ Created S3 bucket: my-bucket
3. Create Sample Data Files and Manifest¶
Let’s create a manifest containing metadata for 10 sample data files. Each file represents a typical big data scenario with varying sizes and record counts.
What is a Manifest?¶
A manifest is a collection of file metadata containing:
URI: S3 location of each data file
Size: File size in bytes for resource planning
Record Count: Number of records for workload estimation
ETag: Data integrity verification hash
[18]:
import s3manifesto.api as s3manifesto
import random
n_file = 10
uri = f"s3://{s3_bucket}/manifest.parquet"
uri_summary = f"s3://{s3_bucket}/manifest-summary.json"
data_file_list = list()
for ith in range(1, 1 + n_file):
data_file_uri = f"s3://{s3_bucket}/data/{ith}.parquet"
# 500K ~ 1.5M records each file
n_record = random.randint(500_000, 1_500_000)
# 5MB ~ 15MB size each file
size = random.randint(5 * 1_000_000, 15 * 1_000_000)
data_file = s3manifesto.DataFile(
uri=data_file_uri,
etag="...",
size=size,
n_record=n_record,
)
data_file_list.append(data_file)
manifest_file = s3manifesto.ManifestFile.new(
uri=uri, # URI for manifest data file (Parquet format)
uri_summary=uri_summary, # URI for manifest summary file (JSON format)
data_file_list=data_file_list,
calculate=True, # Automatically calculate totals from data_file_list
)
print(f"📊 Created manifest with {manifest_file.n_data_file} files")
print(f"📏 Total size: {manifest_file.size_for_human}")
print(f"📦 Total records: {manifest_file.n_record:,}")
print(f"🔐 Fingerprint: {manifest_file.fingerprint}")
📊 Created manifest with 10 files
📏 Total size: 102.92 MB
📦 Total records: 10,036,408
🔐 Fingerprint: 14f938cb78d554e52039879fddb1f1a4
4. Write Manifest to S3¶
A manifest consists of two files that work together:
Manifest Summary File (JSON): Contains aggregate metadata and references to the data file
Manifest Data File (Parquet): Contains detailed per-file metadata in high-performance format
This two-file system allows for efficient metadata access - you can quickly check summary statistics without loading the full detailed metadata.
[21]:
manifest_file.write(s3_client=s3_client)
print(f"✅ Written manifest to S3:")
print(f" 📄 Summary: {uri_summary}")
print(f" 📋 Data: {uri}")
✅ Written manifest to S3:
📄 Summary: s3://my-bucket/manifest-summary.json
📋 Data: s3://my-bucket/manifest.parquet
[22]:
import json
from rich import print as rprint
res = s3_client.get_object(Bucket=s3_bucket, Key="manifest-summary.json")
content = res["Body"].read().decode("utf-8")
print("content of manifest summary file:")
rprint(json.loads(content))
content of manifest summary file:
{ 'manifest': 's3://my-bucket/manifest.parquet', 'size': 107924092, 'n_record': 10036408, 'fingerprint': '14f938cb78d554e52039879fddb1f1a4', 'details': {} }
[24]:
import polars as pl
res = s3_client.get_object(Bucket=s3_bucket, Key="manifest.parquet")
df = pl.read_parquet(res["Body"].read())
print("content of manifest data file:")
df
content of manifest data file:
[24]:
| uri | etag | size | n_record |
|---|---|---|---|
| str | str | i64 | i64 |
| "s3://my-bucket/data/1.parquet" | "..." | 9673507 | 732709 |
| "s3://my-bucket/data/2.parquet" | "..." | 14163379 | 1297214 |
| "s3://my-bucket/data/3.parquet" | "..." | 10807337 | 832962 |
| "s3://my-bucket/data/4.parquet" | "..." | 13934496 | 527477 |
| "s3://my-bucket/data/5.parquet" | "..." | 13297418 | 826390 |
| "s3://my-bucket/data/6.parquet" | "..." | 10755103 | 779863 |
| "s3://my-bucket/data/7.parquet" | "..." | 5198248 | 1460156 |
| "s3://my-bucket/data/8.parquet" | "..." | 11495402 | 1302776 |
| "s3://my-bucket/data/9.parquet" | "..." | 12036096 | 890199 |
| "s3://my-bucket/data/10.parquet" | "..." | 6563106 | 1386662 |
5. Read Manifest from S3¶
When reading a manifest, you only need to provide the summary file URI. The library automatically:
Reads the summary file first to get aggregate statistics
Uses the reference to locate and read the detailed data file
Reconstructs the complete manifest object
This design enables efficient access patterns for different use cases.
[25]:
manifest_file = s3manifesto.ManifestFile.read(
uri_summary=uri_summary,
s3_client=s3_client,
)
print(f"📖 Read manifest from S3:")
print(f" 📊 Files: {manifest_file.n_data_file}")
print(f" 📏 Total size: {manifest_file.size_for_human}")
print(f" 📦 Total records: {manifest_file.n_record:,}")
📖 Read manifest from S3:
📊 Files: 10
📏 Total size: 102.92 MB
📦 Total records: 10,036,408
6. Intelligent File Partitioning by Size¶
One of the most powerful features of s3manifesto is intelligent file partitioning. This uses the Best Fit Decreasing (BFD) algorithm to group files into balanced batches perfect for parallel processing.
Why Partitioning Matters¶
Consistent Resource Usage: Each group has similar total size/records
Optimal Parallel Processing: Balanced workloads across workers
Memory Planning: Predictable memory requirements per batch
Divide-and-Conquer: Ready for distributed processing frameworks
Let’s partition our files by size with a target of 30MB per group:
[26]:
groups = manifest_file.partition_files_by_size(
target_size=30_000_000, # 30MB target per group
)
print(f"🔄 Partitioned {manifest_file.n_data_file} files into {len(groups)} groups")
print(f"🎯 Target size per group: 30MB")
print()
for ith, group in enumerate(groups, start=1):
file_count = len(group.data_files)
print(f"📦 Group {ith}: {file_count} files, {group.size_for_human}")
# Show first few filenames for reference
uris = [file.uri for file in group.data_files]
if len(uris) <= 3:
print(f" Files: {uris}")
else:
print(f" Files: {uris[:2]} ... and {len(uris)-2} more")
print()
🔄 Partitioned 10 files into 4 groups
🎯 Target size per group: 30MB
📦 Group 1: 2 files, 26.80 MB
Files: ['s3://my-bucket/data/2.parquet', 's3://my-bucket/data/4.parquet']
📦 Group 2: 2 files, 24.16 MB
Files: ['s3://my-bucket/data/5.parquet', 's3://my-bucket/data/9.parquet']
📦 Group 3: 3 files, 27.53 MB
Files: ['s3://my-bucket/data/8.parquet', 's3://my-bucket/data/3.parquet', 's3://my-bucket/data/10.parquet']
📦 Group 4: 3 files, 24.44 MB
Files: ['s3://my-bucket/data/6.parquet', 's3://my-bucket/data/1.parquet', 's3://my-bucket/data/7.parquet']
7. Intelligent File Partitioning by Record Count¶
You can also partition files by record count, which is useful when processing time is more dependent on the number of records than file size.
This is particularly valuable for:
ETL pipelines where transformation time scales with record count
Database operations where insert/update performance depends on row count
Analytics workloads where computation scales with data points
[27]:
groups_by_records = manifest_file.partition_files_by_n_record(
target_n_record=2_000_000, # 2M records target per group
)
print(f"🔢 Partitioned by record count: {len(groups_by_records)} groups")
print(f"🎯 Target records per group: 2,000,000")
print()
for ith, group in enumerate(groups_by_records, start=1):
file_count = len(group.data_files)
# Note: group.value represents record count when partitioned by records
print(f"📊 Group {ith}: {file_count} files, {group.value:,} records")
# Calculate total size for this group
total_size = sum(file.size for file in group.data_files)
size_mb = total_size / 1_000_000
print(f" Total size: {size_mb:.1f} MB")
print()
🔢 Partitioned by record count: 7 groups
🎯 Target records per group: 2,000,000
📊 Group 1: 2 files, 1,987,633 records
Total size: 19.1 MB
📊 Group 2: 1 files, 1,386,662 records
Total size: 6.6 MB
📊 Group 3: 1 files, 1,302,776 records
Total size: 11.5 MB
📊 Group 4: 1 files, 1,297,214 records
Total size: 14.2 MB
📊 Group 5: 2 files, 1,723,161 records
Total size: 22.8 MB
📊 Group 6: 2 files, 1,606,253 records
Total size: 24.1 MB
📊 Group 7: 1 files, 732,709 records
Total size: 9.7 MB
8. Summary: Key Benefits¶
🎯 Efficient Metadata Management
Consolidate scattered file metadata into manageable collections
Two-file system (JSON summary + Parquet data) for optimal access patterns
Automatic calculation of aggregate statistics
⚖️ Intelligent Partitioning
Best Fit Decreasing algorithm ensures balanced groups
Never exceeds target size/records (except for naturally oversized files)
Perfect for divide-and-conquer parallel processing
🚀 Production Ready
Handles millions of files efficiently
Optimized for AWS S3 storage
Built for large-scale ETL and data processing workflows
Happy data processing! 🎉