{ "cells": [ { "cell_type": "markdown", "id": "f79bc1c1-ef37-4cef-972a-89e4e6989474", "metadata": {}, "source": [ "# S3 Manifesto Quick Start Tutorial\n", "\n", "This notebook demonstrates how to use s3manifesto for efficient file metadata management and intelligent partitioning in big data workflows.\n", "\n", "## Overview\n", "\n", "s3manifesto provides:\n", "\n", "- **Metadata Organization**: Consolidate file metadata into manageable collections\n", "- **Two-File Storage**: Efficient JSON summary + Parquet data file system \n", "- **Intelligent Partitioning**: Best Fit Decreasing algorithm for optimal file grouping\n", "- **Divide-and-Conquer Ready**: Perfect for distributed processing frameworks\n", "\n", "Let's walk through a complete example!" ] }, { "cell_type": "markdown", "id": "27a35a6e-3812-4897-8306-8936ce5e2bfb", "metadata": {}, "source": [ "## 1. Setup Mock AWS Environment\n", "\n", "For this tutorial, we'll use moto to create a mock AWS environment. In production, you would connect to your actual AWS S3 service." ] }, { "cell_type": "code", "execution_count": 1, "id": "ac4d5de1-3092-4637-ae0e-29506bd66850", "metadata": {}, "outputs": [], "source": [ "import moto\n", "\n", "mock_aws = moto.mock_aws()\n", "mock_aws.start()" ] }, { "cell_type": "markdown", "id": "3c42d68a-6c0e-40cf-9b78-7f151916f6fd", "metadata": {}, "source": [ "## 2. Create S3 Bucket and Client\n", "\n", "Set up our S3 client and create a bucket for storing manifest files." ] }, { "cell_type": "code", "execution_count": 2, "id": "cca1fb74-b84c-4af1-a33e-3a6c99df4044", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "✅ Created S3 bucket: my-bucket\n" ] } ], "source": [ "import boto3\n", "\n", "boto_ses = boto3.Session(region_name=\"us-east-1\")\n", "s3_client = boto_ses.client(\"s3\")\n", "\n", "s3_bucket = \"my-bucket\"\n", "s3_client.create_bucket(Bucket=s3_bucket)\n", "\n", "print(f\"✅ Created S3 bucket: {s3_bucket}\")" ] }, { "cell_type": "markdown", "id": "ab68a683-4454-49e7-9cbc-9ec244ff2501", "metadata": {}, "source": [ "## 3. Create Sample Data Files and Manifest\n", "\n", "Let's create a manifest containing metadata for 10 sample data files. Each file represents a typical big data scenario with varying sizes and record counts.\n", "\n", "### What is a Manifest?\n", "\n", "A manifest is a collection of file metadata containing:\n", "\n", "- **URI**: S3 location of each data file\n", "- **Size**: File size in bytes for resource planning \n", "- **Record Count**: Number of records for workload estimation\n", "- **ETag**: Data integrity verification hash" ] }, { "cell_type": "code", "execution_count": 18, "id": "e54fb8fe-712e-4967-a4c1-ad6e757bbb0c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "📊 Created manifest with 10 files\n", "📏 Total size: 102.92 MB\n", "📦 Total records: 10,036,408\n", "🔐 Fingerprint: 14f938cb78d554e52039879fddb1f1a4\n" ] } ], "source": [ "import s3manifesto.api as s3manifesto\n", "import random\n", "\n", "n_file = 10\n", "uri = f\"s3://{s3_bucket}/manifest.parquet\"\n", "uri_summary = f\"s3://{s3_bucket}/manifest-summary.json\"\n", "\n", "data_file_list = list()\n", "for ith in range(1, 1 + n_file):\n", " data_file_uri = f\"s3://{s3_bucket}/data/{ith}.parquet\"\n", " # 500K ~ 1.5M records each file\n", " n_record = random.randint(500_000, 1_500_000)\n", " # 5MB ~ 15MB size each file\n", " size = random.randint(5 * 1_000_000, 15 * 1_000_000)\n", " data_file = s3manifesto.DataFile(\n", " uri=data_file_uri,\n", " etag=\"...\",\n", " size=size,\n", " n_record=n_record,\n", " )\n", " data_file_list.append(data_file)\n", "\n", "manifest_file = s3manifesto.ManifestFile.new(\n", " uri=uri, # URI for manifest data file (Parquet format)\n", " uri_summary=uri_summary, # URI for manifest summary file (JSON format)\n", " data_file_list=data_file_list,\n", " calculate=True, # Automatically calculate totals from data_file_list\n", ")\n", "\n", "print(f\"📊 Created manifest with {manifest_file.n_data_file} files\")\n", "print(f\"📏 Total size: {manifest_file.size_for_human}\")\n", "print(f\"📦 Total records: {manifest_file.n_record:,}\")\n", "print(f\"🔐 Fingerprint: {manifest_file.fingerprint}\")" ] }, { "cell_type": "markdown", "id": "149b68e8-4302-4b67-b471-1dae52b727f1", "metadata": {}, "source": [ "## 4. Write Manifest to S3\n", "\n", "A manifest consists of **two files** that work together:\n", "\n", "1. **Manifest Summary File** (JSON): Contains aggregate metadata and references to the data file\n", "2. **Manifest Data File** (Parquet): Contains detailed per-file metadata in high-performance format\n", "\n", "This two-file system allows for efficient metadata access - you can quickly check summary statistics without loading the full detailed metadata." ] }, { "cell_type": "code", "execution_count": 21, "id": "84fe7d00-80ca-49ee-b756-a582f5ee1969", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "✅ Written manifest to S3:\n", " 📄 Summary: s3://my-bucket/manifest-summary.json\n", " 📋 Data: s3://my-bucket/manifest.parquet\n" ] } ], "source": [ "manifest_file.write(s3_client=s3_client)\n", "print(f\"✅ Written manifest to S3:\")\n", "print(f\" 📄 Summary: {uri_summary}\")\n", "print(f\" 📋 Data: {uri}\")" ] }, { "cell_type": "code", "execution_count": 22, "id": "016dd59a-af78-4662-a170-d42126f13074", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "content of manifest summary file:\n" ] }, { "data": { "text/html": [ "
{\n", " 'manifest': 's3://my-bucket/manifest.parquet',\n", " 'size': 107924092,\n", " 'n_record': 10036408,\n", " 'fingerprint': '14f938cb78d554e52039879fddb1f1a4',\n", " 'details': {}\n", "}\n", "\n" ], "text/plain": [ "\u001b[1m{\u001b[0m\n", " \u001b[32m'manifest'\u001b[0m: \u001b[32m's3://my-bucket/manifest.parquet'\u001b[0m,\n", " \u001b[32m'size'\u001b[0m: \u001b[1;36m107924092\u001b[0m,\n", " \u001b[32m'n_record'\u001b[0m: \u001b[1;36m10036408\u001b[0m,\n", " \u001b[32m'fingerprint'\u001b[0m: \u001b[32m'14f938cb78d554e52039879fddb1f1a4'\u001b[0m,\n", " \u001b[32m'details'\u001b[0m: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m}\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import json\n", "from rich import print as rprint\n", "\n", "res = s3_client.get_object(Bucket=s3_bucket, Key=\"manifest-summary.json\")\n", "content = res[\"Body\"].read().decode(\"utf-8\")\n", "print(\"content of manifest summary file:\")\n", "rprint(json.loads(content))" ] }, { "cell_type": "code", "execution_count": 24, "id": "48f51edf-6760-4b11-949b-a6f45e909cc9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "content of manifest data file:\n" ] }, { "data": { "text/html": [ "
| uri | etag | size | n_record |
|---|---|---|---|
| str | str | i64 | i64 |
| "s3://my-bucket/data/1.parquet" | "..." | 9673507 | 732709 |
| "s3://my-bucket/data/2.parquet" | "..." | 14163379 | 1297214 |
| "s3://my-bucket/data/3.parquet" | "..." | 10807337 | 832962 |
| "s3://my-bucket/data/4.parquet" | "..." | 13934496 | 527477 |
| "s3://my-bucket/data/5.parquet" | "..." | 13297418 | 826390 |
| "s3://my-bucket/data/6.parquet" | "..." | 10755103 | 779863 |
| "s3://my-bucket/data/7.parquet" | "..." | 5198248 | 1460156 |
| "s3://my-bucket/data/8.parquet" | "..." | 11495402 | 1302776 |
| "s3://my-bucket/data/9.parquet" | "..." | 12036096 | 890199 |
| "s3://my-bucket/data/10.parquet" | "..." | 6563106 | 1386662 |