Quick Start#

What is a Manifest File#

A manifest file is a specialized file format that stores metadata for a group of related files. This metadata serves as a comprehensive index, providing crucial information about each file in the group.

Key components of file metadata include:

  1. URI (Uniform Resource Identifier): Specifies the file’s location, enabling processors to locate and read it efficiently.

  2. Size: Indicates the file’s size, typically in bytes.

  3. Record Count: Represents the number of system records within the file. When combined with the file size, this information helps processors estimate processing time, required computational power, and memory consumption. This data is invaluable for orchestrators when scheduling processing tasks.

  4. ETag: A cryptographic digest of the file used to verify its integrity. If a file becomes corrupted, its ETag hash will differ from the one recorded in the manifest file. Once the file is moved, copied, changed, this value will change. See https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html for more information.

A complete “Manifest” consists of two files:

  1. Manifest Data File: This file stores a dataframe containing metadata for all files in the group. Each row represents a single file, with columns corresponding to the metadata fields mentioned above. While various formats can be used to store this dataframe (e.g., CSV, JSON), this project utilizes the Parquet format for its exceptional I/O performance. See NDJson or Parquet this section for how we made the decision to use Parquet.

  2. Manifest Summary File: A concise JSON file that provides an overview of the entire file group. It includes aggregate information such as: - Total number of files - Combined size of all files - Total record count across all files - URI of the manifest data file, so that processor can locate it. - A unique fingerprint for the manifest file. It is calculated based on the URI and ETag of the data files.

This two-file structure allows for efficient metadata management and quick access to both detailed and summary information about the file group.

Sample Manifest Summary File

{
    "n_files": 50,
    "total_size": 600_000_000, # 600 MB
    "total_records": 100_000,
    "uri": "s3://bucket/prefix/manifest.parquet",
    "fingerprint": "2d0175ad9416dc5fd7138546471738ca"
}

Sample Manifest Data File

+-------------------------------+--------------+----------+----------------------------------+
|              uri              | size (Bytes) | n_record |               Etag               |
+-------------------------------+--------------+----------+----------------------------------+
| s3://bucket/prefix/file1.json |   1_000_000  |   1000   | 8a53247196e46b53699d065ba3cc8e0d |
+-------------------------------+--------------+----------+----------------------------------+
| s3://bucket/prefix/file2.json |   2_000_000  |   2000   | b3f20f3c7a8877c24504634edd067fcf |
+-------------------------------+--------------+----------+----------------------------------+
| s3://bucket/prefix/file3.json |   3_000_000  |   3000   | dd9b315f1d7ec573cb7305e6e238731f |
+-------------------------------+--------------+----------+----------------------------------+
|              ...              |      ...     |    ...   |                ...               |
+-------------------------------+--------------+----------+----------------------------------+
|              ...              |      ...     |    ...   |                ...               |
+-------------------------------+--------------+----------+----------------------------------+
|              ...              |      ...     |    ...   |                ...               |
+-------------------------------+--------------+----------+----------------------------------+

Create a Manifest File#

This example shows how to create a manifest file and write it to AWS S3 using this library.

example.py
 1        # [start1]
 2        # make dummy manifest file
 3        n_file = 1000
 4        uri = f"s3://{self.bucket}/manifest.json"
 5        uri_summary = f"s3://{self.bucket}/manifest-summary.json"
 6
 7        # collect data file metadata
 8        data_file_list = list()
 9        total_size = 0
10        total_record = 0
11        for ith in range(1, 1 + n_file):
12            uri = f"s3://{self.bucket}/data/{ith}.parquet"
13            n_record = random.randint(1000, 10 * 1000)
14            size = n_record * 1000
15            total_size += size
16            total_record += n_record
17            data_file = dict(
18                uri=uri,
19                etag="...",
20                size=size,
21                n_record=n_record,
22            )
23            data_file_list.append(data_file)
24
25        # test write and read
26        # create manifest file object
27        manifest_file = ManifestFile.new(
28            uri=uri, # uri is the manifest-data.parquet file uri
29            uri_summary=uri_summary, # uri_summary is the manifest-summary.json file uri
30            data_file_list=data_file_list,
31            details={"owner": "Alice"},
32            # if True, then calculate the size and n_record using the data_file_list
33            # otherwise, you need to set the size and n_record manually like this
34            # ManifestFile.new(size=total_size, n_record=total_record)
35            calculate=True,
36        )
37        assert manifest_file.size == total_size
38        assert manifest_file.n_record == total_record
39        assert isinstance(manifest_file.fingerprint, str)
40        assert manifest_file.details == {"owner": "Alice"}
41
42        # write the manifest file to S3
43        manifest_file.write(s3_client=self.s3_client)
44
45        # [end1]

Read a Manifest File#

This example shows how to read a manifest file from AWS S3 using this library.

example.py
 1        # [start2]
 2        # read the manifest file from S3
 3        # you only need to provide the uri_summary, it will read the
 4        # manifest-summary.json file to locate the manifest-data.parquet
 5        manifest_file1 = ManifestFile.read(
 6            uri_summary=uri_summary,
 7            s3_client=self.s3_client,
 8        )
 9        assert manifest_file1.size == manifest_file.size
10        assert manifest_file1.n_record == manifest_file.n_record
11        assert len(manifest_file1.data_file_list) == len(manifest_file.data_file_list)
12        assert manifest_file1.fingerprint == manifest_file.fingerprint
13        assert manifest_file.details == {"owner": "Alice"}
14        # [end2]

Feature - Group Files Planner#

The Group Files Planner is a sophisticated tool designed for efficient management of large-scale data processing tasks. It intelligently organizes vast numbers of files into manageable groups based on size or data volume. This feature excels in two key areas:

  1. Enhancing parallel processing by distributing file groups among multiple workers.

  2. Optimizing data lakes through strategic file compaction.

Using advanced algorithms, the Group Files Planner creates approximately equal-sized file groups, enabling efficient task allocation and improved data organization. Its high-performance implementation can rapidly process millions of files, making it an essential component for orchestrating data operations at scale, from terabytes to petabytes.

See Group Files Planner section to see the benchmark of this algorithm.

This example shows how to use this library to group files.

example.py
 1        # [start3]
 2        # test group files into tasks by size
 3        target_size = 100_000_000  # 100MB
 4        data_file_group_list = manifest_file.group_files_into_tasks_by_size(
 5            target_size=target_size,
 6        )
 7        for data_file_group, total_size in data_file_group_list:
 8            assert (
 9                sum([data_file[KeyEnum.SIZE] for data_file in data_file_group])
10                <= target_size * 2
11            )
12            assert total_size <= target_size * 2
13
14        # test group files into tasks by n_record
15        target_n_record = 10_000_000  # 10M
16        data_file_group_list = manifest_file.group_files_into_tasks_by_n_record(
17            target_n_record=target_n_record,
18        )
19        for data_file_group, total_n_record in data_file_group_list:
20            assert (
21                sum([data_file[KeyEnum.N_RECORD] for data_file in data_file_group])
22                <= target_n_record * 2
23            )
24            assert total_n_record <= target_n_record * 2
25        # [end3]