TypedDict¶
When processing large volumes (millions+) of data file metadata, using dataclasses, attrs, or pydantic introduces additional performance overhead. Since my data is trusted and I don’t need serialization, deserialization, or validation features, I want to use TypedDict to replace these libraries. This script tests the feasibility of this approach.
typeddict_poc.py
1# -*- coding: utf-8 -*-
2
3"""
4在处理大量 (百万级以上) 数据文件的 metadata 时无论使用 dataclasses, attrs 还是 pydantic
5都会有额外的性能开销. 因为我的数据是可信的, 我不需要序列化和反序列化以及 validation 这些功能.
6所以我希望用 TypedDict 来替代这些库. 该脚本测试了这一做法的可行性.
7"""
8
9import typing_extensions as T
10
11
12class T_DATA_FILE(T.TypedDict):
13 uri: T.Required[str]
14 size: T.Required[T.Optional[int]]
15 n_record: T.Required[T.Optional[int]]
16 format: T.NotRequired[T.Optional[str]]
17
18
19def func(data_file: T_DATA_FILE):
20 pass
21
22
23# fmt: off
24_ = func({"id": 1}) # IDE catch this error
25_ = func({"uri": "s3://bucket/key"}) # IDE catch this error
26_ = func({"uri": "s3://bucket/key", "size": 1, "n_record": 1})
27_ = func({"uri": "s3://bucket/key", "size": None, "n_record": None})
28_ = func({"uri": "s3://bucket/key", "size": 1, "n_record": 1, "format": "csv"})
29_ = func({"uri": "s3://bucket/key", "size": 1, "n_record": 1, "format": "csv", "compression": "gzip"}) # IDE catch this error
30# fmt: on