Binary Encodings for JSON and Variant

3 days ago

Binary JSON encodings accelerate repeated queries by avoiding costly text parsing, with a minimal design achieving 2,346x faster lookups.
BSON has limitations like storage overhead and lack of random access, while formats like CBOR and MessagePack optimize for compactness, not fast lookups.
Databases like Postgres with JSONB and YDB make trade-offs for internal compatibility, supporting random access but with design constraints.
Binary encoding design depends on workload goals: random access, storage efficiency, serialization cost, and data type handling.
Parquet's VARIANT type generalizes semi-structured data with richer types and optimizations like short string packing and shredding for performance.
The ecosystem is converging on binary JSON representations (e.g., BSON, JSONB, VARIANT) for fast retrieval, with tools like Apache Iceberg adopting VARIANT.
LLMs often use JSON for structured responses, leading to innovations like TOON for token-efficient encoding while maintaining readability.

Hasty Briefsbeta