Reverse Engineering iWork

a day ago

https://andrews.substack.com/p/reverse-engineering-iwork

Copy Link

#Protocol Buffers
#Swift
#iWork

The app discussed ingests files but lacks a solution for parsing .key, .numbers, or .pages files without first exporting them to PDF or another format.
The author built a parser that processes iWork files natively, avoiding the need for conversion or server-side processing, inspired by a previous project porting Perl to WebAssembly for client-side metadata extraction.
Apple switched the iWork document format from XML to a binary format based on Google’s Protocol Buffers in 2013, likely to optimize performance for early iPhones and iPads.
The parser recovers protobuf message descriptors from Apple's Pages, Keynote, and Numbers executables, which define the structure of every message type in the documents.
The parsing process involves decompressing Snappy-compressed chunks, handling Apple's custom Snappy implementation, and processing protobuf messages with type IDs mapped to Swift classes.
Documents are structured with a main Index.zip or directory containing .iwa files, metadata, and referenced media files, with a two-pass loading system for merging incremental updates.
The parser supports various document elements like images, media (audio, video, 3D models), equations, tables, shapes, and charts, each with specific handling for their data structures and metadata.
Style inheritance and spatial information are key features, with styles resolved through inheritance chains and elements positioned using an infinite canvas model.
A visitor protocol provides callbacks for document traversal, offering fully resolved styles and decoded content in the correct reading order.
The code is available as a Swift package on GitHub, with documentation covering the visitor protocol and common use cases, though some features like legacy XML format support are not yet implemented.

Hasty Briefsbeta

Reverse Engineering iWork