Reverse Engineering iWork
a day ago
- #Protocol Buffers
- #Swift
- #iWork
- The app discussed ingests files but lacks a solution for parsing .key, .numbers, or .pages files without first exporting them to PDF or another format.
- The author built a parser that processes iWork files natively, avoiding the need for conversion or server-side processing, inspired by a previous project porting Perl to WebAssembly for client-side metadata extraction.
- Apple switched the iWork document format from XML to a binary format based on Google’s Protocol Buffers in 2013, likely to optimize performance for early iPhones and iPads.
- The parser recovers protobuf message descriptors from Apple's Pages, Keynote, and Numbers executables, which define the structure of every message type in the documents.
- The parsing process involves decompressing Snappy-compressed chunks, handling Apple's custom Snappy implementation, and processing protobuf messages with type IDs mapped to Swift classes.
- Documents are structured with a main Index.zip or directory containing .iwa files, metadata, and referenced media files, with a two-pass loading system for merging incremental updates.
- The parser supports various document elements like images, media (audio, video, 3D models), equations, tables, shapes, and charts, each with specific handling for their data structures and metadata.
- Style inheritance and spatial information are key features, with styles resolved through inheritance chains and elements positioned using an infinite canvas model.
- A visitor protocol provides callbacks for document traversal, offering fully resolved styles and decoded content in the correct reading order.
- The code is available as a Swift package on GitHub, with documentation covering the visitor protocol and common use cases, though some features like legacy XML format support are not yet implemented.