Hasty Briefsbeta

Bilingual

Playing with the language modeling abilities of gzip

12 hours ago
  • #compression
  • #language-modeling
  • #artificial-intelligence
  • Compression and prediction are essentially the same task, with compression representing data compactly and prediction minimizing required information.
  • Language Modeling Is Compression (2023) explored generative abilities of gzip, but limited to one-step-ahead predictions, which is unfair to its capabilities.
  • gzipt improves on gzip by using beam search over bytes, allowing longer substrings to demonstrate compressibility and generating more structured text.
  • The ability to clone a compressor's state in zlib makes it suitable for text generation, as it allows measuring marginal cost of candidates directly.
  • Other compression algorithms like bz2, zstd, brotli, and lzma fail in text generation due to quantization issues and lack of state cloning, leading to repetitive outputs.
  • Using 'span mode' with candidates from the corpus produces coherent text by making cost differences larger, though it reassembles existing text rather than generating novel content.
  • Compressors can answer multiple-choice questions like HellaSwag, with zstd and lzma achieving 32.5% and 33.0% accuracy, beating GPT-2-124M.
  • gzip can generate code in languages with flat, statement-local structures like SQL and CSS, producing valid and novel statements or rules, though not always useful.