Chrome's New Embedding Model: Smaller, Faster, Same Quality

a year ago

#Machine Learning
#Chrome
#Quantization

Chrome's latest update introduces a new text embedding model that is 57% smaller (35.14MB vs 81.91MB) than its predecessor while maintaining nearly identical performance in semantic search tasks.
The size reduction was achieved primarily through quantization of the embedding matrix from float32 to int8 precision, with no measurable degradation in embedding quality or search ranking.
The new model maintains identical architecture with similar tensor counts (611 vs. 606) and identical input/output shapes ([1,64] input and [1,768] output), suggesting it was derived from the same base model, likely a transformer-based embedding architecture similar to BERT.
Despite internal quantization, the new model’s output embeddings maintain full float32 precision, with slightly higher effective precision (25.42 bits vs. 22.59 bits), indicating sophisticated quantization-aware training techniques.
Testing on diverse queries showed virtually identical similarity scores (differences of 0.001-0.004), identical result rankings for most queries, and a slight speed improvement (1-2% faster inference).
The optimization delivers several benefits for Chrome users, including reduced storage footprint, faster browser updates, improved resource efficiency, consistent quality, and potential battery life improvements on mobile devices.
The approach demonstrates how selective quantization of specific model components can be more effective than blanket quantization strategies, particularly valuable for browsers and other edge applications where storage efficiency is critical but performance cannot be sacrificed.

Hasty Briefsbeta

Chrome's New Embedding Model: Smaller, Faster, Same Quality