Semantic Search in Under 3MB
10 hours ago
- #quantization
- #model-optimization
- #semantic-search
- Project optimized a semantic search reranking model from 11.4 MB to 2.79 MB gzipped, focusing on size reduction and performance enhancement for resume page application.
- Utilized term dropout to mitigate overfitting and prevent keyword matching, improving model robustness in a small corpus.
- Mined queries from job postings using an LLM to create realistic training data, boosting MRR by 21% initially.
- Conducted architecture experiments: max pooling outperformed mean pooling, factorized embeddings saved parameters, while SwiGLU showed no gain; multi-vector late interaction improved token-level expressiveness.
- Reduced vocabulary from 30k to 5k tokens, decreased embedding dimensions, and applied aggressive quantization, including 1.58-bit ternary quantization for weights, cutting file size from 8.3 MB to 3.9 MB.
- Replaced ONNX Runtime Web with a custom WASM binary in Rust, slashing inference logic size from 3.4 MB to 4 kB.
- Results showed the final model outperformed baseline and BM25, achieving nDCG@10 scores of 0.787 overall and 0.694 on a hard subset.
- Unsuccessful attempts included factorization post-training, attention pooling, SwiGLU, and ternary cross-encoder, with diminishing returns on extra training data.