

TurboQuant
#24 in AI Inference Hardwaregoogle · since Preprint arXiv: 28. April 2025; Google Research Blog-Ankündigung: 24. März 2026; ICLR 2026 Konferenzpräsentation: April · 11× · last seen Jun 30, 2026
TurboQuant is a vector quantization algorithm developed by Google Research that compresses the KV cache of large language models down to 3–4 bits. The method combines PolarQuant (rotation-based scalar quantization) with a 1-bit QJL residual correction step, achieving at least 6× KV cache memory reduction with no measurable accuracy loss according to Google. TurboQuant is training-free and calibration-free and works on any transformer architecture. No official Google reference implementation has been released as of Q2 2026; community implementations exist for PyTorch, vLLM, and llama.cpp.
Features
| License | No official Google open-source release (as of Q2 2026); community implementations under MIT license |
| Platform | Model-agnostic (any transformer architecture); benchmarks on NVIDIA H100; community ports: PyTorch, vLLM, MLX/Apple Silicon, llama.cpp |
| Price | Not a commercial product; algorithm freely available as a research paper |
| Compute Performance (FLOPS/TOPS) | Up to 8x speedup in attention logit computation (4-bit TurboQuant vs. 32-bit unquantized) on NVIDIA H100 |
| Release Date | arXiv preprint: Apr 28, 2025; Google Research Blog: Mar 24, 2026; ICLR 2026 conference: Apr 2026 |
| Memory | KV cache compression to 3-4 bits/value; at least 6x reduction vs. FP16 (e.g., Llama 3.1 70B 128k: ~40 GB → ~7.5 GB KV cache) |
| Availability | Algorithm/paper: public (arXiv 2504.19874, ICLR 2026); official Google implementation: not yet released; community implementations: PyPI/GitHub (not from Google) |