Synthszr Charts — die großen AI-Marken im Wettkampf ums Podium
synthszr charts
google

TurboQuant

#24 v AI inferenční hardware

google · od Preprint arXiv: 28. April 2025; Google Research Blog-Ankündigung: 24. März 2026; ICLR 2026 Konferenzpräsentation: April · 11× · naposledy 30. 6. 2026

16
Momentum

TurboQuant is a vector quantization algorithm developed by Google Research that compresses the KV cache of large language models down to 3–4 bits. The method combines PolarQuant (rotation-based scalar quantization) with a 1-bit QJL residual correction step, achieving at least 6× KV cache memory reduction with no measurable accuracy loss according to Google. TurboQuant is training-free and calibration-free and works on any transformer architecture. No official Google reference implementation has been released as of Q2 2026; community implementations exist for PyTorch, vLLM, and llama.cpp.

Vývoj momenta
04.04.03.07.

Vlastnosti

LicenseNo official Google open-source release (as of Q2 2026); community implementations under MIT license
PlatformModel-agnostic (any transformer architecture); benchmarks on NVIDIA H100; community ports: PyTorch, vLLM, MLX/Apple Silicon, llama.cpp
PriceNot a commercial product; algorithm freely available as a research paper
Compute Performance (FLOPS/TOPS)Up to 8x speedup in attention logit computation (4-bit TurboQuant vs. 32-bit unquantized) on NVIDIA H100
Release DatearXiv preprint: Apr 28, 2025; Google Research Blog: Mar 24, 2026; ICLR 2026 conference: Apr 2026
MemoryKV cache compression to 3-4 bits/value; at least 6x reduction vs. FP16 (e.g., Llama 3.1 70B 128k: ~40 GB → ~7.5 GB KV cache)
AvailabilityAlgorithm/paper: public (arXiv 2504.19874, ICLR 2026); official Google implementation: not yet released; community implementations: PyPI/GitHub (not from Google)

Zdroje (11)

Další produkty v této kategorii: AI inferenční hardware

Subscribe free. Unsubscribe the second it sucks.

High-signal news across AI, business, UX, and tech. Every morning.