

Miso TTS
#16 v Syntéza řeči (TTS)unknown · od 2026-06-03 · 2× · naposledy 29. 6. 2026
18
Momentum
Miso TTS 8B is an open-weight text-to-speech model by Miso Labs with 8 billion parameters, released on June 3, 2026. It is based on a hierarchical RVQ Transformer architecture (inspired by Sesame CSM) comprising a 7.7B-parameter temporal backbone (Llama 3.2-style) and a 300M-parameter audio decoder. The model conditions speech generation on both text and optional audio input (conversation history), enabling one-shot voice cloning. Currently English-only; weights are available on Hugging Face under a modified MIT license.
Vývoj momenta
04.04.03.07.
Vlastnosti
| Real-Time Capability | ~110 ms latency (time-to-first-byte on H100 hardware per vendor); local inference on consumer GPUs significantly slower |
| Model Size (Parameters) | ~8.2B total (7.7B backbone + 300M audio decoder) |
| Price Tier | Open-weight / free to self-host (modified MIT license); API access announced but not yet available |
| Supported Languages | English (currently English only; v1) |
| Voice Cloning | One-shot voice cloning from ~10-second reference audio (optional, via audio context conditioning) |