

Miso One
#10 v Multimodální modelymiso · od 2026-06-03 · 9× · naposledy 29. 6. 2026
Miso One (official model name: MisoTTS 8B) is a text-to-speech model released by Miso Labs on June 3, 2026, featuring 8 billion parameters and open weights under a modified MIT license. The model is based on a hierarchical RVQ Transformer architecture (7.7B-parameter backbone + 300M-parameter audio decoder) and accepts both text and optional audio context as input to generate expressive, tone-conditioned English speech. Miso Labs claims a time-to-first-byte latency of 110 ms on H100-class hardware (hosted API); local inference on consumer GPUs is materially slower according to the GitHub repository. One-shot voice cloning is supported from approximately 10 seconds of reference audio; generated audio is watermarked by default via SilentCipher.
Vlastnosti
| Context Window (Tokens) | Maximum sequence length: 2,048 tokens (Mimi audio tokenizer, 32 audio codebooks à 2048-way; text vocabulary: 128,256 tokens) |
| Multimodal Inputs | Text + audio (optional audio context for one-shot voice cloning and voice continuation; output: Mimi audio codes / audio file). No video, no image. Currently English only. |