

Ink-2
#3cartesia · v2 · seit 16. Juni 2026 · 6× · zuletzt 29. Juni 2026
Cartesia Ink-2 is a streaming speech-to-text (STT) model purpose-built for real-time voice agents. It is based on a State Space Model (SSM) architecture rather than transformers, and claims the lowest Word Error Rate of any streaming STT model. The model features native turn detection (turn.start, turn.eager_end, turn.end) with no external VAD required, and uses semantic endpointing to assess turn completion by meaning rather than silence. Ink-2 was released alongside Sonic-3.5 on June 16, 2026, debuting at rank #1 on the Artificial Analysis streaming STT leaderboard. At launch it supports English only; multilingual support is announced as forthcoming.
Fonctionnalités
| Latency (ms) | Time to final transcript: 100 ms (0.1 s); sub-350 ms partial latency; turn.eager_end further reduces the gap between the last word and the first LLM response |
| Multilingualism (Dialects) | English only at launch; other languages require fallback to ink-whisper; multilingual support for Ink-2 explicitly announced as 'in progress' |
| On-Device Execution | VPC/on-premise deployment available for enterprise customers (mentioned as a decision criterion for Cartesia vs. alternatives) |
| Languages | English only (at launch); multilingual support announced as in development |