

MAI-Voice-1
#22 v Hlasoví agenti v reálném časemicrosoft · v1 · od Erste Vorstellung: 28. August 2025 (Blogpost "Two in-house models"); breiterer Public-Preview-Launch in Microsoft Foundr · 14× · naposledy 30. 6. 2026
MAI-Voice-1 is Microsoft's first in-house text-to-speech model, developed by the Microsoft AI (MAI) team under Mustafa Suleyman. It generates highly expressive, natural-sounding speech and can produce 60 seconds of audio in under one second on a single GPU. The model supports voice cloning from just a few seconds of audio (Personal Voice feature, gated/approval-based), fine-grained per-turn emotion control via SSML, and long-form content generation with consistent speaker identity. It is available via Azure Speech / Microsoft Foundry in public preview and powers features such as Copilot Audio Expressions and Copilot Podcasts.
Vlastnosti
| Real-Time Streaming | Supports both streaming and batch synthesis; 60 sec of audio in <1 sec on a single GPU |
| Latency | Sub-100 ms latency for interactive workloads via the Azure Speech SDK |
| License | Proprietary; Microsoft holds full licensing rights for commercial use; currently public preview with no SLA |
| Platform | Azure Speech, Microsoft Foundry, MAI Playground, Copilot (Audio Expressions, Podcasts) |
| Price | From $22 per 1M characters (Azure Speech / Foundry) |
| Release Date | Aug 28, 2025 (announcement); Apr 2, 2026 (public preview in Foundry) |
| Languages | Optimized for English (US); multilingual coverage only with successor MAI-Voice-2 (>10 languages) |
| Voice Cloning | Yes, via Personal Voice feature from a 10-second audio sample; requires approval (Responsible AI process) |