

Qwen3-VL-30B-A3B
#41 v Multimodální modelyqwen · v3 · vl 30b a3b · od 2025-10-04 · 2× · naposledy 29. 6. 2026
10
Momentum
Qwen3-VL-30B-A3B is a multimodal vision-language model from Alibaba's Qwen team built on a Mixture-of-Experts (MoE) architecture: 30.5B total parameters with only approximately 3.3B activated per inference. It processes text, images, and video within a unified context and natively supports a 256K-token context window (extensible to 1M tokens). Released as an open-weight model under the Apache 2.0 license, it can be deployed locally (on-device, e.g., 4-bit quantization on 32 GB RAM) or accessed via cloud APIs.
Vývoj momenta
04.04.03.07.
Vlastnosti
| Context Window (Tokens) | Native 256K tokens (262,144 per API documentation); expandable to 1M tokens; max output 32,768 tokens |
| Multimodal Inputs | Text, images, and videos (including interleaved / multi-image multi-turn); plus OCR in 32 languages, 2D/3D spatial grounding, GUI screenshots |
| On-Device vs. Cloud | Both possible: open-weight (Apache 2.0), locally deployable via vLLM / SGLang / llama.cpp / Ollama (runs with 4-bit quantization on 32GB RAM); cloud API via OpenRouter, DeepInfra, SiliconFlow, and others |
| Price per Unit | $0.13 per million input tokens / $0.52 per million output tokens (Instruct variant via OpenRouter) |
| Vision-Language Benchmark Score | DocVQA (test): 95.0% | ScreenSpot: 94.7% | OCRBench: 90.3% | MMLU-Redux: 88.4% | MMBench-V1.1: 87.0% (Instruct variant) |