

LocateAnything
#32 in Multimodale Modellenvidia · siet 2026-06-29 · 3× · tolest 29. Juni 2026
12
Momentum
NVIDIA LocateAnything-3B is a 3-billion-parameter vision-language model (VLM) for visual grounding, part of the Eagle VLM family. It pairs a MoonViT-SO-400M vision encoder with a Qwen2.5-3B language model and introduces Parallel Box Decoding (PBD), which predicts complete bounding boxes in a single parallel step instead of token-by-token autoregressive generation. The model was trained on a dataset of 12 million images, 138 million queries, and 785 million bounding boxes. It is released under NVIDIA's non-commercial research and development license.
Momentum-Verloop
04.04.03.07.
Features
| Context Window (Tokens) | 25,000 tokens (25K); optional MagiAttention integration planned for long sequences ≥32K |
| Price per Unit | Free (open weights); NVIDIA license for non-commercial use (research & development); commercial use not permitted |
| Video Analysis Capability | Supports object localization (pointing) in images and videos; the official Hugging Face Space explicitly accepts photo and video uploads for localization |