Synthszr Charts — die großen AI-Marken im Wettkampf ums Podium

LocateAnything

#32 in Multimodal Models

nvidia · since 2026-06-29 · 3× · last seen Jun 29, 2026

Momentum

NVIDIA LocateAnything-3B is a 3-billion-parameter vision-language model (VLM) for visual grounding, part of the Eagle VLM family. It pairs a MoonViT-SO-400M vision encoder with a Qwen2.5-3B language model and introduces Parallel Box Decoding (PBD), which predicts complete bounding boxes in a single parallel step instead of token-by-token autoregressive generation. The model was trained on a dataset of 12 million images, 138 million queries, and 785 million bounding boxes. It is released under NVIDIA's non-commercial research and development license.

Momentum trend

04.04.03.07.

Features

Context Window (Tokens)	25,000 tokens (25K); optional MagiAttention integration planned for long sequences ≥32K
Price per Unit	Free (open weights); NVIDIA license for non-commercial use (research & development); commercial use not permitted
Video Analysis Capability	Supports object localization (pointing) in images and videos; the official Hugging Face Space explicitly accepts photo and video uploads for localization

LocateAnything

Features

Sources (3)

More products in this category: Multimodal Models

Subscribe free. Unsubscribe the second it sucks.