LocateAnything
https://research.nvidia.com/labs/lpr/locate-anything/static/videos/demo.mp4
https://research.nvidia.com/labs/lpr/locate-anything/?linkId=100000424057485#method
Project
https://github.com/NVlabs/Eagle/tree/main/Embodied
NVlabs/Eagle/Embodied
https://research.nvidia.com/labs/lpr/locate-anything/LocateAnything.pdf
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
https://huggingface.co/nvidia/LocateAnything-3B
nvidia
/LocateAnything-3B
物体検出
を言語生成問題として徹底的に再定義・再学習した
VLM
Moon-ViT
vision encoder +
Qwen2.5
language decoder
PBD (
Parallel Box Decoding
)
https://gyazo.com/fe991ec34a5c6299583104df2e4d000f
NTPは、左から1トークンずつ喋る
LocateAnythingは出力をlocalization形式に制限しているため、その構造に合わせてブロックごとにデコードする
関連
Florence-2