映像基盤モデル
Video Recognition Meta Survey
VideoCoca
Lavender
InternVideo
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Awesome-LLMs-for-Video-Understanding
SAMURAI
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
LongVU
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
RAGで動画検索。「VideoRAG」の解説
VideoRAG: Retrieval-Augmented Generation over Video Corpus
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
You can now fine-tune open-source video models
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Mode
特に画期的なのは、このモデルが単に動画の内容を理解するだけでなく、その理解過程を論理的なステップで説明できる点です。これにより:
AIの判断根拠が透明化される
より信頼性の高い結果が得られる
誤りの原因特定が容易になる
Large-scale Pre-training for Grounded Video Caption Generation
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild