映像基盤モデル
基盤モデル
画像認識
画像処理論
トラッキング
Vision Language Model
ロボット基盤モデル
Spatial AI
Video Recognition Meta Survey
https://hirokatsukataoka.net/temp/presen/241011VideoRecognition_MetaSurvey2024.pdf
VideoCoca
Lavender
InternVideo
Foundational Models Defining a New Era in Vision: A Survey and Outlook
https://twitter.com/Jeande_d/status/1684661924292669440?s=20
Awesome-LLMs-for-Video-Understanding
https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding?tab=readme-ov-file
SAMURAI
https://github.com/yangchris11/samurai
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
https://arxiv.org/abs/2411.04923
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
https://arxiv.org/abs/2410.16268
LongVU
https://github.com/Vision-CAIR/LongVU
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
https://arxiv.org/abs/2501.02955
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
https://arxiv.org/abs/2501.03218
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
https://github.com/jh-yi/Video-Panda
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
https://arxiv.org/abs/2501.08326
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding
https://arxiv.org/abs/2501.07888
RAGで動画検索。「VideoRAG」の解説
https://zenn.dev/knowledgesense/articles/9616c810383b53
VideoRAG: Retrieval-Augmented Generation over Video Corpus
https://arxiv.org/abs/2501.05874
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
https://arxiv.org/abs/2501.05510
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
https://arxiv.org/abs/2501.08326
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
https://arxiv.org/abs/2501.09781
You can now fine-tune open-source video models
https://replicate.com/blog/fine-tune-video
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Mode
https://arxiv.org/pdf/2502.11775
特に画期的なのは、このモデルが単に動画の内容を理解するだけでなく、その理解過程を論理的なステップで説明できる点です。これにより:
AIの判断根拠が透明化される
より信頼性の高い結果が得られる
誤りの原因特定が容易になる
Large-scale Pre-training for Grounded Video Caption Generation
https://arxiv.org/abs/2503.10781
PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild
https://arxiv.org/abs/2504.11326