映像基盤モデル

Video Recognition Meta Survey

https://hirokatsukataoka.net/temp/presen/241011VideoRecognition_MetaSurvey2024.pdf

VideoCoca

Lavender

InternVideo

Foundational Models Defining a New Era in Vision: A Survey and Outlook

https://twitter.com/Jeande_d/status/1684661924292669440?s=20

Awesome-LLMs-for-Video-Understanding

https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding?tab=readme-ov-file

SAMURAI

https://github.com/yangchris11/samurai

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

https://arxiv.org/abs/2411.04923

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

https://arxiv.org/abs/2410.16268

LongVU

https://github.com/Vision-CAIR/LongVU

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

https://arxiv.org/abs/2501.02955

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

https://arxiv.org/abs/2501.03218

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

https://github.com/jh-yi/Video-Panda

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

https://arxiv.org/abs/2501.08326

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

https://arxiv.org/abs/2501.07888

RAGで動画検索。「VideoRAG」の解説

https://zenn.dev/knowledgesense/articles/9616c810383b53

VideoRAG: Retrieval-Augmented Generation over Video Corpus

https://arxiv.org/abs/2501.05874

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

https://arxiv.org/abs/2501.05510

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

https://arxiv.org/abs/2501.08326

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

https://arxiv.org/abs/2501.09781

You can now fine-tune open-source video models

https://replicate.com/blog/fine-tune-video

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Mode

https://arxiv.org/pdf/2502.11775

特に画期的なのは、このモデルが単に動画の内容を理解するだけでなく、その理解過程を論理的なステップで説明できる点です。これにより：

AIの判断根拠が透明化される

より信頼性の高い結果が得られる

誤りの原因特定が容易になる

Large-scale Pre-training for Grounded Video Caption Generation

https://arxiv.org/abs/2503.10781

PVUW 2025 Challenge Report: Advances in Pixel-level Understanding of Complex Videos in the Wild

https://arxiv.org/abs/2504.11326

Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

https://arxiv.org/abs/2510.05034

Introducing Flux: Conversational Speech Recognition to Solve the Biggest Problem in Voice Agents-Interruptions

https://deepgram.com/learn/introducing-flux-conversational-speech-recognition