Sora - 基素基

Sora

https://openai.com/sora

OpenAIのText to video

image to videoでもある

動画生成AI

できること

静止画から動画生成

欠落フレームの生成

https://thebridge.jp/2024/05/the-first-music-video-generated-with-openais-unreleased-sora-model-is-here

https://openai.com/research/video-generation-models-as-world-simulators

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,1,2,3 generative adversarial networks,4,5,6,7 autoregressive transformers,8,9 and diffusion models.10,11,12 These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size.

Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

しくみ

https://gyazo.com/a9500e2050711ba6360cf4b5c6074c7d

生の動画を低次元のlatent spaceに圧縮し、patchに変換する

soraはlatent spaceでtrainingして、latent spaceで動画を生成する

latent spaceの動画を生の動画に戻すのはvisual decoderが行う

その後その後表現を spacetime patchに分解する

このpatchがtransfomer tokenとして使える

ネットの大量データからLLMを作る、というアプローチに着想を得た

LLMの成功はさまざまな種類のテキストをtext tokenを使って統合したこと

Soraは映像でこれをやっている

text tokenに相当するものがvisual patch（視覚データのモデルを表現できることが知られている）

visual patchはいろんな種類の動画や画像で生成モデルをトレーニングするのにスケーラブル

SoraはDiffusion Model

https://gyazo.com/f1756db6fdc443349ad92562d32a9027

SoraはDiffusion Transformerである。transformerがさまざまなドメインでスケーリングしたのと同様に、動画のモデルでも効果的にスケーリングできることがわかった

つまり、学習が進むほど品質が著しく向上する

解説

https://zenn.dev/mattyamonaca/articles/e234e57834d7ad

https://x.com/_akhaliq/status/1762678991549354121?s=20

今までの画像生成は小さな解像度（256x256など）にトリミングしてトレーニングしていた

ネイティブ解像度にすると嬉しいことがあった

実用上任意のサンプリング解像度が取れる

多デバイス向けの対応ができる

程解像度でプロトタイプを作れる

出力重いから、大事基素.icon

フレーミングが良くなる

他人に見せる動画を見て学習してるんだからそうだろうね基素.icon

言語理解が良くなる

「キャプション - 動画」のセットが必要

Dalle3で作ったvisual to textを使ってキャプションモデルを作り、動画のキャプションを作る

Betker, James, et al. "Improving image generation with better captions." Computer Science. https://cdn.openai.com/papers/dall-e-3. pdf 2.3 (2023):

説明的なキャプションモデルにすると、動画の全体的なクオリティだけでなく、fidelityも向上する

GPTを使ってユーザーのpromptを詳細なキャプションに変換して動画モデルに渡す

text to videoだけではなくimage to videoもできる

生成した動画の拡張もできる

動画の時間を遡って拡張する

オチの動画から前の動画を作ってるってことみたい基素.icon

動画の前と後ろを拡張して無限ループの動画を作る

動画のスタイルをゼロショットで編集する

SDEditを使う

全く別の動画をシームレスに補完する

映画みたいだ基素.icon*2

2048x2048までの画像生成

スケーリングがうまくいっているので、物理世界が破綻しない

これがすごい基素.icon*2

ロングレンジのコヒーレンス

三次元的な一貫性

オクルージョン

時間軸ごとにキャラクターが変わったりしない

同じものを別カメラで撮影できる（Object permanence）

物売り的なアクションをシミュレーションできる

can sometimes simulateなので、できることもある、ぐらいだな基素.icon

これは難しいタスクだと思う

議論のセクションでも、ガラスの破砕などの基本的な物理がモデリングできてないと言っている

例えばデモの蝶々は水の粘性を全く感じてないから物理的には違和感がある

こういう表現をする作家もいそうだけどね

マイクラ世界を物理現実にシミュレートする

これはゼロショット編集とは違うの？

world simulator

AIに視界を作ってもらってAIが作り出した都合の良い現実の中で生きるやつくるじゃん基素.icon

Research Leads Bill Peebles & Tim Brooks

Systems Lead Connor Holmes

Contributors

@OpenAI: Introducing Sora, our text-to-video model.

Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions.

https://openai.com/sora

Prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”

https://video.twimg.com/ext_tw_video/1758190624732512256/pu/vid/avc1/1280x720/UkX1I85YBuFLY26w.mp4?tag=12#.mp4

C2PAを埋め込む予定