VIMA: General Robot Manipulation with Multimodal Prompts

Abstract

prompt-based learningは、自然言語処理において成功したパラダイムとして登場した。ここでは、単一の汎用言語モデルに対して、入力プロンプトで指定された任意のタスクを実行するよう指示することができる。しかし、ロボット工学におけるタスクの指定は、one-shotデモの模倣、言語命令への追従、視覚的目標への到達など、様々な形で行われる。これらは別のタスクとみなされ、専用のモデルで取り組まれることが多い。

本研究では、テキストと視覚のトークンを織り交ぜたマルチモーダルなプロンプトを用いて、幅広い範囲のロボット操作タスクを表現できることを示す。我々は、これらのプロンプトを処理し、運動動作を自己回帰的に出力する、transformerベースの汎用ロボットエージェントであるVIMAを設計する。VIMAの訓練と評価のために、新しいシミュレーションベンチマークを開発する。このベンチマークは、マルチモーダルなプロンプトを持つ数千の機械的に生成された机上タスク、模倣学習のための600K以上のエキスパート軌道、および系統的汎化のための4レベルの評価プロトコルを持つ。VIMAは、モデル容量とデータサイズの両方において、強力なスケーラビリティを達成した。最も困難なzero-shot汎化設定において、同じ学習データを用いた場合のタスク成功率が最大で2.9倍となり、先行するSOTA手法を凌駕しています。また、学習データが10倍少ない場合でも、VIMAは競合する上位の手法よりも2.7倍高い性能を発揮する。

アーキテクチャ

https://vimalabs.github.io/assets/videos/vima_arch_animation.mp4

マルチモーダルなpromptを事前学習したT5 modelでencode

3種類のtoken

text token

object token

action token

prompt tokens：入力プロンプト

text, objectからなる

history tokens：履歴

object, actionからなる

VIMA controller

promptとhistoryによって条件付けられてactionを出力する

causal transformer decoder

self-attention層とcross-attention層を交互に繰り返す

Perceiver IO的な？yosider.icon

https://gyazo.com/be5bcc4a732a9991f5d11f0a1c9dc969

VIMA-Bench

Benchmark for Multimodal Robot Learning

tasks

Simple Object Manipulation

Visual Goal Reaching

Novel Concept Grounding

One-shot Video Imitation

Visual Constraint Satisfaction

Visual Reasoning

We answer three main questions during experiments:

How does VIMA compare with prior SOTA transformer-based agents (Gato, Flamingo, and Decision Transformer) on a diverse collection of multimodal-prompted tasks?

What are the scaling properties of our approach in model capacity and data size?

https://gyazo.com/9a9931f242887df047a920b6ce5a98fe

L: level?yosider.icon

How do different visual tokenizers, prompt conditioning, and prompt encoding affect decision making?

https://gyazo.com/a5bb53c6eef2fc4cef8c0bb7f420ac08

Ablation on visual tokenizers

performance of VIMA-200M model across different visual tokenizers

Our proposed object tokens outperform all methods that learn directly from raw pixels, and Object Perceiver that downsamples the object sequence to a fixed number of tokens.

https://gyazo.com/a7f6bc8b5220375c3e0ff67df669fd51

Ablation on prompt conditioning.

We compare our method (xattn: cross-attention prompt conditioning) with a vanilla transformer decoder (gpt-decoder) across different model sizes

Cross-attention is especially helpful in low-parameter regime and for harder generalization tasks.

Conclusion

GPT-3と同様に、汎用のロボットエージェントは、人間ユーザが意図を伝えるための直感的で表現力豊かなインタフェースを持つ必要がある。本研究では、多様なロボット操作タスクを一様なシーケンスモデリング問題に変換する、新しいマルチモーダルプロンプトの定式化を紹介する。我々は、視覚的目標、on-shotビデオ模倣、新規概念接地などのタスクを単一のモデルで解くことができる概念的に単純な変換器ベースのエージェントであるVIMAを提案する。VIMAは、優れたモデルおよびデータスケーリング特性を示し、将来の研究への強力な出発点となる。

We trained a transformer called VIMA that ingests *multimodal* prompt and outputs controls for a robot arm. A single agent is able to solve visual goal, one-shot imitation from video, novel concept grounding, visual constraint, etc. Strong scaling with model capacity and data!🧵 pic.twitter.com/hQNnACB8Ud

— Jim (Linxi) Fan (@DrJimFan) October 7, 2022

VIMA: General Robot Manipulation with Multimodal Prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan

NVIDIA, Stanford University, Macalester College, Caltech, Tsinghua, UT Austin

Submitted on 6 Oct 2022

https://arxiv.org/abs/2210.03094