CLIPort - 研究雑記

CLIPort

CoRL21

CLIPort: What and Where Pathways for Robotic Manipulation

視覚に基づいた操作のためのsemanticかつspatialなpathwayを出力する為のtwo-stream architecture

CLIPによる意味理解と、Transporterによる空間理解による模倣学習エージェントがCLIPort

実世界、シミュレーションともに実験が実施され、few-shot設定においてdata-efficientであり、汎化性能が高かったと主張

Multi-task policyのデータ分布

simulation 10

real 9

言語を入力としたtabletopの広い設定にに適用可能なフレームワーク

シミュレーションの全10タスクに対し1000単位でのuniqueなinstanceが出てくるが、全タスクでsingle-taskなモデルよりも良好な結果が得られたらしい。

https://scrapbox.io/files/64e7fe7a7e63a0001b83875f.png

Dense情報は画像の計算では出てこない？

semantic streamとspatial streamのブランチをはやしている

semantic情報

RGBから抽出、CLIP Resnetとやらでencode

late fusedで結合している(Spatial BranchでU-Netに通してから結合しているため)

spatial情報

RGBDから抽出、Transporter ResNetとやらでEncode

最終出力は、pick & placeのアフォーダンス予測に用いられるdense pixelwise featureのmap

問題設定：goal-conditioned policy$ \piからaction$ \bm{a}を生成する問題からなる。

policyは、時刻$ tにおけるvisual observation$ \bm{o}_tおよびinstruction$ \bm{l}_t

$ \pi(\gamma_t)=\pi(\bm{o}_t, \bm{l}_t)\rightarrow\bm{a}_t=(\mathcal{T}_\text{pick}, \mathcal{T}_{\text{place}})\in\mathcal{A}

action$ \bm{a}について、これはそれぞれpickingとplacingのend-effector poseである。

完全な構造は以下

https://scrapbox.io/files/64e811987c9667001b26d08b.png

意味的情報は

spatial：空間的な、疎の

imbue：染み込ませる