MM-ReAct

おそらくGPT-3.5-turbo

MM-ReActは、マルチモーダルな推論とアクションのために、多数の視覚専門家とChatGPTを構成するシステムパラダイムです。

https://multimodal-react.github.io/images/model_figure_2.gif

https://gyazo.com/5e202ca4ad30cdd0f57c8fdb5a92127a

nomadoor.icon: What are the coordinates of the bird’s beak with respect to the upper left corner of the image?

chatgpt.icon: The coordinates of the bird’s beak are 595, 594.

https://gyazo.com/a125acc1d883b21b4c8dea120aa59b58

大体正解

ただ、葉っぱの枚数はわからなかったり日本語のグラフを与えて情報を答えさせるのは全くできなかったので、所感ではBLIPで見つけた単語をChatGPTに与えている以上の感覚は無かった

GPT-4のVisual inputsがどう実装されているかによるけれど、テキストしか扱えないLLMでも画像を扱えるようになる利点は大きい