小型VLM「MiniCPM-V」を使ってみよう！

本記事では、軽量なVLM「MiniCPM-V」の使い方と、使ってみた感想をまとめる。ollama経由で使えるらしいので、使ってみる。

MiniCPM-Vとは

OpenBMBが開発した、GPT-4oに匹敵する性能を持つ小型のVLM

画像や動画のキャプショニングが可能

パラメータ数は8B（4oのパラメータ数は非公開だが、4oと比較してかなり軽量である）

モバイル端末上でも動作するほど

動画にも対応

「リアルタイムで動画のキャプションを生成する」などが可能

導入方法

今回、ollamaの導入は既に完了している前提で説明

ollamaの導入に関してはこちら

MiniCPM-Vのモデルをダウンロードする

ollama pull minicpm-v

ダウンロードが完了すると、以下のような表示が出る

https://scrapbox.io/files/6899a6193074dd9d5c80f69a.png

使用例

今回は画像を説明させてみる

日本語での精度はそこまでらしいので、今回は入出力共に英語で扱う

対象となるのは以下の画像

https://scrapbox.io/files/673e8a3e004d4f98df37c354.png

Pythonのスクリプトを以下のように記述する

AI-Bridge Labの記事を参考に作成

code:python

#20250811

#MiniCPM-Vの試運転スクリプト

import os

from ollama import Client

# Ollamaクライアントの初期化

client = Client()

image_model = "minicpm-v"

# 画像を分析する関数

def explaining_image(image_path):

try:

response = client.generate(model=image_model, prompt="Please briefly describe this image.", images=image_path)

return response'response'.strip()

except Exception as e:

raise e

# メイン関数

def main():

image_folder = "/path/to/img_folder"

image_files = f for f in os.listdir(image_folder) if f.lower().endswith(('.png', '.jpg', '.jpeg', '.gif'))

if not image_files:

print("画像ファイルが見つかりません。")

return

print("利用可能な画像ファイル:")

for i, file in enumerate(image_files, 1):

print(f"{i}. {file}")

choice = int(input("分析する画像の番号を選択してください: ")) - 1

if 0 <= choice < len(image_files):

image_path = os.path.join(image_folder, image_fileschoice)

result = explaining_image(image_path)

print(result)

else:

print("無効な選択です。")

if __name__ == "__main__":

main()

フォルダ内に画像を保存→スクリプト実行→入力する画像の番号を指定

出力結果：「The image depicts a red apple with two green leaves sprouting from its top, as if it were alive or magical. Uniquely, the apple is shown in mid-air and has white wings attached to its sides, suggesting that it might be an angelic representation of knowledge or education, drawing on common symbolism where "an apple" refers to a teacher's gift. The background shows a sky with clouds illuminated by what appears to be either sunrise or sunset light, adding to the ethereal quality of the image and implying a sense of hopefulness or enlightenment associated with learning or giving thanks for wisdom.」

テキスト出力までの時間が物凄く短い

データセットの前処理、拡張に使えそう

説明の粒度を指定しないと、かなり長々と説明してしまう。（1,000字以上を観測済み）

参考資料

Ollama

https://ollama.com/library/minicpm-v

公式ドキュメント

https://github.com/OpenBMB/MiniCPM-o

arXiv プレプリント

https://arxiv.org/pdf/2408.01800

AI-Bridge Lab『小型マルチモーダルAI「MiniCPM-V」とは？GPT-4Vを凌駕する性能と活用法』

https://note.com/doerstokyo_kb/n/nbbba979499b2

#Yuma_Oe