promptfoo

Secure & reliable LLMs | promptfoo

promptfoo/promptfoo

Promptfoo: AI Testing and Security Platform - NotebookLM ドキュメントを NotebookLM に入れたもの 2025/9/5 時点

Configuration Guide - Getting Started with Promptfoo | Promptfoo

設定全体

Configuration Guide - Getting Started with Promptfoo | Promptfoo

設定の JSON Schema

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json 書いておくといい

tests を別ファイルに分割して参照する場合は、スキーマも部分参照ができる

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json#/definitions/PromptfooConfigSchema/properties/tests

YAML 上でテキストを入力してもいいし file://prompts/hoge.txt でファイル参照できる

Nunjucks 設定中で使うテンプレートエンジン

prompts

file://generate_prompt.py:create_prompt で create_prompt 関数の返り値

tests

LLM Rubric | Promptfoo

assert と認証情報ごとにモデルが決まったりするの驚きポイント

なので個人的には常に defaultTest.provider 設定しておくかなあ

vars にファイル渡せるが file://input.json など渡しても特にデコードされたりしない、文字列として入力される

https://www.promptfoo.dev/docs/configuration/test-cases/#supported-file-types されそうに見えるが...

Python Provider にわたすときに文字列化してるんかね

アサーション使いそうなものピック

Assertions and Metrics - LLM Output Validation | Promptfoo

Deterministic Metrics for LLM Output Validation | Promptfoo 決定的なもの

assert-set おもしろい、中に asserts を何個も書けて 50% 以上通っていたら OK が書ける

Model-graded metrics | Promptfoo LLM-as-a-Judge のもの

CLI オプション

$ promptfoo eval

--model-outputs=

--output=file.html, json, csv xml...

https://www.promptfoo.dev/docs/configuration/outputs/

テストケースのフィルタ

--filter-pattern='auth.*' description がマッチするものだけ実行

--filter-metadata=category=math テストケースの metadata フィールドを参照して指定

--filter-errors-only / --filter-failing 失敗したものやエラーのもののみ、--output と一緒に使う

$ promptfoo eval --filter-errors-only=$(promptfoo list evals --ids-only -n 1)

-c promptfooconfig.yaml, --config=evals/* 複数の設定を実行

無理に1つのconfigで策を弄するより promptfooconfig.yaml ごと分けたほうが良いかも

-j / --max-concurrency=4 デフォルトは 4 らしい

過去の実行結果を書き出す

$ promptfoo list evals この出力何??

$ promptfoo export eval ID -o myeval.json, -o myeval.html 等フォーマットもこのタイミングで決めれる

最後の ID を取得する

$ promptfoo list evals --ids-only -n 1

export や filter-errors などと組みわせると便利

$ npx promptfoo export eval $(npx promptfoo list evals --ids-only -n 1) -o myeval.json

WebUI

使うのだいたい右上のあたり

1行に1 TestCase なので複数の assertion がある場合は個別に見る必要がある(ハイライトしている箇所)

ここでの出力は TestCase に対する出力

tests[].options.transform の結果は表示に反映される (TestCase の出力に変換のため)

tests[].assert[].transform は実行時には参照されるけど UI 上からは分からない気がする

個別に見れるようにしてくれよと思うけど、assert 個別に加工した出力を保存する実装にしないとダメそう

https://gyazo.com/ca27cd256d8687f4009a4ac8e60514b9

出力の変換

https://www.promptfoo.dev/docs/configuration/guide/#transforming-outputs

providers[].config.transformResponse LLM の出力に常に行われる変換

tests[].options.transform

tests[].assert[].transform assert ごとに出力に対する変換

これ動いてるか? → 動いてるが defaultTest 適用後に実行される & view では反映されない?

他

tests[].assert[].contextTransform RAG などのコンテキストを変換、context を加工

PythonProvider が context 返してきてもいい

tests[].options.transformVars テストケースの入力 vars を加工、使うかなあ

キャッシュ

Caching Configuration - Performance Optimization | Promptfoo

困ったら --no-cache つける

Python Provider などでもキャッシュが効く

キャッシュキー

python:${scriptPath}:${functionName}:${apiType}:${fileHash}:${prompt}:${options}:${context.vars}

Python Provider

Python Provider | Promptfoo

async def call_api(prompt: str, options: dict[str, Any], context: dict[str, Any]) -> dict[str, Any]:

引数

prompt は文字列または JSON 文字列

options は provider に関する設定、実行時の providers[] の要素が入ってる

config はグローバル & テスト個別の設定がマージされた後のもの

context は vars

プロンプトに埋めるパラメータ以外もこれで渡す(もし検索条件とかを Agent にわたすなら)

context.get('vars', {}).get('extra_params', default)

中身みるときは Python 側で print して promptfoo eval --verbose して見ればいい

返り値

最低限 {"output": ... }

output は文字列またはオブジェクト

code:ProviderResponse.py

class ProviderResponse:

output: Optional[Union[str, Dictstr, Any]]

error: Optionalstr

tokenUsage: OptionalTokenUsage

cost: Optionalfloat

cached: Optionalbool

logProbs: Optional[Listfloat]

code:コピペ用.py

class TokenUsage:

total: int

prompt: int

completion: int

class ProviderResponse:

output: str | dictstr, Any | None

error: str | None

tokenUsage: TokenUsage | None # noqa: N815

cost: float | None

cached: bool | None

logProbs: listfloat | None # noqa: N815

複雑な構造を返す & テストする

output の dict に任意の構造を入れる & assert ごとに transform で必要な場所を読み取る

TODO output に None を含む構造を返した時に、それを参照すると promptfoo 側でエラーになる

キーがあって null なのかキー自体がないのか区別してエラーにしてほしいけどなあ

code:mock.py

async def call_api(prompt: str, options: dictstr, Any, context: dictstr, Any) -> dictstr, Any:

return {

# output に任意の構造を入れて返す

"output": {

"default": f"Hello! {prompt}",

"category": "greeting",

}

code:parameters.yaml

providers:

- id: file://mock.py

prompts:

- pokutuna

defaultTest:

options:

transform: output.default

# 都度 transform 書くのでもいいが、基本的には output.default のフィールドを

# テスト対象にする、ということにして defaultTest に書く

tests:

- assert:

- type: contains

value: Hello!

# defaultTest の transform があたって output.default に Hello! が含まれることをチェック

- assert:

- type: equals

value: greeting

# transform: ... ← 設定的にはここにも書けるけど defaultTest 当たった後に加工される

options:

transform: output.category

# tests 以下の個々の要素が defaultTest とマージされるので上書きするならここ

uv 使う

venv の activate/deactivate したくない、ラッパースクリプトを置く

code:uv-python.py

#!/bin/bash

uv run python "$@"

code:uv-run-python.yaml

providers:

- id: file://generate.py

config:

pythonExecutable: ./uv-python.sh

Cloud Run & Litestream

TODO できるはず

Self-hosting | Promptfoo

限定公開サービスを作成する - HTTPS リクエストで呼び出す | Cloud Run Documentation | Google Cloud

見る方は IAP かける / 手元の CLI から結果を送るところは gcloud run services proxy で経路を作る、でいける

Gemini の templerature, config なのか config.generationConfig なのか問題

Google AI / Gemini | Promptfoo

config 以下に渡している例と config.generationConfig 以下に渡している例がある

正解は両方

promptfoo/promptfoo@5168539 - src/providers/google/vertex.ts#L305-L315

気になり

別のワークフローで生成した出力に対し評価だけしたい(promptfoo 実行時の LLM 実行は LLM-as-a-Judge だけにする)

--model-outputs=... --assertions=... だけど、うーーん個別実行か

JSON の出力を加工して評価

tests の transform が使える?

出力Aに対するテスト集合 / 出力 B に対するテスト集合 / ... と分けたい

scenario を使う? → 違いそう

config ごと分けて -c でワイルドカードや複数指定するのが自然か?

テストの実行自体のトレースを Langfuse や Otel に送れないか?

#LLM #LLM-as-a-Judge