OpenAIが公表した要約タスクの評価方法

How to evaluate a summarization task

https://cookbook.openai.com/examples/evaluation/how_to_eval_abstractive_summarization

このノートブックでは、簡単な例を用いて、抽象的な要約タスクの評価技術について探求します。伝統的な評価方法であるROUGEやBERTScoreといったものに加えて、LLM（大規模言語モデル）を評価者として使用するより新しいアプローチを紹介します。

要約の品質を評価することは、時間がかかるプロセスです。これには、一貫性、簡潔さ、可読性、内容などのさまざまな品質指標が関わってきます。ROUGEやBERTScoreなどの従来の自動評価指標は具体的で信頼性がありますが、実際の要約の品質と必ずしも相関しない可能性があります。特に、開かれた生成タスクに対する人間の判断との相関は比較的低いことが示されています。(G-Eval論文)

人間による評価、ユーザーフィードバック、モデルベースの指標に頼る必要が高まっていますが、潜在的なバイアスに注意する必要があります。人間の判断は貴重な洞察を提供しますが、スケールが大きくなると実行不可能で費用がかかることが多いです。

これらの伝統的な指標に加えて、私たちは、LLM（大規模言語モデル）を新しい、参照不要の指標として利用する方法（G-Eval）を紹介します。この場合、候補出力のスコアリングにgpt-4を使用します。gpt-4は、流暢で一貫性のあるテキストと低品質のテキストを区別する言語品質の内部モデルを効果的に学習しています。この内部スコアリングメカニズムを利用することで、LLMによって生成された新しい候補出力の自動評価が可能になります。

Setup

code:python

# Installing necessary packages for the evaluation

# rouge: For evaluating with ROUGE metric

# bert_score: For evaluating with BERTScore

# openai: To interact with OpenAI's API

!pip install rouge --quiet

!pip install bert_score --quiet

!pip install openai --quiet

code:python

import openai

import os

import re

import pandas as pd

# Python Implementation of the ROUGE Metric

from rouge import Rouge

# BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.

from bert_score import BERTScorer

openai.api_key = os.environ.get("OPENAI_API_KEY")

タスクの例

このノートブックでは、以下の要約例を使用する。

ROUGEやBERTScoreのような評価メトリクスが必要とする、比較のための2つの生成された要約と、人間が書いた参照用び要約を提供する。

抜粋 (excerpt)：

OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges.

要約:

https://scrapbox.io/files/659e0ade91ddc700232bd5b4.png

どの要約が個人的に好みか、OpenAIの使命をよく表しているか、じっくり考えてみてください。

code:python

excerpt = "OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges."

ref_summary = "OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges."

eval_summary_1 = "OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good."

eval_summary_2 = "OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff."

ROUGEを使った評価

ROUGEは、主に生成された出力と参照テキスト間の単語の重なりを測定する。これは自動要約タスクを評価するための一般的な指標である。ROUGE-Lは、システムが生成した要約と参照された要約の間の最長連続一致度に関する洞察を提供し、システムが元の要約のエッセンスをどの程度保持しているかを評価する。

code:python

# function to calculate the Rouge score

def get_rouge_scores(text1, text2):

rouge = Rouge()

return rouge.get_scores(text1, text2)

rouge_scores_out = []

# Calculate the ROUGE scores for both summaries using reference

eval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)

eval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)

for metric in "rouge-1", "rouge-2", "rouge-l":

for label in "F-Score":

eval_1_score = eval_1_rouge0metric[label0.lower()]

eval_2_score = eval_2_rouge0metric[label0.lower()]

row = {

"Metric": f"{metric} ({label})",

"Summary 1": eval_1_score,

"Summary 2": eval_2_score,

}

rouge_scores_out.append(row)

def highlight_max(s):

is_max = s == s.max()

return [

"background-color: lightgreen" if v else "background-color: white"

for v in is_max

]

rouge_scores_out = (

pd.DataFrame(rouge_scores_out)

.set_index("Metric")

.style.apply(highlight_max, axis=1)

)

rouge_scores_out

https://scrapbox.io/files/659e0c0752aaee0025101b71.png

表は、2つの異なる要約を参照テキストに対して評価したROUGEスコアを示している。rouge-1の場合、要約2が要約1を上回っており、これは個々の単語がよりよく重複していることを示している。rouge-lの場合、要約2の方がスコアが高く、これは最長共通部分文がより近く一致していることを意味し、したがって原文の主な内容と順序を捉える上で、より優れた全体的な要約である可能性がある。要約2には抜粋から直接引用された多くの単語や短いフレーズがあるため、参照要約との重なりはより高く、より高いROUGEスコアにつながる可能性が高い。

ROUGEやBLEUやMETEORのような類似の測定基準は定量的な尺度を提供するが、よく作成された要約の本質を捉えられないことが多い。また、人間のスコアとの相関も悪い。流暢で首尾一貫した要約を作成することに長けているLLMの進歩を考えると、ROUGEのような従来の測定基準は、不注意にもこれらのモデルにペナルティを与えてしまう可能性がある。これは、要約の表現が異なっていても、核となる情報が正確に要約されている場合に特に当てはまる。

BERTScoreを使った評価

ROUGEは予測文と参照文の両方における単語の正確な存在に依存しており、根本的な意味論を解釈することができない。そこでBERTScoreが登場し、BERTモデルからの文脈埋め込みを活用して、機械生成テキストの文脈における予測文と参照文の類似性を評価することを目指す。両方の文の埋め込みを比較することで、BERTScoreは従来のn-gramベースのメトリクスでは見逃される可能性のある意味的類似性を捉える。

💡ベクトルに置き換えているから、意味について解釈できるのか...！

code: python

# Instantiate the BERTScorer object for English language

scorer = BERTScorer(lang="en")

# Calculate BERTScore for the summary 1 against the excerpt

# P1, R1, F1_1 represent Precision, Recall, and F1 Score respectively

P1, R1, F1_1 = scorer.score(eval_summary_1, ref_summary)

# Calculate BERTScore for summary 2 against the excerpt

# P2, R2, F2_2 represent Precision, Recall, and F1 Score respectively

P2, R2, F2_2 = scorer.score(eval_summary_2, ref_summary)

print("Summary 1 F1 Score:", F1_1.tolist()0)

print("Summary 2 F1 Score:", F2_2.tolist()0)

-> Summary 1 F1 Score: 0.9227314591407776

-> Summary 2 F1 Score: 0.9189572930335999

両サマリーのF値/F1スコアが近いことから、重要な情報を捕捉する上で、両サマリーは同様のパフォーマンスを示している可能性がある。しかし、このわずかな差は注意して解釈すべきである。BERTScoreは、人間の評価者が理解するような微妙な点や高度な概念を完全に把握していない可能性があるため、この指標のみに依存することは、要約の実際の品質やニュアンスを誤って解釈することにつながる可能性がある。BERTScoreと人間の判断および他の評価指標を組み合わせた統合的なアプローチは、より信頼性の高い評価を提供できるだろう。

GPT-4を使った評価

gpt-4を使用した参照不要のテキスト評価者の実装例を紹介します。

ROUGEやBERTScore論文のようなメトリクスが参照要約との比較に依存するのに対し、gpt-4ベースの評価者は、基準となる真実の参照なしで、入力プロンプトとテキストだけに基づいて生成されたコンテンツの品質を評価します。これにより、人間の参照が少ないか利用できない新しいデータセットやタスクに適用できます。

概要は以下のとおりです。

私たちは4つの異なる基準を定義します：

Relevance (関連性)：要約が重要な情報のみを含み、余分なものを排除しているかを評価します。

Conherence (論理的で、ぶれない)：要約の論理的な流れと構成を評価する。

Consistency (安定した、いつも変わらない)：要約が元の文書の事実と一致しているかを確認します。

Fluency (流暢さ)：要約の文法と可読性を評価します。

これらの各基準に対して、元の文書と要約を入力とし、CoT (Chain-of-Thought)を活用して、モデルに各基準に対して1〜5の数値スコアを出力させるようなプロンプトを作成します。

定義されたプロンプトでgpt-4からスコアを生成し、要約間で比較します。

このデモンストレーションでは、gpt-4が各メトリックに対して離散的なスコア（1〜5）を生成する直接的なスコアリング機能を使用しています。スコアを正規化し、加重和を取ることで、より堅牢で連続的なスコアを得ることができ、要約の品質と多様性をよりよく反映することができます。

code:python

# Evaluation prompt template based on G-Eval

EVALUATION_PROMPT_TEMPLATE = """

You will be given one summary written for an article. Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions very carefully.

Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

{criteria}

Evaluation Steps:

{steps}

Example:

Source Text:

{document}

Summary:

{summary}

Evaluation Form (scores ONLY):

- {metric_name}

"""

# Metric 1: Relevance

RELEVANCY_SCORE_CRITERIA = """

Relevance(1-5) - selection of important content from the source. \

The summary should include only important information from the source document. \

Annotators were instructed to penalize summaries which contained redundancies and excess information.

"""

RELEVANCY_SCORE_STEPS = """

1. Read the summary and the source document carefully.

2. Compare the summary to the source document and identify the main points of the article.

3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.

4. Assign a relevance score from 1 to 5.

"""

# Metric 2: Coherence

COHERENCE_SCORE_CRITERIA = """

Coherence(1-5) - the collective quality of all sentences. \

We align this dimension with the DUC quality question of structure and coherence \

whereby "the summary should be well-structured and well-organized. \

The summary should not just be a heap of related information, but should build from sentence to a\

coherent body of information about a topic."

"""

COHERENCE_SCORE_STEPS = """

1. Read the article carefully and identify the main topic and key points.

2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,

and if it presents them in a clear and logical order.

3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.

"""

# Metric 3: Consistency

CONSISTENCY_SCORE_CRITERIA = """

Consistency(1-5) - the factual alignment between the summary and the summarized source. \

A factually consistent summary contains only statements that are entailed by the source document. \

Annotators were also asked to penalize summaries that contained hallucinated facts.

"""

CONSISTENCY_SCORE_STEPS = """

1. Read the article carefully and identify the main facts and details it presents.

2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.

3. Assign a score for consistency based on the Evaluation Criteria.

"""

# Metric 4: Fluency

FLUENCY_SCORE_CRITERIA = """

Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.

1: Poor. The summary has many errors that make it hard to understand or sound unnatural.

2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.

3: Good. The summary has few or no errors and is easy to read and follow.

"""

FLUENCY_SCORE_STEPS = """

Read the summary and evaluate its fluency based on the given criteria. Assign a fluency score from 1 to 3.

"""

def get_geval_score(

criteria: str, steps: str, document: str, summary: str, metric_name: str

prompt = EVALUATION_PROMPT_TEMPLATE.format(

criteria=criteria,

steps=steps,

metric_name=metric_name,

document=document,

summary=summary,

)

response = openai.ChatCompletion.create(

model="gpt-4",

messages={"role": "user", "content": prompt},

temperature=0,

max_tokens=5,

top_p=1,

frequency_penalty=0,

presence_penalty=0,

)

return response.choices0.message.content

evaluation_metrics = {

"Relevance": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),

"Coherence": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),

"Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),

"Fluency": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),

}

summaries = {"Summary 1": eval_summary_1, "Summary 2": eval_summary_2}

data = {"Evaluation Type": [], "Summary Type": [], "Score": []}

for eval_type, (criteria, steps) in evaluation_metrics.items():

for summ_type, summary in summaries.items():

data"Evaluation Type".append(eval_type)

data"Summary Type".append(summ_type)

result = get_geval_score(criteria, steps, excerpt, summary, eval_type)

score_num = int(result.strip())

data"Score".append(score_num)

pivot_df = pd.DataFrame(data, index=None).pivot(

index="Evaluation Type", columns="Summary Type", values="Score"

)

styled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)

display(styled_pivot_df)

https://scrapbox.io/files/659e0ec45228010026f794dd.png

全体的に、要約1は4つのカテゴリーのうち3つ（一貫性、関連性、流暢さ）で要約2を上回っているようである。両要約は互いに一貫(要約が元の文書の事実と一致)していることがわかった。この結果は、与えられた評価基準に基づいて、要約1の方が一般的に望ましいことを示唆しているかもしれない。

制限事項

LLMベースのメトリクスは、人間が書いたテキストよりもLLMが生成したテキストを好む傾向があることに注意。

さらに、LLMベースのメトリクスは、システム・メッセージ／プロンプトの影響を受けやすい。

私たちは、パフォーマンスを向上させ、一貫したスコアを得るのに役立つ他のテクニックを試すことを推奨する。

また、このスコアリング手法は現在のところgpt-4のコンテキストウィンドウによって制限されていることは注目に値する。

結論

抽象的要約の評価は、さらなる改善の余地がある。ROUGE、BLEUScore、BERTScoreのような従来の評価基準は、有用な自動評価を提供するが、意味的類似性や要約品質の微妙な側面を捉えることには限界がある。さらに、これらは参照出力を必要とするが、その収集やラベル付けにはコストがかかる。LLMベースのメトリクスは、首尾一貫性、流暢さ、関連性を評価する参照不要の方法として有望である。しかし、これらにも、LLMによって生成されたテキストを好む潜在的なバイアスがある。結局のところ、抽象的要約システムを確実に評価するには、自動評価基準と人間による評価の組み合わせが理想的である。要約の質を包括的に理解するためには人間による評価が不可欠であるが、効率的で大規模なテストを可能にするためには、自動評価で補完されるべきである。この分野は、品質、スケーラビリティ、公平性のバランスをとりながら、より強固な評価手法を進化させ続けるだろう。評価手法を進化させることは、プロダクションアプリケーションの進歩を推進する上で極めて重要である。