COMET: A Neural Framework for MT Evaluation

#memo #paper

https://gyazo.com/b3ec4a0ff9ce479b61306e93fad187be

COMET

Estimator ModelとTranslation Ranking Modelの２つが提案されている

sourceは翻訳元の文章，hypothesisは翻訳結果，referenceは人間による正解の翻訳

Estimator Model

predicted scoresとquality assessments (DA, HTER or MQM)のMSEを誤差とする

関連研究

BERT Regressor

BERTScore

MoverScore

BLEURT

一元的なスカラ値(e.g., DA)への回帰に批判的な論文

深刻な誤訳の識別に向けた分類型翻訳評価データセットの構築

教師データ

現在の翻訳人手評価の主流は直接評価 (Direct Assessment; DA) と呼ばれる標準化された単一の人手評価値である．

DA の予測を回帰として解くアプローチは上記 BERT Regressor や BLEURT でも用いられており，人手評価を自動評価で再現するという観点では理にかなった方法である．

引用: https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P7-5.pdf

HTER

https://aclanthology.org/2006.amta-papers.25.pdf

Human-in-the-Loopの評価になっている

まず，参照文をできるだけ仮説文に近づけるように人間に編集してもらう

その後，編集した参照文と仮説文の間でTERを計測

TER(Translation Edit Rate) = 編集距離のカウント版

MQM

Style, Fluency, Accuracyの三種類の人手評価からエラーを計算し，簡単な計算式で0〜100に落とし込む

Errors are annotated under an internal typology defined under three main error types; ‘Style’, ‘Fluency’ and ‘Accuracy’. Our MQM scores range from −∞ to 100 and are defined as:

コーパス

The QT21 corpus

情報工学と生命科学の翻訳コーパス

著者らはHTERを計測して，回帰

The WMT DARR corpus

人間による評価(DA)が含まれる

The MQM corpus

Each tuple consists of a source sentence, a human-generated reference, a MT hypothesis, and its MQM score, derived from error annotations by one (or more) trained annotators

MQMを含む翻訳コーパス

https://github.com/Unbabel/COMET/issues/36

We train two versions of the Estimator model described in section 2.3: one that regresses on HTER (COMET-HTER) trained with the QT21 corpus, and another that regresses on our proprietary implementation of MQM (COMET-MQM) trained with our internal MQM corpus. For the Translation Ranking model, described in section 2.4, we train with the WMT DARR corpus from 2017 and 2018 (COMETRANK).

memo

COMET自体の翻訳精度(的なナニカ)が評価のupper-boundになるのでは...?

COMETが本質的に翻訳できないものはそもそも評価できないように思えるが

refがあるのでupper-boundにはならないのか