翻訳タスクをどう評価するか？

評価基盤

翻訳タスクでの自動評価指標からわかる通り、これらは人間の判断との整合性が低い。

一方で、GPTを使った翻訳精度からもわかる通り、GPT-4でかなり精度の高い翻訳を実現できる。

そのため、翻訳タスクを対象としたLLMの評価指標の中から、GEMBA-DA (noref)を使う。

ChainForge: https://chainforge.ai/play/?f=214idd13ywlcs

GPT-4を使うと、参照なしで、87.6%と最高評価

https://scrapbox.io/files/65cec583f7cc8f002d1dc9b4.png

API費用が気になる方は、GPT-3.5-Turboでも(86.9%)良さそう

https://scrapbox.io/files/65cec5c3ec92a800256948b7.png

ただし、GPT-3.5の場合、スコアのみを回答しないで、余計な文章をいう傾向にある。

GPT-4-Turbo使えば、アウトプットもスコアだけなら制限できるかつ、GitHubでは、max_tokenを20に設定していることから、GPT-3.5-Turboにするメリットは小さい

GEMBA-MQM論文では、参照なしのさらに良い方法が考案されているが、roleの設定などやや複雑であり、採用を見送った

データセット

GEMBA-DA(noref)を使うのであれば、参照は不要となる。

そのため、特別なデータセットの用意は不要。

もし用意するなら

GEMBA論文で使われた、WMT23の英語->日本語の共有タスクのデータが良いか。https://wmt-metrics-task.github.io/

code:markdown

【Technical Discussion】

The hacked up version of Jedi Knight was crashing because it was calling a function off the end of a vtable.Turns out is was presuming that calling IDirect3D::CreateViewport() would return an IDirect3DViewport3, which has additional methods tacked onto the end compared to an IDirect3DViewport, which is what I've implemented.To me, this is a pretty big assumption because it is only creating the viewport using a Direct3D object, not a Direct3D3 object.

Now, I get that in practice, IDirectXObject2 is typically a proper superset of IDirectXObject, with no changed function signatures, and new methods only added to the end. But this is not universally true; for those cases it matters what interface you are using to create the object in question. So anyway, since it does hold true here, to fix it I had to extend my viewport implementation to contain the IDirect3DViewport3 methods so that the call to the new one was valid.

code:markdown

【Philosophical and Scientific Inquiry】

Suppose for the sake of argument that science at least in part consists of lists of objectively factual statements about the world, true apart from any theory they might support.　Even if it's true that such facts exist in science it's still possible to argue that scientific facts are theory-laden.　Scientific facts result from experiments.　The experiments don't create the facts on this reading, but the choice of which experiments to conduct controls which facts are discovered.　Some facts, e.g. about subatomic particles, can only result from experiments that are themselves only possible in capitalism because they require too much resources, too much organization, too much coercion, to pull off otherwise.　This is a very brief sketch of a plausible argument that theories of capitalism influencing the actual content of science are consistent with theories asserting the existence of objective scientific facts.　It's not an argument in favor of the existence of objective scientific facts, which I don't believe in.

code:markdown

There is significant evidence that real-world communication cannot be reduced to sending signals with context-independent meaning. In this work, based on a variant of the classical Lewis (1969) signaling model, we explore the conditions for the emergence of context-dependent communication in a situated scenario. In particular, we demonstrate that pressure to minimise the vocabulary size is sufficient for such emergence. At the same time, we study the environmental conditions and cognitive capabilities that enable contextual disambiguation of symbol meanings. We show that environmental constraints on the receiver's referent choice can be unilaterally exploited by the sender, without disambiguatio capabilities on the receiver's end. Consistent with common assumptions, the sender's awareness of the context appears to be required for contextual communication. We suggest that context-dependent communication is a situated multilayered phenomenon, crucially influenced by environment properties such as distribution of contexts. The model developed in this work is a demonstration of how signals may be ambiguous out of context, but still allow for near-perfect communication accuracy.

プロンプト

論文では、翻訳性能を上げたとされるプロンプトは見つからず

Role-Play Promptなどが使えそうか。