菊田遥平「原論文から解き明かす生成 AI」2025/8/23

https://gyazo.com/f5bd97a8436e2038c079c6298587535a

2.「入力 data の特徵量化」

Herbert Rubenstein, John B. Goodenough "Contextual correlates of synonymy" 1965/10/1

Contextual correlates of synonymy | Communications of the ACM

Experimentol corroboration was obtained for the hypothesis that the proportion of words common to the contexts of word A and to the contexts of word B is a function of the degree to which A and B are similar in meaning. The tests were carried out for variously defined contexts. The shapes of the functions, however, indicate that similarity of context is reliable as criterion only for detecting pairs of words that are very similar in meaning.

實驗的檢證により、單語 A の文脈と單語 B の文脈に共通する單語の割合が、A と B の意味的類似度の度合ひに依存するといふ假說が支持された。この檢證は樣々な定義に基づく文脈に對して實施された。ただし、得られた函數の形狀から判斷すると、文脈の類似性は意味的に極めて類似した單語の pair を檢出するための基準としてのみ信賴できることが明らかになった。

Rico Sennrich, Barry Haddow, Alexandra Birch "Neural Machine Translation of Rare Words with Subword Units" 2015/8/31

［1508.07909］ Neural Machine Translation of Rare Words with Subword Units

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.

neural 機械飜譯 (NMT) model は通常、固定された語彙 set で動作するが、飜譯問題自體は open 語彙問題である。從來の硏究では、辭書を參照することで語彙外單語の飜譯に對處してきた。本論文では、より簡潔で效果的な approach を提案する。具體的には、NMT model が稀少語や未知語を subword 單位の連續として符號化できるようにすることで、open 語彙飜譯を實現する。この手法の根據は、樣々な語種が單語單位よりも小さな單位によって飜譯可能であるといふ直觀に基づいてゐる。例へば、人名は文字の copy や轉寫によって、複合語は構成要素ごとの飜譯によって、同義語や借用語は音韻的・形態的變換によってそれぞれ飜譯可能である。本硏究では、單純な文字 n-gram model や byte pair 符號化壓縮 algorithm に基づく segmentation 手法を含む、樣々な單語分割技術の適用可能性について考察する。實驗結果から、subword model が WMT 15 の英語-Deutsche 語および英語-русский 語飜譯 task において、それぞれ 1.1 BLEU point および 1.3 BLEU point の改善をもたらし、辭書參照 baseline を上囘る性能を示すことを實證した。

Philip Gage "A new algorithm for data compression" 1994/2/1

A new algorithm for data compression | The C Users Journal

This article describes a simple general-purpose data compression algo-rithm, called Byte Pair Encoding (BPE), which provides almost as much compression as the popular Lempel, Ziv compression.

Data compression is becoming increasingly important as a way to stretch disk space and speed up data transfers. This article describes a simple general-purpose data compression algo-rithm, called Byte Pair Encoding (BPE), which provides almost as much compression as the popular Lempel, Ziv.

本稿では、一般的な data 壓縮 algorithm である「byte pair 符號化 (byte pair encoding: BPE)」について解說する。この algorithm は、廣く利用されてゐる Lempel-Ziv 壓縮法とほぼ同等の壓縮率を實現する simple な汎用 data 壓縮手法である。

disk 容量の有效活用や data 轉送速度の向上において、data 壓縮の重要性はますます高まってゐる。本稿では、一般的な data 壓縮 algorithm である「byte pair 符號化 (byte pair encoding: BPE)」について說明する。この手法は、廣く普及してゐる Lempel-Ziv 壓縮法とほぼ同等の壓縮性能を発揮する simple な汎用 algorithm である。

Taku Kudo "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates" 2018/4/29

［1804.10959］ Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates - ACL Anthology

Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.

subword 單位は、neural 機械飜譯 (NMT) における open 語彙問題を緩和する效果的な手法である。通常、文は固有の subword 列に變換されるが、subword 分割には潛在的な曖昧性が存在し、同一の語彙 set であっても複數の分割可能性が生じる場合がある。本論文で取り組む課題は、かうした分割の曖昧性を noise として積極的に活用することで、NMT system の頑健性を向上させ得るか否かといふ點である。本硏究では、訓練時に確率的に sampling された複數の subword 分割を用ゐて model を訓練する、simple な正則化手法「subword 正則化」を提案する。さらに優れた subword sampling を實現するために、unigram 言語 model に基づく新たな subword 分割 algorithm を提案する。複數の corpus を用ゐた實驗により、特に resource 制約下や domain 外の設定において、一貫した性能向上が確認された。

Taku Kudo, John Richardson "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" 2018/8/19

［1808.06226］ SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing - ACL Anthology

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at this https URL.

本論文では、neural based の text 處理、特に neural 機械飜譯向けに設計された言語非依存型の subword token 化・逆 token 化 tool「SentencePiece」を紹介する。本 tool は、subword 單位の open source 實裝を C++ および Python で提供する。既存の subword 分割 tool が入力 text を單語列に事前分割されてゐることを前提とするのに對し、SentencePiece は生の文から直接 subword model を學習可能である。この特徵により、完全に end-to-end で動作し、言語依存性のない system の構築が可能となる。本硏究では、英語-日本語閒機械飜譯 task を用ゐて NMT の検證實驗を實施したところ、生の文から直接 subword を學習した場合と同等の精度を達成可能であることを確認した。さらに、樣々な設定條件下における subword 學習と分割處理の性能比較も行ってゐる。SentencePiece は Apache 2.0 license の下で、以下の HTTPS URL から入手可能である。

google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE)［Sennrich et al.］) and unigram language model［Kudo.］) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

SentencePiece は、neural network based の text 生成 system 向けに設計された敎師なし text token 化・逆 token 化 tool である。この tool の特徵は、neural model の學習前に語彙 size が事前に決定されてゐる點にある。SentencePiece は、subword 單位 (例 : byte pair 符號化 (BPE)［Sennrich et al.］) や unigram 言語 model［Kudo.］を實裝してをり、生の text から直接學習を行ふ擴張機能を備えてゐる。これにより、言語固有の前處理 / 後處理に依存しない純粹な end-to-end system の構築が可能となる。

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever "Language Models are Unsupervised Multitask Learners" 2019/2/14

Alec Radford, OpenAI "Language Models are Unsupervised Multitask Learners" 2019

質問応答、機械飜譯、読解理解、要約といった自然言語處理 task は、通常、task 特化型の dataset を用ゐた敎師あり學習によって處理される。本硏究では、WebTextと呼ばれる數百万ページに及ぶ新規 dataset で言語 model を學習させたところ、明示的な敎師信號を一切与えなくとも、これらの task を自動的に學習し始めることを實證する。文書と質問を條件とした場合、生成される囘答はCoQA dataset において55F1といふスコアを達成し、127,000件以上の學習例を使用せずに4つの baseline system のうち3つと同等かそれ以上の性能を示した。言語 model の學習能力は、ゼロショット task 転移の成功において決定的な役割を果たす。この能力を向上させることで、task 閒で對數線形的に性能が向上することが確認された。我々の最大 model であるGPT-2は、15億パラメータを有するTransformer アーキテクチャであり、ゼロショット設定において8種類の言語モデリング dataset 中7つで最先端の結果を達成した。しかしながら、この model でもWebText dataset を完全に適合させるには至っていない。model から生成されたサンプルはこれらの性能向上を反映してをり、一貫性のある段落レベルの text を生成する。これらの知見は、自然に発生するデモンストレーションから task 遂行能力を學習する言語處理 system を構築する有望な方向性を示してゐる。

大規模で多樣性に富んだ dataset で言語 model を學習させると、model は廣範な domain や dataset において優れた性能を発揮できるようになる。GPT-2は、8種類の言語モデリング dataset 中7つにおいて最先端の性能をゼロショットで達成した。ゼロショット設定で model が實行可能な task の多樣性は、十分な多樣性を持つ text corpus の尤度を最大化するように訓練された高能力 model が、明示的な敎師信號を必要とせずに驚くほど多くの task を實行する方法を學習し始めることを示唆してゐる。

Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran "Towards a Theory of Tokenization in LLMs" 2024/4/12

［2404.08335］ Toward a Theory of Tokenization in LLMs

While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple kth-order Markov processes for k>1, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from kth-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.

言語モデリングにおいて token 化を囘避しようとする硏究は數多く行われてきたが（Clark et al., 2022; Xue et al., 2022）、現在の硏究動向では、これは最先端の高性能言語 model を設計する上で必須の初期段階の處理であるとの見解が一致してゐる。本論文では、理論的觀點から token 化について考察する。具體的には、單純な data 生成プロセスに對する Transformer の挙動を詳細に分析した。k>1の特定のk次マルコフ過程から生成された data で學習した場合、Transformer は驚くべき現象を示す : token 化が行われていない状態では、経驗的に正しい分布を學習できず、unigram model に従って文字を予測する傾向が認められる（Makkuva et al., 2024）。しかしながら、token 化を導入することで、Transformer はこの障壁を突破し、source data から生成される系列の確率をほぼ最適に model 化することが可能となり、その結果、クロスエントロピー損失が低減されることが實驗的に確認された。この觀察結果を出発點として、本硏究では token 化の有無による Transformer の end-to-end 型クロスエントロピー損失について詳細に検討する。適切な token 化條件下では、Transformer が學習する最も單純な unigram model (token 單位の model) であっても、k次 Марковский 過程に由来する系列の確率をほぼ最適に model 化できることが明らかとなった。本分析は、マルコフ的 data に對する Transformer の挙動を詳細に調べることで、實際の応用における token 化の有效性を理論的に裏付けるものである。

3.「生成 AI model の大前提となる Transformer」

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin "Attention is All You Need" 2017/6/12

Transformer#68ca889b0000000000576b21

4.「Generative Pre-trained Transformer と text 生成」

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever "Improving language understanding by generative pre-training" 2018/6/11

language_understanding_paper.pdf

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

自然言語理解には、text 含意関係判定、質問応答、意味的類似性評価、文書分類など、多岐にわたる多樣な task が含まれる。大規模な未ラベル text corpus は豊富に存在するものの、これらの特定 task を學習するためのラベル付き data は依然として不足してをり、識別的學習手法による model の性能向上は困難である。本硏究では、多樣な未ラベル text corpus を用ゐて言語 model を生成的事前學習した後、各個別 task に對して識別的 fine-tuning を施すことで、これらの task において大幅な性能向上が可能であることを實證する。從來の手法とは異なり、本手法では fine-tuning 時に task 特性を考慮した入力變換を適用することで、model アーキテクチャへの最小限の變更で效果的な知識転移を實現する。本手法の有效性を、自然言語理解分野における廣範なベンチマーク dataset で實證した。task 非依存型の汎用 model は、各 task 専用に設計されたアーキテクチャを採用する識別的學習 model を凌駕する性能を示し、硏究對象とした 12 task 中 9 task において最先端性能を大幅に更新した。具體的には、常識推論 task (Stories Cloze Test) で8.9%、質問応答 task (RACE) で5.7%、text 含意関係判定 task (MultiNLI) で1.5%といふ絶對的な性能向上を達成した。

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever "Language Models are Unsupervised Multitask Learners" 2019/2/14

上記

Tom B. Brown, OpenAI "Language Models are Few-Shot Learners" 2020/5/28

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning.We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.

近年の硏究では、大規模な text corpus を用ゐた事前學習後に特定 task 向けに微調整を行ふことで、自然言語處理（NLP）の多岐にわたる task やベンチマークにおいて顕著な性能向上が確認されてゐる。この手法はアーキテクチャ的には task 非依存であるものの、依然として數千から數万例規模の task 特化型微調整用 dataset を必要とする。これに對し、人閒はわずか數例の事例や單純な指示文から新たな言語 task を一般的に習得できる能力を有してをり、現在のNLP system が依然として大きく苦戦してゐる課題である。本硏究では、言語 model の大規模化が task 非依存の少數事例學習性能を大幅に向上させることを示し、場合によっては從來の最先端微調整手法と遜色ない性能を達成できることを實證した。具體的には、1750億パラメータを有する自己囘帰型言語 model GPT-3を訓練し、これまでで最もパラメータ數の多い非スパース言語 model の10倍の規模を實現した。そして、この model を少數事例設定下で評価した。全ての task において、GPT-3は勾配更新や微調整を一切行わずに適用され、task 定義と少數事例のデモンストレーションは完全に text based の model との相互作用によって行われる。GPT-3は、飜譯、質問応答、穴埋め問題といった多くのNLP dataset において優れた性能を示すほか、單語の並べ替え、新規單語の文中使用、3桁の加減算など、オンザフライでの推論や domain 適応を必要とする task においても良好な結果を得た。同時に、GPT-3の少數事例學習が依然として困難を示す dataset や、大規模ウェブ corpus を用ゐた學習に伴う方法論的課題が顕在化する dataset も特定した。さらに、GPT-3が生成するニュース記事サンプルは、人閒の評価者が人閒が書いた記事と区別するのに困難を感じるレベルに達してゐることも確認した。本硏究成果が持つ社会的影響と、GPT-3 model 全體に関する廣範な議論を展開する。

我々は、ゼロショット、ワンショット、少數事例設定において多くの NLP task やベンチマークで優れた性能を示す1750億パラメータ規模の言語 model を提示した。場合によっては最先端の微調整済み system とほぼ同等の性能を達成するとともに、オンザフライで定義された task においても高品質なサンプル生成と優れた定性的性能を示した。微調整を使用せずに性能がスケールする傾向はほぼ予測通りの結果であった。また、この種の model が社会に与える影響についても考察した。多くの制約や弱點が存在するものの、これらの結果は、極めて大規模な言語 model が適応可能で汎用的な言語 system の開発において重要な要素となり得ることを示唆してゐる。

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever "Generating Long Sequences with Sparse Transformers" 2019/4/23

［1904.10509］ Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $ O(n\sqrt n). We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.

Transformer は強力な系列 model であるが、その計算量は系列長に對して二次的に増加するといふ課題がある。本論文では、注意行列のスパース因子分解手法を提案することで、この計算量を$ O(n\sqrt n)に低減することに成功した。さらに以下の3つの革新的手法を導入した：a) より深いネットワークを訓練するためのアーキテクチャと初期化手法の改良、b) メモリ使用量を削減するための注意行列の再計算手法、c) 訓練時の高速な注意カーネル實裝。これらの改良を施したネットワークを「スパーストランスフォーマー」と呼称し、數百層といふ深層構造を用ゐて數万タイムステップに及ぶ長系列のモデリングが可能であることを實證した。同一のアーキテクチャを用ゐて、生のバイト data から直接画像・音声・text をモデリングし、Enwik8、CIFAR-10、ImageNet-64における密度モデリングの新たな最先端性能を達成した。無條件生成サンプルからはグローバルな一貫性と優れた多樣性が確認され、原理的には自己注意機構を用ゐて100万タイムステップ以上の長さの系列もモデリング可能であることを示す。

OpenAI, et al. "GPT-4 Technical Report" 2023/3/15

［2303.08774］ GPT-4 Technical Report

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.

本硏究では、画像と text の兩方を入力として受け付け、text 出力を生成可能な大規模マルチモーダル model GPT-4の開発について報告する。GPT-4は現實世界の多くのシナリオにおいて人閒の能力には及ばないものの、各種専門職試驗や學術ベンチマークにおいて人閒レベルの性能を発揮し、模擬司法試驗では受驗者上位10％に相当するスコアを獲得した。GPT-4は Transformer アーキテクチャを基盤とした model であり、文書中の次の token を予測するように事前學習されてゐる。事後學習によるアライメント處理を施すことで、事實性と所望の行動規範への準據性に関する評価指標において性能が向上した。このプロジェクトの中核的な要素は、廣範なスケールにわたって予測可能な挙動を示すインフラストラクチャと最適化手法の開発であった。これにより、GPT-4の1,000分の1程度の計算資源で學習した model であっても、GPT-4の性能特性の一部を正確に予測することが可能となった。

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, Zhifang Sui "A Survey on In-context Learning" 2022/12/31

［2301.00234］ A Survey on In-context Learning

A Survey on In-context Learning - ACL Anthology

With the increasing capabilities of large language models (LLMs), in-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP), where LLMs make predictions based on contexts augmented with a few examples. It has been a significant trend to explore ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress and challenges of ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques, including training strategies, prompt designing strategies, and related analysis. Additionally, we explore various ICL application scenarios, such as data engineering and knowledge updating. Finally, we addreƒss the challenges of ICL and suggest potential directions for further research. We hope that our work can encourage more research on uncovering how ICL works and improving ICL.

大規模言語 model (LLM)の機能が進化するにつれ、インコンテキスト學習（ICL）は自然言語處理（NLP）における新たなパラダイムとして台頭してきた。ICLでは、LLMが少數の事例を追加した文脈情報に基づいて予測を行ふ。この手法は、LLMの能力を評価・擴張するための重要な硏究動向となってゐる。本論文では、ICLの進展状況と課題について體系的に調査・総括することを目的とする。まずICLの正式な定義を提示し、関連する硏究分野との関連性を明確にする。續いて、學習戦略、プロンプト設計手法、関連分析手法といった先進的な技術について體系的に整理・考察する。さらに、data エンジニアリングや知識更新といった多樣なICL応用シナリオについても検討する。最後に、ICLが直面する課題を明らかにするとともに、今後の硏究方向性について提言を行ふ。本硏究が、ICLの動作原理の解明とICL技術のさらなる向上に向けた硏究促進に寄与することを期待する。

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, Furu Wei "Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers" 2022/12/20

［2212.10559］ Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers - ACL Anthology

Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand in-context learning as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We comprehensively compare the behaviors of in-context learning and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. Inspired by the dual form between Transformer attention and gradient descent, we design a momentum-based attention by analogy with gradient descent with momentum. The improved performance over vanilla attention further supports our understanding from another perspective, and more importantly, shows the potential to utilize our understanding for future model design. The code is available at \url{this https URL}.

大規模事前學習言語 model は、文脈内學習（In-Context Learning: ICL）において驚くべき能力を示してゐる。少數のデモンストレーション用入力-ラベルペアを与えるだけで、パラメータ更新を行ふことなく未見の入力に對するラベルを予測することが可能である。この性能面での顕著な成果にもかかわらず、その動作メカニズムは依然として未解明の課題である。本論文では、言語 model をメタ最適化器として捉え、文脈内學習を暗黙的な fine-tuning として再解釈する。理論的考察により、Transformer の注意機構が勾配降下法の二重形態を有することを明らかにした。さらに我々は、ICLのメカニズムを以下のように解明した：GPTはまずデモンストレーション例に基づいてメタ勾配を生成し、これらのメタ勾配を元のGPT model に適用することでICL model を構築する。我々は實際の task において、文脈内學習と明示的な fine-tuning の挙動を包括的に比較し、我々の理解を裏付ける實證的證據を提示する。實驗結果から、文脈内學習が複數の觀點から明示的な fine-tuning と同樣の振る舞いを示すことが明らかになった。Transformer の注意機構と勾配降下法の二重性に着想を得て、我々は勾配降下法に運動量項を導入した類似の注意機構を設計した。標準的な注意機構を上囘る性能向上は、我々の理解を別の角度から支持するものであり、さらに重要な點として、本硏究の知見を将来の model 設計に活用する可能性を示してゐる。コードは\url{このHTTPS URL}で公開されてゐる。

LMOps/understand_icl at main · microsoft/LMOps

This repository contains the implementation of ACL 2023 Findings paper "Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers".

本リポジトリには、ACL 2023で発表された論文「Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers」（GPTはなぜ文脈内學習が可能なのか？言語 model は暗黙的に勾配降下法をメタ最適化器として機能させる）の實裝コードが含まれています。

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei "Deep reinforcement learning from human preferences" 2017/6/12

［1706.03741］ Deep reinforcement learning from human preferences

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.

高度な強化學習 (RL) system が實世界環境と效果的に相互作用するためには、複雑な目標をこれらの system に伝達する必要がある。本硏究では、2つの軌道セグメント閒における（専門家ではない）人閒の選好関係といふ觀點から定義された目標について検討する。この approach により、報酬関數へのアクセスが不可能な場合でも、Atariゲームやシミュレーション環境におけるロボットの移動制御といった複雑な RL task を效果的に解決可能であることを示す。さらに、エージェントと環境との相互作用の1%未満といふ極めて少ないフィードバック量でこれを實現してゐる。これにより、人閒による監視コストを大幅に削減でき、最先端のRL system にも實用的に適用可能となる。本手法の柔軟性を實證するため、約1時閒の人閒の作業時閒で複雑な新規行動の學習に成功することを明らかにする。これらの行動と學習環境は、從來人閒のフィードバックから學習されたいかなる事例よりもはるかに複雑である。

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe "Training language models to follow instructions with human feedback" 2022/3/4

［2203.02155］ Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

言語 model の規模を大きくすることが、必ずしもユーザーの意図をより正確に理解・實行する能力の向上につながるわけではない。例へば、大規模言語 model (LLM)は虚偽の情報を生成したり、有害な内容を含んだり、單にユーザーにとって有用でない出力を行ふ場合がある。言い換えれば、これらの model はユーザーの意図と整合性が取れていないのである。本論文では、人閒からのフィードバックを用ゐた fine-tuning によって、幅廣い task において言語 model をユーザーの意図に整合させる手法を提案する。まず、ラベル作成者が作成したプロンプトとOpenAI APIを通じて提出されたプロンプトの set を出発點とし、望ましい model 動作の實例 dataset を収集する。この dataset を用ゐて、敎師あり學習によるGPT-3の fine-tuning を實施する。さらに、model 出力のランキング dataset を収集し、これを基に強化學習 (RL)と人閒からのフィードバックを組み合わせた手法でさらに fine-tuning を行ふ。かうして得られた model を「InstructGPT」と呼ぶ。我々のプロンプト分布における人閒評価實驗では、13億パラメータのInstructGPT model の出力が、1750億パラメータのGPT-3 model の出力よりも好まれるといふ結果が得られた。これはパラメータ數が100分の1であるにもかかわらずである。さらに、InstructGPT model は真實性の向上と有害な出力の生成減少を示しつつ、公開NLP dataset における性能低下は最小限に抑えられてゐる。InstructGPTには確かに單純な誤りが見られるものの、本硏究の結果は、人閒からのフィードバックを用ゐた fine-tuning が言語 model を人閒の意図に整合させる有望な手法であることを示唆してゐる。

5.「擴散 model と畫像生成」

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" 2020/10/22

［2010.11929］ An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Transformerアーキテクチャは自然言語處理 task において事實上の標準となってゐるが、そのコンピュータビジョン分野への応用は未だ限定的である。視覚分野においては、注意機構は疊み込み neural network (CNN)と併用されるか、特定の構成要素を置換しつつネットワーク全體の構造を保持する形で用ゐられるのが一般的である。本硏究では、このCNNへの依存は必ずしも必要ではなく、画像パッチのシーケンスに直接適用した純粹な Transformer アーキテクチャであっても、画像分類 task において極めて優れた性能を発揮し得ることを實證する。大規模 dataset で事前學習を行い、複數の中規模・小規模画像認識ベンチマーク（ImageNet、CIFAR-100、VTABなど）に轉移學習させた場合、Vision Transformer（ViT）は最先端の疊み込み neural network (CNN)に匹敵する優れた結果を達成しつつ、訓練に必要な計算資源を大幅に削減することが可能である。

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" 2015/3/12

［1503.03585］ Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Deep unsupervised learning using nonequilibrium thermodynamics | Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model. We additionally release an open source reference implementation of the algorithm.

機械學習における中核的な課題の一つは、學習、sampling、推論、評価の各プロセスが解析的あるいは計算的に扱い可能な状態を維持しつつ、高度に柔軟な確率分布族を用ゐて複雑な dataset を model 化する手法の開発である。本硏究では、柔軟性と扱いやすさの兩立を同時に實現する新たな approach を提案する。その核心的な着想は、非平衡統計物理學の知見に着想を得たもので、反復的な順方向擴散プロセスを通じて data 分布の構造を系統的かつ緩やかに破壊するといふ手法である。その後、data に構造を再構築する逆方向擴散プロセスを學習することで、高度に柔軟でありながら計算效率に優れた data 生成 model を構築する。この手法により、數千層あるいは時閒ステップを有する深層生成 model においても、迅速な學習、sampling、確率評価が可能となる。さらに、學習済み model に基づく條件付き確率や事後確率の計算も實現する。加えて、本 algorithm の open source 參照實裝を公開する。

W. Feller "On the Theory of Stochastic Processes, With Particular Reference to Applications" 1949

W. Feller "On the Theory of Stochastic Processes, with Particular Reference to Applications" 1949

Jonathan Ho, Ajay Jain, Pieter Abbeel "Denoising Diffusion Probabilistic Models" 2020/6/19

［2006.11239］ Denoising Diffusion Probabilistic Models

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics, and our models naturally admit a progressive lossy decompression scheme that can be interpreted as a generalization of autoregressive decoding. On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality similar to ProgressiveGAN. Our implementation is available at this https URL

本硏究では、非平衡熱力學の原理に着想を得た潛在變數 model の一種である擴散確率 model を用ゐて、高品質な画像生成結果を提示する。最良の結果は、擴散確率 model とランジュバン力學を用ゐた noise 除去スコアマッチングとの新たな関連性に基づいて設計された重み付き變分下界を用ゐて學習を行ふことで得られた。本 model は自然に、自己囘帰型復號化の一般化と解釈可能な漸進的損失壓縮スキームを許容する。無條件のCIFAR10 dataset において、我々はInceptionスコア9.46および最先端レベルのFIDスコア3.17を達成した。256×256解像度のLSUN dataset では、ProgressiveGANと同等のサンプル品質を得てゐる。本實裝は以下のHTTPS URLで利用可能である

hojonathanho/diffusion: Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models

William Peebles, Saining Xie "Scalable Diffusion Models with Transformers" 2022/12/19

［2212.09748］ Scalable Diffusion Models with Transformers

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

本硏究では、Transformer アーキテクチャを基盤とした新たな擴散 model　のクラスを提案する。画像の潛在擴散 model を訓練するにあたり、從來廣く用ゐられてきたU-Netのバックボーンを、潛在パッチを處理する Transformer に置き換えた。擴散トランスフォーマー（DiT）の擴張性については、Gflopsで測定される順方向パスの計算複雑性といふ觀點から分析を行った。その結果、Transformer の深さ/幅の増加あるいは入力 token 數の増加によってGflops値を高めたDiT model は、一貫して低いFID値を示すことが明らかになった。優れた擴張性特性を有することに加え、本硏究で構築した最大 model であるDiT-XL/2は、クラス條件付きImageNetの512x512および256x256ベンチマークにおいて、既存のすべての擴散 model を凌駕する性能を達成し、特に後者のベンチマークでは最先端のFID値2.27といふ優れた結果を得た。

6.「text と畫像の融合」

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever "Learning Transferable Visual Models From Natural Language Supervision" 2021/2/26

［2103.00020］ Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

最先端のコンピュータビジョン system は、事前に定義された固定の物體カテゴリを予測するように訓練されてゐる。この制約された敎師あり學習の形態は、他の視覚的概念を指定するために追加のラベル付き data が必要となるため、system の汎用性と實用性を制限する要因となってゐる。画像に関する生の text から直接學習することは、はるかに廣範な敎師情報を利用できる有望な代替手法である。本硏究では、4億組の (画像・text) dataset（インターネットから収集）において、「どのキャプションがどの画像に對応するか」を予測するといふ單純な事前學習 task が、SOTA（最先端）レベルの画像表現をゼロから學習するための效率的かつ擴張可能な手法であることを實證する。事前學習後、自然言語を用ゐて學習済みの視覚概念を參照したり（あるいは新たな概念を記述したり）することが可能となり、これにより model を下流 task へゼロショットで轉移させることができる。本手法の性能を評価するため、OCR、動画内動作認識、地理位置情報、細粒度物體分類など、30種類以上の既存コンピュータビジョン dataset を用ゐてベンチマークを實施した。この approach はほとんどの task に對して非自明な轉移性を示し、dataset 固有の訓練を必要とせずに完全敎師あり baseline と競合する性能を発揮する場合が多い。例へば、我々は訓練に使用された128万の學習例を一切用ゐることなく、ImageNetにおけるオリジナルのResNet-50と同等の精度をゼロショットで達成した。本硏究で使用したコードと事前學習済み model の重みは、このhttps URLから公開してゐる。

openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

CLIP（Contrastive Language-Image Pre-training：對照言語-画像事前學習）は、多樣な (画像-text) ペア data を用ゐて訓練された neural network である。この model は、GPT-2や3と同樣に、task を直接最適化することなく、自然言語による指示だけで画像に對して最も関連性の高い text 断片を予測するように指示することが可能である。我々の硏究では、CLIPがImageNetの「ゼロショット」評価において、元のResNet50 model と同等の性能を達成できることを確認した。しかも、元の128万枚のラベル付き學習例を一切使用せずにこの結果を得てをり、これはコンピュータビジョン分野における主要な課題のいくつかを克服する画期的な成果である。

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen "Hierarchical Text-Conditional Image Generation with CLIP Latents" 2022/4/13

［2204.06125］ Hierarchical Text-Conditional Image Generation with CLIP Latents

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

CLIPのような對比的 model は、意味内容とスタイルの兩方を捉えた画像の頑健な表現を學習することが示されてゐる。本硏究では、これらの表現を画像生成に活用するため、2段階の model を提案する。第1段階では、text キャプションからCLIPの画像埋め込みを生成する事前分布を、第2段階では、生成された画像埋め込みを條件として画像を生成する復號器を用ゐる。画像表現を明示的に生成することで、写實性やキャプションとの類似性を損なうことなく、画像の多樣性が向上することを實證する。画像表現を條件とした復號器は、画像の本質的な意味内容とスタイルを保持しつつ、画像表現には含まれない非本質的な細部を變化させたバリエーションを生成可能である。さらに、CLIPの統合埋め込み空閒を利用することで、ゼロショット方式による言語誘導型の画像操作を實現できる。復號器には擴散 modelを採用し、事前分布としては自己囘帰型 model と擴散 model の兩方を實驗的に検證した結果、後者の方が計算效率に優れ、より高品質なサンプルを生成できることが明らかとなった。

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani "Imagic: Text-Based Real Image Editing with Diffusion Models" 2022/10/17

［2210.09276］ Imagic: Text-Based Real Image Editing with Diffusion Models

Imagic: Text-Based Real Image Editing with Diffusion Models

Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call "Imagic", leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.

text 條件付き画像編集技術は近年大きな注目を集めてゐる。しかしながら、現在のほとんどの手法は、特定の編集タイプ (例へばオブジェクトの重ね合わせやスタイル變換) に限定されてゐるか、合成画像にのみ適用可能であるか、共通の對象物について複數の入力画像を必要とするといふ制約がある。本論文では、初めて、複雑な非剛體變形を含む text 誘導型意味編集を單一の實画像に適用する手法を提案する。具體的には、画像内の單一または複數のオブジェクトの姿勢や構図を變更しつつ、元の特性を保持することが可能である。例へば、立ってゐる犬を座らせたりジャンプさせたりしたり、鳥に翼を廣げさせたりすることができる――これらすべてが、ユーザーが提供した單一の高解像度自然画像の範囲内で實現される。從來の手法とは異なり、本提案手法では單一の入力画像と目標 text (希望する編集内容) のみを必要とする。本手法は實画像上で動作し、画像マスクや對象物の追加視點といった追加入力を一切必要としない。本手法「Imagic」では、この task のために事前學習済みの text-画像擴散 model を活用する。入力画像と目標 text の兩方と整合する text 埋め込みを生成するとともに、擴散 model を微調整して画像固有の外觀特性を捉えるようにする。樣々な分野から収集した多數の入力 data を用ゐて、本手法の品質と汎用性を實證する。これにより、單一の統合フレームワーク内で、高品質な複雑な意味的画像編集の豊富な事例を提示する。

7.「生成 AI model の scaling 則」

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei "Scaling Laws for Neural Language Models" 2020/1/23

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei “Scaling Laws for Neural Language Models” 2020/1/23

本硏究では、言語 model の性能と交差エントロピー損失の関係について、経驗的なスケーリング則を調査した。損失関數の値は model size、dataset size、および學習に使用した計算資源量に對して、べき乗則に従ってスケールすることが明らかとなった。このスケーリング関係は、7桁以上の範囲にわたって一貫した傾向を示してゐる。ネットワーク幅や深さといった他のアーキテクチャ的詳細要素は、廣範な範囲において影響が極めて小さいことが判明した。過學習の度合いが model size および dataset size に依存する関係、および學習速度が model size に依存する関係は、單純な數式で記述可能である。これらの関係性を活用することで、固定された計算資源予算の最適な配分を決定することが可能となる。特に、大規模 model はサンプル效率が著しく高く、計算效率を最大化するためには、比較的小規模な dataset を用ゐて非常に大規模な model を學習させ、収束點を大幅に手前で學習を停止させる手法が有效であることが示された。

冪乘則

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish "Scaling Laws for Autoregressive Generative Modeling" 2020/10/28

［2010.14701］ Scaling Laws for Autoregressive Generative Modeling

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image↔text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains.

The cross-entropy loss has an information theoretic interpretation as S(True)+$ D_{\rm KL}(True||Model), and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8×8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $ D_{\rm KL}) in nats/image for other resolutions.

We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

本硏究では、生成的画像モデリング、動画モデリング、マルチモーダル画像↔text model、および數學的問題解決といふ 4 つの分野において、交差エントロピー損失関數に関する経驗的なスケーリング法則を明らかにした。いずれの場合においても、自己囘帰型Transformer model は model size と計算資源が増加するにつれて性能が滑らかに向上し、その傾向はべき乗則に定數項を加えたスケーリング法則に従うことが明らかとなった。最適な model size もまた、べき乗則に従って計算資源に依存し、その指數値は全 data 領域でほぼ普遍的な値を示す。

交差エントロピー損失は、情報理論の觀點からS(True)+$ D_{\rm KL}(True||Model)と解釈可能であり、本硏究で得られた経驗的スケーリング法則は、真の data 分布のエントロピーと真の分布と model 分布閒のKLダイバージェンスの兩方について予測を可能にする。この解釈に基づけば、パラメータ數が數十億規模のTransformer model は、8×8解像度にダウン sampling されたYFCC100M画像分布をほぼ完全に再現する model と言える。さらに、この知見を活用することで、他の解像度において所望の低減可能な損失値（すなわち$ D_{\rm KL}）を達成するために必要な model size を、nats/画像單位で予測することが可能となる。

特定の分野においては、以下の追加的なスケーリング法則を発見した：（a）マルチモーダル model におけるキャプションと画像閒の相互情報量に関するスケーリング関係を特定し、「画像は千の言葉に値する」といふ命題に對する解答を提示した。（b）數學的問題解決の分野では、學習分布の範囲を超えて外挿する際の model 性能に関するスケーリング法則を明らかにした。（c）生成的画像 model をImageNet分類 task 向けに fine-tuning した結果、生成損失が頭打ちになった状態でも、分類損失と誤り率がスムーズにスケーリングする現象を確認した。これらの結果を総合すると、スケーリング法則が neural network の性能、特に下流 task における性能に對して重要な意味を持つことが改めて確認された。

DeepSeek-AI, et al. "DeepSeek-V3 Technical Report" 2024/12/27

［2412.19437］ DeepSeek-V3 Technical Report

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at this https URL.

本論文では、6710億個の総パラメータ數を有し、各 token に對して370億個のパラメータを活性化する強力なMixture-of-Experts（MoE）言語 model DeepSeek-V3を提案する。效率的な推論處理とコスト效率に優れた學習を實現するため、DeepSeek-V3ではDeepSeek-V2において十分に検證されたMulti-head Latent Attention（MLA）アーキテクチャとDeepSeekMoEアーキテクチャを採用してゐる。さらにDeepSeek-V3は、負荷分散のための補助損失関數不要の新たな戦略を導入し、より強力な性能を實現するためマルチ token 予測を學習目的として設定した。DeepSeek-V3はまず14.8兆個に及ぶ多樣かつ高品質な token を用ゐて事前學習を實施し、その後Supervised Fine-Tuningと強化學習 (RL)の段階を経てその能力を最大限に引き出す。包括的な評価の結果、DeepSeek-V3は他の open source model を凌駕する性能を示し、主要なクローズド source model と同等の性能を達成してゐることが明らかとなった。優れた性能にもかかわらず、DeepSeek-V3の完全な學習に必要なGPU時閒はわずか278万8千H800時閒である。加えて、その學習プロセスは極めて安定してをり、學習過程全體を通じて恢復不能な損失の急増やロールバック操作は一切発生しなかった。model のチェック point は本https URLから入手可能である。

deepseek-ai/DeepSeek-V3

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" 2022/1/28

［2201.11903］ Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

本硏究では、思考の連鎖 (CoT)――すなわち中閒的な推論ステップの連續――を生成することが、大規模言語 model (LLM)が複雑な推論 task を實行する能力を大幅に向上させる仕組みについて解明する。特に、このような推論能力が、「思考の連鎖 (CoT)プロンプティング」と呼ばれる簡便な手法を通じて、十分な規模の言語 model において自然に発現する過程を明らかにする。この手法では、プロンプト内に少數の思考連鎖事例を模範例として提示する。3種類の大規模言語 model (LLM)を用ゐた實驗により、思考の連鎖 (CoT)プロンプティングが算術問題、常識推論、記號論理推論といった多樣な task において性能向上をもたらすことが確認された。得られた實證結果は極めて顕著である。例へば、パラメータ數5400億の言語 model に對し、わずか8つの思考の連鎖 (CoT)事例をプロンプトとして与えるだけで、數學文章題のベンチマークであるGSM8Kにおいて最先端の精度を達成し、検證器を備えた fine-tuning 済みGPT-3をも凌駕する性能を示した。

DeepSeek-AI, et al. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" 2025/1/22

［2501.12948］ DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

本硏究では、第一世代の推論 model であるDeepSeek-R1-ZeroおよびDeepSeek-R1を紹介する。DeepSeek-R1-Zeroは、事前學習段階として敎師あり微調整（SFT）を行わない大規模な強化學習 (RL)によって訓練された model であり、顕著な推論能力を示してゐる。RLを通じて、DeepSeek-R1-Zeroは自然と數多くの強力かつ興味深い推論行動を獲得する。しかしながら、この model には可読性の低さや言語混合といった課題が存在する。これらの問題を解決し、推論性能をさらに向上させるため、我々はRL前に多段階訓練とコールドスタート data を組み込んだDeepSeek-R1を提案した。DeepSeek-R1は推論 task において、OpenAIのo1-1217 model と同等の性能を達成してゐる。硏究コミュニティを支援するため、我々はDeepSeek-R1-Zero、DeepSeek-R1、およびQwenおよびLlamaを base にDeepSeek-R1から蒸留した6種類の高密度 model（1.5B、7B、8B、14B、32B、70B）の source code を open source として公開する。

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" 2024/2/5

［2402.03300］ DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

數學的推論は、その複雑かつ構造化された性質ゆえに、言語 model にとって重大な課題となる。本論文では、Common Crawlから収集した1200億語に及ぶ數學関連 token と自然言語・コード data を追加學習したDeepSeekMath 7Bを提案する。DeepSeekMath 7Bは、外部 toolkit や投票手法に依存することなく、競技レベルのMATHベンチマークにおいて51.7%といふ優れたスコアを達成してをり、Gemini-UltraやGPT-4に匹敵する性能を示してゐる。DeepSeekMath 7Bから64サンプルについて自己整合性評価を行った結果、MATHベンチマークで60.9%の精度を達成した。DeepSeekMathの數學的推論能力は、主に以下の2つの要因に起因する：第一に、入念に設計された data 選択パイプラインを通じて、公開されてゐるウェブ data が持つ膨大な潛在能力を最大限に活用してゐる點である。第二に、Proximal Policy Optimization（PPO）の變種であるGroup Relative Policy Optimization（GRPO）を新たに導入したことで、數學的推論能力を向上させるとともに、PPOのメモリ使用量の最適化を同時に實現してゐる點である。

deepseek-ai/DeepSeek-Math: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath is initialized with DeepSeek-Coder-v1.5 7B and continues pre-training on math-related tokens sourced from Common Crawl, together with natural language and code data for 500B tokens. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. For research purposes, we release checkpoints of base, instruct, and RL models to the public.

DeepSeekMathは、DeepSeek-Coder-v1.5 7Bを初期 model として採用し、Common Crawlから収集した數學関連 token に加え、自然言語 data とコード data (合計 5000 億 token) を用ゐて事前學習を継續しています。DeepSeekMath 7Bは、外部 toolkit や投票手法に依存することなく、競技レベルのMATHベンチマークにおいて51.7%といふ優れたスコアを達成してをり、Gemini-UltraやGPT-4に匹敵する性能レベルに達しています。硏究用途向けに、base model、指示チューニング model、および強化學習 (RL) model のチェック point を公開します。

deepseek-ai/deepseek-coder-7b-base-v1.5 · Hugging Face

8.「生成 AI model の評價」

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference" 2024/3/7

［2403.04132］ Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{this https URL}.

大規模言語 model (LLM)は新たな可能性と応用分野を切り拓いてきたが、人閒の嗜好との整合性を評価することは依然として大きな課題である。この問題に對處するため、我々は人閒の嗜好に基づくLLM評価のための open プラットフォーム「Chatbot Arena」を提案する。本手法ではペアワイズ比較 approach を採用し、クラウドソーシングを通じて多樣なユーザー層からの意見を収集してゐる。このプラットフォームは数ヶ月にわたり運用されてをり、これまでに24万票以上の投票を蓄積してゐる。本論文では、本プラットフォームの概要を説明し、これまでに収集した data を分析するとともに、效率的かつ正確な model 評価・ランキングのために採用してゐる確立された統計手法について解説する。分析の結果、クラウドソーシングで収集した質問が十分に多樣かつ識別力を備えてゐること、およびクラウドソーシングによる人閒の評価結果が専門家による評価結果と良好な一致を示してゐることを確認した。これらの分析結果を総合することで、Chatbot Arenaの信頼性に對する強固な基盤を確立した。その独自の価値と open 性により、Chatbot Arenaは最も參照されるLLMリーダーボードの一つとして台頭し、主要なLLM開発者や企業から広く引用されてゐる。本デモは\url{このhttps URL}で一般公開されてゐる。

LMArena

Long Phan, et al. "Humanity’s Last Exam" 2025/1/24

［2501.14249］ Humanity's Last Exam

Humanity's Last Exam - Wikipedia

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at this https URL.

大規模言語 model (LLM)の能力向上を追跡する上で、ベンチマークは重要な評価 tool である。しかしながら、現在のベンチマークは難易度の面で進歩が追いついていない。LLMは現在、MMLUなどの主要なベンチマークで90\%を超える精度を達成してをり、最先端LLMの能力を客觀的に測定する有效な手段が制限される状況にある。本硏究ではこの問題に對處するため、人閒の知識の最先端領域を對象とするマルチモーダルベンチマーク「Humanity's Last Exam（HLE）」を提案する。HLEは、廣範な學問分野を網羅した、同種のクローズドエンド型學術ベンチマークとして最後のものとなることを意図して開発された。HLEは數十の科目にわたる2,500問で構成されてをり、數學、人文科學、自然科學など多岐にわたる分野をカバーしてゐる。HLEは各分野の専門家によってグローバルに開発され、自動採點に適した選択式問題と短答式問題で構成されてゐる。各問題には明確な正解が存在し、曖昧さなく検證可能であるものの、インターネット検索によって迅速に囘答を得ることはできない。最先端LLMはHLEにおいて低い精度とキャリブレーションを示してをり、これは現在のLLM能力とクローズドエンド型學術問題における専門家人閒の水準との閒に大きな隔たりが存在することを明らかにしてゐる。model 能力を明確な理解のもとで硏究や政策立案に活用するため、我々はこのHLEを以下のhttps URLで公開する。

Humanity's Last Exam

#藏書