PaLM: Scaling Language Modeling with Pathways

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.

We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

Google AI Blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

この規模のLLMではこれまでで最高の学習効率

ハードウェアFLOPs利用率57.8%

attention layerとfeedforward layerを並列に計算できるようにTransformer blockを再定式化

データセット

英語および多言語のデータセット

高品質なWeb文書、書籍、Wikipedia、会話、GitHubコードなど

"lossless" vocalbulary

すべての空白を保持する

コードでは特に重要

語彙外のユニコード文字をバイトに分割する

数字を各桁ごとに個々のトークンに分割する

https://gyazo.com/c43d47a06772efb26f3cbc5df7a32a69

言語理解・生成

https://gyazo.com/a45e601ae7b53e660c37c1e2a0d40b80

BIG-bench

https://gyazo.com/c8f7a90c35a2e107b2493634734d0a2a

https://gyazo.com/a8561dcf27fd5a1acde7bf471be4ba32

BIGベンチの課題である「原因と結果のラベル付け」「概念の理解」「絵文字から映画を推測」「同義語と反実仮想の発見」において、PaLM 540B 1ショットのパフォーマンスを示す例

推論

PaLMは、モデルスケールと思考の連鎖プロンプト(chain-of-thought prompting)を組み合わせることで、多段階の算術演算や常識的な推論を必要とする推論タスクで画期的な能力を発揮する。Gopherのような先行LLMでは、性能向上におけるモデルスケールの恩恵はあまり見られなかった。

例えば、数千の小学校レベルの算数問題からなるベンチマークGSM8Kにおいて、8ショットプロンプトを用いた場合、PaLMは58%の問題を解くことができました。

これは、7500問のトレーニングセットを用いてGPT-3 175Bモデルを微調整し、外部の計算機と検証機と組み合わせることで達成された、これまでの最高得点55%を上回るものです。

この新しいスコアは、この問題集のターゲットである9～12歳の子供たちが解く問題の平均値60%に迫るものであり、特に興味深いものです。これは、PaLMの語彙の中で数字が別々にエンコードされていることが、性能向上に寄与しているものと思われます。

https://gyazo.com/865f213439fe0983cd5dd4225e9dcf95

標準プロンプトと思考連鎖プロンプトの比較（小学校の算数問題の例）思考連鎖プロンプトは、複数ステップの推論問題に対するプロンプトを、人がアプローチするのと同じように、中間ステップ（黄色でハイライト）に分解するものです。

説明をうまく工夫すると性能が上がる

prompt engineering

AIの思考を人間が助ける「プロンプトエンジニアリング」、能力の劇的進化に要注目 | 日経クロステック（xTECH）

PaLMは、多段階の論理的推論、世界知識、深い言語理解の複雑な組み合わせを必要とするシナリオに対しても、明示的な説明を生成することができるのが特徴です。例えば、ウェブ上では見つけられないような新しいジョークに対して、質の高い説明を提供することができる。

https://gyazo.com/0de242d1021522bee9d07f1b450b061c

コード生成

PaLM 540Bは、事前学習データセットに含まれるコードがわずか5%であるにもかかわらず、1つのモデルでコーディングタスクと自然言語タスクにまたがる高い性能を示しています。その数ショットの性能は、学習に使用するPythonコードが50倍少ないにもかかわらず、微調整を行ったCodex 12Bと同等であることから、特に注目されています。この結果は、他のプログラミング言語や自然言語データからの学習をより効果的に転送できるため、より大きなモデルはより小さなモデルよりもサンプル効率が高いという、以前の発見を補強するものです。

https://gyazo.com/e1a2c8eff4945cfc706456750a6eb4c3

GSM8K-PythonやHumanEvalなどのtext-to-codeタスクとTranscoderなどのcode-to-codeタスクで微調整したPaLM 540Bモデルの例。

また、PaLM-Coderと呼ばれるPythonコードのみのデータセットでPaLMを微調整することで、さらなる性能向上が確認されています。DeepFixと呼ばれるコード修復タスクの例では、最初に壊れたCプログラムを正常にコンパイルできるまで修正することが目的ですが、PaLM-Coder 540Bは82.1%のコンパイル率を達成し、従来の71.7%の水準を上回る素晴らしい性能を示しています。これにより、ソフトウェア開発時に発生する、より複雑なエラーの修正に対応できる可能性が出てきました。

https://gyazo.com/d85e8a6ea2f4f41b9b1881cad5b33082

Google「万能AI」の威力数百万タスク・多言語対応: 日本経済新聞

最強

Google「万能AI」の威力　数百万タスク・多言語対応https://t.co/1jxxMzmiRH

『PaLMは1つの機械学習モデルで、質問応答や文書生成、多段階の論理的な思考、翻訳、コード生成、コード修正、さらにはジョークの解説といった様々なタスクを処理可能。英語だけでなく多言語によるタスクに対応可能』

— 小猫遊りょう（たかにゃし・りょう） (@jaguring1) April 27, 2022

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel

Google Research

Submitted on 5 Apr 2022 (v1), last revised 19 Apr 2022 (this version, v3)

https://arxiv.org/abs/2204.02311