日本語BERTのrun_classifier読解

https://github.com/yoheikikuta/bert-japanese/blob/master/src/run_classifier.py を読んでいる

本家のbertリポジトリをgit submoduleで取り込んでいる

それをsys.pathに追加してimport modelingとかで再利用している

BERTのモデル定義などはそちらに書かれている

関連: VSCodeでsys.path設定より前にimportが移動される

このrun_classifier.pyは本家のrun_classifier.pyを元にSentencePieceを使うなどの修正を加えたもの

tf.app.run()がmainを呼ぶ

main→model_fn_builder→create_model

create_modelの中の

model = modeling.BertModel(...)

ここでBERTのモデルを作ってる

本家BERTリポジトリからmodelingをimportしてある

BERTのネットワーク構造の定義はそちらでされている

https://github.com/google-research/bert/blob/88a817c37f788702a363ff935fd173b6dc6ac0d6/modeling.py#L31

# In the demo, we are doing a simple classification task on the entire segment.

# If you want to use the token-level output, use model.get_sequence_output() instead.

get_pooled_output()を使っている

これは何か

先頭トークンに対する出力を入力としてhidden_sizeの出力を出す単なる全結合レイヤー

code:python

# The "pooler" converts the encoded sequence tensor of shape

# batch_size, seq_length, hidden_size to a tensor of shape

# batch_size, hidden_size. This is necessary for segment-level

# (or segment-pair-level) classification tasks where we need a fixed

# dimensional representation of the segment.

with tf.variable_scope("pooler"):

# We "pool" the model by simply taking the hidden state corresponding

# to the first token. We assume that this has been pre-trained

first_token_tensor = tf.squeeze(self.sequence_output:, 0:1, :, axis=1)

self.pooled_output = tf.layers.dense(

first_token_tensor,

config.hidden_size,

activation=tf.tanh,

kernel_initializer=create_initializer(config.initializer_range))

「えっ、先頭トークンでいいの？文末トークンがいいのでは？」と思ったが正しくない

RNN的なメンタルモデルを引きずっている

BERTは自己注意を積み重ねた構造をしている

一つ一つの自己注意は下位レイヤーに対する不定長のコンボリューションとして働く

なので先頭とか文末とか関係ない

先頭トークンの出力はその単語自体に関する情報と文章全体の情報が詰め込まれるのか、大変だな、と思った

正しくない

先頭にはCLSトークンがあるので単語ではない

tokens: [CLS] the dog is hairy . [SEP]

src

先頭トークンに対する出力を文のベクトル埋め込みだと解釈して良いのか、という議論

Features extracted from layer -1 represent sentence embedding for a sentence? · Issue #71 · google-research/bert

Why not use the hidden state of the first token as default strategy, i.e. the [CLS]?

Frequently Asked Questions — bert-as-service 1.6.1 documentation

Why not the last hidden layer? Why second-to-last?

Frequently Asked Questions — bert-as-service 1.6.1 documentation

僕の結論はNOになった: BERTの文ベクトル

l.728 use_one_hot_embeddings=FLAGS.use_tpu) これは正しいか？

正しい

TPU上ではそれが早い、とsrcに書いてあった

tf.flagsはTensorflow 2.0にはない

AttributeError: module 'tensorflow' has no attribute 'flags'

argparseに置き換えるべき

Tensorflow 2に移植するには、割と修正箇所が多いので、面倒だからvenvでTF 1の環境を作ることにした

code::

python3 -m venv ./venv

source ./venv/bin/activate

pip install --upgrade pip

pip install tensorflow==1.15rc2

pip install -r ../requirements.txt

tokenize

tokenization_sentencepiece.FullTokenizer(model_file="../model/wiki-ja.model", vocab_file="../model/wiki-ja.vocab")

In [9]: tok.tokenize("本日は晴天なりインターナルサーバーエラー")

Out[9]: ['▁本', '日', 'は', '晴', '天', 'なり', 'インター', 'ナル', 'サーバー', 'エラー']

bert_config_file

--bert_config_file= ... にconfig.jsonを指定する必要があるがリポジトリに見つからない件

config.iniがある

run_classifier.pyの冒頭でiniからjsonを生成している

これをconfig.jsonって名前で保存しておくことにした

ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ../model

ダウンロードしたmodel.ckpt-1400000.*を../modelに置いて--init_checkpoint=../modelしたんだが...

--init_checkpoint=../model/model.ckpt-1400000が正解

extract_feature.pyは動いた

コマンドはこんな感じ

$ python3 extract_features.py --model_file=../model/wiki-ja.model --vocab_file=../model/wiki-ja.vocab --input_file=smallinput.txt --bert_config_file=config.json --init_checkpoint=../model/model.ckpt-1400000 --output_file=tmp/output

--output_fileで指定したファイル名でJSONが吐き出される

code:python

x = json.load(open("tmp/output"))

In 8: x"features"2"token"

Out8: '正しい'

In 10: x"features"2"layers"0

Out10:

{'index': -1,

'values': -0.433769, ...}

各トークンごとに768次元のベクトルが入っている

(不用意にでかいファイルに対して使ったらものすごくでかいJSONができてしまうのでは)

文に対するベクトルが欲しい僕にとってはlayer -1の最初のトークンのベクトルだけ取り出せば良いか

read_examples

||| で区切られている場合には２文のペアとみなし、そうでなければ１文と判断

自前のデータを流し込みたい場合、テキストから読むのではなくmainの中のread_examplesを呼んでるところを差し替えるのが良さそうって思った

元データとしてScrapboxのJSONファイルを使いたいから

extract_feature.pyをインポートしてmainを上書きした

6343件のベクトル化に8245秒

MacBook Pro (15-inch, 2018) / 2.6 GHz Intel Core i7 / 16 GB 2400 MHz DDR4

実験に関するメモ: リンク作成支援

run_classifier.py の、追加学習してfine-tuningするのはまだ試してない

試したら日本語BERTのfine-tuningに書く