tensorflowのpretrainedモデルをpytorch(transformers)で読む

https://radiology-nlp.hatenablog.com/entry/2020/01/18/013039

これを参考にやった

yoheikikuta/bert-japaneseの重みをHuggingface/Transformersで読み込んだ

上の記事はpooler layerを自作しているが、BertForSentenceClasifierにも読み込ませられた

直接は無理

BertForSentenceClassifierには、pooler layerがあるが、pretrained dataには無いからだと思う

ちょっと変な対処法でできる

一旦空っぽのbertに読み込んだあとsave

BertForSentenceClasifierからフォルダ指定でロード

これでできた

pytorchのsaveとloadの際は、無い重みがあっても怒られないということか？

pooler layerのとこは何で初期化されてるのか？

後で確認

tokenizerのところは自分で書いた

cl-tohoku/bert-japaneseのBertJapaneseTokenizerとか読んだ見たら何やっているかは大体わかったので、pretrainedモデルをプルリクしてみたくなったが、どうなんだろ

改善したものをココに書いた

https://github.com/miyamonz/bert-japanese-finetune-example

文章をtokenizeするところ

東北大のやつ

https://gyazo.com/85eca71834fe21fe42756a2e53e599ee

encode_plusというやつがinput_idsとattention_maskを返す

これを文章ごとに得られたものを配列にまとめてtorch.catする

これをsentencepieceで同様のことをする

こんな感じになる

自分で[CLS]とかをちゃんとハンドルして、返り値はtorchのtensorにしておく

code:py

import sentencepiece as sp

BASE_SPM = 'wiki-ja.model'

BASE_VOCAB = 'wiki-ja.vocab'

spm = sp.SentencePieceProcessor()

spm.Load(DIR_BERT_KIKUTA + BASE_SPM)

#bert tokenizerのencode_plusと似たような出力になるようにする

def spm_encode(example, max_length = 512):

raw_pieces = spm.EncodeAsPieces(example)

# if input size is over max_length, truncate them

# Account for CLS, SEP with - 2

if len(raw_pieces) > max_length-2:

raw_pieces = raw_pieces:max_length-2

pieces = []

# first token must be CLS

pieces.append("CLS")

for piece in raw_pieces:

pieces.append(piece)

# last token must be SEP

pieces.append('SEP')

# convert pieces to ids

input_ids = spm.PieceToId(p) for p in pieces

attention_mask = 1 * len(input_ids)

#fill 0 in the rest list space

while len(input_ids) < max_length:

input_ids.append(0)

attention_mask.append(0)

return {

"input_ids":torch.tensor(input_ids),

"attention_mask":torch.tensor(attention_mask),

}