SpeechRecognitionのMicrophoneに入力した音声をESPnetのASRモデルで認識する（workaround）

SpeechRecognitionのMicrophoneに入力した音声をnumpy arrayとして取り扱いたいの続き

np.int16で扱うためにうまく認識しないよう

code:python

with sr.Microphone(sample_rate=16_000) as source:

audio_data = recognizer.listen(source)

frame_bytes = audio_data.get_raw_data()

audio_array = np.frombuffer(frame_bytes, dtype=np.int16)

speech2text = Speech2Text.from_pretrained(

"kan-bayashi/csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave"

)

>> text, tokens, *_ = speech2text(audio_array)0

>> text # 「こんにちは」としゃべっている

'で'

np.floatにしてSpeech2Textモデルに渡したい（ref: ESPnetの音声認識モデルはnp.floatのarrayを入力する必要がありそう）

アイデア：soundfileでbytesから変換できないか

io.BytesIOを使う例：sf.read(io.BytesIO(urlopen(url).read()))

https://github.com/bastibe/python-soundfile/blob/0.10.3post1/README.rst#virtual-io

code:python

>> import io

>> _ = io.BytesIO(frame_bytes)

>> type(io.BytesIO(frame_bytes))

>> _ = soundfile.read(io.BytesIO(frame_bytes))

Traceback (most recent call last):

...

RuntimeError: Error opening <_io.BytesIO object at 0x1480262c0>: File contains data in an unknown format.

Workaround：Microphoneが受け取った音声を一度ファイルに保存し、soundfileで読み込む

変換する方法がよくわからない（音声まわりのキャッチアップが必要）

そこで、うまくいく方法をつなぎ合わせる

code:python

>> from scipy.io import wavfile

>> # 再生できるnp.int16に変換して保存（理由を考えるとbytesが16進だから？）

>> frame_bytes:10

b'\xfc\xff\xd8\xff\xd8\xff\xf3\xff\xdc\xff'

>> wavfile.write("tmp.wav", audio_data.sample_rate, np.frombuffer(audio_data.get_raw_data(), dtype=np.int16))

>> data2, rate2 = soundfile.read("tmp.wav")

>> rate2

16000

>> data2.dtype

dtype('float64')

>> data2.shape

(36864,)

>> rate, data = wavfile.read("tmp.wav") # 参考のための比較

>> rate

16000

>> data.shape

(36864,)

>> data.dtype

dtype('int16')

>> # data2もdataもどちらもsounddevice.playで再生できる

>> text, tokens, *_ =speech2text(data2)0 # soundfileでreadした音声を入力 (np.floatの方)

>> text

'えーこんにちは'

tempfileを使わない（workaroundでない）方法が見つかった：SpeechRecognitionのMicrophoneに入力した音声をESPnetのASRモデルで認識する