ESPnetの音声認識モデルはnp.floatのarrayを入力する必要がありそう

「こんにちは」2つのファイル

say こんにちは -o hello.wav --data-format=LEF32@16000

say こんにちは -o i16_hello.wav --data-format=LEI16@16000

scipy.io.wavfileで読み込むとdtypeが異なる

code:python

>> from scipy.io import wavfile

>> sample_rate, frame_data = wavfile.read("i16_hello.wav")

<stdin>:1: WavFileWarning: Chunk (non-data) not understood, skipping it.

>> sample_rate2, frame_data2 = wavfile.read("hello.wav")

>> sample_rate, sample_rate2

(16000, 16000)

>> frame_data.shape, frame_data2.shape

((15551,), (15551,))

>> frame_data.dtype, frame_data2.dtype

(dtype('int16'), dtype('float32'))

dtypeが異なっても、人の耳には同じ音声が再生される（「こんにちは」）

code:python

>> import sounddevice as sd

>> sd.play(frame_data, 16_000)

>> sd.play(frame_data2, 16_000)

ASRモデルに渡すと、np.int16の方ではうまく動いていない

code:python

>> from espnet2.bin.asr_inference import Speech2Text

>> speech2text = Speech2Text.from_pretrained("kan-bayashi/csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave")

>> text, tokens, *_ = speech2text(frame_data2)0 # np.float32の方

>> text

'今日は'

>> tokens

'今', '日', 'は'

>> retval = speech2text(frame_data)0 # np.int16の方

>> retval0

'で'

>> retval1

'で'

>> retval2

soundfileを使うとfloatに揃う

code:python

>> import soundfile

>> speech_array, sampling_rate = soundfile.read("i16_hello.wav")

>> speech_array2, sampling_rate2 = soundfile.read("hello.wav")

>> sampling_rate, sampling_rate2

(16000, 16000)

>> speech_array.shape, speech_array2.shape

((15551,), (15551,))

>> speech_array.dtype, speech_array2.dtype

(dtype('float64'), dtype('float64'))

>> # ASRモデルは、どちらのファイルについても動作する

>> text, tokens, *_ = speech2text(speech_array2)0

>> text

'今日は'

>> text, tokens, *_ = speech2text(speech_array)0

>> text

'今日は'

np.testing.assert_array_equal(speech_array, speech_array2)はAssertionError

Mismatched elements: 12246 / 15551 (78.7%)

SpeechRecognitionのMicrophoneに入力した音声をESPnetのASRモデルで認識する（workaround）