GStreamerのMemoryView対応

#やりたい

これができたらRabbitとwhispercpp gemと組み合わせて、プレゼンテーションのリアルタイム文字起こし（要は字幕）とか、リアルタイム翻訳（Whisper.cppのモデルは翻訳もできる）とかできるのでは。

まず普通にGStreamerでマイクから録音してみる

GStreamer を使おう！その1 ～GStreamer の世界へようこそ編～

GStreamer で音声を扱ってみる

動画なんかは結構起動に時間がかかる感じだな。

ソース

code:sh

% gst-inspect-1.0 | rg src

autodetect: autoaudiosink: Auto audio sink

autodetect: autoaudiosrc: Auto audio source

この辺が上手いことやってくれる？

シンク

code:shell

% gst-inspect-1.0 | rg audio

audiovisualizers: spacescope: Stereo visualizer audiovisualizers: spectrascope: Frequency spectrum scope

audiovisualizers: synaescope: Synaescope audiovisualizers: wavescope: Waveform oscilloscope

取り敢えず動作確認にはこの辺の四つのどれかを使ってみるといいかな。

動作確認

code:sh

% gst-launch-1.0 autoaudiosrc ! autoaudiosink

パイプラインを一時停止 (PAUSED) にしています...

Pipeline is live and does not need PREROLL ...

Pipeline is PREROLLED ...

パイプラインを再生中 (PLAYING) にしています...

New clock: GstAudioSrcClock

Redistribute latency...

0:00:02.5 / 99:99:99.

これは動くものの、何も起こらない。

これで、マイクで音を拾ってスピーカーから出るようになっていた。素晴らしい。

autoaudiosrcがマイクになって、autoaudiosinkがスピーカーになっているということなのだろう。

間にaudiovisualizersを挟んでみる

code:sh

% gst-launch-1.0 autoaudiosrc ! spacescope ! autoaudiosink

パイプラインを一時停止 (PAUSED) にしています...

Pipeline is live and does not need PREROLL ...

Pipeline is PREROLLED ...

パイプラインを再生中 (PLAYING) にしています...

New clock: GstAudioSrcClock

Redistribute latency...

ERROR: from element /GstPipeline:pipeline0/GstAutoAudioSrc:autoaudiosrc0/GstOsxAudioSrc:autoaudiosrc0-actual-src-osxaudi: Internal data stream error.

追加のデバッグ情報:

../subprojects/gstreamer/libs/gst/base/gstbasesrc.c(3177): gst_base_src_loop (): /GstPipeline:pipeline0/GstAutoAudioSrc:autoaudiosrc0/GstOsxAudioSrc:autoaudiosrc0-actual-src-osxaudi:

streaming stopped, reason not-negotiated (-4)

Execution ended after 0:00:00.092222583

Setting pipeline to NULL ...

Freeing pipeline ...

だめだった。

理屈が分かってないので何が問題か分かってない。

でも最初のやつが動いていたので取り敢えず進もう……

次にRubyバインディングで同じことをする

https://github.com/ruby-gnome/ruby-gnome/tree/main/gstreamer

公式にサンプルがあった：audio-example.rb

なんか妙な音が鳴って終わったw

成功したんだと思う。

↑を眺めつつマイクに挑戦

Gst::Bin#<<がエレメントの追加

Gst::Element#>>がエレメントをリンクする

合わせてautoaudiosrc ! autoaudiosinkの!ってとこなのかな。

メインループの回し方はこれが参考になりそう：helloworld.rb

code:micorphone.rb

require "gst"

bin = Gst::Pipeline.new("pipeline")

clock = bin.pipeline_clock

src = Gst::ElementFactory.make("audiotestsrc", nil) # <- よく見たら間違い。 autoaudiosrcであるべき。

raise "need audiotestsrc from gst-plugins-base" if src.nil?

sink = Gst::ElementFactory.make("autoaudiosink", nil)

raise "need autoaudiosink from gst-plugins-good" if sink.nil?

bin << src << sink

src >> sink

loop = GLib::MainLoop.new

bin.play

begin

loop.run

ensure

bin.stop

end

これだとうまくいかない。一定の高さの音がずっと鳴ってるだけ。

ああ、audiotestsrcだからか。autoaudiosrcに見えていた。

audiotestsrc -> autoaudiosrcと変えたらgst-launch-1.0 autoaudiosrc ! autoaudiosinkと同じ挙動になった！

MemoryView対応するためにはこのautoaudiosrcのクラスが分かればいいのかなと思ったけど、inspectしてみても#<GstAutoAudioSrc:0x000000011d0996d8 ptr=0x000000011cf4d5b0>と表示されるだけなのでもっとちゃんと調べないといけない。

autoaudiosrc

あと、マイクから音声を取得するコールバックが欲しい。GStreamerのエレメントを作る必要がある？

取り敢えず16kHz 32ビットfloatのPCMに変換してwaveファイルに保存してみようか。

Data_Get_Strutで構造体取れるのかな

色々調べてMemoryView対応

GitHubイシューをファイルした： gstreamer MemoryView support, or how to get GStreamer structs from Ruby objects? #1634

Gst::SampleをMemoryView対応するのがよさそう

Gst::Sampleは、クラス名が単数系であるように、一つのサンプルを表す。

多分・・・。そうだとするとGst::MapInfo#sizeが何のサイズなのか。。

4096とかだから4096/32=128サンプルあるということを意味するのかもしれない

複数のサンプルの集まりを表している可能性もある。否定し切れない。

GstAudioFormatUnpacの lengthが「the amount of samples to unpack.」だから、MapInfoは複数のサンプルをまとめて扱っている。のでnew-sample（単数系）コールバックの中で使うのは相応しくない

audio_chain_get_samplesみたいなんで複数のサンプルをまとめて取得している

Whisperは複数のサンプルが欲しい。

GStreamerにそれに相当するクラスがあればそれを使い、そうでなければ自分でバッファークラスを用意する。

それかnew-sampleじゃないシグナルを見付ける

それか、sink（element）から複数のサンプルをまとめて取り出す方法を見付ける

audio_chain_get_samplesとか。

エレメントはパッドから複数のサンプルを一度に取ってそうだから、プラグイン開発のドキュメントを読む

https://gstreamer.freedesktop.org/documentation/plugin-development/index.html?gi-language=c

After creating the pad, you have to set a _chain () function pointer that will receive and process the input data on the sinkpad.

https://gstreamer.freedesktop.org/documentation/plugin-development/basics/pads.html?gi-language=c

GstAppsSinkからGstBufferListを取り出せるといいのかも

https://gstreamer.freedesktop.org/documentation/applib/gstappsink.html?gi-language=c#GstAppSink:buffer-list

https://gstreamer.freedesktop.org/documentation/gstreamer/gstbufferlist.html?gi-language=c

上流でGstBufferListを使っていないと意味ない

GstAudioTestSrc

https://github.com/GStreamer/gstreamer/blob/main/subprojects/gst-plugins-base/gst/audiotestsrc/gstaudiotestsrc.h#L108

samples_per_bufferというメンバーがあるから、bufferの方が大きいっぽい

GstPadから複数サンプルをまとめて取れる？

GstStreamを使えばできそう

https://gstreamer.freedesktop.org/documentation/gstreamer/gststreams.html?gi-language=c#GstStream

ドキュメントを読んでも結局分からなかったのでソースコードを読むことにする

https://github.com/GStreamer/gstreamer/blob/f1f5d6002e644fe7e0ee8cffc0919791152f3f9c/subprojects/gst-plugins-base/gst/audioconvert/gstaudioconvert.c

https://github.com/GStreamer/gstreamer/blob/f1f5d6002e644fe7e0ee8cffc0919791152f3f9c/subprojects/gst-plugins-base/ext/vorbis/gstvorbisenc.h

vorbisencはF32LEしか受け取らないからwhisper.cppとシチュエーションが似てる

https://github.com/GStreamer/gstreamer/blob/f1f5d6002e644fe7e0ee8cffc0919791152f3f9c/subprojects/gst-plugins-base/ext/vorbis/gstvorbisenc.c#L733

packetはどこから？

gst_vorbis_enc_handle_frame

これに渡されるバッファーはどこから？

タイミングはいつ？

gst_audio_encoder_set_output_formatが何をやってる、というかこの影響を受けてどこで何をやっているのか

やっぱり一つのバッファーに複数のサンプルが入っているな

gst_vorbis_enc_output_buffers

vorbisenc->samples_outはどこで設定されてる？

ループの中

new sampleコールバック内で読んだバッファーには一つのサンプルしか（チャネルのことは考えずに）入ってないのか？　というのは要確認

複数のサンプルが入っている

https://bookwor.ms/@KitaitiMakoto/113742393527297486

https://bookwor.ms/@KitaitiMakoto/113742395925825477

色々調べたり検討する必要がありそう

formatをF32LEにしてもバッファーが整数列になっている。これはRubyだけ？　Cでやってもこうなる？

GstAudioFormat経由でunpackを試す

GstBufferにoffsetなるメンバーがあって、これなに？

layoutによってMemoryViewのカラム指向と行指向を対応させる？

独自仕様ではあるから何らか説明が必要。

item_desc使える？

オーディオデータをRubyのMemoryViewで扱う時のこと

GstAudioFormatInfoなるクラスを見付けたけど、これって取得できる？　いつでもできる？

単純にやると、appsinkのnew-sampleだとサンプルが少なすぎるから、Rubyレイヤーで複数のサンプルを集めて、buffer_listを持つ大きなGstSampleを作ってwhisperに渡すことになりそう

「ネゴシエーション後はStgCapsが一つしか無い」とかそういう制約があるはずなので明らかにする。でないと場合が多過ぎる。 DONE

For pads, it can either be a list of possible caps (usually a copy of the pad template's capabilities), in which case the pad is not yet negotiated, or it is the type of media that currently streams over this pad, in which case the pad has been negotiated already.

https://gstreamer.freedesktop.org/documentation/application-development/basics/pads.html?gi-language=c#capabilities-of-a-pad

テンプレートじゃなくてpadからcapsを取れば、それは実際に使われてるやつ

GstAudioを作る必要がある DONE

GstAudioFormatInfoのunpack_formatを読もうとするとGLib-GObject-CRITICAL **: g_boxed_copy: assertion 'G_TYPE_IS_BOXED (boxed_type)' failed

https://github.com/ruby-gnome/ruby-gnome/pull/1641

「オーディオデータのMemoryView表現」として一般化することはあまりできなくて、GStreamerの場合はこう、となりそう。

interleavedかnon-interleavedかの情報をMemoryViewには持たせられない

interleavedの時は列指向、non-interleavedの時は行指向、と決めて仕舞えば持たせられるが、こう決めていいような気がしない

そうするとobjを見て、Gst::Sampleだから、Gst::AudioInfoを取得してそのlayoutメンバーを読んで、interelavedかnon-interleavedかを確認する、となる。

当然別のライブラリーの別のクラスのMemoryViewでは異なる確認方法になる。

要確認

appsinkのnew-sampleの中で、バッファーは一サンプル分？

複数チャネルの場合はどうなる？

new-sampleの中で、CapsをF32LEにしている場合、バッファーの値は少数になってくれている？　何らか自分で変換が必要？

unpack_funcとか