MagikaのRubyバインディング

無いんだなあ……時代ですね。

Rust実装があるからそのバインディングを作ってもいいけど、コアになっているモデルがONNXフォーマットだから、ONNX Runtime Rubyを使うんでもいいな。

ただ、Rust実装の速さの秘密は、複数のファイルを並列に処理しているところにもあるようで、その部分も欲しいならバインディング作るのがよさそう。単一ファイルの話ならRubyでも遅くて気にならないとかにはならないのではないか（測ってない、てか作ってない）。

ONNX Runtime RubyはNumo::NArrayに対応しているから、ファイル内容のバイト列の処理はNumo::NArrayでやるといい。

Pythonバインディングから重要そうな所を：

code:magika/python/src/magika/magika.py

file_features = Magika._extract_features_from_seekable(

seekable,

self._model_config.beg_size,

self._model_config.mid_size,

self._model_config.end_size,

self._model_config.padding_token,

self._model_config.block_size,

self._model_config.use_inputs_at_offsets,

)

code:magika/python/src/magika/magika.py

def _get_raw_predictions(

self, features: List[TuplePath, ModelFeatures]

) -> npt.NDArray:

"""Get raw predictions from features.

Given a list of (path, features), return a (files_num, features_size)

matrix encoding the predictions.

"""

start_time = time.time()

X_bytes = []

for _, fs in features:

sample_bytes = []

if self._model_config.beg_size > 0:

sample_bytes.extend(fs.beg: self._model_config.beg_size)

if self._model_config.mid_size > 0:

sample_bytes.extend(fs.mid: self._model_config.mid_size)

if self._model_config.end_size > 0:

sample_bytes.extend(fs.end-self._model_config.end_size :)

X_bytes.append(sample_bytes)

X = np.array(X_bytes, dtype=np.int32)

elapsed_time = 1000 * (time.time() - start_time)

self._log.debug(f"DL input prepared in {elapsed_time:.03f} ms")

raw_predictions_list = []

samples_num = X.shape0

max_internal_batch_size = 1000

batches_num = samples_num // max_internal_batch_size

if samples_num % max_internal_batch_size != 0:

batches_num += 1

for batch_idx in range(batches_num):

self._log.debug(

f"Getting raw predictions for (internal) batch {batch_idx + 1}/{batches_num}"

)

start_idx = batch_idx * max_internal_batch_size

end_idx = min((batch_idx + 1) * max_internal_batch_size, samples_num)

start_time = time.time()

batch_raw_predictions = self._onnx_session.run(

"target_label", {"bytes": Xstart_idx:end_idx, :}

elapsed_time = 1000 * (time.time() - start_time)

self._log.debug(f"DL raw prediction in {elapsed_time:.03f} ms")

raw_predictions_list.append(batch_raw_predictions)

return np.concatenate(raw_predictions_list)

ファイルやIOからのfeatureの取得はPythonでやる

入力の構築はnumpyでやる

それをONNX runtimeに渡す

取り敢えずONNXモデルを動かすまでやってみた：

ファイルフォーマット判別ツールMagikaのONNXモデルを動かしてみる