Warditeのi32を使いまくってる最小ベンチケース

claude sonnet 4が書いてくれた。あとベンチツールも。

https://github.com/udzura/wardite/pull/11

https://github.com/sharkdp/hyperfine

これを使った複雑怪奇なスクリプトを生成したが一発runでいいでしょ...

ブラウザ

https://scrapbox.io/files/689804c86dd69c8b5ba1a258.png

割とバラつくなこれ

wasmtime

code:result

$ hyperfine 'wasmtime examples/i32_bench.wasm --invoke detailed_arithmetic_loop'

Benchmark 1: wasmtime examples/i32_bench.wasm --invoke detailed_arithmetic_loop

Time (mean ± σ): 2.6 ms ± 0.2 ms User: 1.7 ms, System: 1.8 ms

Range (min … max): 2.3 ms … 4.0 ms 471 runs

Warning: Command took less than 5 ms to complete.

wasmtimeは高速すぎてベンチが...

wasmedge

code:result

$ wasmedge --enable-time-measuring examples/i32_bench.wasm detailed_arithmetic_loop

2025-08-10 11:35:45.359 info ==================== Statistics ====================

2025-08-10 11:35:45.359 info Total execution time: 28864250 ns

2025-08-10 11:35:45.359 info Wasm instructions execution time: 28864250 ns

2025-08-10 11:35:45.359 info Host functions execution time: 0 ns

2025-08-10 11:35:45.359 info ======================= End ======================

484490200

$ wasmedge --enable-time-measuring examples/i32_bench.wasm detailed_arithmetic_loop

2025-08-10 11:35:45.359 info ==================== Statistics ====================

2025-08-10 11:35:45.359 info Total execution time: 28864250 ns

2025-08-10 11:35:45.359 info Wasm instructions execution time: 28864250 ns

2025-08-10 11:35:45.359 info Host functions execution time: 0 ns

2025-08-10 11:35:45.359 info ======================= End ======================

484490200

$ hyperfine 'wasmedge examples/i32_bench.wasm detailed_arithmetic_loop'

Benchmark 1: wasmedge examples/i32_bench.wasm detailed_arithmetic_loop

Time (mean ± σ): 34.2 ms ± 0.9 ms User: 29.9 ms, System: 4.1 ms

Range (min … max): 32.4 ms … 36.6 ms 78 runs

warditeは？

code:hoge

$ hyperfine 'bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm'

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm

Time (mean ± σ): 772.5 ms ± 14.4 ms User: 715.4 ms, System: 28.8 ms

Range (min … max): 756.1 ms … 795.8 ms 10 runs

いいね、適度に遅い

code:res2

$ hyperfine 'bundle exec wardite --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm'

Benchmark 1: bundle exec wardite --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm

Time (mean ± σ): 1.863 s ± 0.034 s User: 1.805 s, System: 0.029 s

Range (min … max): 1.807 s … 1.909 s 10 runs

まあYJITは効く

使ってる命令を洗い出したい

detailed_arithmetic_loop だけ取り出したいな...

PRにベンチマーカー突っ込んだが

https://github.com/udzura/wardite/actions/runs/16856525819/job/47750277353

なんかYJIT効いてるかよくわからんな...

warning 追加した

あとは...

https://github.com/udzura/wardite/actions/runs/16856562118/job/47750407274

警告が出てないのでYJIT自体は有効で、コンテナだから（何か制限されている or ディスクの速度）あるいはCPUの性能差だと思われた。

hyperfineのベンチは簡易的すぎるな。普通にフェーズごとに benchmark ライブラリ使うべきだなw

まあおいおい...

いや近日中にやらないとよくわからないかあ

あとはどの命令が使われているか一覧したい

code:u-n

trace instructions stats:

trace local_get: 2300001

trace local_set: 1000104

trace i32_const: 800104

trace i32_add: 800000

trace end: 200102

trace i32_mul: 200000

trace if: 200000

trace call: 100000

trace i32_sub: 100000

trace i32_eq: 100000

trace i32_div_s: 100000

trace i32_rem_s: 100000

trace i32_gts: 100000

trace i32_lts: 100000

trace br_if: 100000

trace loop: 1

いやイマイチだなw

でもそうか、ループがあるとlocal_set/getの数が増えるのね...

一旦これで計測するかあ

いや、stackをうまく使えば...うーん...

dup命令がない

https://github.com/WebAssembly/design/issues/1365

のでstackに残したままloopを回すというのは無理そう...

じゃああ諦めるか

まあまずクソ簡単なケースを置き換えてみる

code:sample.wat

(module

(func $simple_math (export "simple_math") (param $x i32) (param $y i32) (result i32)

local.get $x

local.get $y

i32.add

i32.const 10

i32.sub

i32.const 5

i32.mul

i32.const 2

i32.div_s

)

$ WARDITE_TRACE=1 bundle exec wardite --yjit --no-wasi --invoke simple_math simple_math.wasm 12 12

return value: I32(35)

trace instructions stats:

trace i32_const: 3

trace local_get: 2

trace i32_add: 1

trace i32_sub: 1

trace i32_mul: 1

trace i32_div_s: 1

trace end: 1

https://github.com/udzura/wardite/commit/0c4ecac943d2f4450c6c982853ef8481471c3425

これで再度動作するようになった

紆余曲折

code:uyo

$ WARDITE_TRACE=1 bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm

return value: 484490200

trace instructions stats:

trace local_get: 2300001

trace local_set: 1000104

trace i32_const: 800104

trace i32_add: 800000

trace end: 200102

trace i32_mul: 200000

trace if: 200000

trace call: 100000

trace i32_sub: 100000

trace i32_eq: 100000

trace i32_div_s: 100000

trace i32_rem_s: 100000

trace i32_gts: 100000

trace i32_lts: 100000

trace br_if: 100000

trace loop: 1

external call count: 0

external call elapsed: 0.0(s)

$ rake basic_benchmark

wasm-tools parse examples/i32_bench.wat -o examples/i32_bench.wasm

hyperfine --warmup 3 'bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm'

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm

Time (mean ± σ): 783.0 ms ± 7.8 ms User: 758.2 ms, System: 22.0 ms

Range (min … max): 777.1 ms … 804.1 ms 10 runs

変わらねえ...

https://mametter.hatenablog.com/entry/2020/09/11/230139

先人に倣ってみる

これは変更前のベンチ

https://scrapbox.io/files/68984273b489da5e56a2d08a.png

インスタンス生成個数は？

{I32: 2400114}

200~300万のオーダーではインスタンス生成は支配的ではないかもしれない

というかキャッシュが効いちゃってません？

https://scrapbox.io/files/68984a362acd326931854ea6.png

数値を変える

code:suuchi

$ hyperfine 'bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm'

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm

Time (mean ± σ): 791.7 ms ± 5.8 ms User: 735.0 ms, System: 28.8 ms

Range (min … max): 781.3 ms … 801.4 ms 10 runs

https://scrapbox.io/files/68984c04330a10fb48caf10e.png

やや増えるがそれでも Class.new が支配的とは言い難い

単に計算自体のコストがちょっと上がったかな〜

push_frameがそもそもでかい

インスタンスキャッシュが効かないとnewのコストが増してくるように見える

しかし支配的になるかって言われるとめっちゃ微妙

そもそも数値が複雑になると数値計算自体のコストが増えてるように見える

どうしょうもねえ...

https://scrapbox.io/files/68984da8c91832f1cd99f4bc.png

数値を複雑にして再計測すると、単に別のコストが増えているように見えるし、しかも見えなくなった

Class#new はいなくなった

高速になったわけではない（重要）

条件によっては高速になるかもしれないが、例えばループを増やして命令の個数を増やそうとしたところで、多分ループ自体のコストがでかいのでそこを修正するのが良さそうに思えた

push_frameって何してるんだ？

code:koko.rb

def push_frame(wasm_function)

local_start = stack.size - wasm_function.callsig.size

locals = stacklocal_start..

if !locals

raise LoadError, "stack too short"

end

self.stack = drained_stack(local_start)

locals.concat(wasm_function.default_locals)

arity = wasm_function.retsig.size

frame = Frame.new(-1, stack.size, wasm_function.body, arity, locals)

frame.findex = wasm_function.findex

self.call_stack.push(frame)

end

locals.concatが遅いとなったらこれは...

https://scrapbox.io/files/68984f555aa775098d7531c9.png

eachにしたところで変わらないですよね...

ベンチケースを整理したいな

Nが

100,000

500,000

1,000,000

とする（それ以上は時間かかりすぎ

code:kaerumae

## 100000

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N1.wasm

Time (mean ± σ): 821.7 ms ± 20.5 ms User: 761.1 ms, System: 30.0 ms

Range (min … max): 795.1 ms … 854.7 ms 10 runs

## 500000

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N2.wasm

Time (mean ± σ): 3.525 s ± 0.031 s User: 3.448 s, System: 0.044 s

Range (min … max): 3.463 s … 3.579 s 10 runs

## 1000000

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N3.wasm

Time (mean ± σ): 7.064 s ± 0.048 s User: 6.959 s, System: 0.063 s

Range (min … max): 6.996 s … 7.139 s 10 runs

code:kaetaato

## 100000

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N1.wasm

Time (mean ± σ): 834.5 ms ± 9.1 ms User: 776.3 ms, System: 29.8 ms

Range (min … max): 821.4 ms … 848.7 ms 10 runs

## 500000

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm

Time (mean ± σ): 3.728 s ± 0.048 s User: 3.654 s, System: 0.042 s

Range (min … max): 3.631 s … 3.782 s 10 runs

## 1000000

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench.wasm

Time (mean ± σ): 7.194 s ± 0.143 s User: 7.106 s, System: 0.053 s

Range (min … max): 6.941 s … 7.329 s 10 runs

変えたあとの方が若干成績悪い...

これは、i32 capを真面目にやってないので数値が大きくなりすぎるからかもしれないなw

ただそれでも改善がなさすぎで、真のボトルネックはClass#newじゃない可能性がある

N1 -> N3で、なんか山の形が均等に伸びててあんま最適化できる箇所がなさそう...

ちなみにGC止めたらむしろ遅くなる

code:koreha

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N2.wasm

Time (mean ± σ): 4.106 s ± 0.102 s User: 3.320 s, System: 0.750 s

Range (min … max): 3.954 s … 4.267 s 10 runs

やはりpush_frameが大きいか？試しにdefault_localsが空であればconcatしないようにしたらわずかに変わる

code:kore

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N2.wasm

Time (mean ± σ): 3.572 s ± 0.060 s User: 3.502 s, System: 0.040 s

Range (min … max): 3.486 s … 3.647 s 10 runs

sampleも 826 -> 756

concatのコストは確かに大きいが

code:kore.el

(func $detailed_arithmetic_loop (export "detailed_arithmetic_loop") (result i32)

(local $i i32) ;; Loop counter

(local $x i32) ;; First argument

(local $y i32) ;; Second argument

(local $total_result i32) ;; Accumulated result

(local $current_result i32) ;; Current arithmetic result

こういう定義なのでsize 5の配列が毎回concatされる

変数を借りに畳み込めるなら高速になるが、今回は...

code:koreda.rb

local_start = stack.size - wasm_function.callsig.size

...

locals.concat(wasm_function.default_locals)

callsig.sizeは固定なので最初から伸長したarrayを渡してあげる

値置き換えだけにして使い回す

とか？

code:koreha

diff --git a/lib/wardite.rb b/lib/wardite.rb

index f1efe1e..d42688a 100644

--- a/lib/wardite.rb

+++ b/lib/wardite.rb

@@ -327,13 +327,12 @@ module Wardite

# @rbs wasm_function: WasmFunction

# @rbs return: void

def push_frame(wasm_function)

+ locals = wasm_function.assign_locals

local_start = stack.size - wasm_function.callsig.size

- locals = stacklocal_start..

- if !locals

- raise LoadError, "stack too short"

- end

+ stacklocal_start..&.each_with_index do |v, i|

+ localsi = v

+ end || raise(LoadError, "stack too short")

self.stack = drained_stack(local_start)

- locals.concat(wasm_function.default_locals)

arity = wasm_function.retsig.size

frame = Frame.new(-1, stack.size, wasm_function.body, arity, locals)

@@ -1278,6 +1277,14 @@ module Wardite

code_body.locals_count

end

+ def assign_locals

+ Array.new(callsig.size + locals_count.size, nil)

+ end

+ def locals_all_count

+ @_locals_all_count ||= locals_count.sum

+ end

# @rbs return: ArraywasmValue

def construct_default_locals

locals = [] #: ArraywasmValue

https://scrapbox.io/files/68994afea9c54292740ae943.png

https://scrapbox.io/files/68994b08f65e739ec8d15446.png

stackprofから見たらCPU時間が激減しているのに、実行時間は速くなっていない...。

ruby-prof でフレームグラフ作るのだるいな...

https://github.com/oozou/ruby-prof-flamegraph

なんか刺さるし...

刺さってるんじゃなくてtracepoint API使ってるから遅いのかw

フレームグラフの見た目あんま変わらなくなったし、そもそも見えない処理で時間を使いすぎてる...

元々concatの処理時間は全然支配的じゃないみたいな結果になってるし...

プロファイラが信頼できなくなったw

https://scrapbox.io/files/689952aa3b01025b03355d63.png

何この意味のわからない空白...

時間かかりすぎるけど N=500,000 で計測する

https://scrapbox.io/files/689955e631a291bf2433e26e.png

vernier でも同等の結果（使いやすいなこれ）

そこそこ Class#new はあるといえばある

関数呼び出しに対して命令が多くなればここが伸びる可能性がある

この空白は...

YJITで見えなくなってる処理か？

YJITを切っても見えない処理は多くなったので無関係、むしろ長く...

code:uoooooooo

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N2.wasm

Time (mean ± σ): 3.537 s ± 0.049 s User: 3.469 s, System: 0.039 s

Range (min … max): 3.453 s … 3.595 s 10 runs

code:changed

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N2.wasm

Time (mean ± σ): 3.342 s ± 0.046 s User: 3.269 s, System: 0.041 s

Range (min … max): 3.251 s … 3.399 s 10 runs

登場回数の多い関数を上にたたみ込んだらそれだけで10%近く高速になった...

nilガードとかも除外してみたがそんなに変わらなそう

例えばこうか

命令をsymbolからintegerにたたみ込んだものを作る

多い命令とそうでない命令で明白にintegerの範囲を分ける

命令の判定もなんか高速にできませんか

多い命令の範囲の場合ショートカットする

そうでないものは通常の実行をする

あとやりたいこと

I32のオブジェクト生成のオミットは本当に効果がないのか

やっぱClass#new でかいように見えるんだよな

関数呼び出しが支配的でない場合なら？

とはいえ他のコストが伸びるだけという説もあるんだべ

バイナリパーザの高速化

ちょっと関数呼び出しが少なくて命令が多いサンプルを...

code:kansuusukunai

$ hyperfine 'bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N2-B.wasm'

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench- Time (mean ± σ): 3.397 s ± 0.105 s User: 3.323 s, System: 0.040 s

Range (min … max): 3.256 s … 3.527 s 10 runs

# after

N2-B.wasm

Time (mean ± σ): 2.917 s ± 0.039 s User: 2.844 s, System: 0.038 s

Range (min … max): 2.838 s … 2.987 s 10 runs

関数呼び出しの影響を相対的に小さくした

で、命令の畳み込みをした後がこれ

https://scrapbox.io/files/689962fcb8839b00b19a736b.png

local_get/setの呼び出し回数はより増えているはず

local_get/set の判定を事前計算してみるか...

code:kekka

Benchmark 1: bundle exec wardite --yjit --no-wasi --invoke detailed_arithmetic_loop examples/i32_bench-N2-B.wasm

Time (mean ± σ): 2.891 s ± 0.054 s User: 2.819 s, System: 0.037 s

Range (min … max): 2.789 s … 2.967 s 10 runs

僅かに...

https://scrapbox.io/files/689966553a9bafdc0a0d4088.png

このアプローチでは辺が限界かなあ

一旦ここまででgtayscaleベンチを再び走らせる

code:gs1

Benchmark 1: bundle exec ruby ./tmp/grayscale.rb

Time (mean ± σ): 21.822 s ± 0.236 s User: 21.585 s, System: 0.142 s

Range (min … max): 21.575 s … 22.046 s 3 runs

before

Benchmark 1: bundle exec ruby ./tmp/grayscale.rb

Time (mean ± σ): 22.802 s ± 0.214 s User: 22.618 s, System: 0.131 s

Range (min … max): 22.620 s … 23.038 s 3 runs

https://scrapbox.io/files/68998811084176e9c9f78f27.png

この結果を見るとやはり i32 の評価と実行を特異的にチューニングしたら高速になりそうだな。

それぞれの型ごとの実行時間も可視化されていて便利

cached_or_initialize を無くせるなら... どうなるかな？

----

ruby.wasm のサイズになるとloadも半端なく時間がかかる、というか割と支配的になる

https://scrapbox.io/files/68998ecb0421d3ebf0a5cdef.png

これを見ると命令が多すぎて、その小さな処理がチリツモしてるのがわかる。

大きくは

fetch_ops_while_end が遅い

これをシュッと解決するのはちょっと厳しいかな？

operand_of がなんか重い

正規表現何に使ってるっけ...

to_symが重い

テーブルから撮るところじゃなくてstring#splitしてるところ

Op.new, Op.initialize が重い

operandがあるから簡単にキャッシュするのもむずいかな〜

命令を多段に分けてるためのsplitがかなり無駄そう...

eval側に手を加えないといけなそうな箇所もあるので一旦どうするかな。これって、 Data.define した方が軽かったりする？

code:loder.rb

require "wardite"

require "optparse"

require "ostruct"

$options = OpenStruct.new

opt = OptionParser.new

opt.on('--wasm-file FILE') {|v| $options.wasm_file = v }

opt.parse!

f = File.open($options.wasm_file)

require "vernier"

RubyVM::YJIT.enable

puts "YJIT enabled: #{RubyVM::YJIT.enabled?}"

Vernier.profile(out: "./tmp/load_perf.json") do

start = Time.now

_instance = Wardite::BinaryLoader::load_from_buffer(f);

puts "Profile saved to ./tmp/load_perf.json"

puts "Load time: #{Time.now.to_f - start.to_f} seconds"

end

p "OK"

https://scrapbox.io/files/689c0f43b454ba9837ce475f.png

file loadのみ

Total opcodes: 3314498 とのこと

operand判定でregexp排除

https://scrapbox.io/files/689c11a62a7a27c98c702694.png

operandは、多分生コードを判定するようにした方いい

そして...

code:hayai

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 5.0801379680633545 seconds

"OK"

うおおおきた〜

https://scrapbox.io/files/689c136cbaeea18172df5b59.png

code:sarani

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 4.5271360874176025 seconds

"OK"

メソッド呼び出しをやめたらもっと高速に...

どんだけコストあるんだ...

https://scrapbox.io/files/689c14d4ae1d6ca2223ed487.png

あとは

そもそも内部表現をsymbolにしなくていいのでは

Opクラス、不要では

というところか？

ちなOpの生成とfetch_while_endがないとこれくらい

code:saisoku

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 1.1398019790649414 seconds

"OK"

dummy << [namespace, code, operand] みたいにOp.newしないように変えたけど、ほぼload time変わらなくなった...

fetch_opsを含めても3秒台とかになるんでは？

要素数が少ない配列は生成コストほぼないんだっけ？

大幅にコードは変わるけど、やる価値あるなこれ

memo

fetch_opsはそもそもrevisitでやらずに、パースしてる段階でstackに積んでいけばいいのでは？

これもあとで

fetch_opsの改修からやるか

fetch_opsの改修

https://scrapbox.io/files/68a9af310b482c1be2e37751.png

まあOpの生成の方が大きいんだが

新方式と旧方式でずれてないか照らし合わせるコードを書く

何もかも忘れてたんだが def self.code_body(buf) でやってる

code:kekka.rb

irb(Wardite::BinaryLoader):004> pp revisitor.ops.select{ !_1.meta.empty? };

[#<Wardite::Op:0x0000000100f1f418

@code=:block,

@meta={debug_else_idx: -1, debug_end_idx: 67, end_pos: 67},

@namespace=:default,

@operand=[#<Wardite::Block:0x000000011bfb99d0 @block_types=127>]>,

#<Wardite::Op:0x0000000100f1efb8

@code=:if,

@meta={debug_else_idx: -1, debug_end_idx: 15, end_pos: 15, else_pos: 15},

@namespace=:default,

@operand=#<Wardite::Block:0x000000011bfb9480 @block_types=nil>>,

#<Wardite::Op:0x0000000100f1e298

@code=:if,

@meta={debug_else_idx: -1, debug_end_idx: 29, end_pos: 29, else_pos: 29},

@namespace=:default,

@operand=#<Wardite::Block:0x000000011bfb7dd8 @block_types=nil>>,

#<Wardite::Op:0x0000000100f1d6b8

@code=:if,

@meta={debug_else_idx: -1, debug_end_idx: 44, end_pos: 44, else_pos: 44},

@namespace=:default,

@operand=#<Wardite::Block:0x000000011bfb6d98 @block_types=nil>>,

#<Wardite::Op:0x0000000100f1ccb8

@code=:if,

@meta={debug_else_idx: -1, debug_end_idx: 59, end_pos: 59, else_pos: 59},

@namespace=:default,

@operand=#<Wardite::Block:0x000000011bfb5c18 @block_types=nil>>,

#<Wardite::Op:0x0000000100f1c3a8

@code=:if,

@meta={debug_else_idx: -1, debug_end_idx: 65, end_pos: 65, else_pos: 65},

@namespace=:default,

@operand=#<Wardite::Block:0x000000011bfb51a0 @block_types=nil>>]

あってそう。ifの場合はelseがない時はendと一致する必要があるらしい（そうだっけ...）

code:else.rb

[#<Wardite::Op:0x0000000125c3ec10

@code=:if,

@meta={debug_else_idx: 71, debug_end_idx: 75, end_pos: 75, else_pos: 71},

@namespace=:default,

@operand=[#<Wardite::Block:0x0000000125cf8f70 @block_types=127>]>]

else も大丈夫そう

code:keisoku

$ bundle exec ruby examples/load_perf.rb --wasm-file ./tmp/ruby.wasm

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 3.56905198097229 seconds

https://scrapbox.io/files/68a9b6ac06c523c0bb3a1e21.png

1秒ぐらいカットしてる？

では、Opを作らなくする

revisitorなくなったから...

HINT: [namespace, code, operand, meta]

namespaceいらねえ...

さてと

code:kondake

$ bundle exec steep check --severity-level=error

# Type checking files:

......................................F....

lib/wardite.rb:1267:8: error Cannot allow method body have type ::Array[[::Symbol, ::Symbol, ::Array[::Wardite::operandItem], (::Hash[::Symbol, ::Integer] | nil)]] because declared as type ::Array[::Wardite::Op]

│ ::Array::Symbol, ::Symbol, ::Array::Wardite::operandItem, (::Hash::Symbol, ::Integer | nil) <: ::Array::Wardite::Op

│ [::Symbol, ::Symbol, ::Array::Wardite::operandItem, (::Hash::Symbol, ::Integer | nil)] <: ::Wardite::Op

│ ::Array[(::Symbol | ::Array::Wardite::operandItem | ::Hash::Symbol, ::Integer | nil)] <: ::Wardite::Op

│ ::Object <: ::Wardite::Op

│ ::BasicObject <: ::Wardite::Op

│

│ Diagnostic ID: Ruby::MethodBodyTypeMismatch

│

└ def body

~~~~

Detected 1 problem from 1 file

ここのエラーは仕方ない

code:koko

lib/wardite.rb:339:40: error Cannot pass a value of type ::Array[[::Symbol, ::Symbol, ::Array[::Wardite::operandItem], (::Hash[::Symbol, ::Integer] | nil)]] as an argument of type ::Array[::Wardite::Op]

│ ::Array::Symbol, ::Symbol, ::Array::Wardite::operandItem, (::Hash::Symbol, ::Integer | nil) <: ::Array::Wardite::Op

│ [::Symbol, ::Symbol, ::Array::Wardite::operandItem, (::Hash::Symbol, ::Integer | nil)] <: ::Wardite::Op

│ ::Array[(::Symbol | ::Array::Wardite::operandItem | ::Hash::Symbol, ::Integer | nil)] <: ::Wardite::Op

│ ::Object <: ::Wardite::Op

│ ::BasicObject <: ::Wardite::Op

│

│ Diagnostic ID: Ruby::ArgumentTypeMismatch

│

└ frame = Frame.new(-1, stack.size, wasm_function.body, arity, locals)

~~~~~~~~~~~~~~~~~~

Detected 1 problem from 1 file

ここまでにしよう

loadだけ検証

code:koko

$ bundle exec ruby examples/load_perf.rb --wasm-file ./tmp/ruby.wasm

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 2.4444618225097656 seconds

だいぶきたな〜

https://scrapbox.io/files/68a9ba162c50f5546bf053a1.png

4要素だと重いのかな〜

code:toiuka

$ bundle exec ruby examples/load_perf.rb --wasm-file ./tmp/ruby.wasm

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 2.1779117584228516 seconds

Hashを作らないようにしたらより良くなった

2秒切りたい...

https://scrapbox.io/files/68a9bc22eb4051e7ca5deda1.png

あとはまあsymに変換しないで数値で持たせるとか...

Rubyにプリプロセッサが欲しくなるなんてな

namespaceはやめれるクネ？

code:resolver.rb

def self.resolve_code(c, buf)

namespace, code = Op.to_sym(c)

if namespace == :fc

lower = fetch_uleb128(buf)

return Op.resolve_fc_sym(lower) #: Symbol, Symbol

end

return namespace, code #: Symbol, Symbol

end

namespaceやめさせたいのと、fcの考慮って感じか

Op.to_sym はほぼtableから取ってるだけだけど...

code:koreha.rb

SYMS.each_with_index do |sym, i|

$tablei = sym

end

このtable必要か？笑

code:yossha

$ bundle exec ruby examples/load_perf.rb --wasm-file ./tmp/ruby.wasm

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 1.5799622535705566 seconds

2秒をついに切った

namespace の計算をやめたのでインライン化の必要が...

https://scrapbox.io/files/68a9c1f9eb4051e7ca5e0283.png

operand_ofをワンテーブルにすればいけそうかな

lebのことを考えよう

code:i32.rb

def self.operand_of(code)

case code

when :local_get, :local_set, :local_tee, :global_get, :global_set, :call, :br, :br_if

:u32

when :memory_init, :memory_copy

:u32, :u32

when :memory_size, :memory_grow, :memory_fill

when :call_indirect

when :br_table

when :i32_const

when :i64_const

when :f32_const

when :f64_const

when :if, :block, :loop

:u8_block

when :i32_load, :i64_load, :f32_load, :f64_load, :i32_load8_s, :i32_load8_u, :i32_load16_s, :i32_load16_u,

:i64_load8_s, :i64_load8_u, :i64_load16_s, :i64_load16_u, :i64_load32_s, :i64_load32_u, :i32_store, :i64_store,

:f32_store, :f64_store, :i32_store8, :i32_store16, :i64_store8, :i64_store16, :i64_store32

:u32, :u32

else

[]

end

constの時i32/u32を区別してそれでsleb/ulebの区別が出てるけど、必要かなあ？

uleb見てみると

code:kou.diff

diff --git a/lib/wardite/leb128.rb b/lib/wardite/leb128.rb

index a13f427..41afc4d 100644

--- a/lib/wardite/leb128.rb

+++ b/lib/wardite/leb128.rb

@@ -8,10 +8,9 @@ module Wardite

dest = 0

level = 0

while b = buf.read(1)

- if b == nil

- raise LoadError, "buffer too short"

- end

+ raise LoadError, "buffer too short" unless b

c = b.ord

+ return c if c < 0x80 && level.zero?

upper, lower = (c >> 7), (c & (1 << 7) - 1)

dest |= ower << (7 * level)

@@ -34,9 +33,7 @@ module Wardite

dest = 0

level = 0

while b = buf.read(1)

- if b == nil

- raise LoadError, "buffer too short"

- end

+ raise LoadError, "buffer too short" unless b

c = b.ord

upper, lower = (c >> 7), (c & (1 << 7) - 1)

こうしたら

https://scrapbox.io/files/68a9c414d67ea47bd7cbb123.png

code:nantonaku

$ bundle exec ruby examples/load_perf.rb --wasm-file ./tmp/ruby.wasm

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 1.567669153213501 seconds

調子が悪くても 1.7s -> 1.5sくらいになった

ようやくStringIO#readのコストが上に来るようになった...

operand_ofを修正

code:korede

$ bundle exec ruby examples/load_perf.rb --wasm-file ./tmp/ruby.wasm

YJIT enabled: true

Profile saved to ./tmp/load_perf.json

Load time: 1.4801809787750244 seconds

やっと 1.5 秒をほぼコンスタントに切るように...

https://scrapbox.io/files/68a9c7dd898e07b11961e70f.png

ここが限界だな！

Macの調子で0.1~0.2秒ほどブレるのでキレそうだが

こんなとこだな！

もうスライド作り始めようかなというのと、Op構造体をやめた上にnamespaceがなくなってるのでちゃんと動かないはずで、その辺も気合いで直す必要はある...

いらないコードも生まれてるし...

「Op構造体をやめ」はまあ型を頑張っていけば直る。

namespaceは一旦コピペが必要...

#Warditeの計測

#RubyKaigiFollowUp2025