Triton - 基素基

Triton

The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.

https://openai.com/research/triton

GPUの構成はこうなっている

https://gyazo.com/f9a8f381083491482284753c4fc683c2

CUDAのコードを最適化するには、上の構成にもとづいてこれらを考慮する必要がある

Memory transfers from DRAM must be coalesced into large transactions to leverage the large bus width of modern memory interfaces.

Data must be manually stashed to SRAM prior to being re-used, and managed so as to minimize shared memory bank conflicts upon retrieval.

`Computations must be partitioned and scheduled carefully, both across and within Streaming Multiprocessors (SMs), so as to promote instruction/thread-level parallelism and leverage special-purpose ALUs (e.g., tensor cores).

大変難しいのでベテランも苦戦する

なので、自動化して抽象化するのがTritonの目標

LLVMベースの中間表現であるTriton-IRを使う

最終的にPTXという表現になり、Nvidia GPUで実行される

https://gyazo.com/53c247aeb98bec30973021a5847bc5fb

OpenAI