Triton
The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs. https://gyazo.com/f9a8f381083491482284753c4fc683c2
CUDAのコードを最適化するには、上の構成にもとづいてこれらを考慮する必要がある
Memory transfers from DRAM must be coalesced into large transactions to leverage the large bus width of modern memory interfaces.
Data must be manually stashed to SRAM prior to being re-used, and managed so as to minimize shared memory bank conflicts upon retrieval.
`Computations must be partitioned and scheduled carefully, both across and within Streaming Multiprocessors (SMs), so as to promote instruction/thread-level parallelism and leverage special-purpose ALUs (e.g., tensor cores).
大変難しいのでベテランも苦戦する
最終的にPTXという表現になり、Nvidia GPUで実行される
https://gyazo.com/53c247aeb98bec30973021a5847bc5fb