Skip to content

FL33TW00D/wgpu-mm

Repository files navigation

wgpu-mm

How many FLOPS can we squeeze out of wgpu? The test harness is inspired by Bram Wasti's work here.

GEMM

The M1 8 core GPU can supposedly hit 2.6 TFLOPS of FP32.

A custom metal shader from Tinygrad can hit 2000 GFLOPS or ~75% of theoretical peak. This shader uses SIMD groups which WebGPU doesn't support yet - but it's been proposed a few times e.g here.

The best shader we have is an altered version of that by Tensorflow.JS, which reaches ~900GFLOP on my M1.

GEMV

GEMV is a different problem since it is entirely memory-bound.

The M1 7 core GPU has a memory bandwidth of 66.7 GB/s. We use the formula for bandwidth to be M (GB/s) = M=10-9.(m.n+m+n)*sizeof(scalar type)/T.

For the problem size [1,384] @ [384, 51868] (Whisper logits GEMV), we can calculate the minimum possible runtime to be 1198266.33ns. The best kernel in here, gemv_2, hits ~1300000ns.

As it is memory bound, lower precision is extremely important. We can see our HGEMV can perform the same [1,384] @ [384, 51868] in ~694500ns, ~2x faster.

Our QGEMV can perform the same 1,384] @ [384, 51868] in ~342000ns, ~2x faster again.

Read More

NVIDIA Performance Guide

TODO

[ ] - Flash Attention [ ] - Fast transposed GEMV

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published