WebGL2-compute GEMM shaders based on
Cedric Nugteren's tutorial
(see his test results on Tesla below).
GEMMs are tested on MSI GeForce RTX 2070 Aero + Intel i3-8100, Windows 10 64-bit,
OpenGL backend, N = 4096.
For N = 1024 shader 6 is not optimized yet. As you can see below, tuned OpenCL kernel get 3.94 TFLOPS.
D3D11 backend has higher overheads for big N than OpenGL one.
* Found best result 20.25 ms: 6786.0 GFLOPS * Best parameters: GEMMK=0 KREG=1 KWG=16 KWI=2 MDIMA=32 MDIMC=8 MWG=128 NDIMB=32 NDIMC=16 NWG=128 PRECISION=32 SA=1 SB=1 STRM=1 STRN=0 VWM=4 VWN=1For N = 1024 tuner has found the same parameters and 3942.5 GFLOPS
* Found best result 0.54 ms: 3942.5 GFLOPS * Best parameters: GEMMK=0 KREG=1 KWG=16 KWI=2 MDIMA=32 MDIMC=8 MWG=128 NDIMB=32 NDIMC=16 NWG=128 PRECISION=32 SA=1 SB=1 STRM=1 STRN=0 VWM=4 VWN=1