Part 1. Matrix multiplication in WebGL2-compute

Matrix multiplication C = A x B (SGEMM) tuning for Nvidia GPU (low-end really) demos are based on Tutorial: OpenCL SGEMM tuning for Kepler by Cedric Nugteren (see his test results on Tesla below).
Kepler

OpenGL ES Compute shaders are similar to OpenCL kernels and scripts are matched almost one-to-one (i.e. sources are made by Cedric Nugteren, errors by me :)

WebGL2-compute shaders

Note that experimental WebGL2 Compute shaders are supported only in Google Chrome or Microsoft Edge Insider Channels yet under the flag --enable-webgl2-compute-context. See WebGL 2.0 Compute shader Demos. Hope we will be able to use WebGL2-compute on any device (in a while). Cross-platform, no compilers (just simple html editor).

Shader 1: Naive implementation (benchmark) ~12.4/11.4 GFLOPS (OpenGL/D3D11)
Shader 2: Tiling in the local memory (benchmark) 29.8/29 GFLOPS
Shader 3: More work per thread (benchmark) ~54/52 GFLOPS
Shader 4.4: Wider data-types (benchmark) ~60/60 GFLOPS
Shader 4.8: Wider data-types (benchmark) ~53/42 GFLOPS
Shader 5: Transposed input matrix (benchmark) ~54 GFLOPS (D3D11)
Shader 6: 2D register blocking (benchmark) ~113 GFLOPS !!!
Shader 7: Wider loads with register blocking (benchmark) ~104 GFLOPS

Tested on GeForce GT 710 (Windows 10, 64 bit) (192 cores at 953MHz, peak performance 366 GFLOPS).
Results from SiSoftware Sandra OpenCL FP32 GPU test
GEMM
104 GFLOPS FFT
10.8 GFLOPS N Body
143 GFLOPS

FMA (float multiply + add) is counted as 2 operations. Overheads in D3D11 backend depend on SSBO size therefore OpenGL backend is used for benchmarks.

WebGL2-compute vs. WebGL

We can compare performance of

Shader 6 with SSBuffers benchmark WebGL2-compute
TensorFlow.js matrix multiplication benchmark WebGL
SGEMM with FLOAT RGBA32F textures (benchmark)
Demo with HALF_FLOAT RGBA16F textures (benchmark)

SSBO
TFjs
RGBA32F
RGBA16F

N=1024
19.8 GFLOPS
9 GFLOPS
7.6 GFLOPS
10 GFLOPS

N=2048
62 GFLOPS
22 GFLOPS
20.4 GFLOPS
~34 GFLOPS

N=4096
113 GFLOPS
22.8 GFLOPS
23.6 GFLOPS
~45 GFLOPS

HGEMM with HALF_FLOAT textures are almost x2 faster than SGEMM with FLOAT ones on GT 710. But it is likely that HGEMM and SGEMM are similar on AMD and Intel GPU. GEMMs tests on Google Pixel. But they say that TFjs uses HALF_FLOAT textures on mobile devices. Python + CUDA will be faster on desktop...

Unfortunately HALF_FLOAT SSBOs are not supported by WebGL2-compute
https://bugs.chromium.org/p/angleproject/issues/detail?id=3160.

Shaders tuning

Unfortunately Cedric have interesting observation at Performance on AMD GPUs that these kernels are not optimal for AMD GPUs. He made one more shader 11 for Radeon R9 280X (2550/1960 faster than shader 7) at Inside clBlas. Shader 11 is 1370/830 slower than Shader 6 on Nvidia GPU.

Surprisingly benchmarks on small AMD A6-5200 APU and GT 710 are similar to Cedric's ones (Shaders 6,7 are ~10 times faster than Sh.1). For some reason GT710 performance depends strongly on N. To get "pure" performance overheads are subtracted below.
A6-5200 gt710
IMHO even N = 2048 is too large for real ML applications. To accelerate smaller problems batched routines are used. In the simplest case we need just to multiply rectangular matrices (see an example to the right). batch

Part 1. Matrix multiplication in WebGL2-compute

WebGL2-compute shaders

WebGL2-compute vs. WebGL

Shaders tuning

More links: