Tuning matrix multiplication (GEMM) for Intel GPUs

GEMMs are tested on Intel i3-8100, Windows 10 64-bit, D3D11 backend, N = 1024.

WebGL2-compute GEMM shaders based on Cedric Nugteren tutorial suit well for low-end HW but need more tuning for different GPUs (e.g. Intel and AMD). We can find new optimized OpenCL kernels e.g. in CLBlast library ("Note that CLBlast evolved quite a bit from the tutorials" Cedric).
Intel has many highly optimised kernels (see below). One of them is used for the SLM_8x8_4x16 shader (I just replaced pointers by indexes and "unroll" mad() functions not supported by GLSL 310).

"Semi-quantitative" plot. Unfortunately execution times are highly oscillating (especially TFjs GEMM for 1280 < N < 2048), therefore "highest reproducible numbers" are used :)

Opened questions:

  1. Why HGEMM with RGBA16F textures is so slow on i3-8100? It is 2 times faster than RGBA32F on GT710. Qin Jiajia from Intel wrote: "we are planning to add fp16 support in d3d backend".

  2. Fastest OpenCL kernels use subgroups extension. Jiajia wrote:
    "we are planning to try (subgroups) on d3d backend since Intel has this extension support.

  3. We also can try to use textures to hold data.

  4. Much more shaders tuning is necessarily for TensorFlow.js WebGL2-compute backend.

Intel OpenCL GEMMs

Test results from
SGEMM for Intel Processor Graphics by LINGYI K. (Intel), Robert I. (May 18, 2015)
Platforms (1):
    [0] Intel(R) OpenCL [Selected]
Devices (1; filtered by type gpu):
    [0] Intel(R) UHD Graphics 630 [Selected]
-----------------------------------------
matrix size: ( 1024x1024 ) * ( 1024x1024 )
Algorithm                Peak Kernel GFlops
gemm_naive                     19.144

L3_SLM_8x8_8x16               227.814
L3_SLM_8x8_4x16               250.566
L3_SLM_8x8_16x16              172.774

L3_SIMD_32x2_1x8              249.879
L3_SIMD_16x2_1x8              245.959
L3_SIMD_16x2_4x8              248.497
L3_SIMD_8x4_1x8               264.936
L3_SIMD_8x4_8x8               266.951
L3_SIMD_8x4_8x8_barrier       254.461
block_read_32x1_1x8           316.018
block_read_32x2_1x8           322.534
block_read_32x2_4x8           327.082
block_read_32x2_8x8           325.702
block_read_16x2_1x8           318.779
block_read_16x2_4x8           321.469
block_read_16x2_8x8           320.089
block_read_16x4_1x8           323.488
block_read_16x4_4x8           326.225
block_read_16x4_8x8           325.595

Optimizing Matrix Multiply for Intel Processor Graphics Architecture Gen9 by Jeffrey M. (Dec 23, 2016)

# device name: Intel(R) UHD Graphics 630
# device slm size: 65536
# device max work group size: 256
# Max compute units  (GPU): 23
# Max clock freqency (GPU): 1100.000000
# Peak float perf    (GPU): 404.800000
# build options:  -cl-mad-enable -cl-fast-relaxed-math
# matrix size: 512x512x512
# name                                 time(ms) GFLOPS  Efficiency
Unoptimized                             10.2    26.4     6.5 %
L3_SIMD_4x8x8                            0.9   293.2    72.4 %
MediaBlockRW_SIMD_2x32                   0.9   303.0    74.9 %
MediaBlockRead_SIMD_1x16_2_fp16          0.4   596.7   147.4 %

CLBlast GEMMs

Test resultes from CLBlast (N = 1024). There are compiled tuners for GEMM (for Linux and Windows) at CLBlast releases. See also Faster sgemm for intel? #257.
* Found best result 6.70 ms: 320.6 GFLOPS
* Best parameters: GEMMK=1 KREG=4 KWG=1 KWI=1 MDIMA=16 MDIMC=16 MWG=64
  NDIMB=4 NDIMC=4 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4

under construction

"Intel Driver and Support Assistant" and "Intel System studio 2019" (intel-sw-tools-installation-bundle-win) was used.

https://github.com/intel/clGPU
https://github.com/intel/clGPU/tree/master/experimental/kernels
https://software.intel.com/en-us/iocl-opg-local-memory

https://www.phoronix.com/scan.php?page=news_item&px=Intel-Memory-Regions-Local-Dev

Subgroups
https://www.khronos.org/blog/vulkan-subgroup-tutorial
https://developer.nvidia.com/reading-between-threads-shader-intrinsics


SGEMM in WebGL2-compute     updated 14 July 2019