smish.dev
elements_per_block

Problem Statement

After a naive initial port of femto's kernels to CUDA, we had an implementation that looked something like:

This implementation already shows significant speedups for many of the different combinations of element geometry and polynomial order, but it has an issue: it is written so that each thread block processes a single element. This may work okay when that element is a high order tetrahedron and hexahedron, but for low order triangles and quadrilaterals this amounts to launching a kernel with a block size as low as 3.

This may not be a big deal when running on a CPU, but it is problematic for GPUs. On a GPU, calculations are carried out by "warps" (groups of 32 parallel threads), which consequently can operate on up to 32 pieces of data a time. However, if we launch a kernel with a blocksize of only 3, then most of the threads in a given warp will be inactive, which means we only attain a fraction of the performance that the hardware is capable of.

So, how can we better utilize the GPU hardware for elements that only involve 3-4 values per element?

Well, fundamentally there is not enough work in a single 3-node triangle to keep a warp busy, but we can modify the kernel to process multiple elements at a time, rather than just one. That way, we can ensure that each thread block has enough work to keep the hardware busy.

Here's a sketch of how to do that:

  1. Modify the kernel launch parameters

  2. Inspect threadIdx.z in the kernel to figure out which element within the block to work on

  3. Load data and perform calculations on the element corresponding to threadIdx.z

In order to make step 3 work, we also have to increase the amount of shared memory we allocate, to make sure each element has space for its own intermediates. Some of the data (like shape function evaluations) are common to all of the elements in the thread block, so those don't need to be duplicated. Here's some code for the updated body of the kernel, taking into account processing multiple elements per block:

This modification doesn't require significant changes. The main difference is that the shared memory arrays now have one extra dimension that each thread indexes by its local element id.

Great, now the kernel definition is able to handle multiple elements per block, but how many elements should each thread block process?

Intuitively, we should pick enough elements per block such that the warps in that thread block are utilized efficiently. This is true when the blocksize nodes_per_elem * elems_per_block is a multiple of 32, but that isn't a necessary condition. Also, recall that processing more elements per block, requires more shared memory. So, after a point, increasing the number of elements / block ends up using enough shared memory to hurt performance through decreased occupancy.

From those criteria (picking the smallest number that maximizes expected active threads / warp), we could guess some parameter values for each kind of element, but instead I just set up a performance benchmark and exhaustively tried out different numbers of elements/block (up to a maximum of 32 for the 2D elements and up to 16 for the 3D elements).

Tuning Data

The follow sections tabulate data from a performance benchmark involving an unstructured meshes ~1,000,000 elements of the specified geometry.

The tests were performed with as many quadrature points as nodes in the respective element, and the listed values are the timings (in microseconds) as measured on a NVIDIA GV100 GPU.

The labels "scalar" and "vector" indicate which timings are from interpolation of scalar-valued and vector-valued (with 2 components for tri/quad and 3 components for tet/hex) fields, respectively.

Triangles

elems/blockp=1 scalarp=1 vectorp=2 scalarp=2 vectorp=3 scalarp=3 vector
11827.861650.381657.971842.932519.343418.98
2928.676841.768846.2611059.291197.071854.88
3630.497576.831580.467969.75903.1161626.01
4481.159500.158461.676913.121928.3261632.54
5392.553503.027412.534911.477878.7661600.71
6333.971477.727468.3889.149819.9121547.96
7326.042480.17447.254899.226892.561528.68
8262.445470.527415.596878.654841.1061476.84
9254.571478.638411.888895.126834.0231511.27
10243.835473.16393.825882.801861.8051489.36
11287.049482.172438.747899.418836.5731504.07
12270.981465.057424.349883.528811.1121485.53
13265.044475.665411.909893.59851.7681502.35
14256.67469.396404.22880.548837.2541485.52
15254.173470.795397.545891.093818.6961495.94
16247.275459.956384.163888.107784.3771480.86
17251.62470.28416.948892.995871.3131520.84
18244.949467.163404.171878.305839.3281491.74
19256.185486.186402.933889.179825.9231505.27
20252.626478.833392.897881.011852.221521.06
21252.091486.237389.807885.707852.0661530.15
22249.021468.597406.985874.233836.6571515.23
23249.692471.702406.936883.057834.1851561.02
24245.635461.386397.223862.413813.6131534.48
25247.615473.833398.911873.469813.4351565.22
26245.002467.378391.892862.249863.4251559.67
27246.895469.076423.608883.015859.1841569.57
28242.812459.521415.038870.332837.3821553.05
29243.912467.67412.989878.295832.4071575.21
30242.796465.201408.271867.699825.0561565.11
31244.91467.52407.235876.127818.2931577.99
32239.306464.493398.204878.589795.3471567.85

 

Quadrilaterals

elems/blockp=1 scalarp=1 vectorp=2 scalarp=2 vectorp=3 scalarp=3 vector
11656.782218.671652.462321.881945.383257.68
2845.111143.16874.9641447.731155.962370.33
3597.446815.569655.8541301.681319.272414.98
4466.877677.232804.5361349.371098.342273.14
5407.264619.296693.0711305.251211.962303.77
6357.307590.028625.5271278.111084.492247.2
7335.737579.683574.2871269.971181.232290.01
8320.743583.936652.1371270.681089.452184.34
9424.108630.525625.0741284.371192.982322.53
10399.961606.691584.1381265.551120.312286.34
11375.348593.815645.6481294.831180.932330.26
12342.942585.122611.4761254.761123.152307.1
13337.111581.674595.6321274.771209.032393.54
14331.777582.223568.5661256.251153.512370.01
15337.931576.605640.1211296.061285.812524.15
16331.768583.422610.0191250.671236.862487.09
17359.151599.496601.051272.711233.752486.11
18343.784594.612642.871290.281192.572470.38
19341.156586.575626.8141285.811359.222814.6
20334.723582.002606.7531256.021318.222787.66
21343.475581.999596.0311268.991310.352855.63
22338.722582.531638.6751306.551273.362831.69
23341.062578.47628.9881303.961268.082830.22
24336.373582.976608.8011271.781235.442785.61
25344.9591.105686.1011366.731618.813021.18
26333.62589.11668.8911344.011577.152978.97
27338.581585.48661.151346.661558.622931.92
28336.218581.596641.1111311.831517.172900.92
29340.686581.868651.7791353.141603.422930.69
30336.04580.796636.6781337.771568.332918.19
31340.078578.632633.5251339.321538.872908.85
32333.727561.618607.7171315.421506.122885.17

 

Tetrahedra

elems/blockp=1 scalarp=1 vectorp=2 scalarp=2 vectorp=3 scalarp=3 vector
11918.182579.593598.56195.6318699.428302.8
2956.5331884.271906.294638.28870.1115803.4
3740.4341734.031530.374319.476070.2712413.5
4612.4911644.021416.064164.695095.2211988.8
5648.1591680.531297.224165.554902.8711987.2
6593.741728.371238.64176.594556.2711918.2
7550.6431726.981289.434142.844430.811620
8543.7141731.241190.694145.254246.1411527.8
9558.3791664.431192.714417.584361.3911920.3
10560.0161717.081222.564164.64251.0111878.3
11551.5881717.031201.954299.834214.9111852.8
12528.6261701.241166.934529.634291.9112215.7
13534.9031723.371226.064292.244278.5712079.4
14532.1931740.741200.944435.754224.9212078.4
15541.3531778.261171.284633.824221.0112217.9
16532.5431817.121139.314785.834177.3112249.3

 

Hexahedra

elems/blockp=1 scalarp=1 vectorp=2 scalarp=2 vectorp=3 scalarp=3 vector
12973.046554.483589.0811388.513286.936235.7
21728.234022.053522.0711097.312991.434682.7
31346.823423.513536.4910902.814004.435837.6
41171.543179.033477.8410801.51436337169.5
51466.23598.963865.8511473.817600.543768.8
61293.773314.874127.4911899.616425.142558.9
71262.313252.93778.3811359.415396.642125.9
81173.443135.413504.411232.22147856800.4
91323.423333.43993.7211901.220393.154862.6
101263.753202.473805.6811709.619505.753456.8
111255.963205.364870.3213488.418882.552406.2
121187.453129.544709.1113191.218342.151557.9
131275.193269.114538.3712974.718026.950993.4
141236.693185.274426.1613116.7**
151224.623221.434328.5412747.9**
161182.913165.84164.7712591.6**

Note: the entries marked with * require more shared memory than the GV100 has available

 

Summary

The naive CUDA port of the CPU code was a good start, but that implementation didn't perform very well for the low-order elements (which are the most commonly-used case). Luckily, it didn't make much effort to modify the kernel to accept a parameter that controls how many elements to process in a single threadblock. However, introducing a new parameter meant that we also needed to provide appropriate values for that parameter in each situation. By exhaustively sweeping through different number of elements processed per thread block, we were able to select values that significantly improved performance (7.5x for linear triangles, 5x for linear quads, ...).