smish.dev
initial_cuda_port

Getting femto running on GPUs

In my spare time I've been developing a finite element library called "femto". It's always been part of my plan to make femto run on GPUs, but I believe that implementing GPU support too early on in a project can be detrimental, as it makes it more costly to consider fundamental changes to the library's design. However, after iterating on femto for a few months, I feel like it's at a point where I'm relatively happy with the core FEM abstractions and interfaces, which makes it a natural time to start considering GPU support.

This document goes over the main FEM kernels in femto and what was involved in making them run on GPUs. I've also done several experiments to optimize performance of those GPU kernels, but I will explain those considerations in separate documents.

Core FEM kernels

The finite element method is a remarkably simple idea. The fundamental pieces common to all finite element analyses are:

  1. Meshes to discretize the regions of interest

  2. Solution fields defined over those meshes (or some parts of them)

  3. Ways to express (and linearize) residual equations for the relevant physics

Although mesh generation is an incredibly challenging problem, the rest of steps 1 and 2 are relatively easy: the mesh object just needs to track the various kinds of connectivity between elements and the solution field's primary responsibility is to enumerate degrees of freedom over a mesh.

The part that I find most challenging from a software design perspective is step 3, since the finite element method is used for many different applications (e.g. heat transfer, solid mechanics, fluid flow, electromagnetics, ... ).

femto's approach to step 3 is to provide a way to perform the operations that are common to every residual calculation (regardless of physics): interpolation and integration. Combining these interpolation and integration kernels in different ways lets us succinctly express many different kinds of physics.

So with that in mind, this document will cover the CUDA-fication of an interpolation kernel in femto.

Interpolation

Interpolation kernels map solution fields to the {value, gradient, curl, or divergence} of that solution field at a collection of quadrature points. We'll consider the gradient variant of interpolation in this document, but the other variants are practically identical. The serial CPU implementation of this kernel looks something like this

For-Loop → CUDA Kernel

The first step to get a calculation to run on a GPU is replacing a parallel for-loop by a kernel launch. The element for loop

is the natural one to replace by a CUDA kernel

The kernel launch parameters can impact the performance significantly, but we'll consider that in a separate document. For now, let's just assume that each thread block will be responsible for performing the calculations for a single element.

__device__ function specifiers

Of course, there is more to be done than simply copying the body of the original for-loop into this new kernel. To start, any function used inside a CUDA kernel must be marked with a __device__ specifier. This loop body involves 3 function calls:

so we first need to go to those implementations and mark them appropriately. In some cases, this is trivial, but in other cases it is more involved because __device__ functions can themselves only call other __device__ functions. So, in order to make a function like FiniteElement::gradient work in a CUDA kernel we have to recurse into its implementation and make everything it calls work in a kernel as well.

In this particular case, these three functions only really involve array accesses and simple calculations, so we only need to sprinkle a few __device__ specifiers here and there.

Memory Spaces and Containers

Another thing to consider is that the GPU has a separate memory space from the CPU. Many of the default containers (std::vector, std::map, std::string, etc) won't be usable inside a CUDA kernel and many of the CPU algorithms (std::sort, std::unique, etc) won't work with data that lives on the GPU. In addition to marking my library's custom containers with __device__ on the relevant member functions, we also need to use cudaMalloc/cudaMallocManaged/cudaFree for memory allocations instead of new/delete.

Most of femto uses a custom container class that already has an abstraction around the way memory is allocated:

So, we need only modify this slightly to

to update the container classes to allocate memory in a GPU-compatible way.


"But wait, don't you know that memory allocated with cudaMallocManaged is slow? Don't you want the best performance?" - a coworker of mine

I don't know where this myth showed up, but it's not true. It's easy to run experiments and demonstrate that cudaMallocManaged achieves the same performance as buffers allocated by cudaMalloc, but it seems some people believe that just because cudaMallocManaged offers some conveniences over cudaMalloc it must be slower.


Another important consideration when it comes to containers is how to pass data into a CUDA kernel. On the CPU side, my nd::array container class has value semantics. That is, when you write expressions like

it will create a copy. This is a little problematic when it comes to CUDA kernels for two reasons:

  1. passing nd::arrays into CUDA kernels by-value would be expensive (since it makes a copy)

  2. using nd::arrays passed into a CUDA kernel by-reference will result in dereferencing host data on the device (which isn't allowed)

A common workaround is to have two related container classes. The "view" type encapsulates most of the behavior of the container (array access, striding, etc) without actually owning its data, while the "array" type is itself a "view", but includes ownership. This way, the "view" type has shallow-copy semantics and can be passed by-value to CUDA kernels as an input or output (Note: __global__ functions always return void, so "returning through a pointer" is the canonical way to get data out of a CUDA kernel).

Shared Memory and Synchronization

Now that the containers and functions are ready to go, we can try launching our kernel. However, CPU code that is naively ported to CUDA rarely performs well (except for trivial calculations like SAXPY). In particular, since we're assigning a group of threads to process each element in the mesh, most of the data used by those threads (element values, node ids, element connectivity, shape function evaluations) is the same for each thread in the block. CUDA lets us store quantities like this in __shared__ memory, which is beneficial for two reasons:

So, to take advantage of shared memory in our new kernel, we first do some arithmetic to figure out how much shared memory we'll need to store the quantities we need in the kernel

We pass this shmem_size (in bytes) as the 3rd argument in the triple chevron launch statement

and then put an extern __shared__ declaration like in our kernel like

which we can then slice up into pieces and use in our kernel.

The other thing to be mindful of when using __shared__ memory is that since each thread in the block can access it, it's possible to create race conditions or read/write values out of turn. In general, threads within a block may be executing at different locations in the kernel, so if you write code like

it is probable that some threads reach the final write statement before other threads do, meaning that the values in data may not be completely populated in time. This can be addressed by use of an appropriate synchronization mechanism. For example, inserting a __syncthreads()

ensures that each thread will have finished initializing the entries of data before trying to access them, protecting from the read-before-write problem from before.

These kind of bugs can also be difficult to catch in some cases as they're not deterministic! It's possible that write kernels that contain bugs due to improper shared memory access, but they pass tests most of the time.

For help identifying this sort of bug, regularly run your applications through of compute-santizer.

Final Kernel Implementation

After applying all these ideas described above, here's the body of the new interpolation kernel

Summary and Further Optimizations

With a little bit of work, we were able to get our CPU calculation running reasonably well on the GPU. Initial timings indicate that performance for high-order elements is already close to the manually-tuned FEM kernels in mfem, but the low-order elements were underperforming initially (for obvious reasons explained below). For more information on subsequent kernel optimizations, please refer to the following documents:

Processing Multiple Elements Per Block

Deterministic Assembly


After making those optimizations, we have the following timings (in microseconds) taken from unstructured meshes with 1,000,000 elements each:

Triangles
interpolationp = 1p = 2p = 3
CPU (serial)112002460046900
CPU (threaded)5220660010600
GPU277437809
integrationp = 1p = 2p = 3
CPU (serial)96002390065500
CPU (threaded, atomics)135001670024800
CPU (threaded, gather)3720864015900
GPU (atomics)248401671
GPU (gather)*{150, 211}{267, 469}{473, 800}

Quadrilaterals
interpolationp = 1p = 2p = 3
CPU (serial)2340056500105000
CPU (threaded)8500874014900
GPU3606151140
integrationp = 1p = 2p = 3
CPU (serial)2550061500111000
CPU (threaded, atomics)103002290039900
CPU (threaded, gather)53201210024800
GPU (atomics)3045821060
GPU (gather)*{191, 126}{412, 644}{710, 1130}

Tetrahedra
interpolationp = 1p = 2p = 3
CPU (serial)1830075500271000
CPU (threaded)57401210024300
GPU60612704330
integrationp = 1p = 2p = 3
CPU (serial)1780073700296000
CPU (threaded, atomics)122002880063900
CPU (threaded, gather)52702160056200
GPU (atomics)51011702760
GPU (gather)*{246, 722}{588, 1850}{1930, 2770}

Hexahedra
interpolationp = 1p = 2p = 3
CPU (serial)88600327000784000
CPU (threaded)139002930064500
GPU1200349013000
integrationp = 1p = 2p = 3
CPU (serial)105000389000930000
CPU (threaded, atomics)2670076700169000
CPU (threaded, gather)1650060200135000
GPU (atomics)901324013100
GPU (gather)*{593, 431}{2230, 2330}{8590, 4450}

* the GPU (gather) rows have two entries per column. The first is the time associated with the kernel that calculates element forces, the second is the time for the gather-based assembly kernel. The GPU (atomics) kernel does both integration + assembly in a single kernel (but is not deterministic).


So, in one weekend, we were able to port an entire FEM library to use CUDA (except for sparse matrix assembly). After a few small tweaks, these kernels already perform 10-20x faster than their threaded CPU counterparts, and rival / surpass the hand-optimized kernel implementations in other mature FEM libraries!