initial_cuda

Getting femto running on GPUs

In my spare time I've been developing a finite element library called "femto". It's always been part of my plan to make femto run on GPUs, but I believe that implementing GPU support too early on in a project can be detrimental, as it makes it more costly to consider fundamental changes to the library's design. However, after iterating on femto for a few months, I feel like it's at a point where I'm relatively happy with the core FEM abstractions and interfaces, which makes it a natural time to start considering GPU support.

This document goes over the main FEM kernels in femto and what was involved in making them run on GPUs. I've also done several experiments to optimize performance of those GPU kernels, but I will explain those considerations in separate documents.

Core FEM kernels

The finite element method is a remarkably simple idea. The fundamental pieces common to all finite element analyses are:

Meshes to discretize the regions of interest
Solution fields defined over those meshes (or some parts of them)
Ways to express (and linearize) residual equations for the relevant physics

Although mesh generation is an incredibly challenging problem, the rest of steps 1 and 2 are relatively easy: the mesh object just needs to track the various kinds of connectivity between elements and the solution field's primary responsibility is to enumerate degrees of freedom over a mesh.

The part that I find most challenging from a software design perspective is step 3, since the finite element method is used for many different applications (e.g. heat transfer, solid mechanics, fluid flow, electromagnetics, ... ).

femto's approach to step 3 is to provide a way to perform the operations that are common to every residual calculation (regardless of physics): interpolation and integration. Combining these interpolation and integration kernels in different ways lets us succinctly express many different kinds of physics.

So with that in mind, this document will cover the CUDA-fication of an interpolation kernel in femto.

Interpolation

Interpolation kernels map solution fields to the {value, gradient, curl, or divergence} of that solution field at a collection of quadrature points. We'll consider the gradient variant of interpolation in this document, but the other variants are practically identical. The serial CPU implementation of this kernel looks something like this


x
FiniteElement< geom, family > el{polynomial_degree};

// precalculate shape functions for the provided quadrature rule
auto shape_fn_grads = el.evaluate_shape_function_gradients(xi);

// preallocate a buffer to be used for intermediates in 
nd::array< double > scratch({el.batch_interpolation_scratch_space(xi)});

// for each element with this geometry
for (uint32_t i = 0; i < num_elements; i++) {

  // figure out which nodal values belong to this element 
  el.indices(values.offsets, connectivity(elements(i)).data(), ids.data());

  for (int c = 0; c < components; c++) {

    // load the nodal values for this element
    for (int j = 0; j < nodes_per_element; j++) {
      values_e(j) = values.data(ids(j), c);
    }

    // interpolate the quadrature point values for this element
    nd::range r{i*qpts_per_element, (i+1)*qpts_per_element};
    el.gradient(gradients_q(r, c), values_e, shape_fn_grads, scratch.data());

  }

}

For-Loop → CUDA Kernel

The first step to get a calculation to run on a GPU is replacing a parallel for-loop by a kernel launch. The element for loop


x
// for each element with this geometry
for (uint32_t i = 0; i < num_elements; i++) {
  ... // work to be done for each element
}

is the natural one to replace by a CUDA kernel


x
// define the kernel 
__global__ void interpolate_gradient_kernel(...) {
  ... // work to be done for each element
}

// launch the kernel
uint32_t block_size = ...;
uint32_t grid_size = num_elements;
interpolate_gradient_kernel<<<grid_size, block_size>>>(...);

The kernel launch parameters can impact the performance significantly, but we'll consider that in a separate document. For now, let's just assume that each thread block will be responsible for performing the calculations for a single element.

`device` function specifiers

Of course, there is more to be done than simply copying the body of the original for-loop into this new kernel. To start, any function used inside a CUDA kernel must be marked with a __device__ specifier. This loop body involves 3 function calls:


x
// FiniteElement::indices 
el.indices(values.offsets, connectivity(elements(i)).data(), ids.data());

// nd::array::operator()
values_e(j) = values.data(ids(j), c);

// FiniteElement::gradient
el.gradient(gradients_q(r, c), values_e, shape_fn_grads, scratch.data());

so we first need to go to those implementations and mark them appropriately. In some cases, this is trivial, but in other cases it is more involved because __device__ functions can themselves only call other __device__ functions. So, in order to make a function like FiniteElement::gradient work in a CUDA kernel we have to recurse into its implementation and make everything it calls work in a kernel as well.

In this particular case, these three functions only really involve array accesses and simple calculations, so we only need to sprinkle a few __device__ specifiers here and there.

Memory Spaces and Containers

Another thing to consider is that the GPU has a separate memory space from the CPU. Many of the default containers (std::vector, std::map, std::string, etc) won't be usable inside a CUDA kernel and many of the CPU algorithms (std::sort, std::unique, etc) won't work with data that lives on the GPU. In addition to marking my library's custom containers with __device__ on the relevant member functions, we also need to use cudaMalloc/cudaMallocManaged/cudaFree for memory allocations instead of new/delete.

Most of femto uses a custom container class that already has an abstraction around the way memory is allocated:


x
namespace memory {
  void * allocate(uint64_t n) { 
    return malloc(n);
  }

  void deallocate(void * ptr) {
    free(ptr);
  }

  void memcpy(void * dest, void * src, uint64_t n) { 
    std::memcpy(dest, src, n);
  }

  void zero(void * ptr, uint64_t n) {
    std::memset(ptr, 0, n);
  } 
}

So, we need only modify this slightly to


x
#define error_check(ans) { gpu_assert((ans), __FILE__, __LINE__); }
inline void gpu_assert(cudaError_t code, const char *file, int line) {
   if (code != cudaSuccess) {
      fprintf(stderr,"gpu_assert: %s %s %d\n", cudaGetErrorString(code), file, line);
      exit(code);
   }
}

namespace memory {

  void * allocate(uint64_t n) { 
    void * ptr;
    error_check(cudaMallocManaged(&ptr, n));
    return ptr;
  }

  void deallocate(void * ptr) {
    error_check(cudaFree(ptr));
  }

  void memcpy(void * dest, void * src, uint64_t n) { 
    error_check(cudaMemcpy(dest, src, n, cudaMemcpyDefault));
  }

  void zero(void * ptr, uint64_t n) {
    error_check(cudaMemset(ptr, 0, n));
  } 

}

to update the container classes to allocate memory in a GPU-compatible way.

"But wait, don't you know that memory allocated with cudaMallocManaged is slow? Don't you want the best performance?" - a coworker of mine

I don't know where this myth showed up, but it's not true. It's easy to run experiments and demonstrate that cudaMallocManaged achieves the same performance as buffers allocated by cudaMalloc, but it seems some people believe that just because cudaMallocManaged offers some conveniences over cudaMalloc it must be slower.

Another important consideration when it comes to containers is how to pass data into a CUDA kernel. On the CPU side, my nd::array container class has value semantics. That is, when you write expressions like


x
nd::array< double, 3 > my_3D_array({10, 10, 10});        // create a 10x10x10 array
nd::array< double, 3 > copy = my_3D_array({10, 10, 10}); // create a copy of that array

it will create a copy. This is a little problematic when it comes to CUDA kernels for two reasons:

passing nd::arrays into CUDA kernels by-value would be expensive (since it makes a copy)
using nd::arrays passed into a CUDA kernel by-reference will result in dereferencing host data on the device (which isn't allowed)

A common workaround is to have two related container classes. The "view" type encapsulates most of the behavior of the container (array access, striding, etc) without actually owning its data, while the "array" type is itself a "view", but includes ownership. This way, the "view" type has shallow-copy semantics and can be passed by-value to CUDA kernels as an input or output (Note: __global__ functions always return void, so "returning through a pointer" is the canonical way to get data out of a CUDA kernel).

Shared Memory and Synchronization

Now that the containers and functions are ready to go, we can try launching our kernel. However, CPU code that is naively ported to CUDA rarely performs well (except for trivial calculations like SAXPY). In particular, since we're assigning a group of threads to process each element in the mesh, most of the data used by those threads (element values, node ids, element connectivity, shape function evaluations) is the same for each thread in the block. CUDA lets us store quantities like this in __shared__ memory, which is beneficial for two reasons:

values in __shared__ memory can be accessed by each thread in the block, which avoids the alternative of duplicating values across threads.
accesses to __shared__ memory are significantly faster than accesses to __global__ memory, so for quantities that are accessed frequently, it can be benficial to first stage them in __shared__ memory.

So, to take advantage of shared memory in our new kernel, we first do some arithmetic to figure out how much shared memory we'll need to store the quantities we need in the kernel


x
int shmem_size = 
  shape_fn_grads.size() * sizeof(double) +     // shape functions
  connectivity.shape[1] * sizeof(Connection) + // connectivity
  nodes_per_element * sizeof(uint32_t) +       // node_ids
  nodes_per_element * sizeof(double) +         // u_e
  qpts_per_element * sizeof(grad_type) +       // du_dX_q
  scratch_size * sizeof(double);               // scratch size

We pass this shmem_size (in bytes) as the 3rd argument in the triple chevron launch statement


xxxxxxxxxx
interpolate_gradient_kernel<<<grid_size, block_size, shmem_size>>>(...);
//                                                   ^^^^^^^^^^

and then put an extern __shared__ declaration like in our kernel like


xxxxxxxxxx
// define the kernel 
__global__ void interpolate_gradient_kernel(...) {
  extern __shared__ char shmem[]; // this buffer will have shmem_size bytes

  ... // work to be done for each element
}

which we can then slice up into pieces and use in our kernel.

The other thing to be mindful of when using __shared__ memory is that since each thread in the block can access it, it's possible to create race conditions or read/write values out of turn. In general, threads within a block may be executing at different locations in the kernel, so if you write code like


x
constexpr int block_size = 256;
__shared__ double data[block_size];

data[threadIdx.x] = threadIdx.x;

output[threadIdx.x] = data[block_size - threadIdx.x];

it is probable that some threads reach the final write statement before other threads do, meaning that the values in data may not be completely populated in time. This can be addressed by use of an appropriate synchronization mechanism. For example, inserting a __syncthreads()


x
data[threadIdx.x] = threadIdx.x;

// wait until every thread in the block 
// reaches this statement before proceeding
__syncthreads();

output[threadIdx.x] = data[block_size - threadIdx.x];

ensures that each thread will have finished initializing the entries of data before trying to access them, protecting from the read-before-write problem from before.

These kind of bugs can also be difficult to catch in some cases as they're not deterministic! It's possible that write kernels that contain bugs due to improper shared memory access, but they pass tests most of the time.
For help identifying this sort of bug, regularly run your applications through of compute-santizer.

Final Kernel Implementation

After applying all these ideas described above, here's the body of the new interpolation kernel


x
// start off by setting up our shared memory buffers
uint32_t shmem_offset = 0;
extern __shared__ char shmem[];

uint32_t local_elem = threadIdx.z;
uint32_t elem_per_block = blockDim.z;

nd::view< Connection > shr_connectivity(
  (Connection *)(shmem + shmem_offset), 
  {connectivity.shape[1]}
);
shmem_offset += shr_connectivity.size() * sizeof(Connection);

uint32_t nodes_per_elem = el.num_nodes();
nd::view< uint32_t > shr_node_ids(
  (uint32_t *)(shmem + shmem_offset), 
  {nodes_per_elem}
);
shmem_offset += shr_node_ids.size() * sizeof(uint32_t);

nd::view< grad_type > shr_du_dxi_q(
  (grad_type *)(shmem + shmem_offset), 
  {qpts_per_elem}
);
shmem_offset += shr_du_dxi_q.size() * sizeof(grad_type);

nd::view< double > shr_u_e(
  (double *)(shmem + shmem_offset), 
  {nodes_per_elem}
);
shmem_offset += shr_u_e.size() * sizeof(double);

nd::view< double, n > shr_shape_fn_grads(
  (double *)(shmem + shmem_offset), 
  shape_functions.shape
);
shmem_offset += shr_shape_fn_grads.size() * sizeof(double);

nd::view< double > shr_scratch(
  (double *)(shmem + shmem_offset), 
  {scratch_size}
);

int elem_tid = threadIdx.x + blockDim.x * threadIdx.y;
int elem_stride = blockDim.x * blockDim.y;

int block_tid = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z);
int block_stride = blockDim.x * blockDim.y * blockDim.z;

// 0. load shape function evaluations into shared memory
for (int i = block_tid; i < shr_shape_fn_grads.size(); i += block_stride) {
  shr_shape_fn_grads[i] = shape_functions[i];
}

// 1. then figure out which element we're integrating and load its connectivity info
uint32_t e = blockIdx.x;
uint32_t elem_id = elements[e];

for (int i = elem_tid; i < shr_connectivity.size(); i += elem_stride) {
  shr_connectivity[i] = connectivity(elem_id, i);
}
__syncthreads();

// 2. from the connectivity, figure out which nodes belong to this element 
if (elem_tid == 0) {
  el.indices(u_offsets, shr_connectivity.data(), shr_node_ids.data());
}
__syncthreads();

int num_components = u.shape[1];
for (int c = 0; c < num_components; c++) {

  // 3. load the nodal values for this element
  for (int i = elem_tid; i < nodes_per_elem; i += elem_stride) {
    shr_u_e(i) = u(shr_node_ids(i), c);
  }
  __syncthreads();

  // 4. interpolate the quadrature point values for this element
  el.cuda_gradient(shr_du_dX_q, shr_u_e, shr_shape_fn_grads, shr_scratch.data());

  // 5. write those quadrature values out to global memory
  for (int q = elem_tid; q < qpts_per_elem; q += elem_stride) {
    du_dxi_q(qpts_per_elem * e + q, c) = shr_du_dxi_q[q];
  }

}

Summary and Further Optimizations

With a little bit of work, we were able to get our CPU calculation running reasonably well on the GPU. Initial timings indicate that performance for high-order elements is already close to the manually-tuned FEM kernels in mfem, but the low-order elements were underperforming initially (for obvious reasons explained below). For more information on subsequent kernel optimizations, please refer to the following documents:

Processing Multiple Elements Per Block

Deterministic Assembly

After making those optimizations, we have the following timings (in microseconds) taken from unstructured meshes with 1,000,000 elements each:

Triangles

interpolation	p = 1	p = 2	p = 3
CPU (serial)	11200	24600	46900
CPU (threaded)	5220	6600	10600
GPU	277	437	809

integration	p = 1	p = 2	p = 3
CPU (serial)	9600	23900	65500
CPU (threaded, atomics)	13500	16700	24800
CPU (threaded, gather)	3720	8640	15900
GPU (atomics)	248	401	671
GPU (gather)*	{150, 211}	{267, 469}	{473, 800}

Quadrilaterals

interpolation	p = 1	p = 2	p = 3
CPU (serial)	23400	56500	105000
CPU (threaded)	8500	8740	14900
GPU	360	615	1140

integration	p = 1	p = 2	p = 3
CPU (serial)	25500	61500	111000
CPU (threaded, atomics)	10300	22900	39900
CPU (threaded, gather)	5320	12100	24800
GPU (atomics)	304	582	1060
GPU (gather)*	{191, 126}	{412, 644}	{710, 1130}

Tetrahedra

interpolation	p = 1	p = 2	p = 3
CPU (serial)	18300	75500	271000
CPU (threaded)	5740	12100	24300
GPU	606	1270	4330

integration	p = 1	p = 2	p = 3
CPU (serial)	17800	73700	296000
CPU (threaded, atomics)	12200	28800	63900
CPU (threaded, gather)	5270	21600	56200
GPU (atomics)	510	1170	2760
GPU (gather)*	{246, 722}	{588, 1850}	{1930, 2770}

Hexahedra

interpolation	p = 1	p = 2	p = 3
CPU (serial)	88600	327000	784000
CPU (threaded)	13900	29300	64500
GPU	1200	3490	13000

integration	p = 1	p = 2	p = 3
CPU (serial)	105000	389000	930000
CPU (threaded, atomics)	26700	76700	169000
CPU (threaded, gather)	16500	60200	135000
GPU (atomics)	901	3240	13100
GPU (gather)*	{593, 431}	{2230, 2330}	{8590, 4450}

* the GPU (gather) rows have two entries per column. The first is the time associated with the kernel that calculates element forces, the second is the time for the gather-based assembly kernel. The GPU (atomics) kernel does both integration + assembly in a single kernel (but is not deterministic).

So, in one weekend, we were able to port an entire FEM library to use CUDA (except for sparse matrix assembly). After a few small tweaks, these kernels already perform 10-20x faster than their threaded CPU counterparts, and rival / surpass the hand-optimized kernel implementations in other mature FEM libraries!