In my spare time I've been developing a finite element library called "femto". It's always been part of my plan to make femto run on GPUs, but I believe that implementing GPU support too early on in a project can be detrimental, as it makes it more costly to consider fundamental changes to the library's design. However, after iterating on femto for a few months, I feel like it's at a point where I'm relatively happy with the core FEM abstractions and interfaces, which makes it a natural time to start considering GPU support.
This document goes over the main FEM kernels in femto and what was involved in making them run on GPUs. I've also done several experiments to optimize performance of those GPU kernels, but I will explain those considerations in separate documents.
The finite element method is a remarkably simple idea. The fundamental pieces common to all finite element analyses are:
Meshes to discretize the regions of interest
Solution fields defined over those meshes (or some parts of them)
Ways to express (and linearize) residual equations for the relevant physics
Although mesh generation is an incredibly challenging problem, the rest of steps 1 and 2 are relatively easy: the mesh object just needs to track the various kinds of connectivity between elements and the solution field's primary responsibility is to enumerate degrees of freedom over a mesh.
The part that I find most challenging from a software design perspective is step 3, since the finite element method is used for many different applications (e.g. heat transfer, solid mechanics, fluid flow, electromagnetics, ... ).
femto's approach to step 3 is to provide a way to perform the operations that are common to every residual calculation (regardless of physics): interpolation and integration. Combining these interpolation and integration kernels in different ways lets us succinctly express many different kinds of physics.
So with that in mind, this document will cover the CUDA-fication of an interpolation kernel in femto.
Interpolation kernels map solution fields to the {value, gradient, curl, or divergence} of that solution field at a collection of quadrature points. We'll consider the gradient variant of interpolation in this document, but the other variants are practically identical. The serial CPU implementation of this kernel looks something like this
x
FiniteElement< geom, family > el{polynomial_degree};
// precalculate shape functions for the provided quadrature rule
auto shape_fn_grads = el.evaluate_shape_function_gradients(xi);
// preallocate a buffer to be used for intermediates in
nd::array< double > scratch({el.batch_interpolation_scratch_space(xi)});
// for each element with this geometry
for (uint32_t i = 0; i < num_elements; i++) {
// figure out which nodal values belong to this element
el.indices(values.offsets, connectivity(elements(i)).data(), ids.data());
for (int c = 0; c < components; c++) {
// load the nodal values for this element
for (int j = 0; j < nodes_per_element; j++) {
values_e(j) = values.data(ids(j), c);
}
// interpolate the quadrature point values for this element
nd::range r{i*qpts_per_element, (i+1)*qpts_per_element};
el.gradient(gradients_q(r, c), values_e, shape_fn_grads, scratch.data());
}
}
The first step to get a calculation to run on a GPU is replacing a parallel for-loop by a kernel launch. The element for loop
x
// for each element with this geometry
for (uint32_t i = 0; i < num_elements; i++) {
... // work to be done for each element
}
is the natural one to replace by a CUDA kernel
x
// define the kernel
__global__ void interpolate_gradient_kernel(...) {
... // work to be done for each element
}
// launch the kernel
uint32_t block_size = ...;
uint32_t grid_size = num_elements;
interpolate_gradient_kernel<<<grid_size, block_size>>>(...);
The kernel launch parameters can impact the performance significantly, but we'll consider that in a separate document. For now, let's just assume that each thread block will be responsible for performing the calculations for a single element.
__device__
function specifiersOf course, there is more to be done than simply copying the body of the original for-loop into this new kernel. To start, any function used inside a CUDA kernel must be marked with a __device__
specifier. This loop body involves 3 function calls:
x
// FiniteElement::indices
el.indices(values.offsets, connectivity(elements(i)).data(), ids.data());
// nd::array::operator()
values_e(j) = values.data(ids(j), c);
// FiniteElement::gradient
el.gradient(gradients_q(r, c), values_e, shape_fn_grads, scratch.data());
so we first need to go to those implementations and mark them appropriately. In some cases, this is trivial, but in other cases it is more involved because __device__
functions can themselves only call other __device__
functions. So, in order to make a function like FiniteElement::gradient
work in a CUDA kernel we have to recurse into its implementation and make everything it calls work in a kernel as well.
In this particular case, these three functions only really involve array accesses and simple calculations, so we only need to sprinkle a few __device__
specifiers here and there.
Another thing to consider is that the GPU has a separate memory space from the CPU. Many of the default containers (std::vector
, std::map
, std::string
, etc) won't be usable inside a CUDA kernel and many of the CPU algorithms (std::sort
, std::unique
, etc) won't work with data that lives on the GPU. In addition to marking my library's custom containers with __device__
on the relevant member functions, we also need to use cudaMalloc
/cudaMallocManaged
/cudaFree
for memory allocations instead of new
/delete
.
Most of femto uses a custom container class that already has an abstraction around the way memory is allocated:
x
namespace memory {
void * allocate(uint64_t n) {
return malloc(n);
}
void deallocate(void * ptr) {
free(ptr);
}
void memcpy(void * dest, void * src, uint64_t n) {
std::memcpy(dest, src, n);
}
void zero(void * ptr, uint64_t n) {
std::memset(ptr, 0, n);
}
}
So, we need only modify this slightly to
x
inline void gpu_assert(cudaError_t code, const char *file, int line) {
if (code != cudaSuccess) {
fprintf(stderr,"gpu_assert: %s %s %d\n", cudaGetErrorString(code), file, line);
exit(code);
}
}
namespace memory {
void * allocate(uint64_t n) {
void * ptr;
error_check(cudaMallocManaged(&ptr, n));
return ptr;
}
void deallocate(void * ptr) {
error_check(cudaFree(ptr));
}
void memcpy(void * dest, void * src, uint64_t n) {
error_check(cudaMemcpy(dest, src, n, cudaMemcpyDefault));
}
void zero(void * ptr, uint64_t n) {
error_check(cudaMemset(ptr, 0, n));
}
}
to update the container classes to allocate memory in a GPU-compatible way.
"But wait, don't you know that memory allocated with cudaMallocManaged is slow? Don't you want the best performance?" - a coworker of mine
I don't know where this myth showed up, but it's not true. It's easy to run experiments and demonstrate that cudaMallocManaged
achieves the same performance as buffers allocated by cudaMalloc
, but it seems some people believe that just because cudaMallocManaged
offers some conveniences over cudaMalloc
it must be slower.
Another important consideration when it comes to containers is how to pass data into a CUDA kernel. On the CPU side, my nd::array
container class has value semantics. That is, when you write expressions like
x
nd::array< double, 3 > my_3D_array({10, 10, 10}); // create a 10x10x10 array
nd::array< double, 3 > copy = my_3D_array({10, 10, 10}); // create a copy of that array
it will create a copy. This is a little problematic when it comes to CUDA kernels for two reasons:
passing nd::array
s into CUDA kernels by-value would be expensive (since it makes a copy)
using nd::array
s passed into a CUDA kernel by-reference will result in dereferencing host data on the device (which isn't allowed)
A common workaround is to have two related container classes. The "view" type encapsulates most of the behavior of the container (array access, striding, etc) without actually owning its data, while the "array" type is itself a "view", but includes ownership. This way, the "view" type has shallow-copy semantics and can be passed by-value to CUDA kernels as an input or output (Note: __global__
functions always return void, so "returning through a pointer" is the canonical way to get data out of a CUDA kernel).
Now that the containers and functions are ready to go, we can try launching our kernel. However, CPU code that is naively ported to CUDA rarely performs well (except for trivial calculations like SAXPY). In particular, since we're assigning a group of threads to process each element in the mesh, most of the data used by those threads (element values, node ids, element connectivity, shape function evaluations) is the same for each thread in the block. CUDA lets us store quantities like this in __shared__
memory, which is beneficial for two reasons:
values in __shared__
memory can be accessed by each thread in the block, which avoids the alternative of duplicating values across threads.
accesses to __shared__
memory are significantly faster than accesses to __global__
memory, so for quantities that are accessed frequently, it can be benficial to first stage them in __shared__
memory.
So, to take advantage of shared memory in our new kernel, we first do some arithmetic to figure out how much shared memory we'll need to store the quantities we need in the kernel
x
int shmem_size =
shape_fn_grads.size() * sizeof(double) + // shape functions
connectivity.shape[1] * sizeof(Connection) + // connectivity
nodes_per_element * sizeof(uint32_t) + // node_ids
nodes_per_element * sizeof(double) + // u_e
qpts_per_element * sizeof(grad_type) + // du_dX_q
scratch_size * sizeof(double); // scratch size
We pass this shmem_size
(in bytes) as the 3rd argument in the triple chevron launch statement
xxxxxxxxxx
interpolate_gradient_kernel<<<grid_size, block_size, shmem_size>>>(...);
// ^^^^^^^^^^
and then put an extern __shared__
declaration like in our kernel like
xxxxxxxxxx
// define the kernel
__global__ void interpolate_gradient_kernel(...) {
extern __shared__ char shmem[]; // this buffer will have shmem_size bytes
... // work to be done for each element
}
which we can then slice up into pieces and use in our kernel.
The other thing to be mindful of when using __shared__
memory is that since each thread in the block can access it, it's possible to create race conditions or read/write values out of turn. In general, threads within a block may be executing at different locations in the kernel, so if you write code like
x
constexpr int block_size = 256;
__shared__ double data[block_size];
data[threadIdx.x] = threadIdx.x;
output[threadIdx.x] = data[block_size - threadIdx.x];
it is probable that some threads reach the final write statement before other threads do, meaning that the values in data
may not be completely populated in time. This can be addressed by use of an appropriate synchronization mechanism. For example, inserting a __syncthreads()
x
data[threadIdx.x] = threadIdx.x;
// wait until every thread in the block
// reaches this statement before proceeding
__syncthreads();
output[threadIdx.x] = data[block_size - threadIdx.x];
ensures that each thread will have finished initializing the entries of data
before trying to access them, protecting from the read-before-write problem from before.
These kind of bugs can also be difficult to catch in some cases as they're not deterministic! It's possible that write kernels that contain bugs due to improper shared memory access, but they pass tests most of the time.
For help identifying this sort of bug, regularly run your applications through of
compute-santizer
.
After applying all these ideas described above, here's the body of the new interpolation kernel
x
// start off by setting up our shared memory buffers
uint32_t shmem_offset = 0;
extern __shared__ char shmem[];
uint32_t local_elem = threadIdx.z;
uint32_t elem_per_block = blockDim.z;
nd::view< Connection > shr_connectivity(
(Connection *)(shmem + shmem_offset),
{connectivity.shape[1]}
);
shmem_offset += shr_connectivity.size() * sizeof(Connection);
uint32_t nodes_per_elem = el.num_nodes();
nd::view< uint32_t > shr_node_ids(
(uint32_t *)(shmem + shmem_offset),
{nodes_per_elem}
);
shmem_offset += shr_node_ids.size() * sizeof(uint32_t);
nd::view< grad_type > shr_du_dxi_q(
(grad_type *)(shmem + shmem_offset),
{qpts_per_elem}
);
shmem_offset += shr_du_dxi_q.size() * sizeof(grad_type);
nd::view< double > shr_u_e(
(double *)(shmem + shmem_offset),
{nodes_per_elem}
);
shmem_offset += shr_u_e.size() * sizeof(double);
nd::view< double, n > shr_shape_fn_grads(
(double *)(shmem + shmem_offset),
shape_functions.shape
);
shmem_offset += shr_shape_fn_grads.size() * sizeof(double);
nd::view< double > shr_scratch(
(double *)(shmem + shmem_offset),
{scratch_size}
);
int elem_tid = threadIdx.x + blockDim.x * threadIdx.y;
int elem_stride = blockDim.x * blockDim.y;
int block_tid = threadIdx.x + blockDim.x * (threadIdx.y + blockDim.y * threadIdx.z);
int block_stride = blockDim.x * blockDim.y * blockDim.z;
// 0. load shape function evaluations into shared memory
for (int i = block_tid; i < shr_shape_fn_grads.size(); i += block_stride) {
shr_shape_fn_grads[i] = shape_functions[i];
}
// 1. then figure out which element we're integrating and load its connectivity info
uint32_t e = blockIdx.x;
uint32_t elem_id = elements[e];
for (int i = elem_tid; i < shr_connectivity.size(); i += elem_stride) {
shr_connectivity[i] = connectivity(elem_id, i);
}
__syncthreads();
// 2. from the connectivity, figure out which nodes belong to this element
if (elem_tid == 0) {
el.indices(u_offsets, shr_connectivity.data(), shr_node_ids.data());
}
__syncthreads();
int num_components = u.shape[1];
for (int c = 0; c < num_components; c++) {
// 3. load the nodal values for this element
for (int i = elem_tid; i < nodes_per_elem; i += elem_stride) {
shr_u_e(i) = u(shr_node_ids(i), c);
}
__syncthreads();
// 4. interpolate the quadrature point values for this element
el.cuda_gradient(shr_du_dX_q, shr_u_e, shr_shape_fn_grads, shr_scratch.data());
// 5. write those quadrature values out to global memory
for (int q = elem_tid; q < qpts_per_elem; q += elem_stride) {
du_dxi_q(qpts_per_elem * e + q, c) = shr_du_dxi_q[q];
}
}
With a little bit of work, we were able to get our CPU calculation running reasonably well on the GPU. Initial timings indicate that performance for high-order elements is already close to the manually-tuned FEM kernels in mfem
, but the low-order elements were underperforming initially (for obvious reasons explained below). For more information on subsequent kernel optimizations, please refer to the following documents:
Processing Multiple Elements Per Block
After making those optimizations, we have the following timings (in microseconds) taken from unstructured meshes with 1,000,000 elements each:
interpolation | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 11200 | 24600 | 46900 |
CPU (threaded) | 5220 | 6600 | 10600 |
GPU | 277 | 437 | 809 |
integration | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 9600 | 23900 | 65500 |
CPU (threaded, atomics) | 13500 | 16700 | 24800 |
CPU (threaded, gather) | 3720 | 8640 | 15900 |
GPU (atomics) | 248 | 401 | 671 |
GPU (gather)* | {150, 211} | {267, 469} | {473, 800} |
interpolation | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 23400 | 56500 | 105000 |
CPU (threaded) | 8500 | 8740 | 14900 |
GPU | 360 | 615 | 1140 |
integration | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 25500 | 61500 | 111000 |
CPU (threaded, atomics) | 10300 | 22900 | 39900 |
CPU (threaded, gather) | 5320 | 12100 | 24800 |
GPU (atomics) | 304 | 582 | 1060 |
GPU (gather)* | {191, 126} | {412, 644} | {710, 1130} |
interpolation | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 18300 | 75500 | 271000 |
CPU (threaded) | 5740 | 12100 | 24300 |
GPU | 606 | 1270 | 4330 |
integration | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 17800 | 73700 | 296000 |
CPU (threaded, atomics) | 12200 | 28800 | 63900 |
CPU (threaded, gather) | 5270 | 21600 | 56200 |
GPU (atomics) | 510 | 1170 | 2760 |
GPU (gather)* | {246, 722} | {588, 1850} | {1930, 2770} |
interpolation | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 88600 | 327000 | 784000 |
CPU (threaded) | 13900 | 29300 | 64500 |
GPU | 1200 | 3490 | 13000 |
integration | p = 1 | p = 2 | p = 3 |
---|---|---|---|
CPU (serial) | 105000 | 389000 | 930000 |
CPU (threaded, atomics) | 26700 | 76700 | 169000 |
CPU (threaded, gather) | 16500 | 60200 | 135000 |
GPU (atomics) | 901 | 3240 | 13100 |
GPU (gather)* | {593, 431} | {2230, 2330} | {8590, 4450} |
* the GPU (gather) rows have two entries per column. The first is the time associated with the kernel that calculates element forces, the second is the time for the gather-based assembly kernel. The GPU (atomics) kernel does both integration + assembly in a single kernel (but is not deterministic).
So, in one weekend, we were able to port an entire FEM library to use CUDA (except for sparse matrix assembly). After a few small tweaks, these kernels already perform 10-20x faster than their threaded CPU counterparts, and rival / surpass the hand-optimized kernel implementations in other mature FEM libraries!