Consider the following scenario. You are a developer for a large C++ template library that performs computationally intensive processing on custom, complex data types, and you want to accelerate some of the slower functions with GPUs through CUDA. However, you don’t want to suddenly introduce the CUDA toolkit as a hard dependency because you expect many of your users will continue to use CPU-only implementations. You simply want to provide GPU acceleration to those who wish to leverage it. What’s the best way to accomplish this from a software engineering perspective?
A really bad way would be to maintain two separate
projects. A better way would be to provide
compile-time options to either target the
GPU or CPU implementations of each function;
however, that would force users to pick a
version and stick with it, with recompilation
required to change targets. Furthermore, a
function might run faster on the CPU for small
array sizes, and faster on the GPU for large
ones, so ideally the user would have access to both
implementations. Okay, so what if we just provide
two functions foo_cpu()
and foo_gpu()
so that
the developer can choose which version of foo()
to use? This solution is close to a good answer,
but without one additional improvement you’re now
expecting somebody to go through their (potentially
huge) codebase and change function names and possibly
introduce extra logical statements. As we will
see, there is a functional programming solution
that allows for infectious runtime determination
of implementation that requires minimal change to
existing codebases.
A second concern relates to the development of the CUDA
implementation of the library itself. CUDA is a
very low-level language, and if our library has
complex data structures then it can be difficult
to manage data allocation and memory transfers.
On the host-side (CPU), C++ classes make development
easier by abstracting away this type of housekeeping,
and ideally we want to do the same on the
device-side (GPU) so that we can push our
accelerated library out faster. With specialized template
metaprogramming, we can create a CUDA-interface
to our existing classes that greatly simplifies
GPU development by abstracting away the tedium of
dealing with low-level GPU memory management
with cudaMalloc
, cudaMemcpy
, etc.
In the rest of this article I’ll walk through an example of how this problem might come up in practice and how to solve it.
Demonstration
To demonstrate solutions to both of these topics, let’s consider a simple example of a template library that works on 2D arrays, and a single function that squares every element in the array. The actual CUDA code for this is trivial, but is not the main focus. The overall software design concept is more important.
You can find the code for this article here
The C++ Code
First, we build a basic 2D array class.
Disclaimer: This is an example and not intended to be used in a real library without further modification. You generally don’t want to be passing/freeing raw pointers in constructors/destructors without some sort of reference counting mechanism. There’s also no move constructor, no operator[] overload, error checking, etc.
//Array2D.h
#ifndef ARRAY2D_H
#define ARRAY2D_H
template <class T>
class Array2D {
public:
Array2D(T* _data,
const size_t& _nrows,
const size_t& _ncols);
Array2D(const Array2D<T>& other);
Array2D<T>& operator=(const Array2D<T>& other);
~Array2D(){delete[] this->data;}
size_t get_nrows() const {return this->nrows;}
size_t get_ncols() const {return this->ncols;}
size_t size() const {return this->N;}
T* begin(){return data;}
T* begin()const{return data;}
T* end(){return data + this->size();}
T* end()const{return data + this->size();}
private:
T* data;
size_t nrows;
size_t ncols;
size_t N;
};
template <class T>
Array2D<T>::Array2D(T* _data,
const size_t& _nrows,
const size_t& _ncols):data(_data), nrows(_nrows), ncols(_ncols){
this->N = _nrows * _ncols;
};
template <class T>
Array2D<T>::Array2D(const Array2D<T>& other):nrows(other.nrows), ncols(other.ncols), N(other.N){
data = new T[N];
auto i = this->begin();
for (auto& o:other)*i++=o;
};
template <class T>
Array2D<T>& Array2D<T>::operator=(const Array2D<T>& other){
this->ncols = other.ncols;
this->ncols = other.nrows;
this->N = other.N;
// here should compare the sizes of the arrays and reallocate if necessary
delete[] data;
data = new T[N];
auto i = this->begin();
for (auto& o:other)*i++=o;
return *this;
};
#endif //ARRAY2D_H
The class stores a pointer to the raw data as well as the number of rows and columns for the 2D array it represents. There’s a copy and assignment overload, and I define begin()
and end()
methods so I can use range-based for syntax. The destructor deletes the pointer, so Array2D
objects should be constructed with a pointer returned from new
or similar.
Now that the class exists, the implementation for ArrayPow2
, a function that squares every element in an Array2D
, is trivial.
//ArrayPow2.h
#include <algorithm>
#include "Array2D.h"
template <class T>
void ArrayPow2(Array2D<T>& in, Array2D<T>& result){
std::transform(in.begin(), in.end(), result.begin(), [](const T& a){return a*a;});
}
std::transform
applies the function-like object in the 4th argument to every element in the range from the first to second arguments, and stores the results in the 3rd.
Here I used a lambda function to express the squaring operation.
A simple driver program verifies that what we have so far is okay:
// driver1.cpp
#include <iostream>
#include "Array2D.h"
#include "ArrayPow2.h"
using namespace std;
int main() {
Array2D<float> arr(new float[100], 10, 10);
int a = 2;
for (auto& i:arr)i=++a;
Array2D<float> result(arr);
ArrayPow2(arr, result);
cout << "arr[0] = " << *arr.begin() << endl;
cout << "arr[0]^2 = " << *result.begin() << endl;
return 0;
}
arr[0] = 3
arr[0]^2 = 9
CUDA Template Specialization
To make developing/transcribing the CUDA version of our code easier,
we first want to implement a CUDA version of our Array2D
class. This can
be done by creating a specialization of Array2D
for CUDA. To differentiate
between the CPU and GPU versions of Array2D
, I introduce a trivial class
that just contains a single value.
template <class T>
struct Cutype{
T val;
};
The effect this has is subtle, but powerful. I can now create an Array2D< Cutype<T> >
that will instantiate an entirely
different class than Array2D<T>
, and in this case that specialization
will be used to abstract away calls to cudaMalloc
, cudaMemcpy
, etc. For my implementation, both types
will contain data of type float
, but you might also decide that
the CUDA versions store data as Cutype <float>
. There is no additional overhead
for the extra layer of this datatype, it gets optimized away in the compiler
and replaced as float
(you can check the outputted assembly if you don’t believe me), so
you could always cast one type to the other.
An additional positive side effect of this is that Cutype
then serves
as a compile-time error boundary to prevent accidental assignment or
dereferencing of data on the device from the host, which results in a
segmentation fault at runtime that can be hard to diagnose. Either choice here seems fine to me.
Here is the full specialization (as before this class is incomplete, but contains the code relevant to this discussion):
// Array2D_CUDA.h
#ifndef ARRAY2D_CUDA_H
#define ARRAY2D_CUDA_H
#include <iostream>
#include "Array2D.h"
#include "cuda.h"
#include "cuda_runtime_api.h"
using namespace std;
template <class T>
struct Cutype{
T val;
};
template <class U>
class Array2D< Cutype<U> > {
public:
Array2D(U* _data,
const size_t& _nrows,
const size_t& _ncols);
Array2D(const Array2D<U>&);
Array2D< Cutype<U> >& operator=(const Array2D<U>& other);
~Array2D();
size_t get_nrows() const {return *this->nrows;}
size_t get_ncols() const {return *this->ncols;}
size_t size() const {return *this->N;}
U* begin()const{return data;}
U* end()const{return data + this->size();}
U* begin(){return data;}
U* end(){return data + this->size();}
private:
U* data;
size_t* nrows;
size_t* ncols;
size_t* N;
};
template <class U>
Array2D< Cutype<U> >::Array2D(U* _data,
const size_t& _nrows,
const size_t& _ncols):data(_data){
size_t N_tmp = _nrows * _ncols;
cudaMalloc((void**)&nrows, sizeof(size_t));
cudaMalloc((void**)&ncols, sizeof(size_t));
cudaMalloc((void**)&N , sizeof(size_t));
cudaMalloc((void**)&data , sizeof(U) * N_tmp);
cudaMemcpy(nrows, &_nrows, sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(ncols, &_ncols, sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(N, &N_tmp , sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(data, _data , sizeof(U)*N_tmp, cudaMemcpyHostToDevice);
};
template <class U>
Array2D< Cutype<U> >::Array2D(const Array2D<U>& other){
size_t N_tmp = other.size();
cudaMalloc((void**)&nrows, sizeof(size_t));
cudaMalloc((void**)&ncols, sizeof(size_t));
cudaMalloc((void**)&N , sizeof(size_t));
cudaMalloc((void**)&data , sizeof(U) * N_tmp);
const size_t other_nrows = other.get_nrows();
const size_t other_ncols = other.get_ncols();
const size_t other_N = other.size();
U *other_data = other.begin();
cudaMemcpy(nrows, &other_nrows, sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(ncols, &other_ncols, sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(N, &other_N , sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(data, other_data , sizeof(U)*N_tmp, cudaMemcpyHostToDevice);
}
template <class U>
Array2D< Cutype<U> >& Array2D< Cutype<U> >::operator=(const Array2D<U>& other){
size_t N_tmp = other.size();
cudaMalloc((void**)&nrows, sizeof(size_t));
cudaMalloc((void**)&ncols, sizeof(size_t));
cudaMalloc((void**)&N , sizeof(size_t));
cudaMalloc((void**)&data , sizeof(U) * N_tmp);
const size_t other_nrows = other.get_nrows();
const size_t other_ncols = other.get_ncols();
const size_t other_N = other.size();
U *other_data = other.begin();
cudaMemcpy(nrows, &other_nrows, sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(ncols, &other_ncols, sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(N, &other_N , sizeof(size_t) , cudaMemcpyHostToDevice);
cudaMemcpy(data, other_data , sizeof(U)*N_tmp, cudaMemcpyHostToDevice);
return *this;
}
template <class U>
Array2D< Cutype<U> >::~Array2D(){
cudaFree(nrows);
cudaFree(ncols);
cudaFree(N);
cudaFree(data);
}
#endif //ARRAY2D_CUDA_H
Most of the interface should look familiar. A key difference
is that in place of the assignment and copy constructors
for Array2D<T>
that took in another Array2D<T>
I have
defined how an Array2D< Cutype<T> >
is copied from an
Array2D<T>
. This way we can construct a 2D array object
on the GPU simply by passing in an existing host-side array.
The calls to delete[]
and new
are replaced by cudaMalloc
and cudaFree
, and the data copying is handled with cudaMemcpy
.
Other than the way this object is built, the way you interact with
such a class is largely unchanged. Taking the time to build a class like
this will save a lot of time in the long run, rather than having to manually deal
with memory copying for every new kernel.
CUDA Implementation
Although we have written a bunch of code for the GPU, we haven’t actually
written any CUDA yet – we’ve just used the CUDA runtime API. To make a
version of ArrayPow2
from earlier on the GPU, we create a C++ wrapper function
with the same signature in a header file.
//ArrayPow2_CUDA.cuh
#include "Array2D_CUDA.h"
#include "Array2D.h"
template <class T>
void ArrayPow2_CUDA(Array2D<T>& in, Array2D<T>& result);
And then the implementation in a .cu file
//ArrayPow2_CUDA.cu
#include "Array2D_CUDA.h"
#include "Array2D.h"
#include "cuda.h"
#include "cuda_runtime_api.h"
#define BLOCK_SIZE 1024
template <class T>
__global__ void pow2(T* in, T* out, size_t N){
int idx = threadIdx.x + blockIdx.x*blockDim.x;
if (idx < N)out[idx] = in[idx] * in[idx];
}
template <class T>
void ArrayPow2_CUDA(Array2D<T>& in, Array2D<T>& result) {
std::cout << "Using the GPU version\n";
Array2D< Cutype<T> > in_d(in);
size_t N = in.size();
pow2 <<< (N - 1) / BLOCK_SIZE + 1, BLOCK_SIZE >>> (in_d.begin(), in_d.begin(), in.size());
cudaMemcpy(in.begin(), in_d.begin(), sizeof(T) * N, cudaMemcpyDeviceToHost);
}
template void ArrayPow2_CUDA(Array2D<float>&, Array2D<float>&);
template __global__ void pow2(float*, float*, size_t);
Now you can see where the template specialization really shines.
The wrapper function is called in exactly the same way as our original CPU
implementation, and a GPU copy of the array is created in just one line.
We run the kernel using the exact same .begin()
and .end()
calls that
we used with std::transform
, and then copy the result back. I explicitly
used cudaMemcpy
here because it’s just one line, but you could always
also write a helper function to hide that if you please.
The one important, and somewhat painful, bit is the necessary addition of the last two lines.
If you have worked with templates much, you’ll quickly find that it is a pain to separate
prototypes and implementations into different files. The reason is because the template itself
just defines how the compiler can construct a class from a given type. But if the template code
exists in its own compilation unit, then it gets compiled before it knows which classes actually
need to be instantiated. By the time the linker is trying to connect your main()
with whatever templates
it needs, it usually will yell at you for “undefined symbols”, meaning that it is looking for an object that
hasn’t been defined. In C++, there are some ways you can get around this, but usually the easiest solution is
to put the full template implementation in the header file. This way the classes that need to be instantiated
are guaranteed to be visible at compile time. With CUDA code this solution doesn’t work
because the compilation of CUDA and C++ by nvcc
are inherently decoupled, so there must be multiple files.
The solution is to forcibly instantiate the template types that will be used, which I did in those last two lines by forward declaring a template specialization that I intend to use in my library.
Putting it all together
Now that we have implemented our CUDA template specialization and kernel, we want to tie all of this back into our original library in a way that doesn’t add any CUDA dependencies to users who just want to continue using the CPU-only version and provides a way to choose between CPU or GPU implementations at runtime without demanding that developers go through their code and change function names everywhere. The solution I used for integrating the GPU code is the following:
- Change the name of the existing C++ function. I append with
_CPU
for clarity. - Create a pointer to a function with the same signature as the pure C++ version and the GPU version wrapper (remember, they have the same signature). Name it whatever the C++ function was called before you changed it.
- Add a compiler directive, ENABLE_GPU. If it is not defined at compile time, don’t include anything related to CUDA, and set the function pointer equal to the CPU version. If it is enabled, then introduce an additional runtime check for a command line input and use that to set the function pointer to either the CPU or CUDA version, appropriately.
All of this goes into the driver program, which follows
// main.cpp
#include <iostream>
#include <cstring>
#include "Array2D.h"
#include "ArrayPow2_CPU.h"
#ifdef ENABLE_GPU
#include "Array2D_CUDA.h"
#include "ArrayPow2_CUDA.cuh"
#endif //ENABLE_GPU
template <class T>
using ArrayPow2_F = void(*)(Array2D<T>&, Array2D<T>&);
ArrayPow2_F<float> ArrayPow2;
using namespace std;
int main(int argc, char** argv) {
#ifdef ENABLE_GPU
if (argc>2 && !strcmp(argv[1],"gpu")){
if (!strcmp(argv[2],"1")){
ArrayPow2 = ArrayPow2_CUDA;
} else{
ArrayPow2 = ArrayPow2_CPU;
}
} else
{
ArrayPow2 = ArrayPow2_CUDA;
}
#else
ArrayPow2 = ArrayPow2_CPU;
#endif //ENABLE_GPU
Array2D<float> arr(new float[120], 60, 2);
int a = 2;
for (auto& i:arr)i=++a;
cout << "arr[0]" << *arr.begin()<< endl;
Array2D<float> result(arr);
cout << "arr[0]" << *arr.begin()<< endl;
cout << "result[0]" << *result.begin()<< endl;
ArrayPow2(arr, result);
cout << "arr[0] = " << *arr.begin() << endl;
cout << "arr[0]^2 = " << *result.begin() << endl;
return 0;
}
I also add a short Makefile for convenience
//Makefile
all: cpu
clean:
-rm demo
cpu:
g++ -std=c++11 main.cpp -o demo
gpu:
nvcc -std=c++11 main.cpp -D ENABLE_GPU ArrayPow2_CUDA.cu -o demo
With this setup we now have three ways to compile and run the program:
- Change nothing. Compile with
make
- Compile for the gpu with
make gpu
. The default behavior is now to run on the GPU. - Run the program compiled in option 2) on the CPU by adding
gpu 0
to the command line call
Here’s the output from each of these (I added a line to ArrayPow2_CPU
and ArrayPow2_GPU
to print
which version was triggered):
$ make
$ ./demo
Using the CPU version
arr[0] = 3
arr[0]^2 = 9
$ make gpu
$ ./demo
Using the GPU version
arr[0] = 3
arr[0]^2 = 9
$ make gpu
$ ./demo gpu 0
Using the CPU version
arr[0] = 3
arr[0]^2 = 9
If you think about it, this is an incredibly powerful programming pattern. We still only have
to maintain a single code-base as we continue to add more and more GPU functions. Users
that don’t have or want to use CUDA GPU support for any reason are not affected in any way. Most importantly,
for those who do use the GPU support, the technique of replacing each accelerated
function with a function pointer and modifying the name of the original C++ version is infectious. A project
could be millions of lines long, and contain many calls to a function like
ArrayPow2
, but once it is replaced with a function pointer that is set to the GPU implementation, the entire project would immediately use the GPU version
with no further changes. Now, in this example I only considered a single function, but the extension
to many is trivial. All that is required is configuration and assignment of the function pointers at the beginning of the end-users program.
Furthermore, this could be done however the developer wanted.
For example, a check could be added for the array size, and the CPU or GPU version alternatively used based on optimization/tuning results. Another use case is if code was being run on a large cluster where some nodes have GPUs, but others do not. A simple query can be run to check if a valid GPU is found and to use it, and then if not to fall back to the CPU version. A data center could be concerned about power consumption and dynamically change where data is processed in real-time. There are a lot of possibilies, most of which I probably haven’t thought of. But after all, writing software that is flexible enough that it can be easily adapted to fit a particular user’s needs without the library developer anticipating them is a sign of good software engineering.