Use python to drive your GPU with CUDA for accelerated, parallel computing. Notebook ready to run on the Google Colab platform
In this post, you will learn how to do accelerated, parallel computing on your GPU with CUDA, all in python!
This is the second part of my series on accelerated computing with python:
CUDA is the computing platform and programming model provided by nvidia for their GPUs. It provides low-level access to the GPU, and is the base for other librairies such as cuDNN or, at an even higher level, TensorFlow.
GPUs are not only for games and neural networks. They have a highly parellel architecture that can be used for almost any kind of data crunching. To be able to write custom algorithms for the GPU, we need to learn CUDA. And if you've heard about it already, you might be thinking that it's not going to be easy, and that you'll need to do some C++.
In june 2019, Gunter and Romuald from nvidia came to CCIN2P3 and gave us a great tutorial about accelerated computing with CUDA in python, and I was just blown away! So today, I'd like to share what I learnt with you.
Gunter and Romuald used the excellent material from nvidia's deep learning institute. Obviously, I have been inspired by the DLI tutorial, but I designed my own exercises and organized things a bit differently.
In this article, you will:
You can execute the code below in a jupyter notebook on the Google Colab platform by simply following this link . Open this link in Chrome rather than firefox, and make sure to select GPU as execution environment.
But you can also just keep reading through here if you prefer!
By default, Google Colab is not able to run numba + CUDA, because two lilbraries are not found,
. So we need to make sure that these libraries are found in the notebook.
First, we look for these libraries on the system. To do that, we simply run the
command, to recursively look for these libraries starting at the root of the filesystem. The exclamation mark escapes the line so that it's executed by the Linux shell, and not by the jupyter notebook.
!find / -iname 'libdevice' !find / -iname 'libnvvm.so'
Then, we add the two libraries to numba environment variables:
import os os.environ['NUMBAPRO_LIBDEVICE'] = "/usr/local/cuda-10.0/nvvm/libdevice" os.environ['NUMBAPRO_NVVM'] = "/usr/local/cuda-10.0/nvvm/lib64/libnvvm.so"
And we're done! Now let's get started.
Numpy universal functions or ufuncs are functions that operate on a numpy array in an element-by-element fashion. For example, when we take the square of a numpy array, a ufunc computes the square of each element before returning the resulting array:
import numpy as np x = np.arange(10) x**2
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
Most math functions (if not all) are available as ufuncs in numpy. For example, to exponentiate all elements in a numpy array:
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01, 5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03, 2.98095799e+03, 8.10308393e+03])
Most ufuncs are implemented in compiled C code, so they are already quite fast, and much faster than plain python. For example, let's consider a large array, and let's compute the logarithm of each element, both in plain python and in numpy:
import math x = np.arange(int(1e6)) %timeit np.sqrt(x) %timeit [math.sqrt(xx) for xx in x]
100 loops, best of 3: 2.73 ms per loop 10 loops, best of 3: 178 ms per loop
We see that the numpy ufunc is about 50 times faster. Still, in the ufunc, the calculation is not parallelized: the square root is computed sequentially on the CPU for each element in the array.
The GPU, contrary to the CPU, is able to perform a large number of operations simultaneously. In the following, we will see how to create and compile ufuncs for the GPU to perform the calculation on many elements at the same time.
Our first ufunc for the GPU will again compute the square root for a large number of points.
We start by building a sample of points ranging from 0 to 10 millions. GPUs are more efficient with numbers that are encoded on a small number of bits. And often, a very high precision is not needed. So we create a sample of
numbers (the default being
import numpy as np npoints = int(1e7) a = np.arange(npoints, dtype=np.float32)
With numba, we can create ufuncs compiled for the CPU using the vectorize decorator. Let's start by doing this:
import math from numba import vectorize @vectorize def cpu_sqrt(x): return math.sqrt(x) cpu_sqrt(x)
array([ 0. , 1. , 1.41421356, ..., 999.9985 , 999.999 , 999.9995 ])
Creating a ufunc for the GPU is almost as straightforward:
import math from numba import vectorize @vectorize(['float32(float32)'], target='cuda') def gpu_sqrt(x): return math.sqrt(x)
array([0.0000000e+00, 1.0000000e+00, 1.4142135e+00, ..., 3.1622771e+03, 3.1622773e+03, 3.1622776e+03], dtype=float32)
It is important to note that, contrary to the CPU case, the input and return types of the function have to be specified, when compiling for the GPU. In the string:
corresponds to the return type, and the second one to the input type. You can think of it as: the returned value is a function of the input value. Also, please note that these types need to be adapted to your data. In the case above, we have created an array of
values, so we are fine.
Edit the type definition in the vectorize decorator to read
, and see what happens when you call the function. Then change it back to
Now let's see how much we gained. For this, we use the timeit magic command.
10 loops, best of 3: 24.7 ms per loop
The GPU managed to compute the sqrt for 10 million points in 40 ms. Now let's see what we get with numpy, which is compiled for the CPU, and with our CPU ufunc:
100 loops, best of 3: 9.19 ms per loop
The slowest run took 6.34 times longer than the fastest. This could mean that an intermediate result is being cached. 100 loops, best of 3: 8.92 ms per loop
Wait! We do not gain anything and the CPU version is actually twice faster!
There is a simple reason for this. When running on the GPU, the following happens under the hood:
a) is transferred to the GPU memory;
If the calculation is too simple, there is no use shipping our data to the GPU for fast parallel processing, if we are to wait so long for the data transfers to complete. In other words, most of the time is spent in the data transfers, and the GPU is basically useless.
Let's see what happens with a more involved calculation.
Let's build an array of 1000 points in 2D, described by the cartesian coordinates x and y. We choose to draw the points according to a 2D Gaussian distribution, introducing some correlation between the x and y components:
points = np.random.multivariate_normal([0,0], [[1.,0.9], [0.9,1.]], 1000).astype(np.float32)
import matplotlib.pyplot as plt plt.scatter(points[:,0], points[:,1])
<matplotlib.collections.PathCollection at 0x7fbac45bd358>
It's often useful to convert cartesian coordinates into the polar coordinates
, where r is the distance of the point to origin, and where theta is the angle between the x axis and the direction of the point.
is easily obtained as the quadratic sum of x and y:
But to get theta, one has to use the
function. This function is available in numpy, and we can use it to easily get the results (please note that the y coordinate has to be provided as the first argument)
theta = np.arctan2(points[:,1], points[:,0]) _ = plt.hist(theta, bins=100)
Because of the correlation between x and y, we see two peaks at $\pi/4$ and $-3\pi/4$.
Now let's try and perform the same calculation on the GPU. This time, we have two input values, and we define the function signature accordingly.
import math @vectorize(['float32(float32, float32)'],target='cuda') def gpu_arctan2(y, x): theta = math.atan2(y,x) return theta
As before with
, we need to slice our
array to provide first the array of
coordinates, and then the array of
theta = gpu_arctan2(points[:,1], points[:,0])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-17-2811dabcd6c9> in <module>() ----> 1 theta = gpu_arctan2(points[:,1], points[:,0]) /usr/local/lib/python3.6/dist-packages/numba/cuda/dispatcher.py in __call__(self, *args, **kws) 86 the input arguments. 87 """ ---> 88 return CUDAUFuncMechanism.call(self.functions, args, kws) 89 90 def reduce(self, arg, stream=0): /usr/local/lib/python3.6/dist-packages/numba/npyufunc/deviceufunc.py in call(cls, typemap, args, kws) 292 any_device = True 293 else: --> 294 dev_a = cr.to_device(a, stream=stream) 295 devarys.append(dev_a) 296 /usr/local/lib/python3.6/dist-packages/numba/cuda/dispatcher.py in to_device(self, hostary, stream) 207 208 def to_device(self, hostary, stream): --> 209 return cuda.to_device(hostary, stream=stream) 210 211 def to_host(self, devary, stream): /usr/local/lib/python3.6/dist-packages/numba/cuda/cudadrv/devices.py in _require_cuda_context(*args, **kws) 210 def _require_cuda_context(*args, **kws): 211 get_context() --> 212 return fn(*args, **kws) 213 214 return _require_cuda_context /usr/local/lib/python3.6/dist-packages/numba/cuda/api.py in to_device(obj, stream, copy, to) 101 """ 102 if to is None: --> 103 to, new = devicearray.auto_device(obj, stream=stream, copy=copy) 104 return to 105 if copy: /usr/local/lib/python3.6/dist-packages/numba/cuda/cudadrv/devicearray.py in auto_device(obj, stream, copy) 658 copy=False, 659 subok=True) --> 660 sentry_contiguous(obj) 661 devobj = from_array_like(obj, stream=stream) 662 if copy: /usr/local/lib/python3.6/dist-packages/numba/cuda/cudadrv/devicearray.py in sentry_contiguous(ary) 632 633 else: --> 634 raise ValueError(errmsg_contiguous_buffer) 635 636 ValueError: Array contains non-contiguous buffer and cannot be transferred as a single memory region. Please ensure contiguous buffer with numpy .ascontiguousarray()
Hmm this doesn't work. As the exception says, it is because
contain values that are not contiguous in memory, and same for
. So we do as instructed and convert these slices into contiguous arrays:
x = np.ascontiguousarray(points[:,0]) y = np.ascontiguousarray(points[:,1])
theta = gpu_arctan2(y, x)
_ = plt.hist(theta, bins=200)
And now it works!
As a general rule, one should remember that CUDA operates on data buffers that are contiguous in memory, like a C array, or a numpy array before any slicing.
Now let's be a bit more ambitious, and compute theta for 10 million points:
points = np.random.multivariate_normal([0,0], [[1.,0.9], [0.9,1.]], int(1e7)).astype(np.float32) x = np.ascontiguousarray(points[:,0]) y = np.ascontiguousarray(points[:,1])
_ = plt.hist(gpu_arctan2(y, x), bins=200)
And finally, let's quantify how much time we gain by running on the GPU:
%timeit gpu_arctan2(y, x) # compiled for the GPU
10 loops, best of 3: 36.8 ms per loop
%timeit np.arctan2(y, x) # compiled for the CPU
1 loop, best of 3: 287 ms per loop
%timeit [math.atan2(point, point) for point in points] # plain python
1 loop, best of 3: 5.35 s per loop
Nice! this time we gain almost a factor 10 on the GPU with respect to the numpy version, which is compiled for the CPU. And this comes on top of a factor 20 gain with respect to plain python.
Remember: To use the GPU efficiently, you need to give it enough data to process, and complicated tasks.
In regular ufuncs, the calculation is done on each element of the input array, and returns a scalar.
In generalized ufuncs (gufuncs), however, the calculation can deal with a sub-array of the input array, and return an array of different dimensions.
Ok, I know that sometimes what I write does not make sense... But at least I realize it! So let's have a look at a few examples, things are going to become much clearer.
Remember how we computed the polar angle for our 2D points above? It would have been nice to get both rho and theta from this calculation, to get a real conversion from cartesian to polar coordinates. This kind of thing is not possible with regular ufuncs (or maybe I just don't know how to do it). But with gufuncs, it's easy:
from numba import guvectorize @guvectorize(['(float32[:], float32[:])'], '(i)->(i)', target='cuda') def gpu_polar(vec, out): x = vec y = vec out = math.sqrt(x**2 + y**2) out = math.atan2(y,x)
There are two imporant differences between
(i)->(i)means that a 1D array is taken in input, and that a 1D array with the same size is provided in the output. The 1D array corresponds to the last dimension or innermost dimension of the input array. For example, our points array is of shape
(10000000,2)so the last dimension is of size 2.
vec, the input array, contains the cartesian coordinates
Let's do the conversion between cartesian and polar coordinates:
polar_coords = gpu_polar(points)
_ = plt.hist(polar_coords[:,0], bins=200) _ = plt.xlabel('R')
_ = plt.hist(polar_coords[:,1], bins=200) _ = plt.xlabel('theta')
To understand better how gufuncs work, let's make one that computes the average of the values on each line of a 2D array:
@guvectorize(['(float32[:], float32[:])'], '(n)->()', target='cuda') def gpu_average(array, out): acc = 0 for val in array: acc += val out = acc/len(array)
To test our gufunc, we create a 2D array:
a = np.arange(100).reshape(20, 5).astype(np.float32) a
array([[ 0., 1., 2., 3., 4.], [ 5., 6., 7., 8., 9.], [10., 11., 12., 13., 14.], [15., 16., 17., 18., 19.], [20., 21., 22., 23., 24.], [25., 26., 27., 28., 29.], [30., 31., 32., 33., 34.], [35., 36., 37., 38., 39.], [40., 41., 42., 43., 44.], [45., 46., 47., 48., 49.], [50., 51., 52., 53., 54.], [55., 56., 57., 58., 59.], [60., 61., 62., 63., 64.], [65., 66., 67., 68., 69.], [70., 71., 72., 73., 74.], [75., 76., 77., 78., 79.], [80., 81., 82., 83., 84.], [85., 86., 87., 88., 89.], [90., 91., 92., 93., 94.], [95., 96., 97., 98., 99.]], dtype=float32)
array([ 2., 7., 12., 17., 22., 27., 32., 37., 42., 47., 52., 57., 62., 67., 72., 77., 82., 87., 92., 97.], dtype=float32)
So far, we have run a single function, either a ufunc or a gufunc, on the GPU, but we are not forced to put all of our code in a single function.
Indeed, it is also possible to compile helper functions for the GPU. These functions, called device functions , can then be used on the GPU to make the code cleaner and more modular.
As an example, let's take again the gufunc defined just above, that computes the average of the values of each line of a 2D array.
We define a device function to add the using the
decorator, to sum up the elements of a 1D array. Then, we modify the gpu_average gufunc to make use of the
device function. And finally, we create another gufunc to sum up the elements of on each line of a 2D array:
from numba import cuda @cuda.jit(device=True) def add(array): acc = 0 for val in array: acc += val return acc @guvectorize(['(float32[:], float32[:])'], '(n)->()', target='cuda') def gpu_average_2(array, out): out = add(array)/len(array) @guvectorize(['(float32[:], float32[:])'], '(n)->()', target='cuda') def gpu_sum(array, out): out = add(array)
array([ 2., 7., 12., 17., 22., 27., 32., 37., 42., 47., 52., 57., 62., 67., 72., 77., 82., 87., 92., 97.], dtype=float32)
array([ 10., 35., 60., 85., 110., 135., 160., 185., 210., 235., 260., 285., 310., 335., 360., 385., 410., 435., 460., 485.], dtype=float32)
The device function allows us to avoid code duplication.
Obviously, it is a bit artifial to use a device function in such an easy case. But when implementing complex algorithms, these functions can prove very useful. Indeed, just like on the CPU, the general principles of programming apply: functions should be simple and to the point, and code duplication should be avoided.
As we have seen in the first ufunc example given in this article (parallel square root calculation), the GPU does not always provide a gain in performance.
Indeed, before using the raw computing power of the GPU, we need to ship the data to the device. And afterwards, we need to get the results back.
A good way to improve performance is to minimize data transfers between the host system and the GPU, and this can be done with device arrays.
To illustrate this, we will use an example provided by nvidia in its DLI course. Let's assume we want to implement a neural network for image processing from scratch. A hidden layer in the network might have to do the following:
Each of these three tasks can be done in parallel on the GPU.
But first, let's see how to do that on the CPU with plain numpy. For simplicity, we will generate the greyscale values and the weights randomly.
n = 1000000 # random values between 0. and 255. greyscales = np.floor(np.random.uniform(0, 256, n).astype(np.float32)) # random weights following a Gaussian distribution # centred on 0.5 and with width 0.1 weights = np.random.normal(.5, .1, n).astype(np.float32) def normalize(grayscales): return grayscales / 255 def weigh(values, weights): return values * weights def activate(values): return ( np.exp(values) - np.exp(-values) ) / \ ( np.exp(values) + np.exp(-values) )
%%timeit normalized = normalize(greyscales) weighted = weigh(normalized, weights) activated = activate(weighted)
10 loops, best of 3: 28.7 ms per loop
Now, we implement a parallel version of this algorithm for the GPU, as we have seen above.
@vectorize(['float32(float32)'],target='cuda') def gpu_normalize(x): return x / 255 @vectorize(['float32(float32, float32)'],target='cuda') def gpu_weigh(x, w): return x * w @vectorize(['float32(float32)'],target='cuda') def gpu_activate(x): return ( math.exp(x) - math.exp(-x) ) / ( math.exp(x) + math.exp(-x) )
Please note that the code in these ufuncs operates on scalar values, so we replaced the numpy ufuncs like np.exp by their math equivalent (the division by 255 and the multiplication between the values and the weights were also numpy ufuncs, though hidden a bit).
And we check the performance:
%%timeit normalized = gpu_normalize(greyscales) weighted = gpu_weigh(normalized, weights) activated = gpu_activate(weighted)
The slowest run took 30.06 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loop, best of 3: 12 ms per loop
That's already quite nice, we gained more than a factor of two!
But we realize that we spend time transferring data back and forth between the host and the GPU for nothing:
greyscalesto the GPU
normalizedto the host, and then back to the GPU, together with
weightedto the host, and then back to the GPU
activatedto the host
Actually, we only need to:
weightsto the GPU
So let's do that:
%%timeit # create intermediate arrays on the GPU normalized_gpu = cuda.device_array(shape=(n,), dtype=np.float32) weighted_gpu = cuda.device_array(shape=(n,), dtype=np.float32) # note that output device arrays are provided as arguments gpu_normalize(greyscales, out=normalized_gpu) gpu_weigh(normalized_gpu, weights, out=weighted_gpu) activated = gpu_activate(weighted_gpu)
100 loops, best of 3: 6.16 ms per loop
We gain a factor of two by eliminating unnecessary data transfers!
Another important thing to know is that we can also take full control on the transfers to and from the GPU like this:
# transfer inputs to the gpu greyscales_gpu = cuda.to_device(greyscales) weights_gpu = cuda.to_device(weights) # create intermediate arrays and output array on the GPU normalized_gpu = cuda.device_array(shape=(n,), dtype=np.float32) weighted_gpu = cuda.device_array(shape=(n,), dtype=np.float32) activated_gpu = cuda.device_array(shape=(n,), dtype=np.float32)
Now that everything we need is on the GPU, we do the calculation:
%%timeit gpu_normalize(greyscales_gpu, out=normalized_gpu) gpu_weigh(normalized_gpu, weights_gpu, out=weighted_gpu) gpu_activate(weighted_gpu, out=activated_gpu)
1000 loops, best of 3: 1.32 ms per loop
You might be thinking that this factor 5 gain is artificial, because we did not include the necessary transfer times for the input and output data. That's right! Still, this illustrates how you can take full control of your data transfers, which might prove useful on more complex processing workflows.
For instance, if we wanted to re-use any of the device arrays defined above, we could do it now, as they are still residing on the GPU as I'm writing this!
Finally, here is how to retrieve the results:
activated = activated_gpu.copy_to_host() activated
array([0.27804664, 0.40110067, 0.00198978, ..., 0.27639735, 0.41989788, 0.22209653], dtype=float32)
In this post, you have learnt how to:
In the next part of this tutorial series, we will dig deeper and see how to write our own CUDA kernels for the GPU, effectively using it as a tiny highly-parallel computer!
Please let me know what you think in the comments! I’ll try and answer all questions.
And if you liked this article, you can subscribe to my mailing list to be notified of new posts (no more than one mail per week I promise.)
You can join my mailing list for new posts and exclusive content: