cupy
Published:
CuPy is a GPU array backend that includes some of the most commonly used NumPy (and some SciPy) functions. Following plot illustrates CuPy acceleration on the creation, and creation plus inversion of symmetric arrays of different sizes:
CuPy has a builtin experimental profiler (cupyx.time
) that can accurately assess the GPU runtime. But for simplicity the benchmark here only measures the elapsed time, and represented by the mean value based on multiple iterations. The overhead of the first invocation is excluded but the transfer to the host is not.
While this is all impressive, it is worth noting that NumPy inversions can be made faster (a factor of few) by skipping identity matrix recreation (directly call np.linalg.solve
) or by directly calling the barebone lapack routine.
All calculations were performed on a Google Colab instance with:
CPU: Intel(R) Xeon(R) @ 2.00GHz
GPU_0: Tesla T4
numpy == 1.19.5
cupy == 9.1.0
cuda == 11.0.221
import numpy as np
import cupy as cp
import time
print(f' numpy: {np.__version__}')
print(f' cupy: {cp.__version__}')
!nvcc --version
>>> numpy: 1.19.5
>>> cupy: 9.1.0
>>> nvcc: NVIDIA (R) Cuda compiler driver
>>> Copyright (c) 2005-2020 NVIDIA Corporation
>>> Built on Wed_Jul_22_19:09:09_PDT_2020
>>> Cuda compilation tools, release 11.0, V11.0.221
>>> Build cuda_11.0_bu.TC445_37.28845127_0
!nvidia-smi -L
!grep -m 1 'model name' /proc/cpuinfo
!awk '/MemFree/ { printf "RAM: %.3f GB \n", $2/1024/1024 }' /proc/meminfo
>>> GPU 0: Tesla T4 (UUID: GPU-d5b82767-6438-f1e7-be11-e7a979c96612)
>>> model name : Intel(R) Xeon(R) CPU @ 2.20GHz
>>> RAM: 10.044 GB
"""
create or create+invert (invert=True) a symmetric 2d
array of size (sz, sz) with random values through
numpy - f_numpy or
cupy - f_cupy
"""
def f_cupy(sz, invert=False):
A = cp.random.random((sz, sz))
cp.cuda.Stream.null.synchronize()
if invert:
return cp.linalg.inv(A)
else:
return A
def f_numpy(sz, invert=False):
A = np.random.random((sz, sz))
if invert:
return np.linalg.inv(A)
else:
return A
def measure(pckg, sz, invert=False, Niter = 20):
"""
perform (Niter) iterations and report mean/std
for each size (sz) excluding the first overhead
"""
def _measureOneIter(pckg, sz, invert):
# measure single iteration by elapsed time
start = time.time()
if pckg == 'cupy': f_cupy(sz, invert)
if pckg == 'numpy': f_numpy(sz, invert)
return time.time() - start
t = [_measureOneIter(pckg, sz, invert) for iter in range(Niter)]
t = t[1:]
print(f'{sz:5d} {np.mean(t):5.3e} {np.std(t):5.3e}')
sizes = [int(10**(i/2)) for i in range(9)]
# creation
for sz in sizes: measure('numpy', sz)
>>> 1 9.700e-07 3.763e-06
>>> 3 2.987e-06 2.879e-06
>>> 10 2.121e-06 9.214e-07
>>> 31 1.856e-05 3.479e-05
>>> 100 7.099e-05 6.601e-06
>>> 316 7.554e-04 4.299e-05
>>> 1000 7.384e-03 7.381e-04
>>> 3162 7.268e-02 2.791e-03
>>> 10000 7.127e-01 9.859e-03
for sz in sizes: measure('cupy', sz)
>>> 1 2.935e-05 1.713e-05
>>> 3 5.860e-05 9.230e-05
>>> 10 3.367e-05 2.097e-05
>>> 31 2.582e-05 5.631e-06
>>> 100 2.965e-05 1.544e-05
>>> 316 3.486e-05 8.348e-06
>>> 1000 1.656e-04 3.200e-06
>>> 3162 1.448e-03 5.501e-06
>>> 10000 1.233e-02 1.348e-03
# creation + inversion
for sz in sizes: measure('numpy', sz, invert=True)
>>> 1 2.010e-05 8.241e-06
>>> 3 2.797e-05 2.049e-05
>>> 10 2.658e-05 6.128e-05
>>> 31 7.163e-05 1.711e-05
>>> 100 4.809e-04 4.052e-05
>>> 316 6.534e-03 1.246e-03
>>> 1000 1.061e-01 3.304e-03
>>> 3162 2.454e+00 4.122e-02
>>> 10000 6.855e+01 8.369e-01
for sz in sizes: measure('cupy', sz, invert=True)
>>> 1 1.604e-04 4.458e-05
>>> 3 1.720e-04 2.769e-05
>>> 10 2.592e-04 1.478e-05
>>> 31 4.638e-04 5.610e-06
>>> 100 2.029e-03 2.060e-05
>>> 316 7.933e-03 2.112e-03
>>> 1000 2.781e-02 2.234e-03
>>> 3162 4.750e-01 3.144e-03
>>> 10000 1.221e+01 1.113e-01