cupy

3 minute read

Published:

CuPy is a GPU array backend that includes some of the most commonly used NumPy (and some SciPy) functions. Following plot illustrates CuPy acceleration on the creation, and creation plus inversion of symmetric arrays of different sizes: CuPySpeedup

CuPy has a builtin experimental profiler (cupyx.time) that can accurately assess the GPU runtime. But for simplicity the benchmark here only measures the elapsed time, and represented by the mean value based on multiple iterations. The overhead of the first invocation is excluded but the transfer to the host is not.

While this is all impressive, it is worth noting that NumPy inversions can be made faster (a factor of few) by skipping identity matrix recreation (directly call np.linalg.solve) or by directly calling the barebone lapack routine.

All calculations were performed on a Google Colab instance with:

CPU: Intel(R) Xeon(R) @ 2.00GHz
GPU_0: Tesla T4

numpy == 1.19.5
cupy == 9.1.0
cuda == 11.0.221
import numpy as np
import cupy  as cp
import time


print(f' numpy: {np.__version__}')
print(f' cupy:  {cp.__version__}')
!nvcc --version
>>> numpy: 1.19.5
>>> cupy:  9.1.0
>>> nvcc: NVIDIA (R) Cuda compiler driver
>>> Copyright (c) 2005-2020 NVIDIA Corporation
>>> Built on Wed_Jul_22_19:09:09_PDT_2020
>>> Cuda compilation tools, release 11.0, V11.0.221
>>> Build cuda_11.0_bu.TC445_37.28845127_0


!nvidia-smi -L
!grep -m 1 'model name' /proc/cpuinfo
!awk '/MemFree/ { printf "RAM: %.3f GB \n", $2/1024/1024 }' /proc/meminfo
>>> GPU 0: Tesla T4 (UUID: GPU-d5b82767-6438-f1e7-be11-e7a979c96612)
>>> model name      : Intel(R) Xeon(R) CPU @ 2.20GHz
>>> RAM: 10.044 GB


"""
 create or create+invert (invert=True) a symmetric 2d
 array of size (sz, sz) with random values through
 numpy - f_numpy or
 cupy  - f_cupy
"""
def f_cupy(sz, invert=False):
  A = cp.random.random((sz, sz))
  cp.cuda.Stream.null.synchronize()
  if invert:
    return cp.linalg.inv(A)
  else:
    return A

def f_numpy(sz, invert=False):
  A = np.random.random((sz, sz))
  if invert:
    return np.linalg.inv(A)
  else:
    return A

def measure(pckg, sz, invert=False, Niter = 20):
  """
   perform (Niter) iterations and report mean/std
   for each size (sz) excluding the first overhead
  """
  def _measureOneIter(pckg, sz, invert):
    # measure single iteration by elapsed time
    start = time.time()
    if pckg == 'cupy':  f_cupy(sz,  invert)
    if pckg == 'numpy': f_numpy(sz, invert)
    return time.time() - start

  t = [_measureOneIter(pckg, sz, invert) for iter in range(Niter)]
  t = t[1:]

  print(f'{sz:5d}   {np.mean(t):5.3e}   {np.std(t):5.3e}')


sizes = [int(10**(i/2)) for i in range(9)]


# creation
for sz in sizes: measure('numpy', sz)
>>>     1   9.700e-07   3.763e-06
>>>     3   2.987e-06   2.879e-06
>>>    10   2.121e-06   9.214e-07
>>>    31   1.856e-05   3.479e-05
>>>   100   7.099e-05   6.601e-06
>>>   316   7.554e-04   4.299e-05
>>>  1000   7.384e-03   7.381e-04
>>>  3162   7.268e-02   2.791e-03
>>> 10000   7.127e-01   9.859e-03


for sz in sizes: measure('cupy',  sz)
>>>     1   2.935e-05   1.713e-05
>>>     3   5.860e-05   9.230e-05
>>>    10   3.367e-05   2.097e-05
>>>    31   2.582e-05   5.631e-06
>>>   100   2.965e-05   1.544e-05
>>>   316   3.486e-05   8.348e-06
>>>  1000   1.656e-04   3.200e-06
>>>  3162   1.448e-03   5.501e-06
>>> 10000   1.233e-02   1.348e-03


# creation + inversion
for sz in sizes: measure('numpy', sz, invert=True)
>>>     1   2.010e-05   8.241e-06
>>>     3   2.797e-05   2.049e-05
>>>    10   2.658e-05   6.128e-05
>>>    31   7.163e-05   1.711e-05
>>>   100   4.809e-04   4.052e-05
>>>   316   6.534e-03   1.246e-03
>>>  1000   1.061e-01   3.304e-03
>>>  3162   2.454e+00   4.122e-02
>>> 10000   6.855e+01   8.369e-01


for sz in sizes: measure('cupy',  sz, invert=True)
>>>     1   1.604e-04   4.458e-05
>>>     3   1.720e-04   2.769e-05
>>>    10   2.592e-04   1.478e-05
>>>    31   4.638e-04   5.610e-06
>>>   100   2.029e-03   2.060e-05
>>>   316   7.933e-03   2.112e-03
>>>  1000   2.781e-02   2.234e-03
>>>  3162   4.750e-01   3.144e-03
>>> 10000   1.221e+01   1.113e-01