I’d like to show how to use HPC part written on C++ with CUDA in Python code. So, every heavy part may be done on GPU with CUDA, all gluing tasks (with beautiful matplotlib plots) are done on CPU with Python.

We will use shared object, compiled from C++ CUDA code in Python. The only uncertain part here is conversation of types from «high-level» Python ones to «low-level» C++ ones. We will write application for parallel calculation of elementwise sum for two arrays.

First, CUDA code. CUDA kernel cuda_sum_kernel is doing job on GPU, wrapper cuda_sum prepares arrays for GPU and frees memory after calculation is done. Note extern "C" line. It is important for correct function name in the compiled shared object later.

#include <cuda.h>
#include <cuda_runtime_api.h>

__global__ void cuda_sum_kernel(float *a, float *b, float *c, size_t size)
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= size) {

    c[idx] = a[idx] + b[idx];

extern "C" {
void cuda_sum(float *a, float *b, float *c, size_t size)
    float *d_a, *d_b, *d_c;

    cudaMalloc((void **)&d_a, size * sizeof(float));
    cudaMalloc((void **)&d_b, size * sizeof(float));
    cudaMalloc((void **)&d_c, size * sizeof(float));

    cudaMemcpy(d_a, a, size * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, size * sizeof(float), cudaMemcpyHostToDevice);

    cuda_sum_kernel <<< ceil(size / 256.0), 256 >>> (d_a, d_b, d_c, size);

    cudaMemcpy(c, d_c, size * sizeof(float), cudaMemcpyDeviceToHost);


Compile it to *.so file with nvcc compiler:

/usr/local/cuda/bin/nvcc -Xcompiler -fPIC -shared -o cuda_sum.so cuda_sum.cu

The last part is to use function cuda_sum from created cuda_sum.so file in Python script. Example (with comments):

import numpy as np
import ctypes
from ctypes import *

# extract cuda_sum function pointer in the shared object cuda_sum.so
def get_cuda_sum():
    dll = ctypes.CDLL('./cuda_sum.so', mode=ctypes.RTLD_GLOBAL)
    func = dll.cuda_sum
    func.argtypes = [POINTER(c_float), POINTER(c_float), POINTER(c_float), c_size_t]
    return func

# create __cuda_sum function with get_cuda_sum()
__cuda_sum = get_cuda_sum()

# convenient python wrapper for __cuda_sum
# it does all job with types convertation
# from python ones to C++ ones
def cuda_sum(a, b, c, size):
    a_p = a.ctypes.data_as(POINTER(c_float))
    b_p = b.ctypes.data_as(POINTER(c_float))
    c_p = c.ctypes.data_as(POINTER(c_float))

    __cuda_sum(a_p, b_p, c_p, size)

# testing, sum of two arrays of ones and output head part of resulting array
if __name__ == '__main__':

    a = np.ones(size).astype('float32')
    b = np.ones(size).astype('float32')
    c = np.zeros(size).astype('float32')

    cuda_sum(a, b, c, size)

    print c[:10]

Now you can desing the code above into full-featured Python module.