====
GPUs
====

.. .. contents::


GPUs are supported in Charm4py via the Charm++ HAPI (Hybrid API) interface.
Presently, this support allows asynchronous completion detection of GPU kernels via Charm4py futures, 
using the function ``charm.hapiAddCudaCallback``.

The HAPI Charm4py API is:

.. code-block:: python

    def hapiAddCudaCallback(stream, future)

.. note::

    For now, ``charm.hapiAddCudaCallback`` only supports numba and torch streams as input. This function inserts a callback 
    into the stream such that when the callback is reached, the corresponding Charm4py future is set.

Enabling HAPI
--------
To build Charm4py with HAPI support, add "cuda" to the Charm build options and follow the steps to build Charm4py from source:

.. code-block:: shell

   export CHARM_EXTRA_BUILD_OPTS="cuda"
   pip install .

.. warning:: 

    To ensure that the underlying Charm build has Cuda enabled, remove any pre-existing builds in charm_src/charm before setting the Cuda option and running install.

Examples
--------

.. code-block:: python

    from charm4py import charm
    import time
    import numba.cuda as cuda
    import numpy as np

    @cuda.jit
    def elementwise_sum_kernel(x_in, x_out):
        idx = cuda.grid(1)
        if idx < x_in.shape[0]:
            x_out[idx] = x_in[idx] + x_in[idx]

    def main(args):
        N = 1_000_000
        array_size = (N,)

        s = cuda.stream()
        stream_handle = s.handle.value

        A_host = np.arange(N, dtype=np.float32)

        A_gpu = cuda.device_array(array_size, dtype=np.float32, stream=s)
        B_gpu = cuda.device_array(array_size, dtype=np.float32, stream=s)
        A_gpu.copy_to_device(A_host, stream=s)

        threads_per_block = 128
        blocks_per_grid = (N + (threads_per_block - 1)) // threads_per_block

        print("Launching kernel and inserting callback...")
        start_time = time.perf_counter()
        elementwise_sum_kernel[blocks_per_grid, threads_per_block, s](A_gpu, B_gpu)

        return_fut = charm.Future()
        charm.hapiAddCudaCallback(stream_handle, return_fut)
        return_fut.get()
        kernel_done_time = time.perf_counter()
        print(f"Callback received, kernel finished in {kernel_done_time - start_time:.6f} seconds.")

        B_host = B_gpu.copy_to_host(stream=s)

        s.synchronize()

        sum_result = np.sum(B_host)
        print(f"Sum of result is {sum_result}")

        charm.exit()

    charm.start(main)


The above example demonstrates how to use the Charm4py HAPI interface to insert a callback into a CUDA stream and track 
completion of a numba kernel launch.