QCUDA Save

qCUDA: GPGPU Virtualization at a New API Remoting Method with Para-virtualization

Project README

qCUDA

qCUDA is based on the virtio framework to provide the para-virtualized driver as “front-end”, and the device module as “back-end” for performing the interaction with API remoting and memory management. In our test environment, qCUDA can achieve above 95% of the bandwidth efficiency for most results by comparing with the native. In addition, by comparing with prior work, qCUDA has more flexibility and interposition that it can execute CUDA-compatible programs in the Linux and Windows VMs, respectively, on QEMU-KVM hypervisor for GPGPU virtualization.

System Components

The framework of qCUDA has three components, including qCUlibrary, qCUdriver and qCUdevice; the functions of these three components are defined as follows:

qCUlibrary (qcu-library) – The interposer library in VM (guest OS) provided CUDA runtime access, interface of memory allocation, qCUDA command (qCUcmd), and passing the qCUcmd to the qCUdriver.
qCUdriver (qcu-driver) – The front-end driver was responsible for the memory management, data movement, analyzing the qCUcmd from the qCUlibrary, and passing the qCUcmd by the control channel which is connected to the qCUdevice.
qCUdevice (qcu-device) – The virtual device as the back-end was responsible for receiving/sending the qCUcmd through the control channel; it depended on receiving the qCUcmd to active related operations in the host, including to register GPU binary, convert guest physical addresses (GPA) into host virtual addresses (HVA), and handle the CUDA runtime/driver APIs for accessing the GPU.

Installation

Prerequisites

Host

This branch

CUDA 9.0 (for ubuntu 17.04, gcc version=5.5.0)
Ubuntu 18.04 LTS (GNU/Linux 4.15.0-136-generic x86_64)
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib64"
export CUDA_HOME=/usr/local/cuda
Install required packages

sudo apt install -y  pkg-config bridge-utils uml-utilities zlib1g-dev libglib2.0-dev autoconf \
automake libtool libsdl1.2-dev libsasl2-dev libcurl4-openssl-dev libsasl2-dev libaio-dev libvde-dev \
libsdl2-dev libaio-dev  libattr1-dev libbrlapi-dev libcap-ng-dev libgnutls28-dev libgtk-3-dev libiscsi-dev liblttng-ust-dev \
libncurses5-dev libnfs-dev libnss3-dev libpixman-1-dev libpng-dev librados-dev libsdl1.2-dev libseccomp-dev \ 
libspice-protocol-dev libspice-server-dev libssh2-1-dev liburcu-dev libusb-1.0-0-dev libvte-dev sparse uuid-dev \ 
flex bison

Guest

Ubuntu 18.04 LTS image (guest OS)

How to install

Host

qcu-device was modified from QEMU 2.12.1, for further information please refer to QEMU installation steps

clone this repo.
cd qcu-device
mkdir build && pushd build
../configure --enable-cuda --target-list=x86_64-softmmu && make -j16
sudo make install
popd
sudo mkdir /dev/qcuvf
sudo chmod 777 /dev/qcuvf

Guest

clone this repo.
Enter qcu-driver and execute the commands:
- make all
- make i
Enter qcu-library and execute the commands:
- make all
- make install

A CUDA sample in guest OS

In the guest OS, nvcc compiles the source with host/device code and standard CUDA runtime APIs. To compare with a native OS, in the guest VM, compiling the CUDA program must add the nvcc flag "-cudart=shared", which can be dynamically linked to the qCUlibrary as a shared library. Therefore, the qCUlibrary provided the wrapper functions that intercepted dynamic memory allocation of CPU code and CUDA runtime APIs. For instance, you can find the cdpSimpleQuicksort of the CUDA simples in the NVIDIA CUDA SDK, after installing qCUdriver and qCUlibrary in the guest OS, go to the cdpSimpleQuicksort and modify the internal flags in the Makefile as below:

# internal flags
NVCCFLAGS   := -m${TARGET_SIZE} -cudart=shared

Finally, run make and perform the executable file without change any source code.

How to use qCUDA framework

In our current version, qCUDA has been implementing for 32 CUDA runtime APIs. These implemented CUDA runtime APIs on qCUDA are shown in the table as below:

Classification	CUDA runtime API on qCUDA
Memory Management	cudaMalloc
	cudaMemset
	cudaMemcpy
	cudaMemcpyAsync
	cudaFree
Device Management	cudaGetDevice
	cudaGetDeviceCount
	cudaSetDevice
	cudaGetDeviceProperties
	cudaDeviceSynchronize
	cudaDeviceReset
Version Management	cudaDriverGetVersion
Version Management	cudaRuntimeGetVersion
Stream Management	cudaStreamCreate
	cudaStreamDestroy
	cudaStreamSynchronize
Event Management	cudaEventCreate
	cudaEventCreateWithFlags
	cudaEventRecord
	cudaEventSynchronize
	cudaEventElapsedTime
	cudaEventDestroy
Error Handling	cudaGetLastError
Zero-copy	cudaHostRegister
	cudaHostGetDevicePointer
	cudaHostUnregister
	cudaSetDeviceFlags
Thread Management	cudaThreadSynchronize
Module & Execution Control	cudaRegisterFatBinary
	cudaUnregisterFatBinary
	cudaRegisterFunction
	cudaLaunch

According to our design, it is very easy to add the new CUDA runtime API via our framework. It only need to modify three parts in qCUDA, the main components we have talked in previous section, qCUlibrary, qCUdriver and qCUdevice; these three components of qCUDA source code are located on "qcu-library/libcudart.c", "qcu-driver/qcuda_driver.c" and "qcu-device/hw/misc/virtio-qcuda.c". If a programmer wants to add a new CUDA API that user in guest OS can use the new function, she/he should follow the standards of qCUDA framework to modify these files. We gave an example to show how to add a CUDA runtime API, "cudaThreadSynchronize", modified related files on the qCUDA system, described as below:

qcu-library/libcudart.c

The qCUlibrary component of qCUDA system, providing the interface to wrap the CUDA runtime APIs. The CUDA application in guest can link the function that implemented in the "libcudart.c". It shows how to add the CUDA function "cudaThreadSynchronize" as below:

cudaError_t cudaThreadSynchronize () {
    VirtioQCArg arg ;
    memset(&arg , 0, sizeof (VirtioQCArg ));
    send_cmd_to_device ( VIRTQC_cudaThreadSynchronize, &arg );
    return ( cudaError_t ) arg .cmd;
}

The qCUDA command, qCUcmd, is represented by VirtioQCArg, which is the structure defined in "qcuda_common.h"; we can take this structure as the buffer to pass the variety of parameters that you want to interact with qCUdriver. The structure VirtioQCArg defined as below:

typedef struct VirtioQCArg VirtioQCArg;

struct VirtioQCArg {
    int32_t cmd;
    uint64_t rnd;
    uint64_t para;
    
    uint64_t pA;
    uint32_t pASize;
    
    uint64_t pB;
    uint32_t pBSize;
    
    uint32_t flag;    
};

In this function, "cudaThreadSynchronize", just only has the void parameter; thus we don’t need to pass any parameter to driver, but we could receive the returned value from driver. We use send_cmd_to_device there to send qCUcmd to qCUdriver; the first parameter of send_cmd_to_device is the identified name of cudaThreadSynchronize, should be add the prefix "VIRTQC_" of the head of function name; the second parameter is the VirtioQCArg structure we just declared and initialized it. After the send_cmd_to_device called, we can get the returned value from the specific member of VirtioQCArg.

qcu-library/qcuda_driver.c

The qCUdriver component of qCUDA system, providing the driver interface of guest. Through the driver module, guest can pass the message to the virtual device. It shows how to modified the qCUdriver below:

// @_cmd: device command
// @_arg: argument of cuda function
// this function return cudaError_t .
static long qcu_misc_ioctl(struct file *filp, 
                            unsigned int _cmd, 
                            unsigned long _arg)
{
    VirtioQCArg *arg ; 
    int err ;
    
    arg = kmalloc_safe(sizeof(VirtioQCArg));
    copy_from_user_safe(arg , \
                    (void*) _arg, \
                    sizeof(VirtioQCArg ));
    arg−>cmd = _cmd
    
    switch( arg−>cmd )
    {
        ....
        case VIRTQC_cudaThreadSynchronize:
            qcu_cudaThreadSynchronize ( arg );
            break ; 
        ....
    }

Add the new condition, "VIRTQCcudaThreadSynchronize", in the switch case of the qcu_misc_ioctl function, then next line add "qcucudaThreadSynchronize(arg)" as the function call. Note the prefix "VIRTQC_" and "qcu_" must be add in the head of our function name. Next we implemented the "qcucudaThreadSynchronize(arg)" as below:

void qcu_cudaThreadSynchronize ( VirtioQCArg *arg ) {
    qcu_misc_send_cmd ( arg ); 
}

In this case, it is very simple that we don’t need add extra information in driver, just call the function "qcu_misc_send_cmd" for passing the qCUcmd to the qCUdevice.

qcu-device/hw/misc/virtio-qcuda.c

The qCUdevice component of qCUDA system, providing the interface to execute the actual CUDA runtime API and pass message to guest. Different with the other components, it defined in host and implemented in the part of QEMU source. It shows how to modify the qCUdevice below:


static void virtio_qcuda_cmd_handle ( VirtIODevice *vdev ,
                                      VirtQueue *vq)
{
    VirtQueueElement elem;
    VirtioQCArg *arg;
    
    arg = malloc( sizeof(VirtioQCArg)); 
    while(virtqueue_pop(vq, &elem))
    {
        iov_to_buf(elem.out_sg , \ 
                    elem . out_num , \
                    0,\
                    arg, \
                    sizeof (VirtioQCArg ));
        
        ...
        
        case VIRTQC_cudaThreadSynchronize :
            qcu_cudaThreadSynchronize ( arg ); 
            break ;
    
        ...
    
    }
    
    ...
    
}

Add the new condition, "case VIRTQC_cudaThreadSynchronize", in the switch case of the virtio_qcuda_cmd_handle function, then next line add "qcucudaThreadSynchronize(arg)" as the function call. Note the prefix "VIRTQC" and "qcu_" must be add in the head of our function name. Next we implemented the "qcu_cudaThreadSynchronize(arg)" as below:

static void qcu_cudaThreadSynchronize(VirtioQCArg *arg)
{
    cudaError_t err ;
    cudaError(err = cudaThreadSynchronize());
    
    arg−>cmd = err;
}

It is very simple that we just add the CUDA function we need here, the cudaThreadSynchronize could return the value, then we can pass the value from the specific entry defined of the VirtioQCArg structure.

qcuda_common.h

The common arguments and macro defined here, use the enumeration to define as below, we must add the prefix "VIRTQC_" of the head of the function name as the identified name.

enum
{
    // Module & Execution control (driver API)
    VIRTQC_cudaRegisterFatBinary = 200,
    VIRTQC_cudaUnregisterFatBinary,
    VIRTQC_cudaRegisterFunction,
    VIRTQC_cudaRegisterVar,
    VIRTQC_cudaLaunch,

    // Memory Management (runtime API)
    VIRTQC_cudaMalloc,
    VIRTQC_cudaMemcpy,
    VIRTQC_cudaMemcpyAsync,
    VIRTQC_cudaMemset,
    VIRTQC_cudaFree,

    // Device Management (runtime API)
    VIRTQC_cudaGetDevice,
    VIRTQC_cudaGetDeviceCount,
    VIRTQC_cudaGetDeviceProperties,
    VIRTQC_cudaSetDevice,
    VIRTQC_cudaDeviceSynchronize,
    VIRTQC_cudaDeviceReset,
    VIRTQC_cudaDeviceSetLimit,

    // Version Management (runtime API)
    VIRTQC_cudaDriverGetVersion,
    VIRTQC_cudaRuntimeGetVersion,

    // Event Management (runtime API)
    VIRTQC_cudaEventCreate,
    VIRTQC_cudaEventCreateWithFlags,
    VIRTQC_cudaEventRecord,
    VIRTQC_cudaEventSynchronize,
    VIRTQC_cudaEventElapsedTime,
    VIRTQC_cudaEventDestroy,

    // Error Handling (runtime API)
    VIRTQC_cudaGetLastError,

    //zero-cpy
    VIRTQC_cudaHostRegister,
    VIRTQC_cudaHostGetDevicePointer,
    VIRTQC_cudaHostUnregister,
    VIRTQC_cudaSetDeviceFlags,
    VIRTQC_cudaFreeHost,

    //stream
    VIRTQC_cudaStreamCreate,
    VIRTQC_cudaStreamDestroy,
    VIRTQC_cudaStreamSynchronize,

    // Thread Management
    VIRTQC_cudaThreadSynchronize,
};

According to the above sample, we know that through the qCUDA framwork, programmer doesn’t care about the details of the path of qCUcmd passing between guest and host; also the details of VirtIO are hidden from our high level of abstraction interface.

Contributors

Yu-Shiang Lin
Jordan Huang
Luis Herrera
Jia-Chi Chen

Open Source Agenda is not affiliated with "QCUDA" Project. README Source: coldfunction/qCUDA

Stars

Open Issues

Last Commit

2 years ago

Repository

coldfunction/qCUDA

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/qcuda"><img src="https://www.opensourceagenda.com/projects/qcuda/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022