CUDA API support: cudaHostAlloc and cudaFreeHost #1304

chengchen666 · 2024-05-31T23:47:39Z

Need to implement cudaHostAlloc and cudaFreeHost to support vLLM.
Test case is in:
16bf3d2

To build:
nvcc -cudart shared test_cudahostalloc.cpp -o test_cudahostalloc -lcuda

To Run:
./test_cudahostalloc 1024 1024

The text was updated successfully, but these errors were encountered:

chengchen666 · 2024-06-01T01:26:16Z

Not in high priority. It's highly possible that this API is called by NCCL. So once we finish the NCCL support, we might not need to support this API for now. This is because I don't find this API in vLLM source code, but in NCCL source code, I find it.

mehryar72 · 2024-06-05T00:18:28Z

Branch Merge Issue with mab_hostalloc

The branch named mab_hostalloc is a merge of hostalloc with multithread and nccl. The throughput test that utilizes cudahostalloc using " test_cudahostalloc" successfully executes on its first run. However, upon a second attempt, the container experiences a crash.

Log Details appearing right after crash:

[INFO] [0/60932323] unmap ptr is 4000000000, len is 1000
[INFO] [0/60932402] unmap ptr is 4000001000, len is 28f000
[INFO] [0/60932563] unmap ptr is 4000290000, len is 1a000
[INFO] [0/60932585] unmap ptr is 40002aa000, len is 2000
[INFO] [0/60932594] unmap ptr is 40002ac000, len is 3000

QuarkContainer · 2024-06-05T13:55:19Z

@mehryar72 Thank you! Would you please provide more detail repro step and it will be great to attach whole quark log.

mehryar72 · 2024-06-05T17:32:11Z

@QuarkContainer
how to replicate:
build quark from mab_hostalloc branch.
Inside a container with quark runtime run the cudahostalloc throuput test.
LD_PRELOAD=/path_to_libcudaproxy/libcudaproxy.so ./test_cudahostalloc 1024 1024
the first time the run is successfull. the second time the container gets stuck.
Quark log is attached
quark_log.txt

QuarkContainer · 2024-06-09T15:33:25Z

@mehryar72
I tried to build the branch mab_hostalloc but fail with following error. Looks like I need to install the nvcc library. Could you please update the steps to do that?

Compiling containerd-shim v0.3.0 (https://github.com/QuarkContainer/rust-extensions.git#b3ac82d9)
Compiling quark v0.6.0 (/home/brad/rust/Quark/qvisor)
error: linking with cc failed: exit status: 1
|
= note: LC_ALL="C" PATH="/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/bin:/home/brad/.pyenv/shims:/home/brad/.pyenv/bin:/home/brad/.cargo/bin:/home/brad/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin:/usr/local/go/bin" VSLANG="1033" "cc" "-m64" "/tmp/rustc7pziLu/symbols.o" "/home/brad/rust/Quark/qvisor/../target/release/deps/quark-15c31bd88d58b28c.quark.ad02c6ded2946f8b-cgu.0.rcgu.o" "-Wl,--as-needed" "-L" "/home/brad/rust/Quark/qvisor/../target/release/deps" "-L" "/usr/local/cuda/lib64" "-L" "/usr/local/cuda/lib64/stubs" "-L" "/usr/local/cuda/targets/x86_64-linux/lib" "-L" "/usr/local/cuda/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12/lib64" "-L" "/usr/local/cuda-12/lib64/stubs" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12.3/lib64" "-L" "/usr/local/cuda-12.3/lib64/stubs" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda/lib64" "-L" "/usr/local/cuda/lib64/stubs" "-L" "/usr/local/cuda/targets/x86_64-linux/lib" "-L" "/usr/local/cuda/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12/lib64" "-L" "/usr/local/cuda-12/lib64/stubs" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda-12.3/lib64" "-L" "/usr/local/cuda-12.3/lib64/stubs" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib" "-L" "/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs" "-L" "/usr/local/cuda/lib64" "-L" "/usr/lib/x86_64-linux-gnu" "-L" "/usr/lib/x86_64-linux-gnu/stubs" "-L" "/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bdynamic" "-lnccl" "-lcuda" "-lcudart" "-lnvidia-ml" "-lcublas" "-lcublasLt" "-Wl,-Bstatic" "/tmp/rustc7pziLu/libcompiler_builtins-8ebeba8f78436673.rlib" "-Wl,-Bdynamic" "-lcuda" "-lcublas" "-lcuda" "-lcublasLt" "-lelf" "-lcudart" "-lc" "-lcap" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "/home/brad/.rustup/toolchains/nightly-2023-12-11-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/home/brad/rust/Quark/qvisor/../target/release/deps/quark-15c31bd88d58b28c" "-Wl,--gc-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-nodefaultlibs"
= note: /usr/bin/ld: cannot find -lnccl: No such file or directory
collect2: error: ld returned 1 exit status

my test in the branch hostalloc pass as below.

root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run --net=host --cpus=0.8 -P --runtime=quark_d --mount type=bind,source="/home/brad/rust/Quark",target=/Quark --rm -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash

==========
== CUDA ==

CUDA Version 12.1.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .

** DEPRECATION NOTICE! **

THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024
Average throughput from host to device (cudaHostAlloc): 22.3543 GB/s
Average throughput from device to host (cudaHostAlloc): 24.2204 GB/s
root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024
Average throughput from host to device (cudaHostAlloc): 22.2447 GB/s
Average throughput from device to host (cudaHostAlloc): 24.2431 GB/s

chengchen666 · 2024-06-09T18:41:54Z

Maybe we should make NCCL as an option for building quark. Because not all cuda users require for NCCL.

QuarkContainer · 2024-06-11T14:28:18Z

When test with latest GPUVirtNew branch the test code fail at weired place.

root@brad-MS-7D46:/Quark/target/release# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024
failed to replaced dlopen call to libcudaproxy.so
CUDA error at test_cuda.cpp:104 - �ViY

QuarkContainer · 2024-06-11T15:41:20Z

@mehryar72 @chengchen666 with PR #1315. The cudahostalloc works as below.

root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run --net=host --cpus=0.8 -P --runtime=quark_d --mount type=bind,source="/home/brad/rust/Quark",target=/Quark --rm -it nvidia/cuda:12.1.0-devel-ubuntu22.04 /bin/bash

==========
== CUDA ==

CUDA Version 12.1.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .

** DEPRECATION NOTICE! **

THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024
Average throughput from host to device (cudaHostAlloc): 22.3902 GB/s
Average throughput from device to host (cudaHostAlloc): 23.9117 GB/s
root@brad-MS-7D46:/# LD_PRELOAD=/Quark/target/release/libcudaproxy.so /Quark/test/c/test_cudahostalloc 1024 1024
Average throughput from host to device (cudaHostAlloc): 22.31 GB/s
Average throughput from device to host (cudaHostAlloc): 23.875 GB/s

chengchen666 assigned QuarkContainer May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA API support: cudaHostAlloc and cudaFreeHost #1304

CUDA API support: cudaHostAlloc and cudaFreeHost #1304

chengchen666 commented May 31, 2024

chengchen666 commented Jun 1, 2024 •

edited

Loading

mehryar72 commented Jun 5, 2024

QuarkContainer commented Jun 5, 2024

mehryar72 commented Jun 5, 2024 •

edited

Loading

QuarkContainer commented Jun 9, 2024

chengchen666 commented Jun 9, 2024

QuarkContainer commented Jun 11, 2024

QuarkContainer commented Jun 11, 2024

CUDA API support: cudaHostAlloc and cudaFreeHost #1304

CUDA API support: cudaHostAlloc and cudaFreeHost #1304

Comments

chengchen666 commented May 31, 2024

chengchen666 commented Jun 1, 2024 • edited Loading

mehryar72 commented Jun 5, 2024

QuarkContainer commented Jun 5, 2024

mehryar72 commented Jun 5, 2024 • edited Loading

QuarkContainer commented Jun 9, 2024

========== == CUDA ==

chengchen666 commented Jun 9, 2024

QuarkContainer commented Jun 11, 2024

QuarkContainer commented Jun 11, 2024

========== == CUDA ==

chengchen666 commented Jun 1, 2024 •

edited

Loading

mehryar72 commented Jun 5, 2024 •

edited

Loading

==========
== CUDA ==

==========
== CUDA ==