Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CUDA 11 wheels to avoid statically linking CUDA components #137

Open
bdice opened this issue Jan 22, 2025 · 1 comment
Open

Use CUDA 11 wheels to avoid statically linking CUDA components #137

bdice opened this issue Jan 22, 2025 · 1 comment
Assignees

Comments

@bdice
Copy link
Contributor

bdice commented Jan 22, 2025

This issue proposes to use CUDA 11 wheels as dependencies for RAPIDS wheels. This is an extension of #35. Originally, that issue's scope was reduced to focus on only using CUDA wheels for CUDA 12 packages, because at that time CUDA 11 ARM wheels (specifically ARM!) did not exist for all of the math libraries that RAPIDS depends on. That was rectified as of about August 2024, but we had already done the migration for just CUDA 12. We did not attempt to go back and add support for CUDA 11.

As a part of the work for #33, we came across a pitfall that we previously recognized, but forgot about: cuBLAS only works properly across DSOs if it is using shared linkage. (An upstream nvbug is linked in that comment.)

We are observing this issue in CI for cuVS and cuML, with errors like those below.

cuVS CI failures

This points directly to pylibraft, which is built using libraft C++ wheels.
https://github.com/rapidsai/cuvs/actions/runs/12836113145/job/35800080495#step:9:666

FAILED python/cuvs/cuvs/test/test_ivf_pq.py::test_ivf_pq_search_params[params0] - cuvs.common.exceptions.CuvsException: cuBLAS error encountered at: file=/__w/cuvs/cuvs/python/cuvs/build/cp312-cp312-linux_aarch64/_deps/raft-src/cpp/include/raft/linalg/detail/cublaslt_wrappers.hpp line=261: call='cublasLtMatmul(resource::get_cublaslt_handle(res), mm_desc->desc, alpha, a_ptr, mm_desc->a, b_ptr, mm_desc->b, beta, c_ptr, mm_desc->c, c_ptr, mm_desc->c, &(mm_desc->heuristics.algo), nullptr, 0, stream)', Reason=13:CUBLAS_STATUS_EXECUTION_FAILED
Obtained 49 stack frames
#1 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs.so: raft::cublas_error::cublas_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0xb0 [0xfffdc31bc360]
#2 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs.so: void raft::linalg::detail::legacy_matmul<false, float, float, float, float>(raft::resources const&, bool, bool, unsigned long, unsigned long, unsigned long, float const*, float const*, unsigned long, float const*, unsigned long, float const*, float*, unsigned long, CUstream_st*) +0x6c0 [0xfffdc31db900]
#3 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs.so: void cuvs::neighbors::ivf_pq::detail::train_per_subset<long>(raft::resources const&, cuvs::neighbors::ivf_pq::index<long>&, unsigned long, float const*, unsigned int const*, unsigned int, unsigned int) +0x57c [0xfffdc3e5616c]
#4 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs.so: cuvs::neighbors::ivf_pq::index<long> cuvs::neighbors::ivf_pq::detail::build<float, long, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >(raft::resources const&, cuvs::neighbors::ivf_pq::index_params const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >) +0xc64 [0xfffdc3e59964]
#5 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs.so: cuvs::neighbors::ivf_pq::build(raft::resources const&, cuvs::neighbors::ivf_pq::index_params const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >) +0x30 [0xfffdc3e37800]
#6 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs.so: void cuvs::neighbors::ivf_pq::detail::build<float, long, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >(raft::resources const&, cuvs::neighbors::ivf_pq::index_params const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, cuvs::neighbors::ivf_pq::index<long>*) +0x44 [0xfffdc3e5a7c4]
#7 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs.so: cuvs::neighbors::ivf_pq::build(raft::resources const&, cuvs::neighbors::ivf_pq::index_params const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, cuvs::neighbors::ivf_pq::index<long>*) +0x28 [0xfffdc3e37838]
#8 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/brute_force/../../libcuvs_c.so: cuvsIvfPqBuild +0x204 [0xfffe2f4b8204]
#9 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/neighbors/ivf_pq/ivf_pq.cpython-312-aarch64-linux-gnu.so(+0x4e1e0) [0xfffdafd9b1e0]
#10 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/sklearn/__check_build/_check_build.cpython-312-aarch64-linux-gnu.so(+0x4074) [0xffffa885a074]
#11 in /pyenv/versions/3.12.8/lib/python3.12/site-packages/cuvs/common/resources.cpython-312-aarch64-linux-gnu.so(+0x44d90) [0xfffdb021dd90]
#12 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_MakeTpCall +0x84 [0xffffaa38c42c]
#13 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyEval_EvalFrameDefault +0x2f1c [0xffffaa32c82c]
#14 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_FastCallDictTstate +0xf8 [0xffffaa38e2b8]
#15 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_Call_Prepend +0x10c [0xffffaa38e4b4]
#16 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x1eee74) [0xffffaa40be74]
#17 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_MakeTpCall +0x84 [0xffffaa38c42c]
#18 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyEval_EvalFrameDefault +0x2f1c [0xffffaa32c82c]
#19 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_FastCallDictTstate +0xf8 [0xffffaa38e2b8]
#20 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_Call_Prepend +0x10c [0xffffaa38e4b4]
#21 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x1eee74) [0xffffaa40be74]
#22 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_Call +0x74 [0xffffaa38e544]
#23 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyEval_EvalFrameDefault +0x738 [0xffffaa32a048]
#24 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_FastCallDictTstate +0xf8 [0xffffaa38e2b8]
#25 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_Call_Prepend +0x10c [0xffffaa38e4b4]
#26 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x1eee74) [0xffffaa40be74]
#27 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_MakeTpCall +0x84 [0xffffaa38c42c]
#28 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyEval_EvalFrameDefault +0x2f1c [0xffffaa32c82c]
#29 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_FastCallDictTstate +0xf8 [0xffffaa38e2b8]
#30 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_Call_Prepend +0x10c [0xffffaa38e4b4]
#31 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x1eee74) [0xffffaa40be74]
#32 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_MakeTpCall +0x84 [0xffffaa38c42c]
#33 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyEval_EvalFrameDefault +0x2f1c [0xffffaa32c82c]
#34 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_FastCallDictTstate +0xf8 [0xffffaa38e2b8]
#35 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_Call_Prepend +0x10c [0xffffaa38e4b4]
#36 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x1eee74) [0xffffaa40be74]
#37 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_MakeTpCall +0x84 [0xffffaa38c42c]
#38 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyEval_EvalFrameDefault +0x2f1c [0xffffaa32c82c]
#39 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: PyEval_EvalCode +0x224 [0xffffaa49d24c]
#40 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x27d560) [0xffffaa49a560]
#41 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x1c621c) [0xffffaa3e321c]
#42 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: PyObject_Vectorcall +0x54 [0xffffaa38c70c]
#43 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyEval_EvalFrameDefault +0x2f1c [0xffffaa32c82c]
#44 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: _PyObject_Call +0xf4 [0xffffaa38e5c4]
#45 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0(+0x30065c) [0xffffaa51d65c]
#46 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: Py_RunMain +0x540 [0xffffaa51de60]
#47 in /pyenv/versions/3.12.8/lib/libpython3.12.so.1.0: Py_BytesMain +0x58 [0xffffaa51e5b8]
#48 in /usr/lib/aarch64-linux-gnu/libc.so.6: __libc_start_main +0xe8 [0xffffaa0cae10]
#49 in /pyenv/versions/3.12.8/bin/python(+0x8d8) [0xaaaac4bf08d8]
cuML CI failures

This points directly to pylibraft, which is built using libraft C++ wheels.
https://github.com/rapidsai/cuml/actions/runs/12883195240/job/35916934900#step:9:4966

FAILED test_dask_serialization.py::test_serialize_before_training - RuntimeError: 1 of 1 worker jobs failed: cuBLAS error encountered at: file=/tmp/pip-build-env-sznqd6nk/normal/lib/python3.10/site-packages/libraft/include/raft/linalg/detail/cublaslt_wrappers.hpp line=261: call='cublasLtMatmul(resource::get_cublaslt_handle(res), mm_desc->desc, alpha, a_ptr, mm_desc->a, b_ptr, mm_desc->b, beta, c_ptr, mm_desc->c, c_ptr, mm_desc->c, &(mm_desc->heuristics.algo), nullptr, 0, stream)', Reason=13:CUBLAS_STATUS_EXECUTION_FAILED
Obtained 60 stack frames
#1 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/internals/../libcuml++.so: raft::cublas_error::cublas_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0xbd [0x7f7bd887b45d]
#2 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/internals/../libcuml++.so: void raft::linalg::detail::legacy_matmul<false, double, double, double, double>(raft::resources const&, bool, bool, unsigned long, unsigned long, unsigned long, double const*, double const*, unsigned long, double const*, unsigned long, double const*, double*, unsigned long, CUstream_st*) +0x666 [0x7f7bd88c9fc6]
#3 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/internals/../libcuml++.so: MLCommon::LinAlg::opg::mv_aTb(raft::handle_t const&, MLCommon::Matrix::Data<double>&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> > const&, MLCommon::Matrix::PartDescriptor const&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> > const&, CUstream_st**, int) +0x2d4 [0x7f7bda9e6b44]
#4 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/internals/../libcuml++.so: void MLCommon::LinAlg::opg::lstsqEig_impl<double>(raft::handle_t const&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> > const&, MLCommon::Matrix::PartDescriptor const&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> > const&, double*, CUstream_st**, int) +0xb67 [0x7f7bda9f3c87]
#5 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/internals/../libcuml++.so: void ML::OLS::opg::fit_impl<double>(raft::handle_t&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> >&, MLCommon::Matrix::PartDescriptor&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> >&, double*, double*, bool, bool, int, CUstream_st**, int, bool) +0x927 [0x7f7bd98d2da7]
#6 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/internals/../libcuml++.so: void ML::OLS::opg::fit_impl<double>(raft::handle_t&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> >&, MLCommon::Matrix::PartDescriptor&, std::vector<MLCommon::Matrix::Data<double>*, std::allocator<MLCommon::Matrix::Data<double>*> >&, double*, double*, bool, bool, int, bool) +0x148 [0x7f7bd98d3de8]
#7 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/linear_model/linear_regression_mg.cpython-310-x86_64-linux-gnu.so(+0xcbff) [0x7f7ba2a88bff]
#8 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0xac [0x7f7d9889360c]
#9 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x16c3 [0x7f7d98839bc3]
#10 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#11 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0xc360b) [0x7f7d9889660b]
#12 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0xac [0x7f7d9889360c]
#13 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/linear_model/base_mg.cpython-310-x86_64-linux-gnu.so(+0x738e) [0x7f7ba2a6738e]
#14 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/cuml/linear_model/base_mg.cpython-310-x86_64-linux-gnu.so(+0x12c0f) [0x7f7ba2a72c0f]
#15 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x5c [0x7f7d988935bc]
#16 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x16c3 [0x7f7d98839bc3]
#17 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#18 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0xc360b) [0x7f7d9889660b]
#19 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x64d51) [0x7f7d98837d51]
#20 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x6d5b [0x7f7d9883f25b]
#21 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#22 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x5c [0x7f7d988935bc]
#23 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x16c3 [0x7f7d98839bc3]
#24 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#25 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_FastCallDictTstate +0x52 [0x7f7d98893a72]
#26 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_Call_Prepend +0xe4 [0x7f7d98893d34]
#27 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x12e494) [0x7f7d98901494]
#28 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_MakeTpCall +0x82 [0x7f7d98893922]
#29 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x64dce) [0x7f7d98837dce]
#30 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x182d [0x7f7d98839d2d]
#31 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#32 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x64d51) [0x7f7d98837d51]
#33 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x182d [0x7f7d98839d2d]
#34 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#35 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1d26d7) [0x7f7d989a56d7]
#36 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x10c86d) [0x7f7d988df86d]
#37 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x5c [0x7f7d988935bc]
#38 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x16c3 [0x7f7d98839bc3]
#39 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#40 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x5c [0x7f7d988935bc]
#41 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x16c3 [0x7f7d98839bc3]
#42 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#43 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x64d51) [0x7f7d98837d51]
#44 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1068 [0x7f7d98839568]
#45 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#46 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x5c [0x7f7d988935bc]
#47 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x16c3 [0x7f7d98839bc3]
#48 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#49 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x64d51) [0x7f7d98837d51]
#50 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1068 [0x7f7d98839568]
#51 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#52 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x64d51) [0x7f7d98837d51]
#53 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1068 [0x7f7d98839568]
#54 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1ba894) [0x7f7d9898d894]
#55 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0xc36ab) [0x7f7d988966ab]
#56 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x5c [0x7f7d988935bc]
#57 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x266956) [0x7f7d98a39956]
#58 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x20d367) [0x7f7d989e0367]
#59 in /usr/lib64/libpthread.so.0(+0x81ca) [0x7f7d983921ca]
#60 in /usr/lib64/libc.so.6: clone +0x43 [0x7f7d978638d3]
cuGraph CI failures

https://github.com/rapidsai/cugraph/actions/runs/12882240011/job/35914143656#step:9:16398

 File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/cugraph/community/spectral_clustering.py", line 193, in spectralModularityMaximizationClustering
  Failed example:
      df = cugraph.spectralModularityMaximizationClustering(G, 5)
  Exception raised:
      Traceback (most recent call last):
        File "/pyenv/versions/3.10.16/lib/python3.10/doctest.py", line 1350, in __run
          exec(compile(example.source, filename, "single",
        File "<doctest spectralModularityMaximizationClustering[2]>", line 1, in <module>
          df = cugraph.spectralModularityMaximizationClustering(G, 5)
        File "/pyenv/versions/3.10.16/lib/python3.10/site-packages/cugraph/community/spectral_clustering.py", line 207, in spectralModularityMaximizationClustering
          vertex, partition = pylibcugraph_spectral_modularity_maximization(
        File "spectral_modularity_maximization.pyx", line 145, in pylibcugraph.spectral_modularity_maximization.spectral_modularity_maximization
        File "utils.pyx", line 53, in pylibcugraph.utils.assert_success
      RuntimeError: non-success value returned from cugraph_spectral_modularity_maximization: CUGRAPH_UNKNOWN_ERROR cuBLAS error encountered at: file=/pyenv/versions/3.12.8/lib/python3.12/site-packages/libraft/include/raft/sparse/solver/detail/lanczos.cuh line=1225: call='raft::linalg::detail::cublasnrm2(cublas_h, n, lanczosVecs_dev, 1, &normQ1, stream)', Reason=13:CUBLAS_STATUS_EXECUTION_FAILED
      Obtained 63 stack frames
      #1 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/libcugraph/lib64/libcugraph.so: raft::cublas_error::cublas_error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) +0xb0 [0xfffefbb62cf0]
      #2 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/libcugraph/lib64/libcugraph.so: raft::spectral::lanczos_solver_t<int, float, int>::solve_largest_eigenvectors(raft::resources const&, raft::spectral::matrix::detail::sparse_matrix_t<int, float> const&, float*, float*) const +0xed0 [0xfffefbbc0aa0]
      #3 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/libcugraph/lib64/libcugraph.so: std::tuple<int, float, int> raft::spectral::detail::modularity_maximization<int, float, raft::spectral::lanczos_solver_t<int, float, int>, raft::spectral::kmeans_solver_t<int, float, int> >(raft::resources const&, raft::spectral::matrix::detail::sparse_matrix_t<int, float> const&, raft::spectral::lanczos_solver_t<int, float, int> const&, raft::spectral::kmeans_solver_t<int, float, int> const&, int*, float*, float*) +0x2c4 [0xfffefbc2f3d4]
      #4 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/libcugraph/lib64/libcugraph.so: void cugraph::ext_raft::detail::spectralModularityMaximization_impl<int, int, float>(cugraph::legacy::GraphCSRView<int, int, float> const&, int, int, float, int, float, int, int*, float*, float*) +0x5a4 [0xfffefbc302e4]
      #5 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/libcugraph/lib64/libcugraph.so: void cugraph::ext_raft::spectralModularityMaximization<int, int, float>(cugraph::legacy::GraphCSRView<int, int, float> const&, int, int, float, int, float, int, int*) +0x9c [0xfffefbc30edc]
      #6 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/libcugraph/lib64/libcugraph_c.so: cugraph_spectral_modularity_maximization +0xe28 [0xfffef790d6c8]
      #7 in /pyenv/versions/3.10.16/lib/python3.10/site-packages/pylibcugraph/spectral_modularity_maximization.cpython-310-aarch64-linux-gnu.so(+0x6a54) [0xfffd4df6fa54]
      #8 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #9 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1f48 [0xffff86b593e0]
      #10 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #11 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #12 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x5950 [0xffff86b5cde8]
      #13 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #14 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyEval_EvalCode +0x70 [0xffff86ca02a8]
      #15 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1a9d88) [0xffff86c9ad88]
      #16 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1075d8) [0xffff86bf85d8]
      #17 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #18 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x279c [0xffff86b59c34]
      #19 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #20 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #21 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1d8c [0xffff86b59224]
      #22 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #23 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #24 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1d8c [0xffff86b59224]
      #25 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #26 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0xc1068) [0xffff86bb2068]
      #27 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0xc0 [0xffff86bae9b0]
      #28 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x2084 [0xffff86b5951c]
      #29 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #30 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x68 [0xffff86bae958]
      #31 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x2084 [0xffff86b5951c]
      #32 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #33 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #34 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x5950 [0xffff86b5cde8]
      #35 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #36 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0xc1068) [0xffff86bb2068]
      #37 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #38 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x5950 [0xffff86b5cde8]
      #39 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #40 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_FastCallDictTstate +0xe4 [0xffff86baefcc]
      #41 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_Call_Prepend +0xe8 [0xffff86baf230]
      #42 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x129bb8) [0xffff86c1abb8]
      #43 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_MakeTpCall +0x8c [0xffff86baed8c]
      #44 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65e9c) [0xffff86b56e9c]
      #45 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1f48 [0xffff86b593e0]
      #46 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #47 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #48 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x1d8c [0xffff86b59224]
      #49 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #50 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: PyVectorcall_Call +0x68 [0xffff86bae958]
      #51 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x2084 [0xffff86b5951c]
      #52 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #53 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #54 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x5950 [0xffff86b5cde8]
      #55 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #56 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0xc1068) [0xffff86bb2068]
      #57 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x65df4) [0xffff86b56df4]
      #58 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyEval_EvalFrameDefault +0x5950 [0xffff86b5cde8]
      #59 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x1af0f8) [0xffff86ca00f8]
      #60 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_FastCallDictTstate +0xe4 [0xffff86baefcc]
      #61 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_Call_Prepend +0xe8 [0xffff86baf230]
      #62 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0(+0x129bb8) [0xffff86c1abb8]
      #63 in /pyenv/versions/3.10.16/lib/libpython3.10.so.1.0: _PyObject_Call +0x74 [0xffff86baeb0c]

Currently, our proposed solution is to add support for CUDA wheels to our CUDA 11 builds, which should mitigate the problem and unify our code paths between CUDA 11 and CUDA 12. This should be the lowest-effort path that allows us to continue forward with dynamic linking between RAPIDS C++ wheels (#33).

In the immediate term, we will disable CUDA 11 wheels CI for cuVS, cuML, and cuGraph so they are not blocked.

@bdice bdice changed the title Use CUDA 11 wheels Use CUDA 11 wheels to avoid statically linking CUDA components Jan 22, 2025
@jameslamb
Copy link
Member

Here's a minimal, reproducible example for the cuVS failures.

docker run \
    --rm \
    --gpus all \
    -v $(pwd):/opt/work \
    -w /opt/work \
    -it rapidsai/citestwheel:cuda11.8.0-rockylinux8-py3.12 \
    bash

# pinning to the latest libraft / pylibraft with the problematic linking
pip install \
    'cuvs-cu11[test]==25.2.*,>=0.0.0a0' \
    'libraft-cu11==25.2.0a41' \
    'pylibraft-cu11==25.2.0a41'

cd ./python/cuvs/cuvs
pytest 'test/test_ivf_pq.py::test_ivf_pq_search_params'
# cuvs.common.exceptions.CuvsException: cuBLAS error

Using the packages built from rapidsai/raft#2548, I saw the tests pass 🎉

LIBRAFT_WHEELHOUSE=$(RAPIDS_PY_WHEEL_NAME="libraft_cu11" rapids-get-pr-wheel-artifact raft 2548 cpp)
PYLIBRAFT_WHEELHOUSE=$(RAPIDS_PY_WHEEL_NAME="pylibraft_cu11" rapids-get-pr-wheel-artifact raft 2548 python)
RAFT_DASK_WHEELHOUSE=$(RAPIDS_PY_WHEEL_NAME="raft_dask_cu11" rapids-get-pr-wheel-artifact raft 2548 python)

pip install \
    'cuvs-cu11[test]==25.2.*,>=0.0.0a0' \
    "$(echo ${LIBRAFT_WHEELHOUSE}/*.whl)" \
    "$(echo ${PYLIBRAFT_WHEELHOUSE}/*.whl)" \
    "$(echo ${RAFT_DASK_WHEELHOUSE}/*.whl)"

cd ./python/cuvs/cuvs
pytest 'test/test_ivf_pq.py::test_ivf_pq_search_params'
# === 4 passed in 1.45s ===

So I think rapidsai/raft#2548 will fix this (at least for cuVS)

rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Jan 22, 2025
Contributes to rapidsai/build-planning#137

Follow-up to #2531 .

See the linked issue for many more details, but in short... using a dynamically-loaded libraft which has statically-linked cuBLAS causes issues for other libraries.

There are now aarch64 CUDA 11 wheels for cuBLAS and other CUDA libraries, so it's possible to have RAFT wheels dynamically link against them. This PR does that.

## Notes for Reviewers

This has other side benefits in addition to fixing runtime issues... it also simplifies the wheel-building scripts and CMake, and makes CUDA 11 wheels noticeably smaller 😊

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #2548
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Jan 22, 2025
Contributes to rapidsai/build-planning#137

Follow-up to #4804

Wheel builds here currently list out some shared library to exclude in `auditwheel repair`, which they pick up transitively via linking to `libraft`.

https://github.com/rapidsai/cugraph/blob/a9c923bb3f4a6a6f5a9d46337adc65d969717567/ci/build_wheel.sh#L42-L49

The version components of those library names can change when those libraries have ABI breakages, for example across CUDA major version boundaries. This proposes replacing specific versions with wildcards, to exclude *all* versions of those libraries.

## Notes for Reviewers

This is especially relevant given this: rapidsai/raft#2548

For example, the latest `nvidia-cublas-cu11` has `libcublas.so.11` while `nvidia-cublas-cu12` has `libcublas.so.12`.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #4877
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this issue Jan 22, 2025
Due to some failures coming from libraft C++ wheels, CUDA 11 wheel CI will not pass. This PR temporarily disables CUDA 11 wheel tests until those issues can be resolved.

See rapidsai/build-planning#137.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #599
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Jan 22, 2025
Due to some failures coming from libraft C++ wheels, CUDA 11 wheel CI will not pass. This PR temporarily disables CUDA 11 wheel tests until those issues can be resolved.

See rapidsai/build-planning#137.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #4876
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants