Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] mmcv-full 1.7.2 with H20GPU and torch 2.1.X Focal loss error #3221

Open
2 tasks done
Jenny0420 opened this issue Jan 2, 2025 · 2 comments
Open
2 tasks done

Comments

@Jenny0420
Copy link

Prerequisite

Environment

/usr/local/lib/python3.10/dist-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
{'sys.platform': 'linux', 'Python': '3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]', 'CUDA available': True, 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA H20', 'CUDA_HOME': '/usr/local/cuda', 'NVCC': 'Cuda compilation tools, release 12.1, V12.1.105', 'GCC': 'x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0', 'PyTorch': '2.1.2+cu121', 'PyTorch compiling details': 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX512\n - CUDA Runtime 12.1\n - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n - CuDNN 8.9.2\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n', 'TorchVision': '0.16.2+cu121', 'OpenCV': '4.8.1', 'MMCV': '1.7.2', 'MMCV Compiler': 'GCC 9.3', 'MMCV CUDA Compiler': '12.1'}

Reproduces the problem - code sample

the problem is on loss feedback progress.

Reproduces the problem - command or script

run the Sparse4D code with only one GPU

Reproduces the problem - error message

"
File "/usr/local/lib/python3.10/dist-packages/mmdet/models/losses/focal_loss.py", line 233, in forward
loss_cls = self.loss_weight * calculate_loss_func(
File "/usr/local/lib/python3.10/dist-packages/mmdet/models/losses/focal_loss.py", line 139, in sigmoid_focal_loss
loss = _sigmoid_focal_loss(pred.contiguous(), target.contiguous(), gamma,
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.10/dist-packages/mmcv/ops/focal_loss.py", line 59, in forward
ext_module.sigmoid_focal_loss_forward(
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

"

Additional information

i tried use another focal loss replace mmcv/ops/focal loss , and it is work.

@furh20
Copy link

furh20 commented Jan 9, 2025

想问下8卡的H20训练效率怎么样,相较于A100性能怎么样

@Jenny0420
Copy link
Author

Jenny0420 commented Jan 15, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants