Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HipVMM bug prevents loading any model on desktop systems #105

Closed
tonatiuhmira opened this issue Feb 12, 2025 · 6 comments
Closed

HipVMM bug prevents loading any model on desktop systems #105

tonatiuhmira opened this issue Feb 12, 2025 · 6 comments

Comments

@tonatiuhmira
Copy link

Hi, I'm running koboldcpp-rocm in Arch Linux with my RX 6800 and so far I had no major issues unitl now. After a recent upgrade of system ROCm packages, I'm getting an out of memory error from ggml-cuda.cu,
ggml/src/ggml-cuda/ggml-cuda.cu:444: HipVMM Failure: out of memory
This is, as far as I can tell, a ROCm bug which has been mentioned also by developers of llama.cpp here. However, I'm not sure how to implement the proposed solution which is to avoid using Hip VMM by setting -DGGML_CUDA_NO_VMM=1, however this does not seem to work with koboldcpp-rocm (and specifically in arch, probably they are on Windows, I'm not sure). I have also tried setting
make GGML_USE_VMM=OFF LLAMA_HIPBLAS=1 GPU_TARGETS=gfx1030 -j10
but still get the same error, so I cannot load any model now.

What is the intended way to disable HipVMM?

@YellowRoseCx
Copy link
Owner

Try adding -DGGML_CUDA_NO_VMM=1 to line 261 of the Makefile where HIPFLAGS is and see if it works https://github.com/YellowRoseCx/koboldcpp-rocm/blob/main/Makefile#L261

like this:
HIPFLAGS += -DGGML_USE_HIPBLAS -DGGML_USE_HIP -DGGML_CUDA_NO_VMM=1 -DGGML_USE_CUDA -DSD_USE_CUBLAS $(shell $(ROCM_PATH)/bin/hipconfig -C)

then save it, and in terminal run make clean and then make LLAMA_HIPBLAS=1 -j10

@tonatiuhmira
Copy link
Author

Sorry, still getting the same error on a clean install. I verified that the flag was present in all the steps of the compilation, however when loading the model it seems to still use HipVMM:

Set AMD HSA_OVERRIDE_GFX_VERSION to 10.3.0
***
Welcome to KoboldCpp - Version 1.83.1.yr0-ROCm
Loading Chat Completions Adapter: /home/tonatiuh/workspace/koboldcpp-rocm/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
No GPU or CPU backend was selected. Trying to assign one for you automatically...
Auto Selected CUDA Backend...

...
llama_init_from_model: n_ctx_per_seq (10368) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 10368, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  1944.00 MiB
llama_init_from_model: KV self size  = 1944.00 MiB, K (f16):  972.00 MiB, V (f16):  972.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.58 MiB
llama_init_from_model:      ROCm0 compute buffer size =   870.25 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    30.26 MiB
llama_init_from_model: graph nodes  = 1686
llama_init_from_model: graph splits = 2
ggml/src/ggml-cuda/ggml-cuda.cu:444: HipVMM Failure: out of memory

ptrace: Operación no permitida.
No stack.
The program is not being run.
[1]    186707 IOT instruction (core dumped)  python koboldcpp.py --gpulayers 49 --contextsize 10240 --model 

By the way, I tried with the AUR version (koboldcpp-hipblas), and it just works. The version in AUR is 1.82.4.yr0-1.

Set AMD HSA_OVERRIDE_GFX_VERSION to 10.3.0
***
Welcome to KoboldCpp - Version 1.82.4.yr0-ROCm
No GPU or CPU backend was selected. Trying to assign one for you automatically...
Unable to detect VRAM, please set layers manually.
Auto Selected CUDA Backend...
...
llama_init_from_model: KV self size  = 1944.00 MiB, K (f16):  972.00 MiB, V (f16):  972.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.58 MiB
llama_init_from_model:      ROCm0 compute buffer size =   916.08 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    30.26 MiB
llama_init_from_model: graph nodes  = 1686
llama_init_from_model: graph splits = 46 (with bs=512), 3 (with bs=1)
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

(Model loaded is DeepSeek-R1-Distill-Qwen-14B-Q6_K.gguf)

@ManGuyNY
Copy link

I'm experiencing the same issue on Arch with my 6800 XT, even with the added HIPFLAG. Does not seem to matter which model I load.


load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 339
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:          CPU model buffer size =   426.36 MiB
load_tensors:        ROCm0 model buffer size =  5532.43 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
........................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4224
llama_init_from_model: n_ctx_per_seq = 4224
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4224) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4224, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   231.00 MiB
llama_init_from_model: KV self size  =  231.00 MiB, K (f16):  115.50 MiB, V (f16):  115.50 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.58 MiB
llama_init_from_model:      ROCm0 compute buffer size =   304.00 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    15.26 MiB
llama_init_from_model: graph nodes  = 986
llama_init_from_model: graph splits = 2
ggml/src/ggml-cuda/ggml-cuda.cu:444: HipVMM Failure: out of memory

@YellowRoseCx
Copy link
Owner

Sorry, still getting the same error on a clean install. I verified that the flag was present in all the steps of the compilation, however when loading the model it seems to still use HipVMM:

Looks like the reason might be because the flag to disable VMM with HIP/ROCm is "GGML_HIP_NO_VMM"

Could you please try it again, but change -DGGML_CUDA_NO_VMM=1 to -DGGML_HIP_NO_VMM=1 ?

@tonatiuhmira
Copy link
Author

Sorry I took so long to respond.
Just checked that you added the -DGGML_HIP_NO_VMM flag in the Makefile. Using the version from git now works out of the box. Thank you!

@YellowRoseCx
Copy link
Owner

Sorry I took so long to respond. Just checked that you added the -DGGML_HIP_NO_VMM flag in the Makefile. Using the version from git now works out of the box. Thank you!

Awesome! I'm glad it fixed it for you. Thanks for informing me about the new flags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants