HipVMM bug prevents loading any model on desktop systems #105

tonatiuhmira · 2025-02-12T04:05:20Z

Hi, I'm running koboldcpp-rocm in Arch Linux with my RX 6800 and so far I had no major issues unitl now. After a recent upgrade of system ROCm packages, I'm getting an out of memory error from ggml-cuda.cu,
ggml/src/ggml-cuda/ggml-cuda.cu:444: HipVMM Failure: out of memory
This is, as far as I can tell, a ROCm bug which has been mentioned also by developers of llama.cpp here. However, I'm not sure how to implement the proposed solution which is to avoid using Hip VMM by setting -DGGML_CUDA_NO_VMM=1, however this does not seem to work with koboldcpp-rocm (and specifically in arch, probably they are on Windows, I'm not sure). I have also tried setting
make GGML_USE_VMM=OFF LLAMA_HIPBLAS=1 GPU_TARGETS=gfx1030 -j10
but still get the same error, so I cannot load any model now.

What is the intended way to disable HipVMM?

The text was updated successfully, but these errors were encountered:

YellowRoseCx · 2025-02-12T05:21:47Z

Try adding -DGGML_CUDA_NO_VMM=1 to line 261 of the Makefile where HIPFLAGS is and see if it works https://github.com/YellowRoseCx/koboldcpp-rocm/blob/main/Makefile#L261

like this:
HIPFLAGS += -DGGML_USE_HIPBLAS -DGGML_USE_HIP -DGGML_CUDA_NO_VMM=1 -DGGML_USE_CUDA -DSD_USE_CUBLAS $(shell $(ROCM_PATH)/bin/hipconfig -C)

then save it, and in terminal run make clean and then make LLAMA_HIPBLAS=1 -j10

tonatiuhmira · 2025-02-12T05:56:28Z

Sorry, still getting the same error on a clean install. I verified that the flag was present in all the steps of the compilation, however when loading the model it seems to still use HipVMM:

Set AMD HSA_OVERRIDE_GFX_VERSION to 10.3.0
***
Welcome to KoboldCpp - Version 1.83.1.yr0-ROCm
Loading Chat Completions Adapter: /home/tonatiuh/workspace/koboldcpp-rocm/kcpp_adapters/AutoGuess.json
Chat Completions Adapter Loaded
No GPU or CPU backend was selected. Trying to assign one for you automatically...
Auto Selected CUDA Backend...

...
llama_init_from_model: n_ctx_per_seq (10368) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 10368, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  1944.00 MiB
llama_init_from_model: KV self size  = 1944.00 MiB, K (f16):  972.00 MiB, V (f16):  972.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.58 MiB
llama_init_from_model:      ROCm0 compute buffer size =   870.25 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    30.26 MiB
llama_init_from_model: graph nodes  = 1686
llama_init_from_model: graph splits = 2
ggml/src/ggml-cuda/ggml-cuda.cu:444: HipVMM Failure: out of memory

ptrace: Operación no permitida.
No stack.
The program is not being run.
[1]    186707 IOT instruction (core dumped)  python koboldcpp.py --gpulayers 49 --contextsize 10240 --model

By the way, I tried with the AUR version (koboldcpp-hipblas), and it just works. The version in AUR is 1.82.4.yr0-1.

Set AMD HSA_OVERRIDE_GFX_VERSION to 10.3.0
***
Welcome to KoboldCpp - Version 1.82.4.yr0-ROCm
No GPU or CPU backend was selected. Trying to assign one for you automatically...
Unable to detect VRAM, please set layers manually.
Auto Selected CUDA Backend...
...
llama_init_from_model: KV self size  = 1944.00 MiB, K (f16):  972.00 MiB, V (f16):  972.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.58 MiB
llama_init_from_model:      ROCm0 compute buffer size =   916.08 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    30.26 MiB
llama_init_from_model: graph nodes  = 1686
llama_init_from_model: graph splits = 46 (with bs=512), 3 (with bs=1)
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

(Model loaded is DeepSeek-R1-Distill-Qwen-14B-Q6_K.gguf)

ManGuyNY · 2025-02-12T10:13:23Z

I'm experiencing the same issue on Arch with my 6800 XT, even with the added HIPFLAG. Does not seem to matter which model I load.


load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 339
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:          CPU model buffer size =   426.36 MiB
load_tensors:        ROCm0 model buffer size =  5532.43 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: using async uploads for device ROCm0, buffer type ROCm0, backend ROCm0
........................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4224
llama_init_from_model: n_ctx_per_seq = 4224
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4224) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4224, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   231.00 MiB
llama_init_from_model: KV self size  =  231.00 MiB, K (f16):  115.50 MiB, V (f16):  115.50 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.58 MiB
llama_init_from_model:      ROCm0 compute buffer size =   304.00 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    15.26 MiB
llama_init_from_model: graph nodes  = 986
llama_init_from_model: graph splits = 2
ggml/src/ggml-cuda/ggml-cuda.cu:444: HipVMM Failure: out of memory

YellowRoseCx · 2025-02-12T17:10:58Z

Sorry, still getting the same error on a clean install. I verified that the flag was present in all the steps of the compilation, however when loading the model it seems to still use HipVMM:

Looks like the reason might be because the flag to disable VMM with HIP/ROCm is "GGML_HIP_NO_VMM"

Could you please try it again, but change -DGGML_CUDA_NO_VMM=1 to -DGGML_HIP_NO_VMM=1 ?

tonatiuhmira · 2025-02-13T00:35:20Z

Sorry I took so long to respond.
Just checked that you added the -DGGML_HIP_NO_VMM flag in the Makefile. Using the version from git now works out of the box. Thank you!

YellowRoseCx · 2025-02-13T00:53:47Z

Sorry I took so long to respond. Just checked that you added the -DGGML_HIP_NO_VMM flag in the Makefile. Using the version from git now works out of the box. Thank you!

Awesome! I'm glad it fixed it for you. Thanks for informing me about the new flags

tonatiuhmira closed this as completed Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HipVMM bug prevents loading any model on desktop systems #105

HipVMM bug prevents loading any model on desktop systems #105

tonatiuhmira commented Feb 12, 2025

YellowRoseCx commented Feb 12, 2025

tonatiuhmira commented Feb 12, 2025

ManGuyNY commented Feb 12, 2025

YellowRoseCx commented Feb 12, 2025

tonatiuhmira commented Feb 13, 2025

YellowRoseCx commented Feb 13, 2025

HipVMM bug prevents loading any model on desktop systems #105

HipVMM bug prevents loading any model on desktop systems #105

Comments

tonatiuhmira commented Feb 12, 2025

YellowRoseCx commented Feb 12, 2025

tonatiuhmira commented Feb 12, 2025

ManGuyNY commented Feb 12, 2025

YellowRoseCx commented Feb 12, 2025

tonatiuhmira commented Feb 13, 2025

YellowRoseCx commented Feb 13, 2025