Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

SteelPh0enix · 2025-01-25T16:42:42Z

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: yes
version: 4548 (5f0db95)
built with clang version 19.0.0git ([email protected]:Compute-Mirrors/llvm-project 5353ca3e0e5ae54a31eeebe223da212fa405567a) for x86_64-pc-windows-msvc

Operating systems

Windows

GGML backends

HIP

Hardware

Ryzen 9 5900X w/ RX 7900XT

Models

DeepSeek-R1 Llama3.1 8B quant (q6_k)
Hermes Llama3.2 3B quant (q8)
quanted using llama-quantize from raw weights
however, model probably doesn't matter in this case.

Problem description & steps to reproduce

Model generates a single letter in a loop, after trying to stop it - the server halts indefinitely and stops responding, stopping the generation via web UI does not stop it (even though the "stop" event is logged), the GPU keeps working. It's also impossible to kill via Ctrl+C, killing the parent process is required (in some cases, even that doesn't help and i have to kill it from task manager).

UPDATE: The halting issue is already resolved thanks to @ngxson
However, the main generation issue still persists.

This is how i build llama.cpp:

Function llama-cpp-build-rocm {
    llm-venv-activate
    Write-Host "Building llama.cpp for ROCm..."

    Push-Location $env:LLAMA_CPP_PATH
    cmake -S . -B build -G Ninja `
        -DCMAKE_BUILD_TYPE=Release `
        -DCMAKE_CXX_COMPILER=clang++ `
        -DCMAKE_C_COMPILER=clang `
        -DCMAKE_INSTALL_PREFIX="C:/Users/phoen/AppData/Local/llama-cpp" `
        -DLLAMA_BUILD_TESTS=OFF `
        -DLLAMA_BUILD_EXAMPLES=ON `
        -DLLAMA_BUILD_SERVER=ON `
        -DLLAMA_STANDALONE=ON `
        -DLLAMA_CURL=OFF `
        -DGGML_CCACHE=ON `
        -DGGML_NATIVE=ON `
        -DGGML_OPENMP=ON `
        -DGGML_AVX=ON `
        -DGGML_AVX2=ON `
        -DGGML_FMA=ON `
        -DGGML_HIP=ON `
        -DAMDGPU_TARGETS=gfx1100 `
        -DGGML_CUDA_FA_ALL_QUANTS=ON 

    cmake --build build --config Release --parallel 24
    cmake --install build --config Release
    Pop-Location
    Write-Host "llama.cpp build completed!"
}

First Bad Commit

I've pin-pointed it to the b4548 release, previous one works fine.
5f0db95

Relevant log output

llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 65536
llama_init_from_model: n_ctx_per_seq = 65536
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 256
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (65536) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 65536, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  7168.00 MiB
llama_init_from_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_init_from_model:      ROCm0 compute buffer size =   128.25 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    67.00 MiB
llama_init_from_model: graph nodes  = 791
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 65536
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 65536
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://steelph0enix.pc:51536 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 22
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 22, n_tokens = 22, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 22, n_tokens = 22
srv  cancel_tasks: cancel task, id_task = 0
request: POST /v1/chat/completions 192.168.0.150 200
srv  cancel_tasks: cancel task, id_task = 597
request: POST /v1/chat/completions 192.168.0.150 200

The text was updated successfully, but these errors were encountered:

ngxson · 2025-01-25T16:56:32Z

It's already fixed in #11418 , 1 hour ago. You should update your llama-server

SteelPh0enix · 2025-01-25T16:59:24Z

It's already fixed in #11418 , 1 hour ago. You should update your llama-server

This does not fix that issue, i've tested the latest commit when i was writing up this issue.
Last working commit is the one before the broken one.

SteelPh0enix · 2025-01-25T17:01:35Z

It's already fixed in #11418 , 1 hour ago. You should update your llama-server

i should clarify; this does not fix the generation issue.
Your MR does fix the issue with halting server and not being able to stop generation.

ngxson · 2025-01-25T17:06:48Z

Your version is:
version: 4548 (5f0db95)

It is 18 hours ago, not the latest

SteelPh0enix · 2025-01-25T17:07:36Z

Your version is: version: 4548 (5f0db95)

It is 18 hours ago, not the latest

Yes, because that's the first version that breaks the generation on ROCm build.
All the versions after this are also broken.

ngxson · 2025-01-25T17:08:18Z

For the problem with repeated generation GGGGG...., it is not a server problem. You should test again without GPU (-ngl 0) or with llama-cli to see if it makes any differences

SteelPh0enix · 2025-01-25T17:09:05Z

For the problem with repeated generation GGGGG...., it is not a server problem. You should test again without CPU or with llama-cli to see if it makes any differences

Yes, i am aware it's not a server-specific issue, but rather backend-related one.
I have tested the latest version w/ Vulkan back-end, and it works fine, but the performance is noticeably worse.

ngxson · 2025-01-25T17:12:35Z

Could be related to #11420

SteelPh0enix · 2025-01-25T17:13:27Z

Might be, i'll pull this MR and check if it fixes that issue

SteelPh0enix · 2025-01-25T17:20:54Z

Yup, the VMM fix also fixes this issue.
Closing this, because VMM fix is gonna be merged soon.
Thanks @ngxson!

IMbackK · 2025-01-28T15:33:47Z

@SteelPh0enix could you try the reproducer here (the original with 32GB VM): ROCm/ROCR-Runtime#287 and add your me too with your system configuration to the issue if something goes wrong

SteelPh0enix · 2025-01-28T15:37:03Z

@SteelPh0enix could you try the reproducer here (the original with 32GB VM): ROCm/ROCR-Runtime#287 and add your me too with your system configuration to the issue if something goes wrong

Sure, i'll try to do that in a meantime, but for now i'm extremely busy - ping me closer to the weekend if i won't do it until then

SteelPh0enix added the bug-unconfirmed label Jan 25, 2025

SteelPh0enix mentioned this issue Jan 25, 2025

Misc. bug: commit 5f0db95 breaks model loading on some AMD gpus #11405

Closed

SteelPh0enix closed this as completed Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

ngxson commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

SteelPh0enix commented Jan 25, 2025

ngxson commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

ngxson commented Jan 25, 2025 •

edited

Loading

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

ngxson commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025

IMbackK commented Jan 28, 2025 •

edited

Loading

SteelPh0enix commented Jan 28, 2025

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

Comments

SteelPh0enix commented Jan 25, 2025 • edited Loading

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

ngxson commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025 • edited Loading

SteelPh0enix commented Jan 25, 2025

ngxson commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025 • edited Loading

ngxson commented Jan 25, 2025 • edited Loading

SteelPh0enix commented Jan 25, 2025 • edited Loading

ngxson commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025

SteelPh0enix commented Jan 25, 2025

IMbackK commented Jan 28, 2025 • edited Loading

SteelPh0enix commented Jan 28, 2025

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

ngxson commented Jan 25, 2025 •

edited

Loading

SteelPh0enix commented Jan 25, 2025 •

edited

Loading

IMbackK commented Jan 28, 2025 •

edited

Loading