Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421

Closed
SteelPh0enix opened this issue Jan 25, 2025 · 12 comments

Comments

@SteelPh0enix
Copy link

SteelPh0enix commented Jan 25, 2025

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: yes
version: 4548 (5f0db95)
built with clang version 19.0.0git ([email protected]:Compute-Mirrors/llvm-project 5353ca3e0e5ae54a31eeebe223da212fa405567a) for x86_64-pc-windows-msvc

Operating systems

Windows

GGML backends

HIP

Hardware

Ryzen 9 5900X w/ RX 7900XT

Models

DeepSeek-R1 Llama3.1 8B quant (q6_k)
Hermes Llama3.2 3B quant (q8)
quanted using llama-quantize from raw weights
however, model probably doesn't matter in this case.

Problem description & steps to reproduce

Model generates a single letter in a loop, after trying to stop it - the server halts indefinitely and stops responding, stopping the generation via web UI does not stop it (even though the "stop" event is logged), the GPU keeps working. It's also impossible to kill via Ctrl+C, killing the parent process is required (in some cases, even that doesn't help and i have to kill it from task manager).

UPDATE: The halting issue is already resolved thanks to @ngxson
However, the main generation issue still persists.

Image

This is how i build llama.cpp:

Function llama-cpp-build-rocm {
    llm-venv-activate
    Write-Host "Building llama.cpp for ROCm..."

    Push-Location $env:LLAMA_CPP_PATH
    cmake -S . -B build -G Ninja `
        -DCMAKE_BUILD_TYPE=Release `
        -DCMAKE_CXX_COMPILER=clang++ `
        -DCMAKE_C_COMPILER=clang `
        -DCMAKE_INSTALL_PREFIX="C:/Users/phoen/AppData/Local/llama-cpp" `
        -DLLAMA_BUILD_TESTS=OFF `
        -DLLAMA_BUILD_EXAMPLES=ON `
        -DLLAMA_BUILD_SERVER=ON `
        -DLLAMA_STANDALONE=ON `
        -DLLAMA_CURL=OFF `
        -DGGML_CCACHE=ON `
        -DGGML_NATIVE=ON `
        -DGGML_OPENMP=ON `
        -DGGML_AVX=ON `
        -DGGML_AVX2=ON `
        -DGGML_FMA=ON `
        -DGGML_HIP=ON `
        -DAMDGPU_TARGETS=gfx1100 `
        -DGGML_CUDA_FA_ALL_QUANTS=ON 

    cmake --build build --config Release --parallel 24
    cmake --install build --config Release
    Pop-Location
    Write-Host "llama.cpp build completed!"
}

First Bad Commit

I've pin-pointed it to the b4548 release, previous one works fine.
5f0db95

Relevant log output

llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 65536
llama_init_from_model: n_ctx_per_seq = 65536
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 256
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (65536) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 65536, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  7168.00 MiB
llama_init_from_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_init_from_model:  ROCm_Host  output buffer size =     0.49 MiB
llama_init_from_model:      ROCm0 compute buffer size =   128.25 MiB
llama_init_from_model:  ROCm_Host compute buffer size =    67.00 MiB
llama_init_from_model: graph nodes  = 791
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 65536
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 65536
main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://steelph0enix.pc:51536 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 22
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 22, n_tokens = 22, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 22, n_tokens = 22
srv  cancel_tasks: cancel task, id_task = 0
request: POST /v1/chat/completions 192.168.0.150 200
srv  cancel_tasks: cancel task, id_task = 597
request: POST /v1/chat/completions 192.168.0.150 200
@ngxson
Copy link
Collaborator

ngxson commented Jan 25, 2025

It's already fixed in #11418 , 1 hour ago. You should update your llama-server

@SteelPh0enix
Copy link
Author

SteelPh0enix commented Jan 25, 2025

It's already fixed in #11418 , 1 hour ago. You should update your llama-server

This does not fix that issue, i've tested the latest commit when i was writing up this issue.
Last working commit is the one before the broken one.

@SteelPh0enix
Copy link
Author

It's already fixed in #11418 , 1 hour ago. You should update your llama-server

i should clarify; this does not fix the generation issue.
Your MR does fix the issue with halting server and not being able to stop generation.

@ngxson
Copy link
Collaborator

ngxson commented Jan 25, 2025

Your version is:
version: 4548 (5f0db95)

It is 18 hours ago, not the latest

@SteelPh0enix
Copy link
Author

SteelPh0enix commented Jan 25, 2025

Your version is: version: 4548 (5f0db95)

It is 18 hours ago, not the latest

Yes, because that's the first version that breaks the generation on ROCm build.
All the versions after this are also broken.

@ngxson
Copy link
Collaborator

ngxson commented Jan 25, 2025

For the problem with repeated generation GGGGG...., it is not a server problem. You should test again without GPU (-ngl 0) or with llama-cli to see if it makes any differences

@SteelPh0enix
Copy link
Author

SteelPh0enix commented Jan 25, 2025

For the problem with repeated generation GGGGG...., it is not a server problem. You should test again without CPU or with llama-cli to see if it makes any differences

Yes, i am aware it's not a server-specific issue, but rather backend-related one.
I have tested the latest version w/ Vulkan back-end, and it works fine, but the performance is noticeably worse.

@ngxson
Copy link
Collaborator

ngxson commented Jan 25, 2025

Could be related to #11420

@SteelPh0enix
Copy link
Author

Might be, i'll pull this MR and check if it fixes that issue

@SteelPh0enix
Copy link
Author

Yup, the VMM fix also fixes this issue.
Closing this, because VMM fix is gonna be merged soon.
Thanks @ngxson!

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 28, 2025

@SteelPh0enix could you try the reproducer here (the original with 32GB VM): ROCm/ROCR-Runtime#287 and add your me too with your system configuration to the issue if something goes wrong

@SteelPh0enix
Copy link
Author

@SteelPh0enix could you try the reproducer here (the original with 32GB VM): ROCm/ROCR-Runtime#287 and add your me too with your system configuration to the issue if something goes wrong

Sure, i'll try to do that in a meantime, but for now i'm extremely busy - ping me closer to the weekend if i won't do it until then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants