-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval bug: llama-server generating single letter in a loop and halting (ROCm/Windows) #11421
Comments
It's already fixed in #11418 , 1 hour ago. You should update your llama-server |
This does not fix that issue, i've tested the latest commit when i was writing up this issue. |
i should clarify; this does not fix the generation issue. |
Your version is: It is 18 hours ago, not the latest |
Yes, because that's the first version that breaks the generation on ROCm build. |
For the problem with repeated generation GGGGG...., it is not a server problem. You should test again without GPU (-ngl 0) or with llama-cli to see if it makes any differences |
Yes, i am aware it's not a server-specific issue, but rather backend-related one. |
Could be related to #11420 |
Might be, i'll pull this MR and check if it fixes that issue |
Yup, the VMM fix also fixes this issue. |
@SteelPh0enix could you try the reproducer here (the original with 32GB VM): ROCm/ROCR-Runtime#287 and add your me too with your system configuration to the issue if something goes wrong |
Sure, i'll try to do that in a meantime, but for now i'm extremely busy - ping me closer to the weekend if i won't do it until then |
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: yes
version: 4548 (5f0db95)
built with clang version 19.0.0git ([email protected]:Compute-Mirrors/llvm-project 5353ca3e0e5ae54a31eeebe223da212fa405567a) for x86_64-pc-windows-msvc
Operating systems
Windows
GGML backends
HIP
Hardware
Ryzen 9 5900X w/ RX 7900XT
Models
DeepSeek-R1 Llama3.1 8B quant (q6_k)
Hermes Llama3.2 3B quant (q8)
quanted using
llama-quantize
from raw weightshowever, model probably doesn't matter in this case.
Problem description & steps to reproduce
Model generates a single letter in a loop, after trying to stop it - the server halts indefinitely and stops responding, stopping the generation via web UI does not stop it (even though the "stop" event is logged), the GPU keeps working. It's also impossible to kill via Ctrl+C, killing the parent process is required (in some cases, even that doesn't help and i have to kill it from task manager).
UPDATE: The halting issue is already resolved thanks to @ngxson
However, the main generation issue still persists.
This is how i build
llama.cpp
:First Bad Commit
I've pin-pointed it to the b4548 release, previous one works fine.
5f0db95
Relevant log output
The text was updated successfully, but these errors were encountered: