-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc. bug: commit 5f0db95 breaks model loading on some AMD gpus #11405
Comments
cc @IMbackK |
RX 5700 XT and rdna1 in general is not supported by rocm, so probably vmm is just broken there. But to cover our bases, could you specify the versions of rocm/ rocr you are using? also are you running amdgpu-dkms with kfd or are you running amdgpu from the mainline linux kernel? I would recommend trying the second reproducer here ROCm/ROCR-Runtime#285 (the one thats supposed to work) and filing another issue against rocr. Anyhow we will have to disable vmm on RDNA1 in the mean time. |
I'm currently building rocm 6.1.2. rocm and, as you said, it isn't officially supported on rdna1. |
Ok, could you try rocm 6.2.4 as shipped in the arch repos? Also, not a fix but could you try with https://github.com/ROCm/ROCK-Kernel-Driver? this is what amd themselves uses and rocr when running on this kernel uses an entirely different kernel interface to alloc vram. We need to know what configurations exactly to disable vmm on. I personally only tested MI100 and RX6800XT on rocm 6.3 with current git rocr with both the kfd and the regular drm kernel path |
I'm experiencing the same on my Framework 16, both on the discrete(RX7700S, gfx1102) and integrated(Radeon 780M,gfx1103) GPUs.
Using kernel Disabling VMM with |
hmm ok thats annoying, i gues ill disable it for anything except gfx9 and gfx103x for now |
Could you guys try the repoducer i linked above and file an issue against rocr? |
I might be stupid, but this means i have to build the kernel in the repo, right? |
I also can test on a gfx1030 machine (RX6700 iirc) |
I've just tested on arch's package and the issue is the same. Here's the output of the reproducer that you've posted in the rocr issue:
|
these are different things, the reproducer is useful to file a bug against rocr using the kfd kernel may solve the issue (but it works fine on the mainline kernel here)
Please take this result and the reproducer and file anthoer issue agains rocr with as mutch info as possible. |
Is it worth reporting even if the card isn't supported? |
..Yeah, it failed too.
With same HIP from official repos |
I don't understand how this works then... Sry, I didn't touch the kernel side of Linux too much at all |
Hmm thats even more strange now since it works fine on gfx1030 here |
Disabling VMM on the gfx1030 machine works too |
OK so i downgraded everything to the regular arch linux packages incl the arch linux 6.12.10.arch1-1 kernel and it still works fine:
so lacking a reasonable explanation for why this works here but not in your cases, im just going to disable it by default for now. |
Please do, esp with the reports from the other machines here should be fine. |
If that helps, here's the backtrace from the gfx1102
And from gfx1030
|
I'm currently installing Ubuntu on another partition to install the official 6.3.1 packages and create the report from there. Is it necessary to use the kernel driver for the issue to be useful? |
no, but mention you are using the mainline kernel |
could you guys try with "iommu=pt" on the kernel cmdline? |
and have the iommu enabled in the bios |
No bios setting for it but with the kernel setting it still fails on the Framework |
I get some more info about the error on Ubuntu.
|
dose you dmseg contain and perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank) ... ? |
yup
|
The gfx1030 machine is the same(same dmesg, same crash) |
anything else for me to try? |
no i out of ideas, so im just going to disable it for now, I am on an Zen2 epyc, maybe for some reason it dosent work on consumer platforms. |
@MangoTCF Please add your me too with details on your configuration to the issue created by @daniandtheweb |
This issue is also related to this commit: #11421 However, in my case i'm able to load the model, but the generation doesn't work properly at all. |
This one? |
@MangoTCF He means this: ROCm/ROCR-Runtime#287 |
Oops. Did that now
|
Name and Version
version: 4549 (466ea66)
built with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server, llama-bench
Command line
./llama-bench -m ~/Applications/chat/gguf/llama-2-7b.Q4_0.gguf -ngl 100
Problem description & steps to reproduce
Commit 5f0db95, specifically VMM support, seems to break model loading on Radeon RX 5700 XT.
Every model I try to load isn't able to load properly, it doesn't matter how small the model is.
Disabling VMM at build time with
GGML_CUDA_NO_VMM=ON
solves the issue.First Bad Commit
5f0db95
Relevant log output
The text was updated successfully, but these errors were encountered: