Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU: Error response from daemon: invalid volume specification #1564

Open
mzernovx opened this issue Oct 12, 2023 · 12 comments
Open

GPU: Error response from daemon: invalid volume specification #1564

mzernovx opened this issue Oct 12, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@mzernovx
Copy link

mzernovx commented Oct 12, 2023

Environment:

  • kubernetes 1.27.3
  • docker v20.10.20

Steps to reproduce:

  • Setup Intel Device Plugins
  • Create any pod with gpu.intel.com/i915 resource allocated

Expected behaviour: pod running

Actual behaviour:
pod in CreateContainerError state
Warning Failed 2m49s (x12 over 5m3s) kubelet Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:b7:00.0-card:/dev/dri/by-path/pci-0000:b7:00.0-card:ro'

Likely caused by this commit: 943e34f

@tkatila tkatila added the bug Something isn't working label Oct 12, 2023
@tkatila
Copy link
Contributor

tkatila commented Oct 12, 2023

Thanks for reporting this. Did you verify that it's only on docker runtime?

@tkatila
Copy link
Contributor

tkatila commented Oct 12, 2023

The change that is causing this was introduced on 0.26.1 version. You can workaround it by using 0.26.0 in the mean while.

@mythi
Copy link
Contributor

mythi commented Oct 12, 2023

I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?

@mzernovx
Copy link
Author

@tkatila I can confirm that with containerd it's working fine.

@mzernovx
Copy link
Author

I remember we have had similar cases with volume mounts where the paths have had colons and docker is used. Is docker mandatory here or could proper CRI runtime be used?

BMRA/VMRA uses docker as a default container runtime.

@eero-t
Copy link
Contributor

eero-t commented Oct 12, 2023

docker v20.10.20

That's a bit old. Oldest Docker version listed e.g. in Ubuntu packages site is v20.10.21, and Ubuntu 20.04 LTS updates are already at 24.0.5: https://packages.ubuntu.com/focal-updates/docker.io

Have you tried any newer Docker version?

kubernetes 1.27.3
...
BMRA/VMRA uses docker as a default container runtime.

They could consider updating that default, as Kubernetes deprecated Docker support after k8s v1.20: https://kubernetes.io/blog/2020/12/02/dont-panic-kubernetes-and-docker/

@tkatila
Copy link
Contributor

tkatila commented Oct 13, 2023

Have you tried any newer Docker version?

I tried a newer version and it reproduces with it:

$ dpkg --list | grep Docker
ii  docker-buildx-plugin                             0.11.2-1~ubuntu.22.04~jammy                 amd64        Docker Buildx cli plugin.
ii  docker-ce                                        5:24.0.6-1~ubuntu.22.04~jammy               amd64        Docker: the open-source application container engine
ii  docker-ce-cli                                    5:24.0.6-1~ubuntu.22.04~jammy               amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras                        5:24.0.6-1~ubuntu.22.04~jammy               amd64        Rootless support for Docker.
ii  docker-compose-plugin                            2.21.0-1~ubuntu.22.04~jammy                 amd64        Docker Compose (V2) plugin for the Docker CLI.

Pod fails with:

  Warning  Failed     8s (x2 over 9s)  kubelet            Error: Error response from daemon: invalid volume specification: '/dev/dri/by-path/pci-0000:00:02.0-card:/dev/dri/by-path/pci-0000:00:02.0-card:ro'

Docker Engine is mentioned in container runtimes in k8s docs: https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker that would suggest it's still "ok" to use it.

But to me this is a bug with the docker engine as it works fine with containerd and cri-o. My thought process for this is:

  1. File a bug for the docker engine about it not being able to mount paths with :.
  2. https://github.com/intel/container-experience-kits for docker installation, stick with 0.26.0 GPU plugin
  3. If/when the docker engine bug is resolved, update the GPU plugin to the latest version

I do not want to remove the "by-path" mounting as it's required by distributed training. And adding some cli arg or env variable to temporarily disable it feels icky.

@tkatila
Copy link
Contributor

tkatila commented Oct 13, 2023

It seems that a colon in volumes/binds is a known issue:
docker/docker-py#2041
moby/moby#39293
moby/moby#22825

@mzernovx
Copy link
Author

Looks like there's a workaround to use --mount arg with Docker but there's no clear way to utilize this from the side of Kubernetes.

The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx as they are basically symlinks to devices in /dev/dri

@mythi
Copy link
Contributor

mythi commented Oct 13, 2023

The most suitable fix for this bug seems to be avoiding using /dev/dri/by-path/xxx

Avoid using docker is not an option?

@mzernovx
Copy link
Author

mzernovx commented Oct 13, 2023

Avoid using docker is not an option?

@mythi BMRA/VMRA still uses docker as a "primary" container runtime. The product is build around customers and their needs, so avoiding using Docker is not an option for us.

Downgrading Intel DP to 0.26.0 can be considered as a workaround, but not a fix.

@stefb69
Copy link

stefb69 commented Oct 20, 2024

Workaround: Prevent Creation of /dev/dri/by-path Symlinks with Colons

Issue:
Docker has a bug that prevents containers from starting when bind-mounting paths containing colons (:), such as /dev/dri/by-path/pci-0000:00:02.0-card. This affects applications like Plex that require access to /dev/dri devices for hardware acceleration.

Solution:
Modify the udev rules to prevent the creation of /dev/dri/by-path symlinks with colons by replacing them with hyphens (-). This ensures compatibility with Docker and Kubernetes by avoiding paths with colons.

Steps to Implement the Workaround:

  1. Identify the Culprit udev Rule:

    The symlinks are created, in ubuntu and probably debian, by the 60-drm.rules file, which contains rules for DRM (Direct Rendering Manager) devices.

  2. Create a Custom udev Rule:

    To override the existing rules without modifying system files, create a new udev rule with higher priority.

    sudo nano /etc/udev/rules.d/59-drm-custom.rules
  3. Add the Following Content to the Custom Rule:

    # Prevent creation of original by-path symlinks and create modified symlinks without colons
    
    ACTION!="remove", SUBSYSTEM=="drm", SUBSYSTEMS=="pci|usb|platform", IMPORT{builtin}="path_id"
    
    # Replace colons with hyphens in ID_PATH
    ENV{ID_PATH}=="?*", PROGRAM="/bin/sh -c 'echo $env{ID_PATH} | sed \"s/:/-/g\"'", ENV{ID_PATH}="%c"
    
    # Create new symlinks with modified ID_PATH
    ENV{ID_PATH}=="?*", KERNEL=="card*", SYMLINK+="dri/by-path/$env{ID_PATH}-card"
    ENV{ID_PATH}=="?*", KERNEL=="controlD*", SYMLINK+="dri/by-path/$env{ID_PATH}-control"
    ENV{ID_PATH}=="?*", KERNEL=="renderD*", SYMLINK+="dri/by-path/$env{ID_PATH}-render"
    
    # Stop further processing to prevent original symlinks from being created
    OPTIONS+="last_rule"
    

    Explanation:

    • Import path_id: Ensures ID_PATH is available.
    • Replace Colons: Uses sed to replace all : with - in ID_PATH.
    • Create Modified Symlinks: Generates new symlinks without colons.
    • Stop Further Processing: OPTIONS+="last_rule" prevents the original 60-drm.rules from adding the problematic symlinks.
  4. Set Proper Permissions for the Rule File:

    sudo chmod 644 /etc/udev/rules.d/59-drm-custom.rules
  5. Reload udev Rules and Apply Changes:

    sudo udevadm control --reload-rules
    sudo udevadm trigger --subsystem-match=drm
  6. Verify the New Symlinks:

    Check that the symlinks in /dev/dri/by-path no longer contain colons.

    ls /dev/dri/by-path

    Expected Output:

    pci-0000-00-02.0-card
    pci-0000-00-02.0-render
    

@tkatila tkatila pinned this issue Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants