Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with BUDA #6028

Open
mrinaldi97 opened this issue Jan 27, 2025 · 12 comments
Open

Issues with BUDA #6028

mrinaldi97 opened this issue Jan 27, 2025 · 12 comments

Comments

@mrinaldi97
Copy link
Contributor

Describe the bug

Hello,
I am a researcher and I am trying to set up a new Boinc project for academic purposes.
In addition to a standard C++ app still in development, I'd like to use the new Docker + Buda feature to deploy a large Python app, potentially working also on GPU (so Vbox cannot be used).

However, I am facing some issues related to BUDA.

Docker doesn't support image names containing uppercase letters. Unfortunately, this means that if the project name does contain some uppercase letter (a very common situation), creation of images will fail with a Reference Error. This is very easy to fix, just a minor modification in the get_image_name() function in docker_wrapper.cpp is enough to ensure that only lower case strings are going to be used for image names;

The presence of Docker was not recognized by Boinc Client 8.0.2 on EndeavourOS Linux [6.12.9-arch1-1|libc 2.40]. By looking at source code I realized that server expects tags: <docker_version>%s</docker_version> and <docker_type>%s</docker_type> in the scheduler requests, but none of these tags are passed. This mean that the enum DOCKER_TYPE in common_defs.h will be set to NONE and so hostinfo.cpp is going to run just the command "unknown":

const char* docker_cli_prog(DOCKER_TYPE type) {
    switch (type) {
    case DOCKER: return "docker";
    case PODMAN: return "podman";
    default: break;
    }
    return "unknown";

Now, considered that it could happen that the Client doesn't find Docker, what about fall-backing to "docker" instead of "unknown"? I know it's not an elegant solution (especially considered that podman is opensource...) but docker is more common so at least we increase the chance that something will run while with unknown we will be sure that the program is not going to be executed, unless the volounteer creates a symbolic link from "unknown" to the docker/podman executable

The case of my science app I think it's quite common nowadays: many contemporary science libraries unfortunately are cursed by a dependencies greediness and this means that often the amount of data to be included in a docker image just to run the program can be some gigabytes large.
In my case the size of the image is 2GB. Now, I saw that the default behaviour of docker_wrapper is to create the image from Dockerfile, create a container from the image and then deleting both the container as well as the image (sprintf(cmd, "image rm %s", image_name);). Now, what about giving the option in the app configuration to ignore this line in docker_wrapper and keep the image? In this way cases such as mine could be handled so that only a single image per science app is built and then a new container is going to be created for each WU, thus saving gigabytes of volounteers' bandwith as well as avoid time wastes in I/O operation.

What I did now (and it works) is to put a fixed image name in the docker wrapper instead of using aid.wu_name. This approach, however, is not optimal because one could need (as in my case) to have a different image for each BUDA app to be deployed, so that these images remain fixed unless of updates (as a quick workaround, adding version number to the app's name would suffice). So I wonder, what is the correct way to access from the docker wrapper running on the client to the name of the single science app deployed using BUDA? The only drawback is that in this way images would not be deleted if project is removed, but I guess this would need some major fix to the client code unfortunately...
TLDR: What I want is 1) Same base image across WUs of the same family 2) Different container created for each WU

In one of my experiments, I tried to solve problem (3) by uploading in the sandbox the .tar file of the image, so that at least it doesn't have to be fully redownloaded and rebuilt every time. This approach didn't work: first, even if add the file manually I get fatal error memory every time I try to download the file from the sandbox (even though PHP max memory variable was correctly set), finally even when the client correctly downloaded the 2GB large tar file, I got "Disk usage limit quota exceeded" but not regarding Disk usage, the error message (I don't have the interface in English) seemed more about an I/O quota rather than a disk size usage quota. Is it perhaps related to some default <rsc_disk_bound> for BUDA apps?

Thanks and sorry for this non-canonical issue report + discussion in the same thread, but I wasn't sure if opening 4 different discussions was the best thing to do.
Cheers,
Matteo

@AenBleidd
Copy link
Member

@mrinaldi97,

The presence of Docker was not recognized by Boinc Client 8.0.2 on EndeavourOS Linux [6.12.9-arch1-1|libc 2.40].

Docker support is not yet released, so if you are interested in testing this, please use latest master to build your own client, or check our nightly builds in case you OS is supported (I know nothing about EndeavourOS, but if it's DEB or RPM based, you can try this: https://github.com/BOINC/boinc/wiki/Linux-DEB-and-RPM-support )

@davidpanderson, please take a look at the other points. I think some of them are definitely valid.

@davidpanderson
Copy link
Contributor

  1. is fixed in client and docker_wrapper: lower-case image and container names #6033
    Vitalii addressed 2)
    I don't understand 4) but it may not be relevant.

For 3): I could add options to job.toml so that you can specify the image name,
and tell docker_wrapper not to delete the image after the job.

Before I do this I want to make sure this is actually a problem.
My impression is that Docker stores images as a bunch of layers, in separate files.
It would be inefficient to combine these into one big file.

I can't find anything online about how Docker stores images.
Do we have any experimental evidence?

@mrinaldi97
Copy link
Contributor Author

Thanks @AenBleidd unfortunately I overlooked the fact that Buda is still an experimental feature. Nonetheless, I think it's worth to try to set up the project with this feature: first I have to perform all the testing phase and I am sure it's not a problem for beta testers volunteers to perform some extra step in order to run the project. In this way my project (if it will became a real thing, and I am confident it will) could also serve as a testbed for Buda and thus improving Boinc itself. I am happy to give all the possible help for what I can.

So to recap:

1 solved, 2 is due to the experimental status of the feature, I think an easy fix can be to ask the volunteers to run "ln -s /bin/unknown /usr/bin/docker" and that's it

I agree that it's not so relevant: just that sandbox doesn't work very well with very big files, but it can be only configuration issue and also I don't think that delivering a several gigabytes image through sandbox is a good way to work. Just I don't know if it can be considered an issue that all the traffic generated by the docker container will not be managed by the boinc-client but externally. In this way eventual network usage quota set by the user are not going to be respected. That's another reason why I was trying to insert big files into the sandbox.

What I noticed is that before my modification docker was re-creating a new image from scratch for each WU, so spending several minutes every time only to download all the files, install dependencies and so on. It was very inefficient, but I agree we should look better into Docker mechanism because I noticed that in some cases Docker is able to recreate an image very fastly by re-starting from already computed checkpoints.
In any case that idea of providing the option in the job.toml file is great: at least can solve very well this issue until it's better on focus. With this extra option I could already publish the project with more than one test app. The only drawback is the inability to automatically delete the images if project is updated/removed, so I will have to ask early volounteers to perform a "docker rmi" command when old images should be removed.

@davidpanderson
Copy link
Contributor

Here's an interesting article:
https://docs.docker.com/get-started/docker-concepts/building-images/using-the-build-cache/

Apparently the order of commands in the Dockerfile can affect the cache efficiency.

@davidpanderson
Copy link
Contributor

Proposal: the job.toml file can contain

image_name = "x"
image_retain_days = 7

in which case the docker_wrapper will

  • use the given image name, prepended with boinc__<project>__
  • create a file projects/x/docker_images/x containing the 'retain until' time.
  • not delete the image when the job is done

The client (which periodically removes unused images) won't remove this one until 7 days after its last use.

That way volunteers won't have to manually delete images.

@davidpanderson
Copy link
Contributor

... but this is the sort of complexity that I want to keep out of BUDA.
So I'll hold off on this for now.

@mrinaldi97
Copy link
Contributor Author

Hello to everyone,
unfortunately I have been very busy these last days.

@davidpanderson I see your point of keeping less complexity as possible into Buda. I think that the image_retain_days, although an effective problem-solver, may indeed add too much complexity to the code as well as the user experience.
On the other hand, the possibility of picking a custom image name from job.toml I think it can be good and it is also a very simple function. It is true that the problem of deleting old images remains, but for example the TOML file may contain "image_name" as well as "image_version" so that if the new version has a bigger number then the old one is substituted. Not perfect, but it should work.

Because I am very close in deploying this science app to a small group of beta testers (matters of days/week), I think by now I can work in modifying the docker_wrapper so that it gets this extra parameters from the job.toml file. I haven't re-read the code yet with this modification in mind, but it looks easy and it could work at least for my users. In the meantime I will also study better Docker so that perhaps this is not needed at all. Before I do this kind of modification in my cloned codebase, do you know if there is some nice variable (such as aid.wu_name) that instead of the wu_name contains the name of the BUDA app? In this case I could avoid using the job.toml file and keep things even simpler.

In general, I think that if I do these tests it can be beneficial for the development of Boinc, so we can check if these solutions works or not and eventually integrate them in the Boinc's codebase.
Thanks!

@davidpanderson
Copy link
Contributor

I changed the client so that, for BUDA jobs, APP_INIT_DATA.plan_class is the BUDA app variant
(which is a plan class).

In docker_wrapper, I added a config option
image_name = "xxx"
If specified, it uses this image name and doesn't delete it when done.

@mrinaldi97
Copy link
Contributor Author

Thank you @davidpanderson! I am updating my app right now but I reviewed the code and it should work like a charm.

@Toby-Broom
Copy link

The workaround is the wrong way around should be ln -s /usr/bin/docker /bin/unknown

@davidpanderson
Copy link
Contributor

Sorry, what workaround?

@Toby-Broom
Copy link

Linking docker to unkown so it can find docker, at LHC@home dev they upgraded the server code and I have 8.0.4 so I get docker is missing messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

4 participants