-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.distributed.launch #8383
Comments
Taichi can't use multiple GPUs at this moment. To use multi GPU you need to run taichi in different processes, so it wouldn't play well with torch's multi gpu solution |
It would be really really appreciated that taichi support mutiple gpu. I sometimes write custom kernels naively, including makefile, cmake, and/or pytorch extension. Taich could disposes away cumbersome binding codes, which would focus us on algorithm and leads to minimal amount of debugging time. I, reacently, first made a taichi kernel called in multiple processes spawed in pytorch lightning. However, I could not fix the illegal memory access error, which would seems caused by ti.init in multi process environment, so I moved back to the classical old process, which could take 2-3x more time than using taichi. |
I have been running into the same issue. Is there any way around this? In theory it seems like Taichi should be able to bind to the correct GPU and run on that, but there seems to be some hardcoded logic making it bind to the first GPU which results in it causing an illegal memory access error. With torch's |
Are there any workarounds that would allow us to force Taichi onto a specific GPU index for each process? For example something that would allow us to set the GPU index in |
using the code from blog_code being a layer in pytorch. And then I multi-gpu to see the result. But error occurs.
python -m torch.distributed.launch --nproc_per_node=2 main.py
we can see that RAM in gpu-6 is twice the one in gpu-7. when image become bigger or batch increase, it will be a big problems. I suppose ti.init may be the source of the problem. However, I can not find and solution in relative issue. does anyone know about it?
The text was updated successfully, but these errors were encountered: