Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.distributed.launch #8383

Open
lidc54 opened this issue Oct 24, 2023 · 4 comments
Open

torch.distributed.launch #8383

lidc54 opened this issue Oct 24, 2023 · 4 comments
Labels
question Question on using Taichi

Comments

@lidc54
Copy link

lidc54 commented Oct 24, 2023

using the code from blog_code being a layer in pytorch. And then I multi-gpu to see the result. But error occurs. python -m torch.distributed.launch --nproc_per_node=2 main.py
1698153611088
we can see that RAM in gpu-6 is twice the one in gpu-7. when image become bigger or batch increase, it will be a big problems. I suppose ti.init may be the source of the problem. However, I can not find and solution in relative issue. does anyone know about it?

@lidc54 lidc54 added the question Question on using Taichi label Oct 24, 2023
@github-project-automation github-project-automation bot moved this to Untriaged in Taichi Lang Oct 24, 2023
@bobcao3
Copy link
Collaborator

bobcao3 commented Oct 25, 2023

Taichi can't use multiple GPUs at this moment. To use multi GPU you need to run taichi in different processes, so it wouldn't play well with torch's multi gpu solution

@bobcao3 bobcao3 moved this from Untriaged to Backlog in Taichi Lang Oct 25, 2023
@KazukiYoshiyama-sony
Copy link

@turbo0628

It would be really really appreciated that taichi support mutiple gpu.

I sometimes write custom kernels naively, including makefile, cmake, and/or pytorch extension. Taich could disposes away cumbersome binding codes, which would focus us on algorithm and leads to minimal amount of debugging time.

I, reacently, first made a taichi kernel called in multiple processes spawed in pytorch lightning. However, I could not fix the illegal memory access error, which would seems caused by ti.init in multi process environment, so I moved back to the classical old process, which could take 2-3x more time than using taichi.

@keunhong
Copy link

keunhong commented May 7, 2024

I have been running into the same issue. Is there any way around this? In theory it seems like Taichi should be able to bind to the correct GPU and run on that, but there seems to be some hardcoded logic making it bind to the first GPU which results in it causing an illegal memory access error. With torch's DistributedDataParallel it would be fine as long as Taichi's context could be bound to the correct GPU given by the local rank. This not working currently precludes the use of Taichi within the implementation of any large models that require multi-GPU training.

@keunhong
Copy link

keunhong commented May 9, 2024

@turbo0628

Are there any workarounds that would allow us to force Taichi onto a specific GPU index for each process? For example something that would allow us to set the GPU index in ti.init(ti.cuda)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question on using Taichi
Projects
Status: Backlog
Development

No branches or pull requests

4 participants