torch.distributed.launch #8383

lidc54 · 2023-10-24T13:28:47Z

using the code from blog_code being a layer in pytorch. And then I multi-gpu to see the result. But error occurs. python -m torch.distributed.launch --nproc_per_node=2 main.py

we can see that RAM in gpu-6 is twice the one in gpu-7. when image become bigger or batch increase, it will be a big problems. I suppose ti.init may be the source of the problem. However, I can not find and solution in relative issue. does anyone know about it?

The text was updated successfully, but these errors were encountered:

bobcao3 · 2023-10-25T16:20:37Z

Taichi can't use multiple GPUs at this moment. To use multi GPU you need to run taichi in different processes, so it wouldn't play well with torch's multi gpu solution

KazukiYoshiyama-sony · 2024-02-17T01:58:41Z

@turbo0628

It would be really really appreciated that taichi support mutiple gpu.

I sometimes write custom kernels naively, including makefile, cmake, and/or pytorch extension. Taich could disposes away cumbersome binding codes, which would focus us on algorithm and leads to minimal amount of debugging time.

I, reacently, first made a taichi kernel called in multiple processes spawed in pytorch lightning. However, I could not fix the illegal memory access error, which would seems caused by ti.init in multi process environment, so I moved back to the classical old process, which could take 2-3x more time than using taichi.

keunhong · 2024-05-07T05:12:46Z

I have been running into the same issue. Is there any way around this? In theory it seems like Taichi should be able to bind to the correct GPU and run on that, but there seems to be some hardcoded logic making it bind to the first GPU which results in it causing an illegal memory access error. With torch's DistributedDataParallel it would be fine as long as Taichi's context could be bound to the correct GPU given by the local rank. This not working currently precludes the use of Taichi within the implementation of any large models that require multi-GPU training.

keunhong · 2024-05-09T07:29:24Z

@turbo0628

Are there any workarounds that would allow us to force Taichi onto a specific GPU index for each process? For example something that would allow us to set the GPU index in ti.init(ti.cuda)

lidc54 added the question Question on using Taichi label Oct 24, 2023

taichi-gardener added this to Taichi Lang Oct 24, 2023

github-project-automation bot moved this to Untriaged in Taichi Lang Oct 24, 2023

bobcao3 moved this from Untriaged to Backlog in Taichi Lang Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.launch #8383

torch.distributed.launch #8383

lidc54 commented Oct 24, 2023

bobcao3 commented Oct 25, 2023

KazukiYoshiyama-sony commented Feb 17, 2024

keunhong commented May 7, 2024 •

edited

Loading

keunhong commented May 9, 2024

torch.distributed.launch #8383

torch.distributed.launch #8383

Comments

lidc54 commented Oct 24, 2023

bobcao3 commented Oct 25, 2023

KazukiYoshiyama-sony commented Feb 17, 2024

keunhong commented May 7, 2024 • edited Loading

keunhong commented May 9, 2024

keunhong commented May 7, 2024 •

edited

Loading