Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVBit hangs when creating Cuda Contexts in parallel (multi-gpu) #111

Open
mktip opened this issue Feb 4, 2023 · 0 comments
Open

NVBit hangs when creating Cuda Contexts in parallel (multi-gpu) #111

mktip opened this issue Feb 4, 2023 · 0 comments

Comments

@mktip
Copy link

mktip commented Feb 4, 2023

Tested on: NVBit 1.5.5 and 1.5.4

Greetings,

While playing around with NVBit, I tried a tool I was working on with some multi-gpu code. NVBit seems to hang when multiple Cuda Contexts are being created in parallel.

Here is a sample application, that when instrumented by any nvbit tool on a multi gpu system seems to hang:

#include <iostream>
#include <omp.h>
#include <cuda.h>

int main() {
    std::cout << "Started multi context in parallel" << std::endl;

    int num_devices = 0;
    cudaGetDeviceCount(&num_devices);

#pragma omp parallel num_threads(num_devices)
    {
        int dev_id = omp_get_thread_num();
        std::cout << "Switching to device: " << dev_id << std::endl;
#pragma omp barrier
        cudaSetDevice(dev_id);
        CUcontext ctx;
        cuDevicePrimaryCtxRetain(&ctx, dev_id); // or `cudaFree(0)` just to force the instantiation of the cuda context
#pragma omp barrier
        std::cout << "Switched to device: "  << dev_id << std::endl;
    }

    std::cout << "Ended multi context in parallel" << std::endl;

    return 0;
}

build and run (might require multiple runs):

$ nvcc  -lcuda -lcudart -Xcompiler=-fopenmp -O3 -arch=sm_75 pmcs.cu -o pmcs
$ LD_PRELOAD=./tools/noop/noop.so ./pmcs # or any other tool within ./tools could be used

Note: noop.so is a tool which does nothing. The instrumentation functions simply return when entered.

I've tried to investigate where the hanging occurs, and the only thing I was able to trace it to is a cudaDeviceSynchronize in ./core/nvbit_tool.h +82, but I think it is a symptom rather than the cause:

...

extern "C" void nvbit_at_context_init_hook() {
    __nvbit_start();
    load_module_nvbit_kernel<<<1, 1>>>(0);
    cudaDeviceSynchronize();  // this line
    assert(cudaGetLastError() == cudaSuccess);
}

If this problem is known, are there any known workarounds for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant