Skip to content
This repository has been archived by the owner on Jan 11, 2022. It is now read-only.

[manyblock mode] how are you ensuring that all the blocks are executed in order? #80

Open
isaacleeai opened this issue Dec 4, 2018 · 2 comments

Comments

@isaacleeai
Copy link

I have posted a question regarding the ordering of blocks' execution when there are more blocks than there can be on a device at a given point in time. Nvidia's dev answered that: 1) we cannot assume anything about the order of block's execution
and 2) if block becomes resident, it does not retire until all threads in the block have gone to competition. (https://devtalk.nvidia.com/default/topic/1044740/performance-cost-of-too-many-blocks-/?offset=8#5301239).

However, many blocks uses hundreds of thousands of blocks, and the blocks have to be executed in the order of sample and layer ( 1st layer of 1st sample should be completed for 2nd layer of 1st sample to proceed, and 1st sample should be completed for 2nd sample can proceed ). I have read every line of nv_wavenet_persistent.cuh, and it seems like either one of the two things that the dev has said has to be wrong. Either you can specify the order of blocks, or block can be taken out of execution even if it has not gone to completion ( you use an infinite while-loop to make a block wait for the previous layer, and use block-wise synchronization to make sure that previous sample has been created. Maybe one of these causes block's early retiring? ). Or is it the "barrier.sync" PTX code that is ensuring the correctness of execution of blocks?

Thanks

@isaacleeai isaacleeai changed the title how are you ensuring that all the blocks are executed in order? [manyblock mode] how are you ensuring that all the blocks are executed in order? Dec 4, 2018
@BrianPharris
Copy link
Contributor

The code is relying on thread block launch ordering which, as was described in that thread is a gray area that we shouldn't be depending on, so it's a bug in the code :)

A correct implementation will use atomics to determine the ordered block indices.

@isaacleeai
Copy link
Author

isaacleeai commented Dec 5, 2018

Oh okay, thanks.

I wanted to make sure there wasn't any code I overlooked, as I am trying to build a parallel version of wavenet from studying your code :)

I was thinking cooperative groups would do the trick ( by only allocating as many blocks as there can be on the device at a given point in time and having the blocks go through multiple iterations for each sample ). What do you mean using atomics? Could you please refer me to some literature regarding "atomics" ( as you aren't talking about atomic arithmetic operations and mutex, or are you? )?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants