-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding DS Feature API in accelerator #5423
base: master
Are you sure you want to change the base?
Conversation
accelerator/abstract_accelerator.py
Outdated
@@ -12,7 +12,12 @@ class DeepSpeedAccelerator(ABC): | |||
def __init__(self): | |||
self._name = None | |||
self._communication_backend_name = None | |||
|
|||
self._capabilities: dict[str, bool] = { | |||
"zero1": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These hardcoded constants, such as "zero1", should be defined as symbolic constants in a global location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjruwase As we discussed, I collected the names in *OpBuilder and put it in constants.py. Do we need add other features like optimizers as ds features too?
Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:
What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss. |
2. used __compatible_ops__ 3. adding more op name to constants.py
Thanks @delock for your comments. I integrated most of them. please take a look. |
2. used __compatible_ops__ 3. adding more op name to constants.py
OP_FUSED_LAMB = "fused_lamb" | ||
OP_FUSED_LION = "fused_lion" | ||
OP_INFERENCE_CORE_OPS = "inference_core_ops" | ||
OP_CUTLASS_OPS = "cutlass_ops" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general name is needed to cover the non-cuda devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rogerxfeng8 - is there a specific one you're referencing? I believe we call all devices (cuda and non-cuda) accelerators.
Thanks @duli2012 , my intuition is zero 1/2/3 should not among accelerator feature list. Zero stage code is shared between different accelerators and there is no interface specific to zero stage, so I wonder what makes an accelerator not support zero features? For OP related features i.e. OP_ASYNC_IO, etc. I think there needs to be an mechanism sync them with opbuilder implementation state automatically, otherwise there will be manual maintenance cost each time a new op is introduced. |
|
||
class DeepSpeedAccelerator(ABC): | ||
|
||
def __init__(self): | ||
self._name = None | ||
self._communication_backend_name = None | ||
self._ds_features: dict[str, bool] = {ZERO_1: False, ZERO_2: False, ZERO_3: False} | ||
self._ds_features.update({op: compatibility for op, compatibility in __compatible_ops__}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reflection mechanism better be lazy initialized. Otherwise there might be circular dependence because this init function be called before __compatible_ops__
being initialized.
This PR is a prototype of adding API for capabilities in accelerators including:
Welcome hardware vendors to define capabilities for their own hardware.