Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip the unit tests that use unimplemented op builders. #1

Open
wants to merge 72 commits into
base: master
Choose a base branch
from

Conversation

foin6
Copy link
Owner

@foin6 foin6 commented Mar 9, 2024

No description provided.

Copy link

@rogerxfeng8 rogerxfeng8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me

mosheisland and others added 21 commits April 8, 2024 15:35
This PR enhances DeepSpeed to support MoE for pipeline models (e.g.
GPTModelPipe from Megatron-DeepSpeed).
Main changes:

- Enhance expert groups creation for pipeline (enhance both flavors:
DP/PP/EP and DP/TP/PP/EP)
- Fix MoE save/load checkpoint for PipelineModule based models.
- Display MoE loss for PipelineModule based models.
- Support gradients reduce for BF16_Optimizer for
PipelineModule.<br>Note that same commit also fixes gradients reduction
error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer
also for a dense (no MOE) model.
- When using no-drop tokens, all-reduce the capacity (op=max) using
expert parallel group instead of world group

---------

Signed-off-by: Moshe Island <[email protected]>
Co-authored-by: Moshe Island <[email protected]>
…edai#5164)

The underlying issue preventing us from using pytest 8.0.1+ is in
transformers. Here is a [PR that fixes the
issue](huggingface/transformers#29154) that will
need to be merged and then a new release of transformers before we can
complete this PR.

No impact to nv-nightly, run
[here](https://github.com/microsoft/DeepSpeed/actions/runs/8575192380/job/23503659644)
Currently errors on nv-inference from [the
PR](https://github.com/microsoft/DeepSpeed/actions/runs/8575187611/job/23503647378),
but the same tests are broken in
[master](https://github.com/microsoft/DeepSpeed/actions/runs/8575185551/job/23503640238).
This change increase CI coverage for HPU accelerator.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Add basic workflow for tests on intel xpu. Currently we have part of
tests enabled. We will add more tests in later PRs.

---------

Co-authored-by: Logan Adams <[email protected]>
…#5386)

- Fixes xpu-max1100 not running on PR because of incorrect yml name.
- Only workflows running Ubuntu 20.04 or later can be updated as the
GLIBC that is needed for node 20+ can be updated now.
- Workflows that aren't updated are running Ubuntu 18.04 or older, those
will need to be moved to updated images shortly and will be updated
later in the original PR, deepspeedai#5021

Sample warning that is resolved:

```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
```
- Move `required_torch_version` check from deepspeed.runtime.utils to
deepspeed.utils.torch (newly created).
- Remove unused duplicate definition from `tests/unit/util.py`.
- Update all references to this function.
- Switch checks in `deepspeed/runtime/pipe/p2p.py` to use this function.
- Switch checks in `deepspeed/comm/torch.py` to use this function.

---------

Co-authored-by: Lev Kurilenko <[email protected]>
1. fix a type error
2. update the intel xpu HW support status
…eedai#5373)

This PR removes the for loop inside the dequantizer kernel and use as
many threads and blocks as needed to dequantize the quantized matrix.
The previous implementation was processing each group per thread block
which can reduce the efficiency when have having smaller group-size and
also processes more data per-thread which is unnecessary and we can use
more parallelism to improve the dequantization performance.

Based on my testing results, for a 4K by 4K matrix, dequantizing from
fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s
to 2.5 TB/s on Nvidia H100 GPU).

---------

Co-authored-by: Reza Yazdani <[email protected]>
This PR adds more flexibility to define weight tensor reshaping for
universal checkpointing.

Currently universal checkpointing assumes a few patterns of partitioning
for tensor parallelism, such as column/row wise partitioning of a 2-dim
tensor. However, these are not flexible enough to define partitioning
for more complex usages. Here are some examples:

1) MoE: The user may define the weight tensor for MoE's FFN as
[n_experts * hidden_out, hidden_in]. For TP, we need to *view* this
tensor as 3-dim tensor and partition it along `hidden_out` dimension.
2) GQA: The weights for QKV are often represented as one tensor and we
may have Q, K and V with different sizes. The tensor shape will be
[q_size + k_size + v_size, hidden]. We partition this along first
dimension but for each Q, K, and V. In this case, we first need to
partition Q, V, and V separately and then concatenate them to get a
shard for TP.

We propose a new pattern `PARAMETER_WITH_SUB_PARAMS` to support this.
Here is the usage to cover the above use cases. You can define the view
of the weight tensor and specify the dimension for partitioning based on
the view.
```python
from deepspeed.checkpoint import PARAMETER_WITH_SUB_PARAMS, SubparamShape

info[PARAMETER_WITH_SUB_PARAMS] = [
    asdict(SubparamShape(patterns=[layers_prefix + r"\d+moe.fc1.weight"],
                shape=(num_experts, hidden_out, hidden_in), partition_dim=1)),
    asdict(SubparamShape(patterns=[layers_prefix + r"\d+.qkv.weight"],
                shape=((q_size, k_size, v_size), hidden_size), partition_dim=0)),
...
]
```

The conversion script (`ds_to_universal.py`) merges TP-sharded weight
tensors and the loader of universal checkpoints also partitions them
following the information.

Co-authored-by: Olatunji Ruwase <[email protected]>
…i#5351)

In UTs removed 'cuda' string hardcode by replacing with device variable
set to get_accelerator().device_name()

Co-authored-by: Shaik Raza Sikander <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
ZeRO offload case

Fix the issue of pageble h2d memcpy in step process. Now h2d memcpy uses
pinned memory.

Speedup h2d memcpy by 6x on single GPU and 4-5x on 8GPU node.

cc @tjruwase

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Ubuntu <deepspeed@deepspeed-login.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net>
Co-authored-by: Logan Adams <[email protected]>
using torch.norm instead of inefficient for loop

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
…dai#5333)

Refine the guards of FP6 kernel compilation. Fix the `undefined symbol`
problem of FP6 kernels on non-Ampere architectures.

Related issue: deepspeedai/DeepSpeed-MII#443.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.14.1
Author           - @loadams

Co-authored-by: loadams <[email protected]>
…eedai#5329)

When the dtype is bf16 or fp32 the if condition is not satisfied and it
continues execution instead of skipping when triton is not installed.

Co-authored-by: Shaik Raza Sikander <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Set non_daemonic_proc=True by default on XPU Device, using
non_daemonic_proc for unit test.

Co-authored-by: Logan Adams <[email protected]>
deepspeedai#5411)

Some users are concerned that changes in TP topology during MOE training
may potentially cause interference with experiments when noticing
similar issues
deepspeedai/Megatron-DeepSpeed#151 
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files

We found a grad_norm calculation error after enabling TP. This error
occurs because flattened grad of a params group is used, where the group
contains both non-TP and TP parameters. Therefore, it is not possible to
use a single attribute to determine whether flattened grad needs to
compute the norm. In the current code logic, all params are assumed to
be non-TP, resulting in only tp_rank0 grad participating in grad_norm
computation. Other tp_rank grads have grad_norm_sum equal to 0. We
tested and found that with TP=1 and TP=4, the difference in grad_norm is
approximately twice (sqrt(4)). This aligns with the aforementioned
issue. This problem should also affect dense models.

Due to the absence of flattening params_group grad in bf16, this problem
is avoided.

We tested the loss curve on the 1.3B model. In cases where TP size
increases the inconsistent gap should be larger.

with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59)


without this change  1.3B with EP=4 TP=4&1  ,fp16,mbs=1,gbs=16

![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0)

---------

Co-authored-by: Conglong Li <[email protected]>
…5415)

A [warning is
shown](https://github.com/microsoft/DeepSpeed/actions/runs/8695213322/job/23845782048#step:10:31)
when we do releases:

```
[deploy](https://github.com/microsoft/DeepSpeed/actions/runs/8695213322/job/23845782048)
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: peter-evans/create-pull-request@v4. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
```

To resolve this we update the create a pull request to `@v6`, see
release notes
[here](https://github.com/peter-evans/create-pull-request/releases)
jeffra and others added 30 commits April 23, 2024 12:24
Optimized version of `nn.Linear` that adds features such as:
      * LoRA w. base weight sharding
      * FP [6,8,12] quantization

Depends on deepspeedai#5336 being merged first

Co-authored-by: @rajhans
Co-authored-by: @aurickq

---------

Co-authored-by: Rajhans Samdani <[email protected]>
Co-authored-by: Jeff Rasley <[email protected]>
Fixing a minor typo at the README file

Co-authored-by: Logan Adams <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.14.2
Author           - @loadams

Co-authored-by: loadams <[email protected]>
deepspeedai#5299)

Add getter and setter methods for `compile_backend` across accelerators,
which provide a mechanism to retrieve the compile backend. These APIs
handle user-defined backend selection and raise a `ValueError` with
informative error messages for unsupported backends.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
PyTorch v2.3 throws an error when it tries to compile `iter_params` used
for ZeRO3.
This PR excludes the function from the compilation targets.

After this PR is merged, we can [unpin the torch version for unit
tests](deepspeedai#5459).
…edai#5256)" (deepspeedai#5461)

This reverts commit 54c0687 due to
deepspeedai#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled.

This bug was discovered due to failures in the DS Chat CI workflow.
Failing tests across CI failures:
| Failing Test Name |
| --- |
| test_ds_chat[zero3--offload-] |
| test_ds_chat[zero3--offload-lora] |
| test_ds_chat[zero3-he-offload-] |
| test_ds_chat[zero3-he-offload-lora] |

Error message:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
```

It seems that `torch.stack()` or `torch.norm()` is having issues when
the offload feature is enabled and tensors are split between CPU/GPU,
however this is just an initial guess and would require more
investigation.

@nelyahu Since you are the original author of the PR, if you have some
bandwidth, any help here is greatly appreciated!

After reverting this commit, all tests pass in the DS Chat CI workflow:

https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763

@tjruwase for context.
…pspeedai#5462)

This PR updates the ds-chat CI workflow to run when ZeRO stage 1-3 files
are updated.
…speedai#5465)

Order of parameters in create_dir_symlink method looks wrong. Because
this we get the error "PermissionError: [WinError 5] Denied access:
'.\\deepspeed\\ops\\csrc'" when install deepspeed >= 0.4.0 on Windows
enviroment.

Please check this out @eltonzheng and @jeffra.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
…compile_zero tests on v100 (deepspeedai#5459)

Torch updating to 2.3.0 broke some test_compile_zero tests, we pinned
it, @tohtana pushed fixes in deepspeedai#5463, this should un-pin and move us back
to the latest.

Failing test that indicates the generated code cannot run bf16 on V100
[here](https://github.com/microsoft/DeepSpeed/actions/runs/8838672379/job/24270349996?pr=5459#step:8:5157).
…eepspeedai#5493)

reverting previous revert of this feature:

nelyahu@bc48371
in addition,
bug fix for offload mode.
…or().current_device() (deepspeedai#5464)

Creating a Torch tensor with the parameter
`device=get_accelerator().current_device()` can result in a crash when
using an NPU.

This issue arises because the `current_device` API across all
accelerators is expected to return a device id as an integer, according
to the [interface
docs.](https://github.com/microsoft/DeepSpeed/blob/fa8458b1a80d6ba55091b17f092de19bbf95eb3d/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103)

However, specifying `device` as an interger when creating tensors by
default directs Torch to use the CUDA backend, which leads to crash on
NPUs (and potentially other accelerators as well).

To resolve this, we should use `get_accelerator().current_device_name()`
instead, which returns the correct device identifier strings such as
`"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate
context needed for creating tensors on specific hardware accelerators.

I also notice that `device=get_accelerator().current_device()` is used
across several files under deepspeed/inference, and may also lead to
crash on other accelerators.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
compile wrapper will inherit from user module class and copy it's
__dict__

This should resolve most issues in deepspeedai#5383 except potential extra user
forward hooks.

@tohtana @loadams

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
This PR aims to enable phi3 mini autotp.

Phi3 mini uses chunk MLP. We adjust this linear layer weight order to
support this model.

Please kindly review~ Thanks!

---------

Co-authored-by: Lev Kurilenko <[email protected]>
Update the mainfest to cover hpp file in csrc.
This PR aims to enable phi2 model autotp.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Fixes: deepspeedai#5517 

Link to PyPI for nvidia-ml-py
[here](https://pypi.org/project/nvidia-ml-py/) showing usage remaining
the same as previous pynvml package.
…#5533)

Hi @loadams, Could you please help to review this pr?
After add hpp files in csrc, we found sometimes the hpp headers will
still be excluded from the op src packaging, so we add the hpp file in
deepspeed to make sure the hpp header in deepspeed package, to ensure
jit load to compile the xpu/fused_adam ops in 0.14.2.
This PR introduces a new monitoring option - `CometMonitor` which comes
up as an official integration with
[CometML](https://www.comet.com/site/).

The new monitor is covered with unit tests.

Notes:
* We've updated `docs/code-docs/source/monitor.rst` but it doesn't look
used anymore
* We've updated the "Monitoring Module" section name in `config-json.md`
to be generic so the next integration won't require updating it.

---------

Co-authored-by: Boris Feld <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Was providing the optimizer name which was configured, and not optimizer
that was actually taking place after this function processing.
This is not always aligned.

Co-authored-by: Logan Adams <[email protected]>
…eedai#5159)

Enhance testing: Skip fused_optimizer tests if not supported.

Added condition check to skip fused_optimizer tests if FusedAdam and
FusedLamb are not supported by the accelerator. This enhancement ensures
that the tests are appropriately skipped when the hardware configuration
does not support these optimizers, preventing potential issues.

Details:
- Introduced a condition check to determine support for FusedAdam and
FusedLamb.
- If not supported, fused_optimizer tests are skipped to improve test
reliability.
- Improved compatibility and stability across different hardware
configurations.

---------

Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.