Skip the unit tests that use unimplemented op builders. #1

foin6 · 2024-03-09T07:31:56Z

No description provided.

rogerxfeng8

looks ok to me

This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>

…edai#5164) The underlying issue preventing us from using pytest 8.0.1+ is in transformers. Here is a [PR that fixes the issue](huggingface/transformers#29154) that will need to be merged and then a new release of transformers before we can complete this PR. No impact to nv-nightly, run [here](https://github.com/microsoft/DeepSpeed/actions/runs/8575192380/job/23503659644) Currently errors on nv-inference from [the PR](https://github.com/microsoft/DeepSpeed/actions/runs/8575187611/job/23503647378), but the same tests are broken in [master](https://github.com/microsoft/DeepSpeed/actions/runs/8575185551/job/23503640238).

This change increase CI coverage for HPU accelerator. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Add basic workflow for tests on intel xpu. Currently we have part of tests enabled. We will add more tests in later PRs. --------- Co-authored-by: Logan Adams <[email protected]>

…#5386) - Fixes xpu-max1100 not running on PR because of incorrect yml name.

This reverts commit eb216a8.

- Only workflows running Ubuntu 20.04 or later can be updated as the GLIBC that is needed for node 20+ can be updated now. - Workflows that aren't updated are running Ubuntu 18.04 or older, those will need to be moved to updated images shortly and will be updated later in the original PR, deepspeedai#5021 Sample warning that is resolved: ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/. ```

- Move `required_torch_version` check from deepspeed.runtime.utils to deepspeed.utils.torch (newly created). - Remove unused duplicate definition from `tests/unit/util.py`. - Update all references to this function. - Switch checks in `deepspeed/runtime/pipe/p2p.py` to use this function. - Switch checks in `deepspeed/comm/torch.py` to use this function. --------- Co-authored-by: Lev Kurilenko <[email protected]>

1. fix a type error 2. update the intel xpu HW support status

…eedai#5373) This PR removes the for loop inside the dequantizer kernel and use as many threads and blocks as needed to dequantize the quantized matrix. The previous implementation was processing each group per thread block which can reduce the efficiency when have having smaller group-size and also processes more data per-thread which is unnecessary and we can use more parallelism to improve the dequantization performance. Based on my testing results, for a 4K by 4K matrix, dequantizing from fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s to 2.5 TB/s on Nvidia H100 GPU). --------- Co-authored-by: Reza Yazdani <[email protected]>

This PR adds more flexibility to define weight tensor reshaping for universal checkpointing. Currently universal checkpointing assumes a few patterns of partitioning for tensor parallelism, such as column/row wise partitioning of a 2-dim tensor. However, these are not flexible enough to define partitioning for more complex usages. Here are some examples: 1) MoE: The user may define the weight tensor for MoE's FFN as [n_experts * hidden_out, hidden_in]. For TP, we need to *view* this tensor as 3-dim tensor and partition it along `hidden_out` dimension. 2) GQA: The weights for QKV are often represented as one tensor and we may have Q, K and V with different sizes. The tensor shape will be [q_size + k_size + v_size, hidden]. We partition this along first dimension but for each Q, K, and V. In this case, we first need to partition Q, V, and V separately and then concatenate them to get a shard for TP. We propose a new pattern `PARAMETER_WITH_SUB_PARAMS` to support this. Here is the usage to cover the above use cases. You can define the view of the weight tensor and specify the dimension for partitioning based on the view. ```python from deepspeed.checkpoint import PARAMETER_WITH_SUB_PARAMS, SubparamShape info[PARAMETER_WITH_SUB_PARAMS] = [ asdict(SubparamShape(patterns=[layers_prefix + r"\d+moe.fc1.weight"], shape=(num_experts, hidden_out, hidden_in), partition_dim=1)), asdict(SubparamShape(patterns=[layers_prefix + r"\d+.qkv.weight"], shape=((q_size, k_size, v_size), hidden_size), partition_dim=0)), ... ] ``` The conversion script (`ds_to_universal.py`) merges TP-sharded weight tensors and the loader of universal checkpoints also partitions them following the information. Co-authored-by: Olatunji Ruwase <[email protected]>

…i#5351) In UTs removed 'cuda' string hardcode by replacing with device variable set to get_accelerator().device_name() Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Logan Adams <[email protected]>

@tjruwase

ZeRO offload case Fix the issue of pageble h2d memcpy in step process. Now h2d memcpy uses pinned memory. Speedup h2d memcpy by 6x on single GPU and 4-5x on 8GPU node. cc @tjruwase --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Ubuntu <deepspeed@deepspeed-login.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net> Co-authored-by: Logan Adams <[email protected]>

using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <[email protected]>

…dai#5333) Refine the guards of FP6 kernel compilation. Fix the `undefined symbol` problem of FP6 kernels on non-Ampere architectures. Related issue: deepspeedai/DeepSpeed-MII#443. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.14.1 Author - @loadams Co-authored-by: loadams <[email protected]>

…eedai#5329) When the dtype is bf16 or fp32 the if condition is not satisfied and it continues execution instead of skipping when triton is not installed. Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

Set non_daemonic_proc=True by default on XPU Device, using non_daemonic_proc for unit test. Co-authored-by: Logan Adams <[email protected]>

deepspeedai#5411) Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues deepspeedai/Megatron-DeepSpeed#151 https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models. Due to the absence of flattening params_group grad in bf16, this problem is avoided. We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger. with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59) without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0) --------- Co-authored-by: Conglong Li <[email protected]>

…5415) A [warning is shown](https://github.com/microsoft/DeepSpeed/actions/runs/8695213322/job/23845782048#step:10:31) when we do releases: ``` [deploy](https://github.com/microsoft/DeepSpeed/actions/runs/8695213322/job/23845782048) Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: peter-evans/create-pull-request@v4. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/. ``` To resolve this we update the create a pull request to `@v6`, see release notes [here](https://github.com/peter-evans/create-pull-request/releases)

@rajhans

Optimized version of `nn.Linear` that adds features such as: * LoRA w. base weight sharding * FP [6,8,12] quantization Depends on deepspeedai#5336 being merged first Co-authored-by: @rajhans Co-authored-by: @aurickq --------- Co-authored-by: Rajhans Samdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

Fixing a minor typo at the README file Co-authored-by: Logan Adams <[email protected]>

…eepspeedai#5454)

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.14.2 Author - @loadams Co-authored-by: loadams <[email protected]>

deepspeedai#5299) Add getter and setter methods for `compile_backend` across accelerators, which provide a mechanism to retrieve the compile backend. These APIs handle user-defined backend selection and raise a `ValueError` with informative error messages for unsupported backends. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

PyTorch v2.3 throws an error when it tries to compile `iter_params` used for ZeRO3. This PR excludes the function from the compilation targets. After this PR is merged, we can [unpin the torch version for unit tests](deepspeedai#5459).

@nelyahu

…edai#5256)" (deepspeedai#5461) This reverts commit 54c0687 due to deepspeedai#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: | Failing Test Name | | --- | | test_ds_chat[zero3--offload-] | | test_ds_chat[zero3--offload-lora] | | test_ds_chat[zero3-he-offload-] | | test_ds_chat[zero3-he-offload-lora] | Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.

…pspeedai#5462) This PR updates the ds-chat CI workflow to run when ZeRO stage 1-3 files are updated.

@eltonzheng

…speedai#5465) Order of parameters in create_dir_symlink method looks wrong. Because this we get the error "PermissionError: [WinError 5] Denied access: '.\\deepspeed\\ops\\csrc'" when install deepspeed >= 0.4.0 on Windows enviroment. Please check this out @eltonzheng and @jeffra. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

@tohtana

…compile_zero tests on v100 (deepspeedai#5459) Torch updating to 2.3.0 broke some test_compile_zero tests, we pinned it, @tohtana pushed fixes in deepspeedai#5463, this should un-pin and move us back to the latest. Failing test that indicates the generated code cannot run bf16 on V100 [here](https://github.com/microsoft/DeepSpeed/actions/runs/8838672379/job/24270349996?pr=5459#step:8:5157).

…eepspeedai#5493) reverting previous revert of this feature: nelyahu@bc48371 in addition, bug fix for offload mode.

…or().current_device() (deepspeedai#5464) Creating a Torch tensor with the parameter `device=get_accelerator().current_device()` can result in a crash when using an NPU. This issue arises because the `current_device` API across all accelerators is expected to return a device id as an integer, according to the [interface docs.](https://github.com/microsoft/DeepSpeed/blob/fa8458b1a80d6ba55091b17f092de19bbf95eb3d/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103) However, specifying `device` as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well). To resolve this, we should use `get_accelerator().current_device_name()` instead, which returns the correct device identifier strings such as `"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate context needed for creating tensors on specific hardware accelerators. I also notice that `device=get_accelerator().current_device()` is used across several files under deepspeed/inference, and may also lead to crash on other accelerators. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>

@tohtana

compile wrapper will inherit from user module class and copy it's __dict__ This should resolve most issues in deepspeedai#5383 except potential extra user forward hooks. @tohtana @loadams Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>

This PR aims to enable phi3 mini autotp. Phi3 mini uses chunk MLP. We adjust this linear layer weight order to support this model. Please kindly review~ Thanks! --------- Co-authored-by: Lev Kurilenko <[email protected]>

Co-authored-by: Olatunji Ruwase <[email protected]>

Update the mainfest to cover hpp file in csrc.

This PR aims to enable phi2 model autotp. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

Fixes: deepspeedai#5517 Link to PyPI for nvidia-ml-py [here](https://pypi.org/project/nvidia-ml-py/) showing usage remaining the same as previous pynvml package.

@loadams

…#5533) Hi @loadams, Could you please help to review this pr? After add hpp files in csrc, we found sometimes the hpp headers will still be excluded from the op src packaging, so we add the hpp file in deepspeed to make sure the hpp header in deepspeed package, to ensure jit load to compile the xpu/fused_adam ops in 0.14.2.

This PR introduces a new monitoring option - `CometMonitor` which comes up as an official integration with [CometML](https://www.comet.com/site/). The new monitor is covered with unit tests. Notes: * We've updated `docs/code-docs/source/monitor.rst` but it doesn't look used anymore * We've updated the "Monitoring Module" section name in `config-json.md` to be generic so the next integration won't require updating it. --------- Co-authored-by: Boris Feld <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Was providing the optimizer name which was configured, and not optimizer that was actually taking place after this function processing. This is not always aligned. Co-authored-by: Logan Adams <[email protected]>

…eedai#5159) Enhance testing: Skip fused_optimizer tests if not supported. Added condition check to skip fused_optimizer tests if FusedAdam and FusedLamb are not supported by the accelerator. This enhancement ensures that the tests are appropriately skipped when the hardware configuration does not support these optimizers, preventing potential issues. Details: - Introduced a condition check to determine support for FusedAdam and FusedLamb. - If not supported, fused_optimizer tests are skipped to improve test reliability. - Improved compatibility and stability across different hardware configurations. --------- Co-authored-by: Logan Adams <[email protected]>

foin6 and others added 8 commits February 28, 2024 17:51

ad dist.initialized() when judging

31bf1a0

Merge branch 'microsoft:master' into dev

1cd1110

skip the test if the feature is not implemented

dcae6df

Delete redundant comments

6ec7c3e

Merge branch 'microsoft:master' into dev

c798ffd

restore code

03b93a4

skip the test if the feature is not implemented

6835e04

delete two extra rows

360f105

rogerxfeng8 reviewed Apr 4, 2024

View reviewed changes

mosheisland and others added 21 commits April 8, 2024 15:35

Increase CI coverage for Gaudi2 accelerator. (deepspeedai#5358)

ca4bc66

This change increase CI coverage for HPU accelerator. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Add CI for Intel XPU/Max1100 (deepspeedai#5376)

8737388

Add basic workflow for tests on intel xpu. Currently we have part of tests enabled. We will add more tests in later PRs. --------- Co-authored-by: Logan Adams <[email protected]>

Update path name on xpu-max1100.yml, add badge in README (deepspeedai…

7e126d2

…#5386) - Fixes xpu-max1100 not running on PR because of incorrect yml name.

Update XPU workflow too

eb216a8

Revert "Update XPU workflow too"

62ab181

This reverts commit eb216a8.

Update README.md for intel XPU support (deepspeedai#5389)

cc9e7b9

1. fix a type error 2. update the intel xpu HW support status

stage3: efficient compute of scaled_global_grad_norm (deepspeedai#5256)

54c0687

using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <[email protected]>

Update version.txt after 0.14.1 release (deepspeedai#5413)

2b3d31f

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.14.1 Author - @loadams Co-authored-by: loadams <[email protected]>

[XPU] Use non_daemonic_proc by default on XPU device (deepspeedai#5412)

258e500

Set non_daemonic_proc=True by default on XPU Device, using non_daemonic_proc for unit test. Co-authored-by: Logan Adams <[email protected]>

jeffra and others added 30 commits April 23, 2024 12:24

Update README.md (deepspeedai#5453)

ad20279

Fixing a minor typo at the README file Co-authored-by: Logan Adams <[email protected]>

Update PyTest torch version to match PyTorch latest official (2.3.0) (d…

5f631ab

…eepspeedai#5454)

Update version.txt after 0.14.2 release (deepspeedai#5458)

fbdf0ea

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.14.2 Author - @loadams Co-authored-by: loadams <[email protected]>

Update ds-chat CI workflow paths to include zero stage 1-3 files (dee…

e37296b

…pspeedai#5462) This PR updates the ds-chat CI workflow to run when ZeRO stage 1-3 files are updated.

Update with ops not supported on Windows (deepspeedai#5468)

4c15ad9

re-introduce: stage3: efficient compute of scaled_global_grad_norm (d…

90793aa

…eepspeedai#5493) reverting previous revert of this feature: nelyahu@bc48371 in addition, bug fix for offload mode.

Merge branch 'master' into dev

ce3fdb8

enable phi3_mini autotp (deepspeedai#5501)

3dd7ccf

This PR aims to enable phi3 mini autotp. Phi3 mini uses chunk MLP. We adjust this linear layer weight order to support this model. Please kindly review~ Thanks! --------- Co-authored-by: Lev Kurilenko <[email protected]>

Fused adam for HPU (deepspeedai#5500)

df4ef0a

Co-authored-by: Olatunji Ruwase <[email protected]>

[manifest] update mainfest to add hpp file in csrc. (deepspeedai#5522)

4696afd

Update the mainfest to cover hpp file in csrc.

enable phi2 autotp (deepspeedai#5436)

3a7f3aa

This PR aims to enable phi2 model autotp. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

Switch pynvml to nvidia-ml-py (deepspeedai#5529)

82ce4ae

Fixes: deepspeedai#5517 Link to PyPI for nvidia-ml-py [here](https://pypi.org/project/nvidia-ml-py/) showing usage remaining the same as previous pynvml package.

Switch from double quotes to match single quotes (deepspeedai#5530)

62ca317

Improve _configure_optimizer() final optimizer log (deepspeedai#5528)

23173fa

Was providing the optimizer name which was configured, and not optimizer that was actually taking place after this function processing. This is not always aligned. Co-authored-by: Logan Adams <[email protected]>

Merge branch 'master' into dev

2d8f09c

Fix merge conflicts

98681f8

Rewrite pytest skip message

4ac702e

Fix typos

dca5e7f

Formatting fixes

6e7d75a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip the unit tests that use unimplemented op builders. #1

Skip the unit tests that use unimplemented op builders. #1

foin6 commented Mar 9, 2024

rogerxfeng8 left a comment

Skip the unit tests that use unimplemented op builders. #1

Are you sure you want to change the base?

Skip the unit tests that use unimplemented op builders. #1

Conversation

foin6 commented Mar 9, 2024

rogerxfeng8 left a comment

Choose a reason for hiding this comment