forked from deepspeedai/DeepSpeed
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip the unit tests that use unimplemented op builders. #1
Open
foin6
wants to merge
72
commits into
master
Choose a base branch
from
dev
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rogerxfeng8
reviewed
Apr 4, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok to me
This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <[email protected]> Co-authored-by: Moshe Island <[email protected]>
…edai#5164) The underlying issue preventing us from using pytest 8.0.1+ is in transformers. Here is a [PR that fixes the issue](huggingface/transformers#29154) that will need to be merged and then a new release of transformers before we can complete this PR. No impact to nv-nightly, run [here](https://github.com/microsoft/DeepSpeed/actions/runs/8575192380/job/23503659644) Currently errors on nv-inference from [the PR](https://github.com/microsoft/DeepSpeed/actions/runs/8575187611/job/23503647378), but the same tests are broken in [master](https://github.com/microsoft/DeepSpeed/actions/runs/8575185551/job/23503640238).
This change increase CI coverage for HPU accelerator. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Add basic workflow for tests on intel xpu. Currently we have part of tests enabled. We will add more tests in later PRs. --------- Co-authored-by: Logan Adams <[email protected]>
…#5386) - Fixes xpu-max1100 not running on PR because of incorrect yml name.
This reverts commit eb216a8.
- Only workflows running Ubuntu 20.04 or later can be updated as the GLIBC that is needed for node 20+ can be updated now. - Workflows that aren't updated are running Ubuntu 18.04 or older, those will need to be moved to updated images shortly and will be updated later in the original PR, deepspeedai#5021 Sample warning that is resolved: ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/. ```
- Move `required_torch_version` check from deepspeed.runtime.utils to deepspeed.utils.torch (newly created). - Remove unused duplicate definition from `tests/unit/util.py`. - Update all references to this function. - Switch checks in `deepspeed/runtime/pipe/p2p.py` to use this function. - Switch checks in `deepspeed/comm/torch.py` to use this function. --------- Co-authored-by: Lev Kurilenko <[email protected]>
1. fix a type error 2. update the intel xpu HW support status
…eedai#5373) This PR removes the for loop inside the dequantizer kernel and use as many threads and blocks as needed to dequantize the quantized matrix. The previous implementation was processing each group per thread block which can reduce the efficiency when have having smaller group-size and also processes more data per-thread which is unnecessary and we can use more parallelism to improve the dequantization performance. Based on my testing results, for a 4K by 4K matrix, dequantizing from fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s to 2.5 TB/s on Nvidia H100 GPU). --------- Co-authored-by: Reza Yazdani <[email protected]>
This PR adds more flexibility to define weight tensor reshaping for universal checkpointing. Currently universal checkpointing assumes a few patterns of partitioning for tensor parallelism, such as column/row wise partitioning of a 2-dim tensor. However, these are not flexible enough to define partitioning for more complex usages. Here are some examples: 1) MoE: The user may define the weight tensor for MoE's FFN as [n_experts * hidden_out, hidden_in]. For TP, we need to *view* this tensor as 3-dim tensor and partition it along `hidden_out` dimension. 2) GQA: The weights for QKV are often represented as one tensor and we may have Q, K and V with different sizes. The tensor shape will be [q_size + k_size + v_size, hidden]. We partition this along first dimension but for each Q, K, and V. In this case, we first need to partition Q, V, and V separately and then concatenate them to get a shard for TP. We propose a new pattern `PARAMETER_WITH_SUB_PARAMS` to support this. Here is the usage to cover the above use cases. You can define the view of the weight tensor and specify the dimension for partitioning based on the view. ```python from deepspeed.checkpoint import PARAMETER_WITH_SUB_PARAMS, SubparamShape info[PARAMETER_WITH_SUB_PARAMS] = [ asdict(SubparamShape(patterns=[layers_prefix + r"\d+moe.fc1.weight"], shape=(num_experts, hidden_out, hidden_in), partition_dim=1)), asdict(SubparamShape(patterns=[layers_prefix + r"\d+.qkv.weight"], shape=((q_size, k_size, v_size), hidden_size), partition_dim=0)), ... ] ``` The conversion script (`ds_to_universal.py`) merges TP-sharded weight tensors and the loader of universal checkpoints also partitions them following the information. Co-authored-by: Olatunji Ruwase <[email protected]>
…i#5351) In UTs removed 'cuda' string hardcode by replacing with device variable set to get_accelerator().device_name() Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Logan Adams <[email protected]>
ZeRO offload case Fix the issue of pageble h2d memcpy in step process. Now h2d memcpy uses pinned memory. Speedup h2d memcpy by 6x on single GPU and 4-5x on 8GPU node. cc @tjruwase --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Ubuntu <deepspeed@deepspeed-login.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net> Co-authored-by: Logan Adams <[email protected]>
using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <[email protected]>
…dai#5333) Refine the guards of FP6 kernel compilation. Fix the `undefined symbol` problem of FP6 kernels on non-Ampere architectures. Related issue: deepspeedai/DeepSpeed-MII#443. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.14.1 Author - @loadams Co-authored-by: loadams <[email protected]>
…eedai#5329) When the dtype is bf16 or fp32 the if condition is not satisfied and it continues execution instead of skipping when triton is not installed. Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Set non_daemonic_proc=True by default on XPU Device, using non_daemonic_proc for unit test. Co-authored-by: Logan Adams <[email protected]>
deepspeedai#5411) Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues deepspeedai/Megatron-DeepSpeed#151 https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models. Due to the absence of flattening params_group grad in bf16, this problem is avoided. We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger. with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59) without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0) --------- Co-authored-by: Conglong Li <[email protected]>
…5415) A [warning is shown](https://github.com/microsoft/DeepSpeed/actions/runs/8695213322/job/23845782048#step:10:31) when we do releases: ``` [deploy](https://github.com/microsoft/DeepSpeed/actions/runs/8695213322/job/23845782048) Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: peter-evans/create-pull-request@v4. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/. ``` To resolve this we update the create a pull request to `@v6`, see release notes [here](https://github.com/peter-evans/create-pull-request/releases)
Optimized version of `nn.Linear` that adds features such as: * LoRA w. base weight sharding * FP [6,8,12] quantization Depends on deepspeedai#5336 being merged first Co-authored-by: @rajhans Co-authored-by: @aurickq --------- Co-authored-by: Rajhans Samdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>
Fixing a minor typo at the README file Co-authored-by: Logan Adams <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.14.2 Author - @loadams Co-authored-by: loadams <[email protected]>
deepspeedai#5299) Add getter and setter methods for `compile_backend` across accelerators, which provide a mechanism to retrieve the compile backend. These APIs handle user-defined backend selection and raise a `ValueError` with informative error messages for unsupported backends. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
PyTorch v2.3 throws an error when it tries to compile `iter_params` used for ZeRO3. This PR excludes the function from the compilation targets. After this PR is merged, we can [unpin the torch version for unit tests](deepspeedai#5459).
…edai#5256)" (deepspeedai#5461) This reverts commit 54c0687 due to deepspeedai#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: | Failing Test Name | | --- | | test_ds_chat[zero3--offload-] | | test_ds_chat[zero3--offload-lora] | | test_ds_chat[zero3-he-offload-] | | test_ds_chat[zero3-he-offload-lora] | Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.
…pspeedai#5462) This PR updates the ds-chat CI workflow to run when ZeRO stage 1-3 files are updated.
…speedai#5465) Order of parameters in create_dir_symlink method looks wrong. Because this we get the error "PermissionError: [WinError 5] Denied access: '.\\deepspeed\\ops\\csrc'" when install deepspeed >= 0.4.0 on Windows enviroment. Please check this out @eltonzheng and @jeffra. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
…compile_zero tests on v100 (deepspeedai#5459) Torch updating to 2.3.0 broke some test_compile_zero tests, we pinned it, @tohtana pushed fixes in deepspeedai#5463, this should un-pin and move us back to the latest. Failing test that indicates the generated code cannot run bf16 on V100 [here](https://github.com/microsoft/DeepSpeed/actions/runs/8838672379/job/24270349996?pr=5459#step:8:5157).
…eepspeedai#5493) reverting previous revert of this feature: nelyahu@bc48371 in addition, bug fix for offload mode.
…or().current_device() (deepspeedai#5464) Creating a Torch tensor with the parameter `device=get_accelerator().current_device()` can result in a crash when using an NPU. This issue arises because the `current_device` API across all accelerators is expected to return a device id as an integer, according to the [interface docs.](https://github.com/microsoft/DeepSpeed/blob/fa8458b1a80d6ba55091b17f092de19bbf95eb3d/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103) However, specifying `device` as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well). To resolve this, we should use `get_accelerator().current_device_name()` instead, which returns the correct device identifier strings such as `"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate context needed for creating tensors on specific hardware accelerators. I also notice that `device=get_accelerator().current_device()` is used across several files under deepspeed/inference, and may also lead to crash on other accelerators. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]>
compile wrapper will inherit from user module class and copy it's __dict__ This should resolve most issues in deepspeedai#5383 except potential extra user forward hooks. @tohtana @loadams Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
This PR aims to enable phi3 mini autotp. Phi3 mini uses chunk MLP. We adjust this linear layer weight order to support this model. Please kindly review~ Thanks! --------- Co-authored-by: Lev Kurilenko <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Update the mainfest to cover hpp file in csrc.
This PR aims to enable phi2 model autotp. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Fixes: deepspeedai#5517 Link to PyPI for nvidia-ml-py [here](https://pypi.org/project/nvidia-ml-py/) showing usage remaining the same as previous pynvml package.
…#5533) Hi @loadams, Could you please help to review this pr? After add hpp files in csrc, we found sometimes the hpp headers will still be excluded from the op src packaging, so we add the hpp file in deepspeed to make sure the hpp header in deepspeed package, to ensure jit load to compile the xpu/fused_adam ops in 0.14.2.
This PR introduces a new monitoring option - `CometMonitor` which comes up as an official integration with [CometML](https://www.comet.com/site/). The new monitor is covered with unit tests. Notes: * We've updated `docs/code-docs/source/monitor.rst` but it doesn't look used anymore * We've updated the "Monitoring Module" section name in `config-json.md` to be generic so the next integration won't require updating it. --------- Co-authored-by: Boris Feld <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Was providing the optimizer name which was configured, and not optimizer that was actually taking place after this function processing. This is not always aligned. Co-authored-by: Logan Adams <[email protected]>
…eedai#5159) Enhance testing: Skip fused_optimizer tests if not supported. Added condition check to skip fused_optimizer tests if FusedAdam and FusedLamb are not supported by the accelerator. This enhancement ensures that the tests are appropriately skipped when the hardware configuration does not support these optimizers, preventing potential issues. Details: - Introduced a condition check to determine support for FusedAdam and FusedLamb. - If not supported, fused_optimizer tests are skipped to improve test reliability. - Improved compatibility and stability across different hardware configurations. --------- Co-authored-by: Logan Adams <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.