You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With rocm/vllm-dev:nightly_aiter_intergration_final_20250130, mixtral 8x7B fp8 tp1 throughput dropped from 9100tks to 7500tks with aiter vs without aiter.
Problem Description
With rocm/vllm-dev:nightly_aiter_intergration_final_20250130, mixtral 8x7B fp8 tp1 throughput dropped from 9100tks to 7500tks with aiter vs without aiter.
Operating System
OS: NAME="Ubuntu" VERSION="22.04.2 LTS (Jammy Jellyfish)"
CPU
Intel(R) Xeon(R) Platinum 8480C
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.3.1
ROCm Component
No response
Steps to Reproduce
with aiter:
export VLLM_USE_TRITON_FLASH_ATTN=0
export VLLM_USE_AITER=1
python3 /app/vllm/benchmarks/benchmark_throughput.py
--model /models/Mixtral-8x7B-Instruct-v0.1-FP8-KV
--distributed-executor-backend mp
--quantization fp8
--kv-cache-dtype fp8
--dtype bfloat16
--gpu-memory-utilization 0.90
--num-scheduler-steps 10
--max-model-len 8192
--max-num-batched-tokens 32768
--input-len 128
--output-len 128
--tensor-parallel-size 1
--num-prompts 30000
--max-num-seqs 2048 --block_size 16
without aiter:
export VLLM_USE_TRITON_FLASH_ATTN=0
export VLLM_USE_AITER=0
python3 /app/vllm/benchmarks/benchmark_throughput.py
--model /models/Mixtral-8x7B-Instruct-v0.1-FP8-KV
--distributed-executor-backend mp
--quantization fp8
--kv-cache-dtype fp8
--dtype float16
--gpu-memory-utilization 0.90
--num-scheduler-steps 10
--max-model-len 8192
--max-num-batched-tokens 32768
--input-len 128
--output-len 128
--tensor-parallel-size 1
--num-prompts 30000
--max-num-seqs 2048
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: