[Issue]: mixtral 8x7b fp8 tp1 performance degraded with aiter #91

andyluo7 · 2025-02-05T20:36:18Z

Problem Description

With rocm/vllm-dev:nightly_aiter_intergration_final_20250130, mixtral 8x7B fp8 tp1 throughput dropped from 9100tks to 7500tks with aiter vs without aiter.

Operating System

OS: NAME="Ubuntu" VERSION="22.04.2 LTS (Jammy Jellyfish)"

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.3.1

ROCm Component

No response

Steps to Reproduce

with aiter:

export VLLM_USE_TRITON_FLASH_ATTN=0
export VLLM_USE_AITER=1

python3 /app/vllm/benchmarks/benchmark_throughput.py
--model /models/Mixtral-8x7B-Instruct-v0.1-FP8-KV
--distributed-executor-backend mp
--quantization fp8
--kv-cache-dtype fp8
--dtype bfloat16
--gpu-memory-utilization 0.90
--num-scheduler-steps 10
--max-model-len 8192
--max-num-batched-tokens 32768
--input-len 128
--output-len 128
--tensor-parallel-size 1
--num-prompts 30000
--max-num-seqs 2048 --block_size 16

without aiter:

export VLLM_USE_TRITON_FLASH_ATTN=0
export VLLM_USE_AITER=0

python3 /app/vllm/benchmarks/benchmark_throughput.py
--model /models/Mixtral-8x7B-Instruct-v0.1-FP8-KV
--distributed-executor-backend mp
--quantization fp8
--kv-cache-dtype fp8
--dtype float16
--gpu-memory-utilization 0.90
--num-scheduler-steps 10
--max-model-len 8192
--max-num-batched-tokens 32768
--input-len 128
--output-len 128
--tensor-parallel-size 1
--num-prompts 30000
--max-num-seqs 2048

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

valarLip · 2025-02-07T14:37:11Z

aiter's fused moe currently not include big tile M to support cases with bs>256, so perf will dropped for these cases...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: mixtral 8x7b fp8 tp1 performance degraded with aiter #91

[Issue]: mixtral 8x7b fp8 tp1 performance degraded with aiter #91

andyluo7 commented Feb 5, 2025

valarLip commented Feb 7, 2025

[Issue]: mixtral 8x7b fp8 tp1 performance degraded with aiter #91

[Issue]: mixtral 8x7b fp8 tp1 performance degraded with aiter #91

Comments

andyluo7 commented Feb 5, 2025

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

valarLip commented Feb 7, 2025