Why is ITL's first token so long? #62

sunshenao · 2025-01-03T10:46:02Z

model: Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4
GPU: H20 * 8
P: 4H20
D: 4H20
input:1024,output : 6

sudo sh disagg_performance_benchmark.sh

This is what I get when qps=10.

The average ITL is much larger for separated than for non-separated, but the median ITL for separated is much smaller.
I see this in the result file generated

The first token of the decode part always takes a long time.
This phenomenon does not occur at lower qps, e.g., when qps = 2

But when the qps is small, the ITL increase is not very obvious. but the TTFT increases a lot

May I ask why this is so, is there any way to reduce the time of the first token of decode?
I look forward to your answer. think you

ShangmingCai · 2025-01-07T03:29:48Z

In disaggregated prefilling scenarios, the first token of the decode (i.e., TTFT) consists of the prefill stage overhead, KVCache transfer cost, and the first run overhead of the decode stage. Since the layer-wise KVCache transfer is not ready yet, the unsatisfactory performance of TTFT is as expected temporarily.

Also, the implementation of KVCache transfer is not in a zero-copy fashion currently. And there is a buffer_lock in the implementation of simple_buffer.py, which might cause troubles when QPS is large.

Please refer to the roadmap of the disaggregated prefilling feature of vLLM (vllm-project/vllm#10818) and Mooncake (#44), there remains much work to do before we make this feature production-level ready.

BTW, how many GPUs do you use for the set of chunked prefill experiments?

sunshenao · 2025-01-07T03:52:43Z

Thanks for your answer, I started 2 chunked prefill instances with 4*h20 each, just like the disagg_performance_benchmark.sh configuration.
The other question I have is, is that so far the results of my tests don't show any increase in performance compared to the chunked prefill instances, what is the reason for this?

For example, my results running on a30

gpu: A30
prefill: 1A30
decode: 1 A30
model : Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
mooncake.json: {
"prefill_url": "localhost:13003",
"decode_url": "localhost:13103",
"metadata_server": "localhost:2379",
"metadata_backend": "etcd",
"protocol": "tcp",
"device_name": ""
}
input_len: 256, output_len: 6
sudo sh disagg_performance_benchmark.sh

ShangmingCai · 2025-01-07T04:31:46Z

Thanks for your answer, I started 2 chunked prefill instances with 4*h20 each, just like the disagg_performance_benchmark.sh configuration. The other question I have is, is that so far the results of my tests don't show any increase in performance compared to the chunked prefill instances, what is the reason for this?

Yes. Since PD separation is similar to pipeline processing at the current stage, even with high QPS, there exist computing bubbles on both nodes. However, for the two chunked prefill implementations, the resources of each GPU can be fully utilized, and because it does not involve KVCache transmission between nodes, the performance will not be limited by the network.

Therefore, 1P1D tests will not show any increase in performance compared to 2 chunked prefill instances. To better evaluate the practicality of PD separation, we need the implementation of XpYd, and also a heterogeneous GPU environment and high-speed network.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is ITL's first token so long? #62

Why is ITL's first token so long? #62

sunshenao commented Jan 3, 2025

ShangmingCai commented Jan 7, 2025

sunshenao commented Jan 7, 2025

ShangmingCai commented Jan 7, 2025

Why is ITL's first token so long? #62

Why is ITL's first token so long? #62

Comments

sunshenao commented Jan 3, 2025

ShangmingCai commented Jan 7, 2025

sunshenao commented Jan 7, 2025

ShangmingCai commented Jan 7, 2025