Skip to content

Releases: vllm-project/vllm

v0.6.6.post1

27 Dec 06:24
2339d59
Compare
Choose a tag to compare

This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 .

What's Changed

  • [Docs] Document Deepseek V3 support by @simon-mo in #11535
  • Update openai_compatible_server.md by @robertgshaw2-neuralmagic in #11536
  • [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
  • [V1] Fix yapf by @WoosukKwon in #11538
  • [CI] Fix broken CI by @robertgshaw2-neuralmagic in #11543
  • [misc] fix typing by @youkaichao in #11540
  • [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-neuralmagic in #11534
  • [BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-neuralmagic in #11547

Full Changelog: v0.6.6...v0.6.6.post1

v0.6.6

27 Dec 00:12
f49777b
Compare
Choose a tag to compare

Highlights

  • Support Deepseek V3 (#11523, #11502) model.

    • On 8xH200s or MI300x: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --trust-remote-code --max-model-len 8192. The context length can be increased to about 32K beyond running into memory issue.
    • For other devices, follow our distributed inference guide to enable tensor parallel and/or pipeline parallel inference
    • We are just getting started for enhancing the support and unlock more performance. See #11539 for planned work.
  • Last mile stretch for V1 engine refactoring: API Server (#11529, #11530), penalties for sampler (#10681), prefix caching for vision language models (#11187, #11305), TP Ray executor (#11107,#11472)

  • Breaking change: X-Request-ID echoing is now opt-in instead of on by default for performance reason. Set --enable-request-id-headers to enable it.

Model Support

  • IBM Granite 3.1 (#11307), JambaForSequenceClassification model (#10860)
  • Add QVQ and QwQ to the list of supported models (#11509)

Performance

  • Cutlass 2:4 Sparsity + FP8/INT8 Quant Support (#10995)

Production Engine

  • Support streaming model from S3 using RunAI Model Streamer as optional loader (#10192)
  • Online Pooling API (#11457)
  • Load video from base64 (#11492)

Others

  • Add pypi index for every commit and nightly build (#11404)

What's Changed

Read more

v0.6.5

17 Dec 23:10
2d1b9ba
Compare
Choose a tag to compare

Highlights

Model Support

Hardware Support

Performance & Scheduling

  • Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).

Benchmark & Frontend

Documentation & Plugins

Bugfixes & Misc

What's Changed

Read more

v0.6.4.post1

15 Nov 17:50
a6221a1
Compare
Choose a tag to compare

This patch release covers bug fixes (#10347, #10349, #10348, #10352, #10363), keep compatibility for vLLMConfig usage in out of tree models (#10356)

What's Changed

New Contributors

Full Changelog: v0.6.4...v0.6.4.post1

v0.6.4

15 Nov 07:32
02dbf30
Compare
Choose a tag to compare

Highlights

Model Support

  • New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
  • New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
  • Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
    • Add user-configurable --task parameter for models that support both generation and embedding (#9424)
    • Chat-based Embeddings API (#9759)
  • Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
  • LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
  • BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
  • Unified multi-modal processor for VLM (#10040, #10044)
  • Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)

Hardware Support

  • Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
  • CPU: Add embedding models support for CPU backend (#10193)
  • TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
  • Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)

Performance

  • Combine chunked prefill with speculative decoding (#9291)
  • fused_moe Performance Improvement (#9384)

Engine Core

  • Override HF config.json via CLI (#5836)
  • Add goodput metric support (#9338)
  • Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
  • Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)

Others

  • Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
  • Dropped support for Python 3.8 (#10038, #8464)
  • Basic Integration Test For TPU (#9968)
  • Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
  • Benchmark throughput now supports image input (#9851)

What's Changed

  • [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
  • [Frontend] merge beam search implementations by @LunrEclipse in #9296
  • [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
  • [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
  • [Frontend] Clarify model_type error messages by @stevegrubb in #9345
  • [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
  • [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
  • [BugFix] Fix chat API continuous usage stats by @njhill in #9357
  • pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
  • [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
  • [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
  • [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
  • [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
  • [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
  • [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
  • [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
  • [Core] Rename input data types by @DarkLight1337 in #8688
  • [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
  • [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
  • Support mistral interleaved attn by @patrickvonplaten in #9414
  • [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
  • [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
  • [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
  • [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
  • [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
  • [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
  • Add notes on the use of Slack by @terrytangyuan in #9442
  • [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
  • [Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
  • [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
  • [Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
  • [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
  • [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
  • [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
  • [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
  • [Misc] Remove commit id file by @DarkLight1337 in #9470
  • [torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
  • [Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
  • [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
  • [Bugfix] Print warnings related to mistral_common tokenizer only once by @sasha0552 in #9468
  • [Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
  • Support BERTModel (first encoder-only embedding model) by @robertgshaw2-neuralmagic in #9056
  • [BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
  • [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
  • [Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
  • [CI/Build] Use commit hash references for github actions by @russellb in #9430
  • [BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
  • [Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
  • [BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
  • [CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
  • [Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
  • [Misc] benchmark: Add option to set max concurrency by @russellb in #9390
  • [Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
  • [CI/Build] Add error matching config for mypy by @russellb in #9512
  • [Model] Support Pixtral models ...
Read more

v0.6.3.post1

17 Oct 17:26
a2c71c5
Compare
Choose a tag to compare

Highlights

New Models

  • Support Ministral 3B and Ministral 8B via interleaved attention (#9414)
  • Support multiple and interleaved images for Llama3.2 (#9095)
  • Support VLM2Vec, the first multimodal embedding model in vLLM (#9303)

Important bug fix

  • Fix chat API continuous usage stats (#9357)
  • Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids (#9034)
  • Fix Molmo text-only input bug (#9397)
  • Fix CUDA 11.8 Build (#9386)
  • Fix _version.py not found issue (#9375)

Other Enhancements

  • Remove block manager v1 and make block manager v2 default (#8704)
  • Spec Decode Optimize ngram lookup performance (#9333)

What's Changed

New Contributors

  • @gracehonv made their first contribution in #9349
  • @streaver91 made their first contribution in #9396

Full Changelog: v0.6.3...v0.6.3.post1

v0.6.3

14 Oct 20:20
fd47e57
Compare
Choose a tag to compare

Highlights

Model Support

  • New Models:
  • Expansion in functionality:
    • Add Gemma2 embedding model (#9004)
    • Support input embeddings for qwen2vl (#8856), minicpmv (#9237)
    • LoRA:
      • LoRA support for MiniCPMV2.5 (#7199), MiniCPMV2.6 (#8943)
      • Expand lora modules for mixtral (#9008)
    • Pipeline parallelism support to remaining text and embedding models (#7168, #9090)
    • Expanded bitsandbytes quantization support for Falcon, OPT, Gemma, Gemma2, and Phi (#9148)
    • Tool use:
      • Add support for Llama 3.1 and 3.2 tool use (#8343)
      • Support tool calling for InternLM2.5 (#8405)
  • Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)

Documentation

  • New compatibility matrix for mutual exclusive features (#8512)
  • Reorganized installation doc, note that we publish a per-commit docker image (#8931)

Hardware Support:

  • Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
  • Support AWQ for CPU backend (#7515)
  • Add async output processor for xpu (#8897)
  • Add on-device sampling support for Neuron (#8746)

Architectural Enhancements

  • Progress in vLLM's refactoring to a core core:
    • Spec decode removing batch expansion (#8839, #9298).
    • We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
    • Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
    • Move guided decoding params into sampling params (#8252)
  • Torch Compile:
    • You can now set an env var VLLM_TORCH_COMPILE_LEVEL to control torch.compile various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), using VLLM_TORCH_COMPILE_LEVEL=3 can turn on Inductor's full graph compilation without vLLM's custom ops.

Others

  • Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
  • Enhancements towards priority scheduling (#8965, #8956, #8850)

What's Changed

Read more

v0.6.2

25 Sep 21:50
7193774
Compare
Choose a tag to compare

Highlights

Model Support

  • Support Llama 3.2 models (#8811, #8822)

     vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --enforce-eager --max-num-seqs 16
    
  • Beam search have been soft deprecated. We are moving towards a version of beam search that's more performant and also simplifying vLLM's core. (#8684, #8763, #8713)

    • ⚠️ You will see the following error now, this is breaking change!

      Using beam search as a sampling parameter is deprecated, and will be removed in the future release. Please use the vllm.LLM.use_beam_search method for dedicated beam search instead, or set the environment variable VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1 to suppress this error. For more details, see #8306

  • Support for Solar Model (#8386), minicpm3 (#8297), LLaVA-Onevision model support (#8486)

  • Enhancements: pp for qwen2-vl (#8696), multiple images for qwen-vl (#8247), mistral function calling (#8515), bitsandbytes support for Gemma2 (#8338), tensor parallelism with bitsandbytes quantization (#8434)

Hardware Support

  • TPU: implement multi-step scheduling (#8489), use Ray for default distributed backend (#8389)
  • CPU: Enable mrope and support Qwen2-VL on CPU backend (#8770)
  • AMD: custom paged attention kernel for rocm (#8310), and fp8 kv cache support (#8577)

Production Engine

  • Initial support for priority sheduling (#5958)
  • Support Lora lineage and base model metadata management (#6315)
  • Batch inference for llm.chat() API (#8648)

Performance

  • Introduce MQLLMEngine for API Server, boost throughput 30% in single step and 7% in multistep (#8157, #8761, #8584)
  • Multi-step scheduling enhancements
    • Prompt logprobs support in Multi-step (#8199)
    • Add output streaming support to multi-step + async (#8335)
    • Add flashinfer backend (#7928)
  • Add cuda graph support during decoding for encoder-decoder models (#7631)

Others

  • Support sample from HF datasets and image input for benchmark_serving (#8495)
  • Progress in torch.compile integration (#8488, #8480, #8384, #8526, #8445)

What's Changed

Read more

v0.6.1.post2

13 Sep 18:35
9ba0817
Compare
Choose a tag to compare

Highlights

  • This release contains an important bugfix related to token streaming combined with stop string (#8468)

What's Changed

Full Changelog: v0.6.1.post1...v0.6.1.post2

v0.6.1.post1

13 Sep 04:40
acda0b3
Compare
Choose a tag to compare

Highlights

This release features important bug fixes and enhancements for

  • Pixtral models. (#8415, #8425, #8399, #8431)
    • Chunked scheduling has been turned off for vision models. Please replace --max_num_batched_tokens 16384 with --max-model-len 16384
  • Multistep scheduling. (#8417, #7928, #8427)
  • Tool use. (#8423, #8366)

Also

  • support multiple images for qwen-vl (#8247)
  • removes engine_use_ray (#8126)
  • add engine option to return only deltas or final output (#7381)
  • add bitsandbytes support for Gemma2 (#8338)

What's Changed

New Contributors

Full Changelog: v0.6.1...v0.6.1.post1