[AI] (9) Use vLLM enterprise-level deployment of DeepSeek-R1 models

Little scum · Posted on 3/6/2025 11:23:03 AM

Requirements: I have previously deployed the deepseek-r1:32b model with Ollama, which is very convenient and fast, suitable for personal rapid deployment. If it is an enterprise production environment, how should it be deployed? Generally, vllm and sglang are used for deployment, and this article uses vLLM to deploy the DeepSeek-R1 model.

Ollama vs. vLLM

The differences are as follows:

Contrast dimensions	Ollama	vLLM
Core positioning	Lightweight localization tools for individual developers and small-scale experiments	Production-level inference framework, focusing on enterprise-level scenarios with high concurrency and low latency
Hardware requirements	Supports CPU and GPU, low memory footprint (uses quantization model by default)	Must rely on NVIDIA GPUs, which have a high memory usage
Model support	Built-in pre-trained model library (supports 1700+ models), automatic download of quantitative versions (mainly int4)	Manual download of the original model file (e.g. HuggingFace format) supports a wider range of models
Difficulty of deployment	One-button installation and out-of-the-box use with no programming base required	Python environment and CUDA driver are required, and technical experience is required
Performance characteristics	The single inference speed is fast, but the concurrency processing ability is weak	High throughput, support for dynamic batch processing and thousands of concurrent requests
resource management	Adjust resource usage flexibly and automatically release video memory when idle	The video memory occupancy is fixed, and resources need to be reserved to cope with peak loads

A brief introduction to vLLMs

vLLM is a fast and easy-to-use library of LLM inference and services.

vLLM with new algorithms redefines the latest technology level of LLM services: . Compared to HuggingFace Transformers, it offers up to 24x higher throughput without any model architecture changes. Halving hashrate and increasing throughput tenfold, the study compared vLLM's throughput with the most popular LLM library, HuggingFace Transformers (HF), and the previous HuggingFace Text Generation Inference (TGI) with SOTA throughput. In addition, the study divided the experimental setup into two types: LLaMA-7B with NVIDIA A10G GPU as hardware; The other is LLaMA-13B, with NVIDIA A100 GPU (40GB) on hardware. They sampled input/output lengths from the ShareGPT dataset. The results showed that the throughput of vLLM was 24 times higher than HF and 3.5 times higher than TGI.

vLLM documentation:The hyperlink login is visible.
Source code address:The hyperlink login is visible.
Performance Testing:The hyperlink login is visible.

You don't have to understand the picture, the cow is done!

Environmental preparation

I purchased Tencent Cloud high-performance application services and configured them as follows:

Ubuntu 20.04
Environment configuration: Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8
Computing power type: Two-card GPU basic type - 2*16GB+ | 16+TFlops SP | CPU - 16 cores | RAM - 64GB

Install Conda

Create a python environment with conda, paste the script directly:

Login is visible.

Deploy DeepSeek-R1 using a vLLM

Create a python environment with conda with the following command:

Login is visible.

Install vllm and modelscope with the following commands:

Login is visible.

Download the DeepSeek-R1 model using modelscope with the following command:

Login is visible.

Reference:The hyperlink login is visible.

Start the deepseek model using the vllm with the following command:

Login is visible.

If you encounter "Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.” Warning, just add parameters according to the warning.

Remark:

--tensor-parallel-size and GPU count settings
--gpu-memory-utilization controls the percentage of memory used
--served-model-name The model name used in the API
--disable-log-requests disables logging requests

vLLM Linux GPU Installation Documentation:The hyperlink login is visible.
Engine Parameters:The hyperlink login is visible.

Check out the GPU status as shown below:

Use Postman tests

Browser open:http://ip:8000/
Interface Documentation:http://ip:8000/docs

Postman call, as shown in the following image:

Login is visible.

Benchmarking

Download the test code with the following command:

Login is visible.

The command is executed as follows:

Login is visible.

Result: Throughput: 2.45 requests/s, 1569.60 total tokens/s, 1255.68 output tokens/s

(End)

Little scum · Posted on 3/12/2025 3:14:42 PM

Using vllm or sglang on Windows is not currently supported, and if you want to run it on Windows, you can use WSL (Windows Subsystem for Linux) instead.

Little scum · Posted on 8/18/2025 11:46:22 AM

Other inference frameworks: TensorRT, vLLM, LMDeploy and MLC-LLM, sglang

[AI] (9) Use vLLM enterprise-level deployment of DeepSeek-R1 models

Related Posts

Sections viewed