This article is a mirror article of machine translation, please click here to jump to the original article.

View: 1547|Reply: 2

[AI] (9) Use vLLM enterprise-level deployment of DeepSeek-R1 models

[Copy link]
Posted on 2025-3-6 11:23:03 | | | |
Requirements: I have previously deployed the deepseek-r1:32b model with Ollama, which is very convenient and fast, suitable for personal rapid deployment. If it is an enterprise production environment, how should it be deployed? Generally, vllm and sglang are used for deployment, and this article uses vLLM to deploy the DeepSeek-R1 model.

Ollama vs. vLLM

The differences are as follows:

Contrast dimensionsOllamavLLM
Core positioningLightweight localization tools for individual developers and small-scale experimentsProduction-level inference framework, focusing on enterprise-level scenarios with high concurrency and low latency
Hardware requirementsSupports CPU and GPU, low memory footprint (uses quantization model by default)Must rely on NVIDIA GPUs, which have a high memory usage
Model supportBuilt-in pre-trained model library (supports 1700+ models), automatic download of quantitative versions (mainly int4)Manual download of the original model file (e.g. HuggingFace format) supports a wider range of models
Difficulty of deploymentOne-button installation and out-of-the-box use with no programming base requiredPython environment and CUDA driver are required, and technical experience is required
Performance characteristicsThe single inference speed is fast, but the concurrency processing ability is weakHigh throughput, support for dynamic batch processing and thousands of concurrent requests
resource managementAdjust resource usage flexibly and automatically release video memory when idleThe video memory occupancy is fixed, and resources need to be reserved to cope with peak loads


A brief introduction to vLLMs

vLLM is a fast and easy-to-use library of LLM inference and services.

vLLM with new algorithms redefines the latest technology level of LLM services: . Compared to HuggingFace Transformers, it offers up to 24x higher throughput without any model architecture changes. Halving hashrate and increasing throughput tenfold, the study compared vLLM's throughput with the most popular LLM library, HuggingFace Transformers (HF), and the previous HuggingFace Text Generation Inference (TGI) with SOTA throughput. In addition, the study divided the experimental setup into two types: LLaMA-7B with NVIDIA A10G GPU as hardware; The other is LLaMA-13B, with NVIDIA A100 GPU (40GB) on hardware. They sampled input/output lengths from the ShareGPT dataset. The results showed that the throughput of vLLM was 24 times higher than HF and 3.5 times higher than TGI.

vLLM documentation:The hyperlink login is visible.
Source code address:The hyperlink login is visible.
Performance Testing:The hyperlink login is visible.



You don't have to understand the picture, the cow is done!

Environmental preparation

I purchased Tencent Cloud high-performance application services and configured them as follows:

Ubuntu 20.04
Environment configuration: Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8
Computing power type: Two-card GPU basic type - 2*16GB+ | 16+TFlops SP | CPU - 16 cores | RAM - 64GB

Install Conda

Create a python environment with conda, paste the script directly:


Deploy DeepSeek-R1 using a vLLM

Create a python environment with conda with the following command:


Install vllm and modelscope with the following commands:


Download the DeepSeek-R1 model using modelscope with the following command:


Reference:The hyperlink login is visible.

Start the deepseek model using the vllm with the following command:




If you encounter "Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.” Warning, just add parameters according to the warning.

Remark:

  • --tensor-parallel-size and GPU count settings
  • --gpu-memory-utilization controls the percentage of memory used
  • --served-model-name The model name used in the API
  • --disable-log-requests disables logging requests


vLLM Linux GPU Installation Documentation:The hyperlink login is visible.
Engine Parameters:The hyperlink login is visible.

Check out the GPU status as shown below:



Use Postman tests

Browser open:http://ip:8000/
Interface Documentation:http://ip:8000/docs



Postman call, as shown in the following image:




Benchmarking

Download the test code with the following command:


The command is executed as follows:


Result: Throughput: 2.45 requests/s, 1569.60 total tokens/s, 1255.68 output tokens/s



(End)




Previous:Home network starts with GL-MT3000 router
Next:Webmaster's self-media account
 Landlord| Posted on 2025-3-12 15:14:42 |
Using vllm or sglang on Windows is not currently supported, and if you want to run it on Windows, you can use WSL (Windows Subsystem for Linux) instead.
 Landlord| Posted on 2025-8-18 11:46:22 |
Other inference frameworks: TensorRT, vLLM, LMDeploy and MLC-LLM, sglang
Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com