Requirements: I have previously deployed the deepseek-r1:32b model with Ollama, which is very convenient and fast, suitable for personal rapid deployment. If it is an enterprise production environment, how should it be deployed? Generally, vllm and sglang are used for deployment, and this article uses vLLM to deploy the DeepSeek-R1 model.
Ollama vs. vLLM
The differences are as follows:
| Contrast dimensions | Ollama | vLLM | | Core positioning | Lightweight localization tools for individual developers and small-scale experiments | Production-level inference framework, focusing on enterprise-level scenarios with high concurrency and low latency | | Hardware requirements | Supports CPU and GPU, low memory footprint (uses quantization model by default) | Must rely on NVIDIA GPUs, which have a high memory usage | | Model support | Built-in pre-trained model library (supports 1700+ models), automatic download of quantitative versions (mainly int4) | Manual download of the original model file (e.g. HuggingFace format) supports a wider range of models | | Difficulty of deployment | One-button installation and out-of-the-box use with no programming base required | Python environment and CUDA driver are required, and technical experience is required | | Performance characteristics | The single inference speed is fast, but the concurrency processing ability is weak | High throughput, support for dynamic batch processing and thousands of concurrent requests | | resource management | Adjust resource usage flexibly and automatically release video memory when idle | The video memory occupancy is fixed, and resources need to be reserved to cope with peak loads |
A brief introduction to vLLMs
vLLM is a fast and easy-to-use library of LLM inference and services.
vLLM with new algorithms redefines the latest technology level of LLM services: . Compared to HuggingFace Transformers, it offers up to 24x higher throughput without any model architecture changes. Halving hashrate and increasing throughput tenfold, the study compared vLLM's throughput with the most popular LLM library, HuggingFace Transformers (HF), and the previous HuggingFace Text Generation Inference (TGI) with SOTA throughput. In addition, the study divided the experimental setup into two types: LLaMA-7B with NVIDIA A10G GPU as hardware; The other is LLaMA-13B, with NVIDIA A100 GPU (40GB) on hardware. They sampled input/output lengths from the ShareGPT dataset. The results showed that the throughput of vLLM was 24 times higher than HF and 3.5 times higher than TGI.
vLLM documentation:The hyperlink login is visible. Source code address:The hyperlink login is visible. Performance Testing:The hyperlink login is visible.
You don't have to understand the picture, the cow is done!
Environmental preparation
I purchased Tencent Cloud high-performance application services and configured them as follows:
Ubuntu 20.04 Environment configuration: Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8 Computing power type: Two-card GPU basic type - 2*16GB+ | 16+TFlops SP | CPU - 16 cores | RAM - 64GB
Install Conda
Create a python environment with conda, paste the script directly:
Deploy DeepSeek-R1 using a vLLM
Create a python environment with conda with the following command:
Install vllm and modelscope with the following commands:
Download the DeepSeek-R1 model using modelscope with the following command:
Reference:The hyperlink login is visible.
Start the deepseek model using the vllm with the following command:
If you encounter "Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.” Warning, just add parameters according to the warning.
Remark:
- --tensor-parallel-size and GPU count settings
- --gpu-memory-utilization controls the percentage of memory used
- --served-model-name The model name used in the API
- --disable-log-requests disables logging requests
vLLM Linux GPU Installation Documentation:The hyperlink login is visible. Engine Parameters:The hyperlink login is visible.
Check out the GPU status as shown below:
Use Postman tests
Browser open:http://ip:8000/ Interface Documentation:http://ip:8000/docs
Postman call, as shown in the following image:
Benchmarking
Download the test code with the following command:
The command is executed as follows:
Result: Throughput: 2.45 requests/s, 1569.60 total tokens/s, 1255.68 output tokens/s
(End) |