[AI] (7) Use llama.cpp to deploy the DeepSeek-R1 model on-premises

Little scum · Posted on 2/7/2025 1:58:06 PM

llama.cpp Introduction

Inference Meta's LLaMA model (and others) using pure C/C++. The primary goal llama.cpp to enable LLM inference on various hardware (on-premises and in the cloud) with minimal setup and state-of-the-art performance.

Pure C/C++ implementation with no dependencies
Apple silicon is top-notch – optimized with ARM NEON, Accelerate, and Metal frameworks
AVX, AVX2, AVX512, and AMX support x86 architectures
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
Custom CUDA cores for running LLMs on NVIDIA GPUs (AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
Vulkan and SYCL backend support
CPU+GPU hybrid inference, partially accelerating models larger than the total VRAM capacity

Github address:The hyperlink login is visible.
Download Address:The hyperlink login is visible.

Download llama.cpp

First, download the corresponding version of the llama.cpp software according to your computer's hardware configuration, as shown in the figure below:

AVX supports 256-bit wide operation.
AVX2 also supports 256-bit wide operations, but adds support for integer operations as well as some additional instructions.
The AVX-512 supports 512-bit wide operations, providing increased parallelism and performance, especially when dealing with large amounts of data or floating-point operations.

My computer runs on pure CPU and supports avx512 instruction set, so download the "" version, download address:The hyperlink login is visible.After the download is completed, unzip it toD:\llama-b4658-bin-win-avx512-x64Directory.

Download the DeepSeek-R1 model

Download Address:The hyperlink login is visible.This article begins with "DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.ggufFor example.

Just download it according to your own configuration. The higher the level of quantization, the larger the file, and the higher the accuracy of the model.

llama.cpp Deploy the DeepSeek-R1 model

Run the following command in the DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf file directory:

Login is visible.

As shown below:

Open it using a browserhttp://127.0.0.1:8080/The address is tested as shown below:

Attached is the running parameter configuration:The hyperlink login is visible.

Little scum · Posted on 3/5/2025 10:48:53 AM

AI model community

Hugging Face Official Website:https://huggingface.co/
Hugging Face Domestic Mirror:https://hf-mirror.com/
Magic Matching modelscope:https://www.modelscope.cn/

[AI] (7) Use llama.cpp to deploy the DeepSeek-R1 model on-premises

Related Posts

Sections viewed