This article is a mirror article of machine translation, please click here to jump to the original article.

View: 2195|Reply: 1

[AI] (7) Use llama.cpp to deploy the DeepSeek-R1 model on-premises

[Copy link]
Posted on 2025-2-7 13:58:06 | | | |
llama.cpp Introduction

Inference Meta's LLaMA model (and others) using pure C/C++. The primary goal llama.cpp to enable LLM inference on various hardware (on-premises and in the cloud) with minimal setup and state-of-the-art performance.

  • Pure C/C++ implementation with no dependencies
  • Apple silicon is top-notch – optimized with ARM NEON, Accelerate, and Metal frameworks
  • AVX, AVX2, AVX512, and AMX support x86 architectures
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory usage
  • Custom CUDA cores for running LLMs on NVIDIA GPUs (AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA)
  • Vulkan and SYCL backend support
  • CPU+GPU hybrid inference, partially accelerating models larger than the total VRAM capacity


Github address:The hyperlink login is visible.
Download Address:The hyperlink login is visible.

Download llama.cpp

First, download the corresponding version of the llama.cpp software according to your computer's hardware configuration, as shown in the figure below:



AVX supports 256-bit wide operation.
AVX2 also supports 256-bit wide operations, but adds support for integer operations as well as some additional instructions.
The AVX-512 supports 512-bit wide operations, providing increased parallelism and performance, especially when dealing with large amounts of data or floating-point operations.

My computer runs on pure CPU and supports avx512 instruction set, so download the "" version, download address:The hyperlink login is visible.After the download is completed, unzip it toD:\llama-b4658-bin-win-avx512-x64Directory.

Download the DeepSeek-R1 model

Download Address:The hyperlink login is visible.This article begins with "DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.ggufFor example.

Just download it according to your own configuration. The higher the level of quantization, the larger the file, and the higher the accuracy of the model.

llama.cpp Deploy the DeepSeek-R1 model

Run the following command in the DeepSeek-R1-Distill-Qwen-1.5B-Q3_K_L.gguf file directory:

As shown below:



Open it using a browserhttp://127.0.0.1:8080/The address is tested as shown below:



Attached is the running parameter configuration:The hyperlink login is visible.




Previous:The frontend generates a .d.ts file from the .js file
Next:How to solve the Ollama model pull problem
 Landlord| Posted on 2025-3-5 10:48:53 |
AI model community

Hugging Face Official Website:https://huggingface.co/
Hugging Face Domestic Mirror:https://hf-mirror.com/
Magic Matching modelscope:https://www.modelscope.cn/
Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com