This article is a mirror article of machine translation, please click here to jump to the original article.

View: 2394|Reply: 10

【AI】(3) Tencent Cloud Deploys DeepSeek-R1 with HAI tutorial

[Copy link]
Posted on 2025-2-5 21:14:04 | | | |
Hyper Application Inventor (HAI) is a GPU application service product for AI and scientific computing, providing plug-and-play computing power and common environments to help small and medium-sized enterprises and developers quickly deploy LLMs.

Address:The hyperlink login is visible.

HAI vs GPU servers

Greatly reduce the threshold for GPU cloud server use, optimize the product experience from multiple angles, and use it out of the box, as shown in the figure below:



Purchase HAI computing power

Go to the purchase page, select the basic environment "Ubuntu 20.04" image, and configure the environment:Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8The image has already installed the driver for us, and we choose to pay as we go, as shown in the figure below:



Video memory: 32GB+
Hashrate: 15+TFlops SP
CPU: 8~10 cores
RAM: 40GB

After waiting for a few minutes, the instance is successfully created, and Academic Acceleration is turned on, as shown in the following figure:



The first time you use it, you need to reset your password, and the login username is:ubuntu。 Try logging in to the server and checking the NVIDIA GPU driver information with the following command:


As shown below:


Install Ollama

Ollama Official Website:The hyperlink login is visible.

Log in to the server using the putty tool and start installing the Ollama tool with the following command:


The installation is complete, and the output is as follows:
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.

Check out the version command: ollama -v
See the model that is currently loaded into memory: ollama ps

Create a custom model storage folder with the following command:

Modify the default listening address and model storage path (you cannot modify the default port, otherwise the command will fail) and use the following commands:


Deploy the deepseek-r1 model

Run the deepseek-r1:8b model with the following command:


As shown below:



Test the dialog as shown below:



The firewall releases TCP port 11434 and calls the HTTP interface, as shown in the following figure:



{
  "models": [
    {
      "name": "deepseek-r1:8b",
      "model": "deepseek-r1:8b",
      "size": 6930032640,
      "digest": "28f8fd6cdc677661426adab9338ce3c013d7e69a5bea9e704b364171a5d61a10",
      "details": {
        "parent_model": "",
        "format": "gguf",
        "family": "llama",
        "families": [
          "llama"
        ],
        "parameter_size": "8.0B",
        "quantization_level": "Q4_K_M"
      },
      "expires_at": "2025-02-05T21:14:50.715753614+08:00",
      "size_vram": 6930032640
    }
  ]
}

Reference:
The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.




Previous:[AI] (2) The difference between DeepSeek-V3 vs R1 versions
Next:[AI] (4) Use Open WebUI to call the DeepSeek-R1 model
 Landlord| Posted on 2025-2-5 21:22:49 |
If the model does not receive requests or inputs for a period of time, Ollama automatically terminates the model in the cloud center to save resources.
 Landlord| Posted on 2025-2-6 09:03:57 |
ollama environment variable configuration item

VariableDefault ValueDescription + Effect + Scenario
OLLAMA_HOST"[color=var(--fgColor-accent, var(--color-accent-fg))]The hyperlink login is visible."Configures the host and scheme for the Ollama server. Effect: Determines the URL used for connecting to the Ollama server. Scenario: Useful when deploying Ollama in a distributed environment or when you need to expose the service on a specific network interface.
OLLAMA_ORIGINS[localhost, 127.0.0.1, 0.0.0.0] + app://, file://, tauri://Configures allowed origins for CORS. Effect: Controls which origins are allowed to make requests to the Ollama server. Scenario: Critical when integrating Ollama with web applications to prevent unauthorized access from different domains.
OLLAMA_MODELS$HOME/.ollama/modelsSets the path to the models directory. Effect: Determines where model files are stored and loaded from. Scenario: Useful for managing disk space on different drives or setting up shared model repositories in multi-user environments.
OLLAMA_KEEP_ALIVE5 minutesSets how long models stay loaded in memory. Effect: Controls the duration models remain in memory after use. Scenario: Longer durations improve response times for frequent queries but increase memory usage. Shorter durations free up resources but may increase initial response times.
OLLAMA_DEBUGfalseEnables additional debug information. Effect: Increases verbosity of logging and debugging output. Scenario: Invaluable for troubleshooting issues or understanding the system's behavior during development or deployment.
OLLAMA_FLASH_ATTENTIONfalseEnables experimental flash attention feature. Effect: Activates an experimental optimization for attention mechanisms. Scenario: Can potentially improve performance on compatible hardware but may introduce instability.
OLLAMA_NOHISTORYfalseDisables readline history. Effect: Prevents command history from being saved. Scenario: Useful in security-sensitive environments where command history should not be persisted.
OLLAMA_NOPRUNEfalseDisables pruning of model blobs on startup. Effect: Keeps all model blobs, potentially increasing disk usage. Scenario: Helpful when you need to maintain all model versions for compatibility or rollback purposes.
OLLAMA_SCHED_SPREADfalseAllows scheduling models across all GPUs. Effect: Enables multi-GPU usage for model inference. Scenario: Beneficial in high-performance computing environments with multiple GPUs to maximize hardware utilization.
OLLAMA_INTEL_GPUfalseEnables experimental Intel GPU detection. Effect: Allows usage of Intel GPUs for model inference. Scenario: Useful for organizations leveraging Intel GPU hardware for AI workloads.
OLLAMA_LLM_LIBRARY"" (auto-detect)Sets the LLM library to use. Effect: Overrides automatic detection of LLM library. Scenario: Useful when you need to force a specific library version or implementation for compatibility or performance reasons.
OLLAMA_TMPDIRSystem default temp directorySets the location for temporary files. Effect: Determines where temporary files are stored. Scenario: Important for managing I/O performance or when system temp directory has limited space.
CUDA_VISIBLE_DEVICESAll availableSets which NVIDIA devices are visible. Effect: Controls which NVIDIA GPUs can be used. Scenario: Critical for managing GPU allocation in multi-user or multi-process environments.
HIP_VISIBLE_DEVICESAll availableSets which AMD devices are visible. Effect: Controls which AMD GPUs can be used. Scenario: Similar to CUDA_VISIBLE_DEVICES but for AMD hardware.
OLLAMA_RUNNERS_DIRSystem-dependentSets the location for runners. Effect: Determines where runner executables are located. Scenario: Important for custom deployments or when runners need to be isolated from the main application.
OLLAMA_NUM_PARALLEL0 (unlimited)Sets the number of parallel model requests. Effect: Controls concurrency of model inference. Scenario: Critical for managing system load and ensuring responsiveness in high-traffic environments.
OLLAMA_MAX_LOADED_MODELS0 (unlimited)Sets the maximum number of loaded models. Effect: Limits the number of models that can be simultaneously loaded. Scenario: Helps manage memory usage in environments with limited resources or many different models.
OLLAMA_MAX_QUEUE512Sets the maximum number of queued requests. Effect: Limits the size of the request queue. Scenario: Prevents system overload during traffic spikes and ensures timely processing of requests.
OLLAMA_MAX_VRAM0 (unlimited)Sets a maximum VRAM override in bytes. Effect: Limits the amount of VRAM that can be used. Scenario: Useful in shared GPU environments to prevent a single process from monopolizing GPU memory.


Source:The hyperlink login is visible.

$ ollama help serve
Start ollama

Usage:
  ollama serve [flags]

Aliases:
  serve, start

Flags:
  -h, --help   help for serve

Environment Variables:
      OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)
      OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)
      OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")
      OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models per GPU
      OLLAMA_MAX_QUEUE           Maximum number of queued requests
      OLLAMA_MODELS              The path to the models directory
      OLLAMA_NUM_PARALLEL        Maximum number of parallel requests
      OLLAMA_NOPRUNE             Do not prune model blobs on startup
      OLLAMA_ORIGINS             A comma separated list of allowed origins
      OLLAMA_SCHED_SPREAD        Always schedule model across all GPUs
      OLLAMA_TMPDIR              Location for temporary files
      OLLAMA_FLASH_ATTENTION     Enabled flash attention
      OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection
      OLLAMA_GPU_OVERHEAD        Reserve a portion of VRAM per GPU (bytes)
      OLLAMA_LOAD_TIMEOUT        How long to allow model loads to stall before giving up (default "5m")


Reference:The hyperlink login is visible.
 Landlord| Posted on 2025-2-6 09:19:49 |
ollama command

ollama list: Displays a list of models
Ollama show: Displays information about the model
ollama pull: pull model
ollama push: Push model
ollama cp: copy a model
ollama rm: Delete a model
ollama run: Run a model
 Landlord| Posted on 2025-2-6 09:33:17 |
Model optimization, editing/etc/systemd/system/ollama.service.d/override.confconfiguration, add the following:


The complete configuration is as follows:



 Landlord| Posted on 2025-2-6 09:35:26 |
--verboseThe parameter is used to display the inference speed (token / second). For example:

>>> Hello, who are you?
<think>

</think>

Hello! I am an AI assistant independently developed by DeepSeek in China, and I am happy to serve you!

total duration:       489.831897ms
load duration:        24.116423ms
prompt eval count:    10 token(s)
prompt eval duration: 4ms
prompt eval rate:     2500.00 tokens/s
eval count:           36 token(s)
eval duration:        460ms
eval rate:            78.26 tokens/s
>>> I'm a little scumbag, what's my name?
<think>
Well, is little scum my name? Let me think about it carefully. I remember when I was a child, my parents named me "Little Scumbag" because my family thought I had big ears, but in fact, that's it
Not the real reason. Maybe it's because when I was a child, I always put my finger in my mouth like garbage, so I was called "little scum".

However, the name sounds a bit strange now, and it doesn't quite fit the naming style of today's society. I should consider renaming it with a name that suits me better. First, I need to understand self
Your personality traits and interests, so you can choose a name that is both meaningful and unique.

I like to help others and am good at listening and communicating, so maybe pick a name with "wisdom", "warm" or "friendly". At the same time, I also want to reflect my personality, such as
If you like books, music, or sports, you can incorporate these elements into your name.

Furthermore, I need to make sure that the new name is not misremembered or misunderstood. This means that the name must be unique and recognizable, not too complex or too ordinary. I can
Refer to the names of some well-known figures to see which ones have both cultural heritage and modernity.

In addition, I need to consider whether the pronunciation is smooth, whether there is a positive meaning, and how it is expressed in different contexts. This means that when choosing a name, not only sound good, but also
There must be meaning and value behind it.

Finally, I need to communicate with my family and friends to make sure they are happy with the new name and that the name can be a part of my life. After all, the name is not only
identifier is part of self-expression.

In short, Xiao Zhao is not my real name, and I have the right to choose a name that suits me better. It takes time and thinking, but by carefully analyzing your personality, interests
As well as the meaning of the name, I believe I will be able to find the most suitable name for myself.
</think>

Based on the above thinking process, I decided to name myself "Wisdom Star". The name reflects both my ingenuity and warmth and hope, making it a perfect fit for my personal sexuality
Characteristics. At the same time, the "star" symbolizes light and infinite possibilities, inspiring me to constantly pursue progress and growth.

**Finalized Name: Wisdom Star**

This name not only matches my personality, but is also full of positive energy that makes me feel very proud and fulfilled.

total duration:       7.600299527s
load duration:        24.37534ms
prompt eval count:    59 token(s)
prompt eval duration: 10ms
prompt eval rate:     5900.00 tokens/s
eval count:           557 token(s)
eval duration:        6.618s
eval rate:            84.16 tokens/s

 Landlord| Posted on 2025-2-6 10:22:02 |
Deploy ollama models with AMD GPUs
ollama-for-amd:The hyperlink login is visible.

Reference:The hyperlink login is visible.
 Landlord| Posted on 2025-2-6 13:26:17 |
Run the deepseek-r1:32b model




root@VM-0-8-ubuntu:~# nvidia-smi
Thu Feb  6 13:25:04 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                  Off |
| N/A   65C    P0   205W / 300W |  21822MiB / 32768MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     91457      C   ... 1_avx/ollama_llama_server    21820MiB |
+-----------------------------------------------------------------------------+
root@VM-0-8-ubuntu:~# ollama show deepseek-r1:32b
  Model
    architecture        qwen2
    parameters          32.8B
    context length      131072
    embedding length    5120
    quantization        Q4_K_M

  Parameters
    stop    "<|begin▁of▁sentence|>"
    stop    "<|end▁of▁sentence|>"
    stop    "<|User|>"
    stop    "<|Assistant|>"

  License
    MIT License
    Copyright (c) 2023 DeepSeek

root@VM-0-8-ubuntu:~# ollama ps
NAME               ID              SIZE     PROCESSOR    UNTIL
deepseek-r1:32b    38056bbcbb2d    23 GB    100% GPU     Forever


 Landlord| Posted on 2025-2-8 08:34:18 |
How to solve the Ollama model pull problem
https://www.itsvse.com/thread-10939-1-1.html
 Landlord| Posted on 2025-2-13 09:25:04 |
Experience the DeepSeek R1 32b model on the Jetson AGX Orin (32G):The hyperlink login is visible.
Jetson runs large language models:https://www.jetson-ai-lab.com/models.html

Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com