【AI】(3) Tencent Cloud Deploys DeepSeek-R1 with HAI tutorial

Little scum · Posted on 2/5/2025 9:14:04 PM

Hyper Application Inventor (HAI) is a GPU application service product for AI and scientific computing, providing plug-and-play computing power and common environments to help small and medium-sized enterprises and developers quickly deploy LLMs.

Address:The hyperlink login is visible.

HAI vs GPU servers

Greatly reduce the threshold for GPU cloud server use, optimize the product experience from multiple angles, and use it out of the box, as shown in the figure below:

Purchase HAI computing power

Go to the purchase page, select the basic environment "Ubuntu 20.04" image, and configure the environment:Ubuntu 20.04, Driver 525.105.17, Python 3.8, CUDA 12.0, cuDNN 8The image has already installed the driver for us, and we choose to pay as we go, as shown in the figure below:

Video memory: 32GB+
Hashrate: 15+TFlops SP
CPU: 8~10 cores
RAM: 40GB

After waiting for a few minutes, the instance is successfully created, and Academic Acceleration is turned on, as shown in the following figure:

The first time you use it, you need to reset your password, and the login username is:ubuntu。 Try logging in to the server and checking the NVIDIA GPU driver information with the following command:

Login is visible.

As shown below:

Install Ollama

Ollama Official Website:The hyperlink login is visible.

Log in to the server using the putty tool and start installing the Ollama tool with the following command:

Login is visible.

The installation is complete, and the output is as follows:

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.

Check out the version command: ollama -v
See the model that is currently loaded into memory: ollama ps

Create a custom model storage folder with the following command:

Login is visible.

Modify the default listening address and model storage path (you cannot modify the default port, otherwise the command will fail) and use the following commands:

Login is visible.

Deploy the deepseek-r1 model

Run the deepseek-r1:8b model with the following command:

Login is visible.

As shown below:

Test the dialog as shown below:

The firewall releases TCP port 11434 and calls the HTTP interface, as shown in the following figure:

{
  "models": [
{
   "name": "deepseek-r1:8b",
   "model": "deepseek-r1:8b",
   "size": 6930032640,
   "digest": "28f8fd6cdc677661426adab9338ce3c013d7e69a5bea9e704b364171a5d61a10",
   "details": {
      "parent_model": "",
      "format": "gguf",
      "family": "llama",
      "families": [
      "llama"
      ],
      "parameter_size": "8.0B",
      "quantization_level": "Q4_K_M"
   },
   "expires_at": "2025-02-05T21:14:50.715753614+08:00",
   "size_vram": 6930032640
}
  ]
}

Reference:
The hyperlink login is visible.
The hyperlink login is visible.
The hyperlink login is visible.

Little scum · Posted on 2/5/2025 9:22:49 PM

If the model does not receive requests or inputs for a period of time, Ollama automatically terminates the model in the cloud center to save resources.

Little scum · Posted on 2/6/2025 9:03:57 AM

ollama environment variable configuration item

Variable	Default Value	Description + Effect + Scenario
OLLAMA_HOST	"[color=var(--fgColor-accent, var(--color-accent-fg))]The hyperlink login is visible."	Configures the host and scheme for the Ollama server. Effect: Determines the URL used for connecting to the Ollama server. Scenario: Useful when deploying Ollama in a distributed environment or when you need to expose the service on a specific network interface.
OLLAMA_ORIGINS	[localhost, 127.0.0.1, 0.0.0.0] + app://, file://, tauri://	Configures allowed origins for CORS. Effect: Controls which origins are allowed to make requests to the Ollama server. Scenario: Critical when integrating Ollama with web applications to prevent unauthorized access from different domains.
OLLAMA_MODELS	$HOME/.ollama/models	Sets the path to the models directory. Effect: Determines where model files are stored and loaded from. Scenario: Useful for managing disk space on different drives or setting up shared model repositories in multi-user environments.
OLLAMA_KEEP_ALIVE	5 minutes	Sets how long models stay loaded in memory. Effect: Controls the duration models remain in memory after use. Scenario: Longer durations improve response times for frequent queries but increase memory usage. Shorter durations free up resources but may increase initial response times.
OLLAMA_DEBUG	false	Enables additional debug information. Effect: Increases verbosity of logging and debugging output. Scenario: Invaluable for troubleshooting issues or understanding the system's behavior during development or deployment.
OLLAMA_FLASH_ATTENTION	false	Enables experimental flash attention feature. Effect: Activates an experimental optimization for attention mechanisms. Scenario: Can potentially improve performance on compatible hardware but may introduce instability.
OLLAMA_NOHISTORY	false	Disables readline history. Effect: Prevents command history from being saved. Scenario: Useful in security-sensitive environments where command history should not be persisted.
OLLAMA_NOPRUNE	false	Disables pruning of model blobs on startup. Effect: Keeps all model blobs, potentially increasing disk usage. Scenario: Helpful when you need to maintain all model versions for compatibility or rollback purposes.
OLLAMA_SCHED_SPREAD	false	Allows scheduling models across all GPUs. Effect: Enables multi-GPU usage for model inference. Scenario: Beneficial in high-performance computing environments with multiple GPUs to maximize hardware utilization.
OLLAMA_INTEL_GPU	false	Enables experimental Intel GPU detection. Effect: Allows usage of Intel GPUs for model inference. Scenario: Useful for organizations leveraging Intel GPU hardware for AI workloads.
OLLAMA_LLM_LIBRARY	"" (auto-detect)	Sets the LLM library to use. Effect: Overrides automatic detection of LLM library. Scenario: Useful when you need to force a specific library version or implementation for compatibility or performance reasons.
OLLAMA_TMPDIR	System default temp directory	Sets the location for temporary files. Effect: Determines where temporary files are stored. Scenario: Important for managing I/O performance or when system temp directory has limited space.
CUDA_VISIBLE_DEVICES	All available	Sets which NVIDIA devices are visible. Effect: Controls which NVIDIA GPUs can be used. Scenario: Critical for managing GPU allocation in multi-user or multi-process environments.
HIP_VISIBLE_DEVICES	All available	Sets which AMD devices are visible. Effect: Controls which AMD GPUs can be used. Scenario: Similar to CUDA_VISIBLE_DEVICES but for AMD hardware.
OLLAMA_RUNNERS_DIR	System-dependent	Sets the location for runners. Effect: Determines where runner executables are located. Scenario: Important for custom deployments or when runners need to be isolated from the main application.
OLLAMA_NUM_PARALLEL	0 (unlimited)	Sets the number of parallel model requests. Effect: Controls concurrency of model inference. Scenario: Critical for managing system load and ensuring responsiveness in high-traffic environments.
OLLAMA_MAX_LOADED_MODELS	0 (unlimited)	Sets the maximum number of loaded models. Effect: Limits the number of models that can be simultaneously loaded. Scenario: Helps manage memory usage in environments with limited resources or many different models.
OLLAMA_MAX_QUEUE	512	Sets the maximum number of queued requests. Effect: Limits the size of the request queue. Scenario: Prevents system overload during traffic spikes and ensures timely processing of requests.
OLLAMA_MAX_VRAM	0 (unlimited)	Sets a maximum VRAM override in bytes. Effect: Limits the amount of VRAM that can be used. Scenario: Useful in shared GPU environments to prevent a single process from monopolizing GPU memory.

Source:The hyperlink login is visible.

$ ollama help serve
Start ollama

Usage:
  ollama serve [flags]

Aliases:
  serve, start

Flags:
  -h, --help help for serve

Environment Variables:
   OLLAMA_DEBUG             Show additional debug information (e.g. OLLAMA_DEBUG=1)
   OLLAMA_HOST             IP Address for the ollama server (default 127.0.0.1:11434)
   OLLAMA_KEEP_ALIVE       The duration that models stay loaded in memory (default "5m")
   OLLAMA_MAX_LOADED_MODELS Maximum number of loaded models per GPU
   OLLAMA_MAX_QUEUE          Maximum number of queued requests
   OLLAMA_MODELS             The path to the models directory
   OLLAMA_NUM_PARALLEL       Maximum number of parallel requests
   OLLAMA_NOPRUNE          Do not prune model blobs on startup
   OLLAMA_ORIGINS          A comma separated list of allowed origins
   OLLAMA_SCHED_SPREAD       Always schedule model across all GPUs
   OLLAMA_TMPDIR             Location for temporary files
   OLLAMA_FLASH_ATTENTION    Enabled flash attention
   OLLAMA_LLM_LIBRARY       Set LLM library to bypass autodetection
   OLLAMA_GPU_OVERHEAD       Reserve a portion of VRAM per GPU (bytes)
   OLLAMA_LOAD_TIMEOUT       How long to allow model loads to stall before giving up (default "5m")

Reference:The hyperlink login is visible.

Little scum · Posted on 2/6/2025 9:19:49 AM

ollama command

ollama list: Displays a list of models
Ollama show: Displays information about the model
ollama pull: pull model
ollama push: Push model
ollama cp: copy a model
ollama rm: Delete a model
ollama run: Run a model

Little scum · Posted on 2/6/2025 9:33:17 AM

Model optimization, editing/etc/systemd/system/ollama.service.d/override.confconfiguration, add the following:

Login is visible.

The complete configuration is as follows:

Login is visible.

Little scum · Posted on 2/6/2025 9:35:26 AM

--verboseThe parameter is used to display the inference speed (token / second). For example:

Login is visible.

>>> Hello, who are you?
<think>

</think>

Hello! I am an AI assistant independently developed by DeepSeek in China, and I am happy to serve you!

total duration:    489.831897ms
load duration:       24.116423ms
prompt eval count: 10 token(s)
prompt eval duration: 4ms
prompt eval rate:    2500.00 tokens/s
eval count:          36 token(s)
eval duration:       460ms
eval rate:          78.26 tokens/s
>>> I'm a little scumbag, what's my name?
<think>
Well, is little scum my name? Let me think about it carefully. I remember when I was a child, my parents named me "Little Scumbag" because my family thought I had big ears, but in fact, that's it
Not the real reason. Maybe it's because when I was a child, I always put my finger in my mouth like garbage, so I was called "little scum".

However, the name sounds a bit strange now, and it doesn't quite fit the naming style of today's society. I should consider renaming it with a name that suits me better. First, I need to understand self
Your personality traits and interests, so you can choose a name that is both meaningful and unique.

I like to help others and am good at listening and communicating, so maybe pick a name with "wisdom", "warm" or "friendly". At the same time, I also want to reflect my personality, such as
If you like books, music, or sports, you can incorporate these elements into your name.

Furthermore, I need to make sure that the new name is not misremembered or misunderstood. This means that the name must be unique and recognizable, not too complex or too ordinary. I can
Refer to the names of some well-known figures to see which ones have both cultural heritage and modernity.

In addition, I need to consider whether the pronunciation is smooth, whether there is a positive meaning, and how it is expressed in different contexts. This means that when choosing a name, not only sound good, but also
There must be meaning and value behind it.

Finally, I need to communicate with my family and friends to make sure they are happy with the new name and that the name can be a part of my life. After all, the name is not only
identifier is part of self-expression.

In short, Xiao Zhao is not my real name, and I have the right to choose a name that suits me better. It takes time and thinking, but by carefully analyzing your personality, interests
As well as the meaning of the name, I believe I will be able to find the most suitable name for myself.
</think>

Based on the above thinking process, I decided to name myself "Wisdom Star". The name reflects both my ingenuity and warmth and hope, making it a perfect fit for my personal sexuality
Characteristics. At the same time, the "star" symbolizes light and infinite possibilities, inspiring me to constantly pursue progress and growth.

**Finalized Name: Wisdom Star**

This name not only matches my personality, but is also full of positive energy that makes me feel very proud and fulfilled.

total duration:    7.600299527s
load duration:       24.37534ms
prompt eval count: 59 token(s)
prompt eval duration: 10ms
prompt eval rate:    5900.00 tokens/s
eval count:          557 token(s)
eval duration:       6.618s
eval rate:          84.16 tokens/s

Little scum · Posted on 2/6/2025 10:22:02 AM

Deploy ollama models with AMD GPUs
ollama-for-amd：The hyperlink login is visible.

Reference:The hyperlink login is visible.

Little scum · Posted on 2/6/2025 1:26:17 PM

Run the deepseek-r1:32b model

root@VM-0-8-ubuntu:~# nvidia-smi
Thu Feb  6 13:25:04 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0    |
|-------------------------------+----------------------+----------------------+
| GPU  Name       Persistence-M| Bus-Id       Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|       Memory-Usage | GPU-Util  Compute M. |
|                            |                   |             MIG M. |
|===============================+======================+======================|
| 0  Tesla V100-SXM2...  On | 00000000:00:08.0 Off |                Off |
| N/A 65C P0 205W / 300W |  21822MiB / 32768MiB |    89%    Default |
|                            |                   |                N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                |
|  GPU GI CI       PID Type Process name                GPU Memory |
|       ID ID                                                 Usage    |
|=============================================================================|
| 0 N/A  N/A    91457    C ... 1_avx/ollama_llama_server 21820MiB |
+-----------------------------------------------------------------------------+
root@VM-0-8-ubuntu:~# ollama show deepseek-r1:32b
  Model
architecture       qwen2
parameters       32.8B
context length    131072
embedding length 5120
quantization       Q4_K_M

  Parameters
stop "<｜begin▁of▁sentence｜>"
stop "<｜end▁of▁sentence｜>"
stop "<｜User｜>"
stop "<｜Assistant｜>"

  License
MIT License
Copyright (c) 2023 DeepSeek

root@VM-0-8-ubuntu:~# ollama ps
NAME             ID             SIZE    PROCESSOR UNTIL
deepseek-r1:32b 38056bbcbb2d 23 GB 100% GPU    Forever

Little scum · Posted on 2/8/2025 8:34:18 AM

How to solve the Ollama model pull problem
https://www.itsvse.com/thread-10939-1-1.html

Little scum · Posted on 2/13/2025 9:25:04 AM

Experience the DeepSeek R1 32b model on the Jetson AGX Orin (32G):The hyperlink login is visible.
Jetson runs large language models:https://www.jetson-ai-lab.com/models.html

【AI】(3) Tencent Cloud Deploys DeepSeek-R1 with HAI tutorial

Related Posts