diff --git a/docs/.vitepress/en.ts b/docs/.vitepress/en.ts
index 4b65e7c0b0..24516d0ab9 100644
--- a/docs/.vitepress/en.ts
+++ b/docs/.vitepress/en.ts
@@ -216,6 +216,10 @@ const side = {
text: "Clone applications",
link: "/manual/olares/market/clone-apps",
},
+ {
+ text: "Use LLM Base applications",
+ link: "/manual/olares/market/llm-base-apps",
+ },
{
text: "Manage paid applications",
link: "/manual/olares/market/purchase-paid-apps",
diff --git a/docs/.vitepress/zh.ts b/docs/.vitepress/zh.ts
index bcd72e3a19..aca4fe16ba 100644
--- a/docs/.vitepress/zh.ts
+++ b/docs/.vitepress/zh.ts
@@ -214,6 +214,10 @@ const side = {
text: "克隆应用",
link: "/zh/manual/olares/market/clone-apps",
},
+ {
+ text: "使用大模型底座",
+ link: "/zh/manual/olares/market/llm-base-apps",
+ },
{
text: "管理付费应用",
link: "/zh/manual/olares/market/purchase-paid-apps",
diff --git a/docs/manual/olares/market/llm-base-apps.md b/docs/manual/olares/market/llm-base-apps.md
new file mode 100644
index 0000000000..09532f54d6
--- /dev/null
+++ b/docs/manual/olares/market/llm-base-apps.md
@@ -0,0 +1,304 @@
+---
+outline: [2, 3]
+description: Learn how to use the LLM Base applications on Olares to self-host large language models and run different inference engines by cloning the base apps.
+---
+
+# Host local large language models with LLM Base apps
+
+Olares V1.12.6 introduces the local hosting and management platform for large language models (LLMs), a self-hosting solution powered by the `llm-init` project. This platform provides four LLM Base applications, each for one inference engine: **Ollama LLM Base**, **vLLM LLM Base**, **llama.cpp LLM Base**, and **SGLang LLM Base**. Select the base app for the engine you want, use it to deploy different models, and then monitor model performance through a dedicated console.
+
+## Before you start
+
+- Your Olares system has been upgraded to V1.12.6 or later.
+
+## Locate LLM Base apps
+
+1. Open Market and search for "LLM Base". Four base apps appear: vLLM LLM Base (llm-init), SGLang LLM Base (llm-init), Ollama LLM Base (llm-init), and llama.cpp LLM Base (llm-init).
+
+ 
+
+2. Each base app is optimized for a different inference scenario. Choose one based on your model source, performance needs, and hardware.
+
+ | Base app | When to choose |
+ | :--- | :--- |
+ | **llama.cpp LLM Base (llm-init)** | Choose llama.cpp when you are running lightweight
GGUF models or deploying with limited GPU memory. |
+ | **Ollama LLM Base (llm-init)** | Choose Ollama when you want to get started quickly with
broad model compatibility. It pulls models automatically
using native model tags, making it ideal for chat and embedding tasks. |
+ | **vLLM LLM Base (llm-init)** | Choose vLLM when you need high-throughput serving
of Hugging Face models under heavy concurrent load. |
+ | **SGLang LLM Base (llm-init)** | Choose SGLang when you need efficient structured
generation or advanced reasoning optimizations. |
+
+## Create a new model instance
+
+An LLM Base app serves as a template. To run a model, you must first clone the base app into an independent running instance.
+
+1. Select the base app that matches your preferred inference engine, and then click **View** on it. For example, **llama.cpp LLM Base (llm-init)**.
+2. Click **Create** to initialize a new instance.
+
+ 
+
+3. Specify the instance identity settings:
+
+ - **New app name**: Enter a unique name for the instance. This name is displayed as the app name in Market and Settings. For example, `Qwen3.6-35B-A3B`.
+ - **Shortcut name for {client}**: Enter a unique shortcut name for the instance. This name is displayed on the Launchpad. For example, `qwen3.6-35b-a3b`.
+
+4. Click **Create** to proceed to the environment configuration.
+
+## Configure engine environment variables
+
+After creating the instance, the configuration window opens. Define where your engine pulls the model, how much memory it uses, and what capabilities it exposes to other client apps.
+
+1. In the **Configure environment variables for {New-app-name}** window, fill in the following details according to the target model and engine:
+
+ | Variable | Description |
+ | :--- | :--- |
+ | **MODEL_SOURCE** | Specify where the engine pulls the model.
The format depends on the selected engine:
- **Ollama**: `ollama://:`
Example: `ollama://qwen3.5:2b` - **llama.cpp**: `hf:// --include .gguf`
Example: `hf://unsloth/Qwen3.6-35B-A3B-GGUF --include Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf` - **vLLM** / **SGLang**: `hf://`
Example: `hf://Qwen/Qwen3.5-2B`
|
+ | **MODEL_NAME** | Define the name that client apps use to call this instance.
Derive it from `MODEL_SOURCE` per engine:- **Ollama**: Use the string after `ollama://`.
`MODEL_SOURCE`: `ollama://qwen3.5:2b`
`MODEL_NAME`: `qwen3.5:2b` - **llama.cpp**: Use the repo name plus the quantization tag (one quantization per instance).
`MODEL_SOURCE`: `hf://unsloth/Qwen3.6-35B-A3B-GGUF --include Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf`
`MODEL_NAME`: `unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL` - **vLLM** / **SGLang**: Use the string after `hf://`.
`MODEL_SOURCE`: `hf://Qwen/Qwen3.5-2B`
`MODEL_NAME`: `Qwen/Qwen3.5-2B`
|
+ | **MODEL_MODE** | Select **Chat** or **Embedding**. |
+ | **MODEL_SUPPORTS** | Select the capabilities the model supports: **Vision**, **Tools**,
**Thinking**, or **Embedding**. |
+ | **ENGINE_ARGS** | Specify the engine startup parameters, separated by spaces.
The format depends on the engine:- **Ollama**: `OLLAMA_CONTEXT_LENGTH=8192`
- **llama.cpp**: `-c 65536 -ngl all`
- **vLLM**: `--max-model-len 65536`
- **SGLang**: `--context-length 65536`
For more arguments, see [Engine tuning arguments](#reference-engine-tuning-arguments). |
+ | **{ENGINE}_REQUIRED
_GPU_MEMORY** | Enter the minimum GPU memory the instance needs to start,
in MB or Gi. For example, `20Gi`.- In time slicing or exclusive mode, set it below your total VRAM.
- In memory slicing mode, set it below your remaining VRAM.
|
+
+2. Click **Confirm** to save the configuration and start the instance installation.
+
+ An **Instances** panel appears on the right side of the page, showing the installation progress. Once the setup completes, the instance's operation button changes to **Open**, indicating that the base service is running. A model app with the same name also appears on the Launchpad.
+
+ 
+
+ :::info
+ Model instances created from LLM Base apps show a `From template` tag next to the app name. You can see this tag when viewing the app in Market or Settings.
+
+ {width=70%}
+ :::
+
+:::tip Update variables later
+To change these variables after installation, go to Olares **Settings** > **Applications** > **[App-Name]** > **Manage environment variables**. Click the edit icon next to a variable, update its value, save your change, and then click **Apply**.
+:::
+
+### Reference: Engine tuning arguments
+
+Use the `ENGINE_ARGS` variable to add custom settings that adjust memory usage, context limits, and processing behaviors. Separate multiple arguments with spaces. Select your inference engine below to view the available tuning arguments.
+
+
+
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `OLLAMA_CONTEXT_LENGTH` | Sets the default context window
size in tokens.
Default scales by VRAM:- Less than 24G: 4096
- Between 24G and 48G: 32768
- 48G and more: 262144
| `8192` to `131072` |
+| `OLLAMA_KEEP_ALIVE` | Sets model resident duration in memory
after the last request. Use `-1`
for permanent retention.
Default: `5m`. | `30m` or `-1` |
+| `OLLAMA_FLASH_ATTENTION` | Enables Flash Attention to optimize
memory efficiency during long-context
operations.
Default: `0` (Disabled). | `1` (Enabled) |
+| `OLLAMA_KV_CACHE_TYPE` | Sets the KV cache quantization type
to save video memory.
Default: `f16`. | `q8_0` (minor precision loss) or `q4_0` |
+| `OLLAMA_NUM_PARALLEL` | Sets the number of concurrent
requests processed per model.
Ollama determines this automatically
based on your available VRAM,
typically `1` or `4`.
Default: `0` | `1` |
+
+For other Ollama arguments, see the [official documentation](https://github.com/ollama/ollama/blob/main/docs/faq.mdx).
+
+
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `--max-model-len` | Sets the maximum context length.
Lower it if you hit out-of-memory errors. | `65536` |
+| `--gpu-memory-utilization` | Sets the fraction of GPU memory the
vLLM engine may use. | `0.9` |
+| `--tensor-parallel-size` | Sets the tensor-parallel size, that is,
how many GPUs split and run one
model together. | `1` |
+| `--max-num-batched-tokens` | Caps the number of tokens processed
per batch, preventing sharp latency
spikes. | `8192` |
+| `--enable-prefix-caching` | Caches the KV Cache of shared prompt
prefixes and reuses it across requests. | Enabled |
+| `--kv-cache-dtype` | Sets the KV Cache data type. Using
`fp8` raises throughput while preserving
quality.
Accepted values: `auto`, `bfloat16`,
`fp8`, `fp8_ds_mla`, `fp8_e4m3`,
`fp8_e5m2`, `fp8_inc`.
With `auto` (default), the KV Cache type
matches the model weights (usually
`float16` or `bfloat16`). | `fp8` |
+
+For other vLLM arguments, see the [official documentation](https://docs.vllm.ai/en/v0.17.0/configuration/engine_args/).
+
+
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `-c` | Sets the maximum context length
in tokens. | `65536` |
+| `-ngl` | Offloads all model layers to the GPU
to avoid CPU-bound slowdowns. | `all` |
+| `-fa` | Enables Flash Attention to speed up
attention computation. | `on` |
+| `-ctk` / `-ctv` | Quantizes the KV Cache to 8-bit,
balancing GPU memory use and precision. | `q8_0` |
+| `--spec-type` | Enables MTP (speculative decoding). | `draft-mtp` |
+| `--spec-draft-n-max` | Sets the maximum number of tokens
the drafter guesses ahead per
speculative step. | `3` |
+
+For other llama.cpp arguments, see the [official documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
+
+
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `--context-length` | Sets the maximum context length. | `65536` |
+| `--mem-fraction-static` | Sets the fraction of GPU memory
pre-allocated for static usage, similar
to vLLM's `--gpu-memory-utilization`. | `0.85` |
+| `--chunked-prefill-size` | Splits very long inputs into chunks so
they don't block the GPU for long,
keeping concurrent requests' streaming
smooth. | `4096` |
+| `--reasoning-parser` | Separates chain-of-thought output:
writes the model's reasoning to the
`reasoning_content` field and the final
answer to the `content` field. Set it
to match the model. | `gpt-oss` |
+| `--tool-call-parser` | Enables parsing of function-call
(tool use) output. Set it to match
the model. | `gpt-oss` |
+
+For other SGLang arguments, see the [official documentation](https://docs.sglang.io/docs/advanced_features/server_arguments).
+
+
+
+
+
+### Sample engine configurations
+
+
+
+
+Ollama pulls models automatically using native model tags.
+
+**Chat Model Example**
+```text
+MODEL_SOURCE=ollama://qwen3.5:2b
+MODEL_NAME=qwen3.5-2b
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice
+ENGINE_ARGS=OLLAMA_CONTEXT_LENGTH=8192
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+**Embedding Model Example**
+```text
+MODEL_SOURCE=ollama://nomic-embed-text
+MODEL_NAME=nomic-embed-text
+MODEL_MODE=supports_embedding
+MODEL_SUPPORTS=embedding
+ENGINE_ARGS=OLLAMA_KEEP_ALIVE=-1
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+
+
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_reasoning
+ENGINE_ARGS=--max-model-len 8192 --gpu-memory-utilization 0.9 --tensor-parallel-size 1
+VLLM_REQUIRED_GPU_MEMORY=10Gi
+```
+
+
+
+
+```text
+MODEL_SOURCE=hf://unsloth/Qwen3.5-2B-GGUF --include Qwen3.5-2B-UD-Q4_K_XL.gguf,hf://unsloth/Qwen3.5-2B-GGUF --include mmproj-F16.gguf
+MODEL_NAME=unsloth/Qwen3.5-2B-GGUF:UD-Q4_K_XL
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_vision,supports_reasoning
+ENGINE_ARGS=-c 65536 -ngl all -fa on
+LLAMACPP_REQUIRED_GPU_MEMORY=8192
+```
+
+
+
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice,supports_reasoning,supports_thinking
+ENGINE_ARGS=--context-length 32768 --mem-fraction-static 0.85 --max-running-requests 64 --reasoning-parser qwen3
+SGLANG_REQUIRED_GPU_MEMORY=8192
+```
+
+
+
+## Monitor deployment and configure settings
+
+Track model downloads, verify engine readiness, and manage operational parameters through the built-in model console.
+
+1. Locate the model instance in the **Instances** panel on the LLM Base app details page, or find it on the Launchpad.
+2. Open it to launch the dedicated model console.
+
+ The console opens on the **Status** tab by default. The model files are start downloading automatically.
+
+ 
+
+3. Tracks the readiness of the model and the engine:
+
+ - **Model**: Shows `READY` once the files are downloaded and verified. Copy the **Model name** to use for client app connections.
+ - **Engine**: Shows `RUNNING` once the inference service is online. Configure how client apps reach it:
+ - **WHO IS CALLING**: Select who can access the API, **Apps in Olares**, **Devices in LAN**, or **Remote**.
+ - **WHAT API FORMAT**: Select the API format. The available options depend on the engine, for example **OpenAI-Compatible**, **Anthropic-Compatible**, or **Ollama**.
+ - **Base URL**: Copy this URL to use for client app connections.
+ - **Supported Endpoints**: Expand to see the available API endpoints.
+
+ 
+
+4. Select the **Config** tab to review the model's capabilities and parameters, check how it sits on the GPU, and measure its performance.
+
+ 
+
+ - **Model card** (top): Shows the model name, mode (**Chat** or **Embedding**), and the capability tags the instance exposes, such as `function_calling`, `parallel_function_calling`, `reasoning`, `reasoning_effort`, and `tool_choice`.
+ - **Parameters**: View the engine parameters. Expand **Advanced parameters** for the full set, and use the **Form** / **Raw** toggle to switch the view.
+ - **GPU Residency**: Confirm whether the model is actually running on the GPU. Click **Detect** to refresh, then check **Mode**:
+ - `full GPU`: All layers run on the GPU. This is the expected, fastest state.
+ - `partial` or `cpu_only`: Part or all of the model fell back to the CPU, which makes inference much slower. On a GPU host this usually means an environment mis-mount, so review your `{ENGINE}_REQUIRED_GPU_MEMORY` setting and engine arguments.
+
+ The panel also reports **VRAM**, **KV cache used**, and **GPU mem util**, so you can see how much memory the model occupies and how much headroom is left for longer contexts or more concurrent requests.
+ - **Performance**: Click **Run test** to benchmark responsiveness:
+ - **TTFT** (Time To First Token): How long a user waits before the first word appears. Lower means a snappier experience.
+ - **perf.code**: How long the engine takes to load the model from scratch, for example after a restart.
+
+ Use these numbers to compare quantization levels, context sizes, or engine arguments, and to confirm that a change actually improved speed before you rely on it.
+
+## Connect client apps to the model service
+
+Once the model instance is running, any client app that speaks the OpenAI-compatible API can connect to it through the base URL.
+
+The following example uses [OpenCode](../../../use-cases/opencode.md) as the client.
+
+1. In the model console, go to the **Status** tab. Under **Service status**:
+
+ - **WHO IS CALLING**: Select **Apps in Olares**, because OpenCode runs inside Olares.
+ - **WHAT API FORMAT**: Select **OpenAI-Compatible**.
+ - Copy the **Base URL** and note down the **Model name**.
+
+2. In OpenCode, click settings in the bottom-left corner, select **Providers**, then scroll down and select **Connect** next to **Custom Provider**.
+
+3. Enter the following details:
+
+ - **Provider ID**: A unique identifier for this provider. For example, `olares-llm`.
+ - **Display name**: The name shown in the provider list. For example, `Olares LLM`.
+ - **Base URL**: The **Base URL** you copied from the model console.
+ - **Models**:
+ - **Model ID**: Your `MODEL_NAME`. For example, `Qwen3.6-35B-A3B`.
+ - **Display Name**: The name shown for this model. For example, `Qwen3.6 35B A3B`.
+
+4. Click **Submit** to save the configuration. The provider appears in the provider list.
+5. Run a task to test the connection. This example uses the Olares skill to upload and deploy an app to Olares.
+
+ a. At the top, click the **Search** field and select **Toggle terminal** to open a terminal.
+
+ b. Log in to the Olares CLI. Replace `alice123@olares.com` with your own Olares ID: `olares-cli profile login --olares-id alice123@olares.com`.
+
+ c. When prompted, type your Olares password and press **Enter**. The input stays hidden.
+
+ d. Below the chat box, select **Big Pickle** to open the model selector, and select **Qwen3.6 35B A3B** from the list.
+
+ e. Send a task:
+
+ ```text
+ Upload and deploy this app to Olares:
+ https://github.com/chandruk4321/dockerize-static-web-project
+ ```
+
+ f. Respond to OpenCode's questions, decisions, and approvals until the task finishes.
+
+ 
+
+ In this example, the Todo app is uploaded and deployed to **My Olares**. Open it from **My Olares** or the Launchpad to use the running app.
+
+ 
+
+## Uninstall model instances
+
+1. Open Market, go to **My Olares**, and then locate the model instance app.
+2. Click the drop-down arrow next to the operation button, and then select **Uninstall**.
diff --git a/docs/public/images/manual/olares/llm-base-apps-create-instance.png b/docs/public/images/manual/olares/llm-base-apps-create-instance.png
new file mode 100644
index 0000000000..7fb39fa78e
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps-create-instance.png differ
diff --git a/docs/public/images/manual/olares/llm-base-apps-create-instance1.png b/docs/public/images/manual/olares/llm-base-apps-create-instance1.png
new file mode 100644
index 0000000000..8cd564d482
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps-create-instance1.png differ
diff --git a/docs/public/images/manual/olares/llm-base-apps.png b/docs/public/images/manual/olares/llm-base-apps.png
new file mode 100644
index 0000000000..9109b0a03d
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-console-config.png b/docs/public/images/manual/olares/llm-base-model-console-config.png
new file mode 100644
index 0000000000..08e0cdf699
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-config.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-console-ready.png b/docs/public/images/manual/olares/llm-base-model-console-ready.png
new file mode 100644
index 0000000000..fe2e80a051
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-ready.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-console-status.png b/docs/public/images/manual/olares/llm-base-model-console-status.png
new file mode 100644
index 0000000000..5a4b80fb54
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-status.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-inst-task.png b/docs/public/images/manual/olares/llm-base-model-inst-task.png
new file mode 100644
index 0000000000..bebb9bd569
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-inst-task.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-inst-test.png b/docs/public/images/manual/olares/llm-base-model-inst-test.png
new file mode 100644
index 0000000000..6ebdb900a3
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-inst-test.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-installed.png b/docs/public/images/manual/olares/llm-base-model-instance-installed.png
new file mode 100644
index 0000000000..94f1b150d0
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-installed.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-installed1.png b/docs/public/images/manual/olares/llm-base-model-instance-installed1.png
new file mode 100644
index 0000000000..b7ffe08841
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-installed1.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-tag.png b/docs/public/images/manual/olares/llm-base-model-instance-tag.png
new file mode 100644
index 0000000000..7bf69eb34f
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-tag.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-tag1.png b/docs/public/images/manual/olares/llm-base-model-instance-tag1.png
new file mode 100644
index 0000000000..45aeb08396
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-tag1.png differ
diff --git a/docs/zh/manual/olares/market/llm-base-apps.md b/docs/zh/manual/olares/market/llm-base-apps.md
new file mode 100644
index 0000000000..4b1664f25f
--- /dev/null
+++ b/docs/zh/manual/olares/market/llm-base-apps.md
@@ -0,0 +1,224 @@
+---
+outline: [2, 3]
+description: 了解如何在 Olares 中使用大模型底座(LLM Base App)托管本地大语言模型,并通过克隆底座运行 Ollama、vLLM、llama.cpp 或 SGLang 等推理引擎。
+---
+
+# 使用大模型底座托管本地大语言模型
+
+Olares V1.12.6 推出了基于 `llm-init` 项目的本地大语言模型(LLM)托管与管理平台。该平台提供四个大模型底座应用,分别对应四种推理引擎:**Ollama LLM Base**、**vLLM LLM Base**、**llama.cpp LLM Base** 和 **SGLang LLM Base**。选择对应引擎的底座,部署不同模型,并通过专属面板监控模型运行状态。
+
+## 开始之前
+
+- 你的 Olares 系统已升级至 V1.12.6 或更高版本。
+
+## 找到大模型底座
+
+1. 打开 **Market**,搜索“LLM Base”。
+
+ 
+
+2. 每个底座应用针对不同的推理场景做了优化。根据你的模型来源、性能需求和硬件条件进行选择:
+
+ | 底座应用 | 适用场景 |
+ | :--- | :--- |
+ | **Ollama LLM Base (llm-init)** | 快速上手和广泛的模型兼容。Ollama 可通过原生模型标签自动拉取模型,最适合聊天和嵌入任务。 |
+ | **vLLM LLM Base (llm-init)** | 高并发场景下对 HuggingFace 模型进行高吞吐量推理服务。 |
+ | **SGLang LLM Base (llm-init)** | 需要高效结构化生成和高级推理服务优化的场景。 |
+ | **llama.cpp LLM Base (llm-init)** | 轻量 GGUF 模型、显存有限或资源紧张的部署环境。 |
+
+## 创建新的模型实例
+
+大模型底座只是一个模板。要运行模型,你需要先将底座克隆为独立的运行实例。
+
+1. 选择与你所需推理引擎匹配的底座,然后点击 **View**。例如,**Ollama LLM Base (llm-init)**。
+2. 点击 **Create**,初始化一个新的实例。
+3. 配置实例标识:
+
+ - **New app name**:输入实例的唯一名称。该名称会显示在 **Market** > **My Olares** 中。
+ - **Shortcut name for {client}**:输入在启动台上显示的唯一快捷方式名称。
+
+4. 点击 **Create**,进入环境变量配置。
+
+## 配置引擎环境变量
+
+创建实例后,会弹出配置窗口。你需要定义引擎从哪里拉取模型、使用多少显存,以及向其他客户端应用暴露哪些能力。
+
+1. 在 **Configure environment variables for {New-app-name}** 窗口中,根据目标模型和引擎填写以下信息:
+
+ | 变量 | 说明 |
+ | :--- | :--- |
+ | **MODEL_SOURCE** | 指定模型源地址。
格式取决于所选引擎。
示例:`ollama://qwen3.5:0.8b` 或 `hf://Qwen/Qwen3.5-2B`。 |
+ | **MODEL_NAME** | 指定客户端应用连接时使用的模型名称。
示例:`qwen3.5-2b`。 |
+ | **MODEL_MODE** | 选择 **Chat** 或 **Embedding**。
示例:`chat`。 |
+ | **MODEL_SUPPORTS** | 输入逗号分隔的模型能力标志:- **推理模型**:包含 `supports_reasoning`。
- **工具调用模型**:包含 `supports_function_calling`。
如需同时处理多个任务,再添加 `supports_parallel_function_calling`。 - **视觉模型**:包含 `supports_vision`。
- **嵌入模型**:留空,不要填写聊天相关能力标志。
示例:`supports_function_calling,supports_tool_choice`。
参考:[模型能力标志](#model-capability-flags)。 |
+ | **ENGINE_ARGS** | 指定引擎启动参数。
多个参数之间用空格分隔。
示例:`OLLAMA_CONTEXT_LENGTH=4096`。
参考:[引擎调优参数](#engine-tuning-arguments)。 |
+ | **{ENGINE}_REQUIRED
_GPU_MEMORY** | 设置实例所需的最低显存,单位为 MB 或 Gi。
示例:`8192`。 |
+
+2. 点击 **Confirm** 保存配置并开始安装实例。
+
+ 页面右侧会出现 **Instances** 面板,显示安装进度。安装完成后,实例操作按钮会变为 **Open**,表示底层服务正在运行。
+
+ 
+
+### 参考:模型能力标志 {#model-capability-flags}
+
+`MODEL_SUPPORTS` 变量声明模型向外部客户端暴露的能力。这些标志对所有推理引擎通用。
+
+| 类别 | 支持的标志 |
+| --- | --- |
+| **核心** | `supports_vision`、`supports_function_calling`、
`supports_reasoning`、`supports_native_streaming`、
`supports_response_schema`、`supports_prompt_caching`、
`supports_web_search`、`supports_parallel_function_calling` |
+| **多模态** | `supports_audio_input`、`supports_audio_output`、
`supports_video_input`、`supports_pdf_input`、
`supports_computer_use`、`supports_url_context` |
+| **推理与控制 token** | `supports_reasoning_effort`、`supports_thinking`、
`supports_assistant_prefill`、`supports_tool_choice`、
`supports_tokenizer` |
+| **采样控制** | `supports_system_messages`、`supports_temperature`、
`supports_top_p`、`supports_top_k`、
`supports_stop_sequences`、`supports_frequency_penalty`、
`supports_presence_penalty` |
+| **响应形式** | `supports_n`、`supports_logprobs`、`supports_seed`、
`supports_response_format`、`supports_logit_bias`、`supports_user` |
+
+### 参考:引擎调优参数 {#engine-tuning-arguments}
+
+使用 `ENGINE_ARGS` 变量来调整显存占用、上下文长度和处理行为。点击下方推理引擎查看可用调优参数。
+
+
+
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `OLLAMA_CONTEXT_LENGTH` | 设置默认上下文窗口大小(以 token 为单位)。
默认根据显存自动调整:- 小于 24G:4096
- 24G 到 48G 之间:32768
- 48G 及以上:262144
| `8192` 到 `131072` |
+| `OLLAMA_NUM_PARALLEL` | 设置每个模型可同时处理的并发请求数。
Ollama 会根据可用显存自动决定,通常为 `1` 或 `4`。
默认值:`0` | `1` |
+| `OLLAMA_KV_CACHE_TYPE` | 设置 KV 缓存量化类型以节省显存。
默认值:`f16`。 | `q8_0`(轻微精度损失)或 `q4_0` |
+| `OLLAMA_FLASH_ATTENTION` | 启用 Flash Attention,以优化长上下文场景下的显存效率。
默认值:`0`(关闭)。 | `1`(开启) |
+| `OLLAMA_MAX_LOADED_MODELS` | 设置可同时加载在内存中的模型数量上限。
默认会根据每块可用 GPU 自动扩展到约 3 个模型。
默认值:`0`。 | `1` |
+| `OLLAMA_MAX_QUEUE` | 设置处理队列中允许的最大请求数。
默认值:`512`。 | `512` |
+| `OLLAMA_KEEP_ALIVE` | 设置最后一次请求后模型在内存中保留的时长。使用 `-1` 表示永久保留。
默认值:`5m`。 | `30m` 或 `-1` |
+| `OLLAMA_LOAD_TIMEOUT` | 设置等待模型加载完成的最大时长。
默认值:`5m`。 | `5m` |
+| `OLLAMA_GPU_OVERHEAD` | 设置预留的安全显存余量(字节)。
默认值:`0`。 | `0` |
+| `OLLAMA_DEBUG` | 设置系统日志级别,用于排查问题。
默认值:`0`(Info)。 | `1`(Debug) |
+
+
+
+
+占位符
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `--max-model-len` | 设置最大上下文窗口大小。如遇到显存不足,可适当减小。 | `8192` |
+| `--gpu-memory-utilization` | 设置模型可使用的显存比例上限。 | `0.9` |
+| `--tensor-parallel-size` | 设置用于张量并行的 GPU 数量。 | `1` |
+
+
+
+
+占位符
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `-c` | 设置最大上下文窗口大小(以 token 为单位)。 | `65536` |
+| `-ngl` | 将模型层 offload 到 GPU。 | `all` |
+| `-fa` | 启用 Flash Attention 以降低显存占用。 | `on` |
+
+
+
+
+占位符
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `--context-length` | 设置最大上下文长度。 | `32768` |
+| `--mem-fraction-static` | 设置用于静态用途(模型权重和 KV 缓存)的显存比例。 | `0.85` |
+| `--max-running-requests` | 设置并发处理的最大请求数。 | `64` |
+| `--reasoning-parser` | 配置推理模型的解析器。 | `qwen3` |
+
+
+
+
+### 引擎配置示例
+
+
+
+
+Ollama 使用原生模型标签自动拉取模型。
+
+**聊天模型示例**
+```text
+MODEL_SOURCE=ollama://qwen3.5:2b
+MODEL_NAME=qwen3.5-2b
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice
+ENGINE_ARGS=OLLAMA_CONTEXT_LENGTH=8192
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+**嵌入模型示例**
+```text
+MODEL_SOURCE=ollama://nomic-embed-text
+MODEL_NAME=nomic-embed-text
+MODEL_MODE=embedding
+MODEL_SUPPORTS=
+ENGINE_ARGS=OLLAMA_KEEP_ALIVE=-1
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+
+
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_reasoning
+ENGINE_ARGS=--max-model-len 8192 --gpu-memory-utilization 0.9 --tensor-parallel-size 1
+VLLM_REQUIRED_GPU_MEMORY=10Gi
+```
+
+
+
+
+```text
+MODEL_SOURCE=hf://unsloth/Qwen3.5-2B-GGUF --include Qwen3.5-2B-UD-Q4_K_XL.gguf,hf://unsloth/Qwen3.5-2B-GGUF --include mmproj-F16.gguf
+MODEL_NAME=unsloth/Qwen3.5-2B-GGUF:UD-Q4_K_XL
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_vision,supports_reasoning
+ENGINE_ARGS=-c 65536 -ngl all -fa on
+LLAMACPP_REQUIRED_GPU_MEMORY=8192
+```
+
+
+
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice,supports_reasoning,supports_thinking
+ENGINE_ARGS=--context-length 32768 --mem-fraction-static 0.85 --max-running-requests 64 --reasoning-parser qwen3
+SGLANG_REQUIRED_GPU_MEMORY=8192
+```
+
+
+
+
+## 监控下载与初始化状态
+
+你可以通过实例内置面板跟踪模型下载、查看性能指标,并获取 API 连接信息。
+
+1. 在底座详情页右侧的 **Instances** 面板中找到你的部署。
+2. 当状态显示底层服务正在运行时,点击 **Open**。模型实例的 **llm-init** 页面会随之打开。
+3. 在 **STATUS** 标签页中确认部署进度:
+ - **DOWNLOAD**:实时显示下载百分比、速度和预计完成时间(ETA)。
+ - **STATUS**:跟踪失败或重试次数。如果出现网络中断或模型源地址格式错误,修复后点击 **Retry**。
+ - **ENGINE**:显示初始化状态。
+
+ - 确认两个追踪标签均显示 **Engine alive: yes** 和 **Model exists: yes**,这表示引擎已在线并可接受请求。
+ - 复制模型名称和 OpenAI 兼容 API 基础地址。
+
+4. 进入 **CONFIG** 标签页,查看运行限制、执行性能探测、查看基准历史或更新变量。
+
+## 将客户端应用连接到模型服务
+
+本地模型实例成功运行后,其他客户端应用可以使用标准 OpenAI API 模式连接该服务。
+
+1. 打开 Olares **Settings**,然后进入 **Applications** > **{Your-New-Model-Instance}** > **Shared entrance** > **{Engine} LLM API**。
+2. 复制端点 URL,并与你定义的 `MODEL_NAME` 一起填入客户端应用的模型配置部分。
+
+## 卸载模型实例
+
+1. 从应用市场打开目标底座应用。
+2. 在**实例**部分,找到目标模型实例,点击操作按钮旁的下拉箭头,然后点击**卸载**。