diff --git a/docs/.vitepress/en.ts b/docs/.vitepress/en.ts
index 4b65e7c0b0..24516d0ab9 100644
--- a/docs/.vitepress/en.ts
+++ b/docs/.vitepress/en.ts
@@ -216,6 +216,10 @@ const side = {
               text: "Clone applications",
               link: "/manual/olares/market/clone-apps",
             },
+            {
+              text: "Use LLM Base applications",
+              link: "/manual/olares/market/llm-base-apps",
+            },
             {
               text: "Manage paid applications",
               link: "/manual/olares/market/purchase-paid-apps",
diff --git a/docs/.vitepress/zh.ts b/docs/.vitepress/zh.ts
index bcd72e3a19..aca4fe16ba 100644
--- a/docs/.vitepress/zh.ts
+++ b/docs/.vitepress/zh.ts
@@ -214,6 +214,10 @@ const side = {
               text: "克隆应用",
               link: "/zh/manual/olares/market/clone-apps",
             },
+            {
+              text: "使用大模型底座",
+              link: "/zh/manual/olares/market/llm-base-apps",
+            },
             {
               text: "管理付费应用",
               link: "/zh/manual/olares/market/purchase-paid-apps",
diff --git a/docs/manual/olares/market/llm-base-apps.md b/docs/manual/olares/market/llm-base-apps.md
new file mode 100644
index 0000000000..09532f54d6
--- /dev/null
+++ b/docs/manual/olares/market/llm-base-apps.md
@@ -0,0 +1,304 @@
+---
+outline: [2, 3]
+description: Learn how to use the LLM Base applications on Olares to self-host large language models and run different inference engines by cloning the base apps.
+---
+
+# Host local large language models with LLM Base apps
+
+Olares V1.12.6 introduces the local hosting and management platform for large language models (LLMs), a self-hosting solution powered by the `llm-init` project. This platform provides four LLM Base applications, each for one inference engine: **Ollama LLM Base**, **vLLM LLM Base**, **llama.cpp LLM Base**, and **SGLang LLM Base**. Select the base app for the engine you want, use it to deploy different models, and then monitor model performance through a dedicated console. 
+
+## Before you start
+
+- Your Olares system has been upgraded to V1.12.6 or later.
+
+## Locate LLM Base apps
+
+1. Open Market and search for "LLM Base". Four base apps appear: vLLM LLM Base (llm-init), SGLang LLM Base (llm-init), Ollama LLM Base (llm-init), and llama.cpp LLM Base (llm-init).
+
+    ![LLM Base apps in Market](/images/manual/olares/llm-base-apps.png#bordered)
+
+2. Each base app is optimized for a different inference scenario. Choose one based on your model source, performance needs, and hardware.
+
+    | Base app | When to choose |
+    | :--- | :--- |
+    | **llama.cpp LLM Base (llm-init)** | Choose llama.cpp when you are running lightweight<br> GGUF models or deploying with limited GPU memory. |    
+    | **Ollama LLM Base (llm-init)** | Choose Ollama when you want to get started quickly with<br> broad model compatibility. It pulls models automatically<br> using native model tags, making it ideal for chat and embedding tasks. |
+    | **vLLM LLM Base (llm-init)** | Choose vLLM when you need high-throughput serving <br>of Hugging Face models under heavy concurrent load. |
+    | **SGLang LLM Base (llm-init)** | Choose SGLang when you need efficient structured<br> generation or advanced reasoning optimizations. |
+
+## Create a new model instance
+
+An LLM Base app serves as a template. To run a model, you must first clone the base app into an independent running instance.
+
+1. Select the base app that matches your preferred inference engine, and then click **View** on it. For example, **llama.cpp LLM Base (llm-init)**.
+2. Click **Create** to initialize a new instance.
+
+    ![Create a model instance](/images/manual/olares/llm-base-apps-create-instance1.png#bordered)
+
+3. Specify the instance identity settings:
+
+    - **New app name**: Enter a unique name for the instance. This name is displayed as the app name in Market and Settings. For example, `Qwen3.6-35B-A3B`.
+    - **Shortcut name for {client}**: Enter a unique shortcut name for the instance. This name is displayed on the Launchpad. For example, `qwen3.6-35b-a3b`.
+
+4. Click **Create** to proceed to the environment configuration.
+
+## Configure engine environment variables
+
+After creating the instance, the configuration window opens. Define where your engine pulls the model, how much memory it uses, and what capabilities it exposes to other client apps.
+
+1. In the **Configure environment variables for {New-app-name}** window, fill in the following details according to the target model and engine:
+
+    | Variable | Description |
+    | :--- | :--- |
+    | **MODEL_SOURCE** | Specify where the engine pulls the model.<br><br>The format depends on the selected engine:<ul><li>**Ollama**: `ollama://<model>:<size>`<br>Example: `ollama://qwen3.5:2b`</li><li>**llama.cpp**: `hf://<repo> --include <file>.gguf`<br>Example: `hf://unsloth/Qwen3.6-35B-A3B-GGUF --include Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf`</li><li>**vLLM** / **SGLang**: `hf://<repo>`<br>Example: `hf://Qwen/Qwen3.5-2B`</li></ul> |
+    | **MODEL_NAME** | Define the name that client apps use to call this instance.<br><br> Derive it from `MODEL_SOURCE` per engine:<ul><li>**Ollama**: Use the string after `ollama://`.<br>`MODEL_SOURCE`: `ollama://qwen3.5:2b`<br>`MODEL_NAME`: `qwen3.5:2b`</li><li>**llama.cpp**: Use the repo name plus the quantization tag (one quantization per instance).<br>`MODEL_SOURCE`: `hf://unsloth/Qwen3.6-35B-A3B-GGUF --include Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf`<br>`MODEL_NAME`: `unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL`</li><li>**vLLM** / **SGLang**: Use the string after `hf://`.<br>`MODEL_SOURCE`: `hf://Qwen/Qwen3.5-2B`<br>`MODEL_NAME`: `Qwen/Qwen3.5-2B`</li></ul> |
+    | **MODEL_MODE** | Select **Chat** or **Embedding**. |
+    | **MODEL_SUPPORTS** | Select the capabilities the model supports: **Vision**, **Tools**,<br>**Thinking**, or **Embedding**. |
+    | **ENGINE_ARGS** | Specify the engine startup parameters, separated by spaces.<br><br>The format depends on the engine:<ul><li>**Ollama**: `OLLAMA_CONTEXT_LENGTH=8192`</li><li>**llama.cpp**: `-c 65536 -ngl all`</li><li>**vLLM**: `--max-model-len 65536`</li><li>**SGLang**: `--context-length 65536`</li></ul>For more arguments, see [Engine tuning arguments](#reference-engine-tuning-arguments). |
+    | **{ENGINE}_REQUIRED<br>_GPU_MEMORY** | Enter the minimum GPU memory the instance needs to start,<br>in MB or Gi. For example, `20Gi`.<ul><li>In time slicing or exclusive mode, set it below your total VRAM.</li><li>In memory slicing mode, set it below your remaining VRAM.</li></ul> |
+
+2. Click **Confirm** to save the configuration and start the instance installation.
+
+    An **Instances** panel appears on the right side of the page, showing the installation progress. Once the setup completes, the instance's operation button changes to **Open**, indicating that the base service is running. A model app with the same name also appears on the Launchpad.
+
+    ![Model instance installed](/images/manual/olares/llm-base-model-instance-installed1.png#bordered)
+
+    :::info
+    Model instances created from LLM Base apps show a `From template` tag next to the app name. You can see this tag when viewing the app in Market or Settings.
+
+    ![Model instance tag](/images/manual/olares/llm-base-model-instance-tag1.png#bordered){width=70%}   
+    :::
+
+:::tip Update variables later
+To change these variables after installation, go to Olares **Settings** > **Applications** > **[App-Name]** > **Manage environment variables**. Click the edit icon next to a variable, update its value, save your change, and then click **Apply**.
+:::
+
+### Reference: Engine tuning arguments
+
+Use the `ENGINE_ARGS` variable to add custom settings that adjust memory usage, context limits, and processing behaviors. Separate multiple arguments with spaces. Select your inference engine below to view the available tuning arguments.
+
+<tabs>
+<template #Ollama>
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `OLLAMA_CONTEXT_LENGTH` | Sets the default context window<br> size in tokens. <br><br>Default scales by VRAM:<ul><li>Less than 24G: 4096</li><li>Between 24G and 48G: 32768</li><li>48G and more: 262144</li></ul> | `8192` to `131072` |
+| `OLLAMA_KEEP_ALIVE` | Sets model resident duration in memory<br> after the last request. Use `-1` <br>for permanent retention. <br><br>Default: `5m`. | `30m` or `-1` |
+| `OLLAMA_FLASH_ATTENTION` | Enables Flash Attention to optimize<br> memory efficiency during long-context<br> operations. <br><br>Default: `0` (Disabled). | `1` (Enabled) |
+| `OLLAMA_KV_CACHE_TYPE` | Sets the KV cache quantization type<br> to save video memory. <br><br>Default: `f16`. | `q8_0` (minor precision loss) or `q4_0` |
+| `OLLAMA_NUM_PARALLEL` | Sets the number of concurrent<br> requests processed per model. <br><br>Ollama determines this automatically<br> based on your available VRAM, <br>typically `1` or `4`.<br><br>Default: `0` | `1` |
+
+For other Ollama arguments, see the [official documentation](https://github.com/ollama/ollama/blob/main/docs/faq.mdx).
+</template>
+<template #vLLM>
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `--max-model-len` | Sets the maximum context length.<br>Lower it if you hit out-of-memory errors. | `65536` |
+| `--gpu-memory-utilization` | Sets the fraction of GPU memory the<br> vLLM engine may use. | `0.9` |
+| `--tensor-parallel-size` | Sets the tensor-parallel size, that is,<br> how many GPUs split and run one<br> model together. | `1` |
+| `--max-num-batched-tokens` | Caps the number of tokens processed<br> per batch, preventing sharp latency<br> spikes. | `8192` |
+| `--enable-prefix-caching` | Caches the KV Cache of shared prompt<br> prefixes and reuses it across requests. | Enabled |
+| `--kv-cache-dtype` | Sets the KV Cache data type. Using<br> `fp8` raises throughput while preserving<br> quality. <br><br>Accepted values: `auto`, `bfloat16`,<br> `fp8`, `fp8_ds_mla`, `fp8_e4m3`,<br> `fp8_e5m2`, `fp8_inc`. <br><br>With `auto` (default), the KV Cache type<br> matches the model weights (usually<br> `float16` or `bfloat16`). | `fp8` |
+
+For other vLLM arguments, see the [official documentation](https://docs.vllm.ai/en/v0.17.0/configuration/engine_args/).
+</template>
+<template #Llama-cpp>
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `-c` | Sets the maximum context length<br> in tokens. | `65536` |
+| `-ngl` | Offloads all model layers to the GPU<br> to avoid CPU-bound slowdowns. | `all` |
+| `-fa` | Enables Flash Attention to speed up<br> attention computation. | `on` |
+| `-ctk` / `-ctv` | Quantizes the KV Cache to 8-bit,<br> balancing GPU memory use and precision. | `q8_0` |
+| `--spec-type` | Enables MTP (speculative decoding). | `draft-mtp` |
+| `--spec-draft-n-max` | Sets the maximum number of tokens<br> the drafter guesses ahead per<br> speculative step. | `3` |
+
+For other llama.cpp arguments, see the [official documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
+</template>
+<template #SGLang>
+
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `--context-length` | Sets the maximum context length. | `65536` |
+| `--mem-fraction-static` | Sets the fraction of GPU memory<br> pre-allocated for static usage, similar<br> to vLLM's `--gpu-memory-utilization`. | `0.85` |
+| `--chunked-prefill-size` | Splits very long inputs into chunks so<br> they don't block the GPU for long,<br> keeping concurrent requests' streaming<br> smooth. | `4096` |
+| `--reasoning-parser` | Separates chain-of-thought output:<br> writes the model's reasoning to the<br> `reasoning_content` field and the final<br> answer to the `content` field. Set it<br> to match the model. | `gpt-oss` |
+| `--tool-call-parser` | Enables parsing of function-call<br> (tool use) output. Set it to match<br> the model. | `gpt-oss` |
+
+For other SGLang arguments, see the [official documentation](https://docs.sglang.io/docs/advanced_features/server_arguments).
+</template>
+</tabs>
+
+<!--
+| Argument | Purpose | Recommended Example |
+| :--- | :--- | :--- |
+| `OLLAMA_CONTEXT_LENGTH` | Sets the default context window<br> size in tokens. <br><br>Default scales by VRAM:<ul><li>Less than 24G: 4096</li><li>Between 24G and 48G: 32768</li><li>48G and more: 262144</li></ul> | `8192` to `131072` |
+| `OLLAMA_NUM_PARALLEL` | Sets the number of concurrent<br> requests processed per model. <br><br>Ollama determines this automatically<br> based on your available VRAM, <br>typically `1` or `4`.<br><br>Default: `0` | `1` |
+| `OLLAMA_KV_CACHE_TYPE` | Sets the KV cache quantization type<br> to save video memory. <br><br>Default: `f16`. | `q8_0` (minor precision loss) or `q4_0` |
+| `OLLAMA_FLASH_ATTENTION` | Enables Flash Attention to optimize<br> memory efficiency during long-context<br> operations. <br><br>Default: `0` (Disabled). | `1` (Enabled) |
+| `OLLAMA_MAX_LOADED_MODELS` | Sets the maximum number of models<br> kept loaded in memory simultaneously.<br>It automatically scales to roughly 3<br> models per available GPU<br><br>Default: `0`. | `1` |
+| `OLLAMA_MAX_QUEUE` | Sets the maximum number of incoming<br> requests allowed in the processing queue. <br><br>Default: `512`. | `512` |
+| `OLLAMA_KEEP_ALIVE` | Sets model resident duration in memory<br> after the last request. Use `-1` <br>for permanent retention. <br><br>Default: `5m`. | `30m` or `-1` |
+| `OLLAMA_LOAD_TIMEOUT` | Sets the maximum duration to wait for<br> a model to finish loading before giving up. <br><br>Default: `5m`. | `5m` |
+| `OLLAMA_GPU_OVERHEAD` | Sets the amount of video memory <br>in bytes reserved as a safety margin<br> overhead. <br><br>Default: `0`. | `0` |
+| `OLLAMA_DEBUG` | Sets the system log level for <br>troubleshooting. <br><br>Default: `0` (Info). | `1` (Debug) |
+-->
+
+### Sample engine configurations
+
+<tabs>
+<template #Ollama>
+
+Ollama pulls models automatically using native model tags.
+
+**Chat Model Example**
+```text
+MODEL_SOURCE=ollama://qwen3.5:2b
+MODEL_NAME=qwen3.5-2b
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice
+ENGINE_ARGS=OLLAMA_CONTEXT_LENGTH=8192
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+**Embedding Model Example**
+```text
+MODEL_SOURCE=ollama://nomic-embed-text
+MODEL_NAME=nomic-embed-text
+MODEL_MODE=supports_embedding
+MODEL_SUPPORTS=embedding
+ENGINE_ARGS=OLLAMA_KEEP_ALIVE=-1
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+</template>
+<template #vLLM>
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_reasoning
+ENGINE_ARGS=--max-model-len 8192 --gpu-memory-utilization 0.9 --tensor-parallel-size 1
+VLLM_REQUIRED_GPU_MEMORY=10Gi
+```
+
+</template>
+<template #Llama-cpp>
+
+```text
+MODEL_SOURCE=hf://unsloth/Qwen3.5-2B-GGUF --include Qwen3.5-2B-UD-Q4_K_XL.gguf,hf://unsloth/Qwen3.5-2B-GGUF --include mmproj-F16.gguf
+MODEL_NAME=unsloth/Qwen3.5-2B-GGUF:UD-Q4_K_XL
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_vision,supports_reasoning
+ENGINE_ARGS=-c 65536 -ngl all -fa on
+LLAMACPP_REQUIRED_GPU_MEMORY=8192
+```
+
+</template>
+<template #SGLang>
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice,supports_reasoning,supports_thinking
+ENGINE_ARGS=--context-length 32768 --mem-fraction-static 0.85 --max-running-requests 64 --reasoning-parser qwen3
+SGLANG_REQUIRED_GPU_MEMORY=8192
+```
+</template>
+</tabs>
+
+## Monitor deployment and configure settings
+
+Track model downloads, verify engine readiness, and manage operational parameters through the built-in model console.
+
+1. Locate the model instance in the **Instances** panel on the LLM Base app details page, or find it on the Launchpad.
+2. Open it to launch the dedicated model console.
+
+    The console opens on the **Status** tab by default. The model files are start downloading automatically.
+
+    ![Model console status tab](/images/manual/olares/llm-base-model-console-status.png#bordered)
+
+3. Tracks the readiness of the model and the engine:
+
+    - **Model**: Shows `READY` once the files are downloaded and verified. Copy the **Model name** to use for client app connections.
+    - **Engine**: Shows `RUNNING` once the inference service is online. Configure how client apps reach it:
+        - **WHO IS CALLING**: Select who can access the API, **Apps in Olares**, **Devices in LAN**, or **Remote**.
+        - **WHAT API FORMAT**: Select the API format. The available options depend on the engine, for example **OpenAI-Compatible**, **Anthropic-Compatible**, or **Ollama**.
+        - **Base URL**: Copy this URL to use for client app connections.
+        - **Supported Endpoints**: Expand to see the available API endpoints.
+
+    ![Model console ready](/images/manual/olares/llm-base-model-console-ready.png#bordered)
+
+4. Select the **Config** tab to review the model's capabilities and parameters, check how it sits on the GPU, and measure its performance.
+
+    ![Model console, config page](/images/manual/olares/llm-base-model-console-config.png#bordered)
+
+    - **Model card** (top): Shows the model name, mode (**Chat** or **Embedding**), and the capability tags the instance exposes, such as `function_calling`, `parallel_function_calling`, `reasoning`, `reasoning_effort`, and `tool_choice`.
+    - **Parameters**: View the engine parameters. Expand **Advanced parameters** for the full set, and use the **Form** / **Raw** toggle to switch the view.
+    - **GPU Residency**: Confirm whether the model is actually running on the GPU. Click **Detect** to refresh, then check **Mode**:
+        - `full GPU`: All layers run on the GPU. This is the expected, fastest state.
+        - `partial` or `cpu_only`: Part or all of the model fell back to the CPU, which makes inference much slower. On a GPU host this usually means an environment mis-mount, so review your `{ENGINE}_REQUIRED_GPU_MEMORY` setting and engine arguments.
+
+        The panel also reports **VRAM**, **KV cache used**, and **GPU mem util**, so you can see how much memory the model occupies and how much headroom is left for longer contexts or more concurrent requests.
+    - **Performance**: Click **Run test** to benchmark responsiveness:
+        - **TTFT** (Time To First Token): How long a user waits before the first word appears. Lower means a snappier experience.
+        - **perf.code**: How long the engine takes to load the model from scratch, for example after a restart.
+
+        Use these numbers to compare quantization levels, context sizes, or engine arguments, and to confirm that a change actually improved speed before you rely on it.
+
+## Connect client apps to the model service
+
+Once the model instance is running, any client app that speaks the OpenAI-compatible API can connect to it through the base URL. 
+
+The following example uses [OpenCode](../../../use-cases/opencode.md) as the client.
+
+1. In the model console, go to the **Status** tab. Under **Service status**:
+
+    - **WHO IS CALLING**: Select **Apps in Olares**, because OpenCode runs inside Olares.
+    - **WHAT API FORMAT**: Select **OpenAI-Compatible**.
+    - Copy the **Base URL** and note down the **Model name**.
+
+2. In OpenCode, click <i class="material-symbols-outlined">settings</i> in the bottom-left corner, select **Providers**, then scroll down and select **Connect** next to **Custom Provider**.
+
+3. Enter the following details:
+
+    - **Provider ID**: A unique identifier for this provider. For example, `olares-llm`.
+    - **Display name**: The name shown in the provider list. For example, `Olares LLM`.
+    - **Base URL**: The **Base URL** you copied from the model console.
+    - **Models**:
+        - **Model ID**: Your `MODEL_NAME`. For example, `Qwen3.6-35B-A3B`.
+        - **Display Name**: The name shown for this model. For example, `Qwen3.6 35B A3B`.
+
+4. Click **Submit** to save the configuration. The provider appears in the provider list.
+5. Run a task to test the connection. This example uses the Olares skill to upload and deploy an app to Olares.
+
+    a. At the top, click the **Search** field and select **Toggle terminal** to open a terminal.
+
+    b. Log in to the Olares CLI. Replace `alice123@olares.com` with your own Olares ID: `olares-cli profile login --olares-id alice123@olares.com`.
+
+    c. When prompted, type your Olares password and press **Enter**. The input stays hidden.
+
+    d. Below the chat box, select **Big Pickle** to open the model selector, and select **Qwen3.6 35B A3B** from the list.
+
+    e. Send a task:
+
+    ```text
+    Upload and deploy this app to Olares:
+    https://github.com/chandruk4321/dockerize-static-web-project
+    ```
+
+    f. Respond to OpenCode's questions, decisions, and approvals until the task finishes.
+
+    ![Running the task in OpenCode](/images/manual/olares/llm-base-model-inst-test.png#bordered)
+
+    In this example, the Todo app is uploaded and deployed to **My Olares**. Open it from **My Olares** or the Launchpad to use the running app.
+
+    ![Todo app deployed to My Olares](/images/manual/olares/llm-base-model-inst-task.png#bordered)
+
+## Uninstall model instances
+
+1. Open Market, go to **My Olares**, and then locate the model instance app.
+2. Click the drop-down arrow next to the operation button, and then select **Uninstall**.
diff --git a/docs/public/images/manual/olares/llm-base-apps-create-instance.png b/docs/public/images/manual/olares/llm-base-apps-create-instance.png
new file mode 100644
index 0000000000..7fb39fa78e
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps-create-instance.png differ
diff --git a/docs/public/images/manual/olares/llm-base-apps-create-instance1.png b/docs/public/images/manual/olares/llm-base-apps-create-instance1.png
new file mode 100644
index 0000000000..8cd564d482
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps-create-instance1.png differ
diff --git a/docs/public/images/manual/olares/llm-base-apps.png b/docs/public/images/manual/olares/llm-base-apps.png
new file mode 100644
index 0000000000..9109b0a03d
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-console-config.png b/docs/public/images/manual/olares/llm-base-model-console-config.png
new file mode 100644
index 0000000000..08e0cdf699
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-config.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-console-ready.png b/docs/public/images/manual/olares/llm-base-model-console-ready.png
new file mode 100644
index 0000000000..fe2e80a051
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-ready.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-console-status.png b/docs/public/images/manual/olares/llm-base-model-console-status.png
new file mode 100644
index 0000000000..5a4b80fb54
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-status.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-inst-task.png b/docs/public/images/manual/olares/llm-base-model-inst-task.png
new file mode 100644
index 0000000000..bebb9bd569
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-inst-task.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-inst-test.png b/docs/public/images/manual/olares/llm-base-model-inst-test.png
new file mode 100644
index 0000000000..6ebdb900a3
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-inst-test.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-installed.png b/docs/public/images/manual/olares/llm-base-model-instance-installed.png
new file mode 100644
index 0000000000..94f1b150d0
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-installed.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-installed1.png b/docs/public/images/manual/olares/llm-base-model-instance-installed1.png
new file mode 100644
index 0000000000..b7ffe08841
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-installed1.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-tag.png b/docs/public/images/manual/olares/llm-base-model-instance-tag.png
new file mode 100644
index 0000000000..7bf69eb34f
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-tag.png differ
diff --git a/docs/public/images/manual/olares/llm-base-model-instance-tag1.png b/docs/public/images/manual/olares/llm-base-model-instance-tag1.png
new file mode 100644
index 0000000000..45aeb08396
Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-tag1.png differ
diff --git a/docs/zh/manual/olares/market/llm-base-apps.md b/docs/zh/manual/olares/market/llm-base-apps.md
new file mode 100644
index 0000000000..4b1664f25f
--- /dev/null
+++ b/docs/zh/manual/olares/market/llm-base-apps.md
@@ -0,0 +1,224 @@
+---
+outline: [2, 3]
+description: 了解如何在 Olares 中使用大模型底座（LLM Base App）托管本地大语言模型，并通过克隆底座运行 Ollama、vLLM、llama.cpp 或 SGLang 等推理引擎。
+---
+
+# 使用大模型底座托管本地大语言模型
+
+Olares V1.12.6 推出了基于 `llm-init` 项目的本地大语言模型（LLM）托管与管理平台。该平台提供四个大模型底座应用，分别对应四种推理引擎：**Ollama LLM Base**、**vLLM LLM Base**、**llama.cpp LLM Base** 和 **SGLang LLM Base**。选择对应引擎的底座，部署不同模型，并通过专属面板监控模型运行状态。
+
+## 开始之前
+
+- 你的 Olares 系统已升级至 V1.12.6 或更高版本。
+
+## 找到大模型底座
+
+1. 打开 **Market**，搜索“LLM Base”。
+
+   ![应用市场中的大模型底座](/images/manual/olares/llm-base-apps.png#bordered)
+
+2. 每个底座应用针对不同的推理场景做了优化。根据你的模型来源、性能需求和硬件条件进行选择：
+
+   | 底座应用 | 适用场景 |
+   | :--- | :--- |
+   | **Ollama LLM Base (llm-init)** | 快速上手和广泛的模型兼容。Ollama 可通过原生模型标签自动拉取模型，最适合聊天和嵌入任务。 |
+   | **vLLM LLM Base (llm-init)** | 高并发场景下对 HuggingFace 模型进行高吞吐量推理服务。 |
+   | **SGLang LLM Base (llm-init)** | 需要高效结构化生成和高级推理服务优化的场景。 |
+   | **llama.cpp LLM Base (llm-init)** | 轻量 GGUF 模型、显存有限或资源紧张的部署环境。 |
+
+## 创建新的模型实例
+
+大模型底座只是一个模板。要运行模型，你需要先将底座克隆为独立的运行实例。
+
+1. 选择与你所需推理引擎匹配的底座，然后点击 **View**。例如，**Ollama LLM Base (llm-init)**。
+2. 点击 **Create**，初始化一个新的实例。
+3. 配置实例标识：
+
+   - **New app name**：输入实例的唯一名称。该名称会显示在 **Market** > **My Olares** 中。
+   - **Shortcut name for {client}**：输入在启动台上显示的唯一快捷方式名称。
+
+4. 点击 **Create**，进入环境变量配置。
+
+## 配置引擎环境变量
+
+创建实例后，会弹出配置窗口。你需要定义引擎从哪里拉取模型、使用多少显存，以及向其他客户端应用暴露哪些能力。
+
+1. 在 **Configure environment variables for {New-app-name}** 窗口中，根据目标模型和引擎填写以下信息：
+
+   | 变量 | 说明 |
+   | :--- | :--- |
+   | **MODEL_SOURCE** | 指定模型源地址。<br>格式取决于所选引擎。<br>示例：`ollama://qwen3.5:0.8b` 或 `hf://Qwen/Qwen3.5-2B`。 |
+   | **MODEL_NAME** | 指定客户端应用连接时使用的模型名称。<br>示例：`qwen3.5-2b`。 |
+   | **MODEL_MODE** | 选择 **Chat** 或 **Embedding**。<br>示例：`chat`。 |
+   | **MODEL_SUPPORTS** | 输入逗号分隔的模型能力标志：<ul><li>**推理模型**：包含 `supports_reasoning`。</li><li>**工具调用模型**：包含 `supports_function_calling`。<br>如需同时处理多个任务，再添加 `supports_parallel_function_calling`。</li><li>**视觉模型**：包含 `supports_vision`。</li><li>**嵌入模型**：留空，不要填写聊天相关能力标志。</li></ul>示例：`supports_function_calling,supports_tool_choice`。<br>参考：[模型能力标志](#model-capability-flags)。 |
+   | **ENGINE_ARGS** | 指定引擎启动参数。<br>多个参数之间用空格分隔。<br>示例：`OLLAMA_CONTEXT_LENGTH=4096`。<br>参考：[引擎调优参数](#engine-tuning-arguments)。 |
+   | **{ENGINE}_REQUIRED<br>_GPU_MEMORY** | 设置实例所需的最低显存，单位为 MB 或 Gi。<br>示例：`8192`。 |
+
+2. 点击 **Confirm** 保存配置并开始安装实例。
+
+   页面右侧会出现 **Instances** 面板，显示安装进度。安装完成后，实例操作按钮会变为 **Open**，表示底层服务正在运行。
+
+   ![大模型底座实例安装完成](/images/manual/olares/llm-base-model-instance-installed.png#bordered)
+
+### 参考：模型能力标志 {#model-capability-flags}
+
+`MODEL_SUPPORTS` 变量声明模型向外部客户端暴露的能力。这些标志对所有推理引擎通用。
+
+| 类别 | 支持的标志 |
+| --- | --- |
+| **核心** | `supports_vision`、`supports_function_calling`、<br>`supports_reasoning`、`supports_native_streaming`、<br>`supports_response_schema`、`supports_prompt_caching`、<br>`supports_web_search`、`supports_parallel_function_calling` |
+| **多模态** | `supports_audio_input`、`supports_audio_output`、<br>`supports_video_input`、`supports_pdf_input`、<br>`supports_computer_use`、`supports_url_context` |
+| **推理与控制 token** | `supports_reasoning_effort`、`supports_thinking`、<br>`supports_assistant_prefill`、`supports_tool_choice`、<br>`supports_tokenizer` |
+| **采样控制** | `supports_system_messages`、`supports_temperature`、<br>`supports_top_p`、`supports_top_k`、<br>`supports_stop_sequences`、`supports_frequency_penalty`、<br>`supports_presence_penalty` |
+| **响应形式** | `supports_n`、`supports_logprobs`、`supports_seed`、<br>`supports_response_format`、`supports_logit_bias`、`supports_user` |
+
+### 参考：引擎调优参数 {#engine-tuning-arguments}
+
+使用 `ENGINE_ARGS` 变量来调整显存占用、上下文长度和处理行为。点击下方推理引擎查看可用调优参数。
+
+<tabs>
+<template #Ollama>
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `OLLAMA_CONTEXT_LENGTH` | 设置默认上下文窗口大小（以 token 为单位）。<br><br>默认根据显存自动调整：<ul><li>小于 24G：4096</li><li>24G 到 48G 之间：32768</li><li>48G 及以上：262144</li></ul> | `8192` 到 `131072` |
+| `OLLAMA_NUM_PARALLEL` | 设置每个模型可同时处理的并发请求数。<br><br>Ollama 会根据可用显存自动决定，通常为 `1` 或 `4`。<br><br>默认值：`0` | `1` |
+| `OLLAMA_KV_CACHE_TYPE` | 设置 KV 缓存量化类型以节省显存。<br><br>默认值：`f16`。 | `q8_0`（轻微精度损失）或 `q4_0` |
+| `OLLAMA_FLASH_ATTENTION` | 启用 Flash Attention，以优化长上下文场景下的显存效率。<br><br>默认值：`0`（关闭）。 | `1`（开启） |
+| `OLLAMA_MAX_LOADED_MODELS` | 设置可同时加载在内存中的模型数量上限。<br>默认会根据每块可用 GPU 自动扩展到约 3 个模型。<br><br>默认值：`0`。 | `1` |
+| `OLLAMA_MAX_QUEUE` | 设置处理队列中允许的最大请求数。<br><br>默认值：`512`。 | `512` |
+| `OLLAMA_KEEP_ALIVE` | 设置最后一次请求后模型在内存中保留的时长。使用 `-1` 表示永久保留。<br><br>默认值：`5m`。 | `30m` 或 `-1` |
+| `OLLAMA_LOAD_TIMEOUT` | 设置等待模型加载完成的最大时长。<br><br>默认值：`5m`。 | `5m` |
+| `OLLAMA_GPU_OVERHEAD` | 设置预留的安全显存余量（字节）。<br><br>默认值：`0`。 | `0` |
+| `OLLAMA_DEBUG` | 设置系统日志级别，用于排查问题。<br><br>默认值：`0`（Info）。 | `1`（Debug） |
+
+</template>
+<template #vLLM>
+
+占位符
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `--max-model-len` | 设置最大上下文窗口大小。如遇到显存不足，可适当减小。 | `8192` |
+| `--gpu-memory-utilization` | 设置模型可使用的显存比例上限。 | `0.9` |
+| `--tensor-parallel-size` | 设置用于张量并行的 GPU 数量。 | `1` |
+
+</template>
+<template #Llama-cpp>
+
+占位符
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `-c` | 设置最大上下文窗口大小（以 token 为单位）。 | `65536` |
+| `-ngl` | 将模型层 offload 到 GPU。 | `all` |
+| `-fa` | 启用 Flash Attention 以降低显存占用。 | `on` |
+
+</template>
+<template #SGLang>
+
+占位符
+
+| 参数 | 用途 | 推荐示例 |
+| :--- | :--- | :--- |
+| `--context-length` | 设置最大上下文长度。 | `32768` |
+| `--mem-fraction-static` | 设置用于静态用途（模型权重和 KV 缓存）的显存比例。 | `0.85` |
+| `--max-running-requests` | 设置并发处理的最大请求数。 | `64` |
+| `--reasoning-parser` | 配置推理模型的解析器。 | `qwen3` |
+
+</template>
+</tabs>
+
+### 引擎配置示例
+
+<tabs>
+<template #Ollama>
+
+Ollama 使用原生模型标签自动拉取模型。
+
+**聊天模型示例**
+```text
+MODEL_SOURCE=ollama://qwen3.5:2b
+MODEL_NAME=qwen3.5-2b
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice
+ENGINE_ARGS=OLLAMA_CONTEXT_LENGTH=8192
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+**嵌入模型示例**
+```text
+MODEL_SOURCE=ollama://nomic-embed-text
+MODEL_NAME=nomic-embed-text
+MODEL_MODE=embedding
+MODEL_SUPPORTS=
+ENGINE_ARGS=OLLAMA_KEEP_ALIVE=-1
+OLLAMA_REQUIRED_GPU_MEMORY=4096
+```
+
+</template>
+<template #vLLM>
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_reasoning
+ENGINE_ARGS=--max-model-len 8192 --gpu-memory-utilization 0.9 --tensor-parallel-size 1
+VLLM_REQUIRED_GPU_MEMORY=10Gi
+```
+
+</template>
+<template #Llama-cpp>
+
+```text
+MODEL_SOURCE=hf://unsloth/Qwen3.5-2B-GGUF --include Qwen3.5-2B-UD-Q4_K_XL.gguf,hf://unsloth/Qwen3.5-2B-GGUF --include mmproj-F16.gguf
+MODEL_NAME=unsloth/Qwen3.5-2B-GGUF:UD-Q4_K_XL
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_vision,supports_reasoning
+ENGINE_ARGS=-c 65536 -ngl all -fa on
+LLAMACPP_REQUIRED_GPU_MEMORY=8192
+```
+
+</template>
+<template #SGLang>
+
+```text
+MODEL_SOURCE=hf://Qwen/Qwen3.5-2B
+MODEL_NAME=Qwen/Qwen3.5-2B
+MODEL_MODE=chat
+MODEL_SUPPORTS=supports_function_calling,supports_tool_choice,supports_reasoning,supports_thinking
+ENGINE_ARGS=--context-length 32768 --mem-fraction-static 0.85 --max-running-requests 64 --reasoning-parser qwen3
+SGLANG_REQUIRED_GPU_MEMORY=8192
+```
+
+</template>
+</tabs>
+
+## 监控下载与初始化状态
+
+你可以通过实例内置面板跟踪模型下载、查看性能指标，并获取 API 连接信息。
+
+1. 在底座详情页右侧的 **Instances** 面板中找到你的部署。
+2. 当状态显示底层服务正在运行时，点击 **Open**。模型实例的 **llm-init** 页面会随之打开。
+3. 在 **STATUS** 标签页中确认部署进度：
+   - **DOWNLOAD**：实时显示下载百分比、速度和预计完成时间（ETA）。
+   - **STATUS**：跟踪失败或重试次数。如果出现网络中断或模型源地址格式错误，修复后点击 **Retry**。
+   - **ENGINE**：显示初始化状态。
+
+     - 确认两个追踪标签均显示 **Engine alive: yes** 和 **Model exists: yes**，这表示引擎已在线并可接受请求。
+     - 复制模型名称和 OpenAI 兼容 API 基础地址。
+
+4. 进入 **CONFIG** 标签页，查看运行限制、执行性能探测、查看基准历史或更新变量。
+
+## 将客户端应用连接到模型服务
+
+本地模型实例成功运行后，其他客户端应用可以使用标准 OpenAI API 模式连接该服务。
+
+1. 打开 Olares **Settings**，然后进入 **Applications** > **{Your-New-Model-Instance}** > **Shared entrance** > **{Engine} LLM API**。
+2. 复制端点 URL，并与你定义的 `MODEL_NAME` 一起填入客户端应用的模型配置部分。
+
+## 卸载模型实例
+
+1. 从应用市场打开目标底座应用。
+2. 在**实例**部分，找到目标模型实例，点击操作按钮旁的下拉箭头，然后点击**卸载**。