diff --git a/docs/.vitepress/en.ts b/docs/.vitepress/en.ts index 4b65e7c0b0..24516d0ab9 100644 --- a/docs/.vitepress/en.ts +++ b/docs/.vitepress/en.ts @@ -216,6 +216,10 @@ const side = { text: "Clone applications", link: "/manual/olares/market/clone-apps", }, + { + text: "Use LLM Base applications", + link: "/manual/olares/market/llm-base-apps", + }, { text: "Manage paid applications", link: "/manual/olares/market/purchase-paid-apps", diff --git a/docs/.vitepress/zh.ts b/docs/.vitepress/zh.ts index bcd72e3a19..aca4fe16ba 100644 --- a/docs/.vitepress/zh.ts +++ b/docs/.vitepress/zh.ts @@ -214,6 +214,10 @@ const side = { text: "克隆应用", link: "/zh/manual/olares/market/clone-apps", }, + { + text: "使用大模型底座", + link: "/zh/manual/olares/market/llm-base-apps", + }, { text: "管理付费应用", link: "/zh/manual/olares/market/purchase-paid-apps", diff --git a/docs/manual/olares/market/llm-base-apps.md b/docs/manual/olares/market/llm-base-apps.md new file mode 100644 index 0000000000..09532f54d6 --- /dev/null +++ b/docs/manual/olares/market/llm-base-apps.md @@ -0,0 +1,304 @@ +--- +outline: [2, 3] +description: Learn how to use the LLM Base applications on Olares to self-host large language models and run different inference engines by cloning the base apps. +--- + +# Host local large language models with LLM Base apps + +Olares V1.12.6 introduces the local hosting and management platform for large language models (LLMs), a self-hosting solution powered by the `llm-init` project. This platform provides four LLM Base applications, each for one inference engine: **Ollama LLM Base**, **vLLM LLM Base**, **llama.cpp LLM Base**, and **SGLang LLM Base**. Select the base app for the engine you want, use it to deploy different models, and then monitor model performance through a dedicated console. + +## Before you start + +- Your Olares system has been upgraded to V1.12.6 or later. + +## Locate LLM Base apps + +1. Open Market and search for "LLM Base". Four base apps appear: vLLM LLM Base (llm-init), SGLang LLM Base (llm-init), Ollama LLM Base (llm-init), and llama.cpp LLM Base (llm-init). + + ![LLM Base apps in Market](/images/manual/olares/llm-base-apps.png#bordered) + +2. Each base app is optimized for a different inference scenario. Choose one based on your model source, performance needs, and hardware. + + | Base app | When to choose | + | :--- | :--- | + | **llama.cpp LLM Base (llm-init)** | Choose llama.cpp when you are running lightweight
GGUF models or deploying with limited GPU memory. | + | **Ollama LLM Base (llm-init)** | Choose Ollama when you want to get started quickly with
broad model compatibility. It pulls models automatically
using native model tags, making it ideal for chat and embedding tasks. | + | **vLLM LLM Base (llm-init)** | Choose vLLM when you need high-throughput serving
of Hugging Face models under heavy concurrent load. | + | **SGLang LLM Base (llm-init)** | Choose SGLang when you need efficient structured
generation or advanced reasoning optimizations. | + +## Create a new model instance + +An LLM Base app serves as a template. To run a model, you must first clone the base app into an independent running instance. + +1. Select the base app that matches your preferred inference engine, and then click **View** on it. For example, **llama.cpp LLM Base (llm-init)**. +2. Click **Create** to initialize a new instance. + + ![Create a model instance](/images/manual/olares/llm-base-apps-create-instance1.png#bordered) + +3. Specify the instance identity settings: + + - **New app name**: Enter a unique name for the instance. This name is displayed as the app name in Market and Settings. For example, `Qwen3.6-35B-A3B`. + - **Shortcut name for {client}**: Enter a unique shortcut name for the instance. This name is displayed on the Launchpad. For example, `qwen3.6-35b-a3b`. + +4. Click **Create** to proceed to the environment configuration. + +## Configure engine environment variables + +After creating the instance, the configuration window opens. Define where your engine pulls the model, how much memory it uses, and what capabilities it exposes to other client apps. + +1. In the **Configure environment variables for {New-app-name}** window, fill in the following details according to the target model and engine: + + | Variable | Description | + | :--- | :--- | + | **MODEL_SOURCE** | Specify where the engine pulls the model.

The format depends on the selected engine: | + | **MODEL_NAME** | Define the name that client apps use to call this instance.

Derive it from `MODEL_SOURCE` per engine: | + | **MODEL_MODE** | Select **Chat** or **Embedding**. | + | **MODEL_SUPPORTS** | Select the capabilities the model supports: **Vision**, **Tools**,
**Thinking**, or **Embedding**. | + | **ENGINE_ARGS** | Specify the engine startup parameters, separated by spaces.

The format depends on the engine:For more arguments, see [Engine tuning arguments](#reference-engine-tuning-arguments). | + | **{ENGINE}_REQUIRED
_GPU_MEMORY** | Enter the minimum GPU memory the instance needs to start,
in MB or Gi. For example, `20Gi`. | + +2. Click **Confirm** to save the configuration and start the instance installation. + + An **Instances** panel appears on the right side of the page, showing the installation progress. Once the setup completes, the instance's operation button changes to **Open**, indicating that the base service is running. A model app with the same name also appears on the Launchpad. + + ![Model instance installed](/images/manual/olares/llm-base-model-instance-installed1.png#bordered) + + :::info + Model instances created from LLM Base apps show a `From template` tag next to the app name. You can see this tag when viewing the app in Market or Settings. + + ![Model instance tag](/images/manual/olares/llm-base-model-instance-tag1.png#bordered){width=70%} + ::: + +:::tip Update variables later +To change these variables after installation, go to Olares **Settings** > **Applications** > **[App-Name]** > **Manage environment variables**. Click the edit icon next to a variable, update its value, save your change, and then click **Apply**. +::: + +### Reference: Engine tuning arguments + +Use the `ENGINE_ARGS` variable to add custom settings that adjust memory usage, context limits, and processing behaviors. Separate multiple arguments with spaces. Select your inference engine below to view the available tuning arguments. + + + + + + + + + + +### Sample engine configurations + + + + + + + + +## Monitor deployment and configure settings + +Track model downloads, verify engine readiness, and manage operational parameters through the built-in model console. + +1. Locate the model instance in the **Instances** panel on the LLM Base app details page, or find it on the Launchpad. +2. Open it to launch the dedicated model console. + + The console opens on the **Status** tab by default. The model files are start downloading automatically. + + ![Model console status tab](/images/manual/olares/llm-base-model-console-status.png#bordered) + +3. Tracks the readiness of the model and the engine: + + - **Model**: Shows `READY` once the files are downloaded and verified. Copy the **Model name** to use for client app connections. + - **Engine**: Shows `RUNNING` once the inference service is online. Configure how client apps reach it: + - **WHO IS CALLING**: Select who can access the API, **Apps in Olares**, **Devices in LAN**, or **Remote**. + - **WHAT API FORMAT**: Select the API format. The available options depend on the engine, for example **OpenAI-Compatible**, **Anthropic-Compatible**, or **Ollama**. + - **Base URL**: Copy this URL to use for client app connections. + - **Supported Endpoints**: Expand to see the available API endpoints. + + ![Model console ready](/images/manual/olares/llm-base-model-console-ready.png#bordered) + +4. Select the **Config** tab to review the model's capabilities and parameters, check how it sits on the GPU, and measure its performance. + + ![Model console, config page](/images/manual/olares/llm-base-model-console-config.png#bordered) + + - **Model card** (top): Shows the model name, mode (**Chat** or **Embedding**), and the capability tags the instance exposes, such as `function_calling`, `parallel_function_calling`, `reasoning`, `reasoning_effort`, and `tool_choice`. + - **Parameters**: View the engine parameters. Expand **Advanced parameters** for the full set, and use the **Form** / **Raw** toggle to switch the view. + - **GPU Residency**: Confirm whether the model is actually running on the GPU. Click **Detect** to refresh, then check **Mode**: + - `full GPU`: All layers run on the GPU. This is the expected, fastest state. + - `partial` or `cpu_only`: Part or all of the model fell back to the CPU, which makes inference much slower. On a GPU host this usually means an environment mis-mount, so review your `{ENGINE}_REQUIRED_GPU_MEMORY` setting and engine arguments. + + The panel also reports **VRAM**, **KV cache used**, and **GPU mem util**, so you can see how much memory the model occupies and how much headroom is left for longer contexts or more concurrent requests. + - **Performance**: Click **Run test** to benchmark responsiveness: + - **TTFT** (Time To First Token): How long a user waits before the first word appears. Lower means a snappier experience. + - **perf.code**: How long the engine takes to load the model from scratch, for example after a restart. + + Use these numbers to compare quantization levels, context sizes, or engine arguments, and to confirm that a change actually improved speed before you rely on it. + +## Connect client apps to the model service + +Once the model instance is running, any client app that speaks the OpenAI-compatible API can connect to it through the base URL. + +The following example uses [OpenCode](../../../use-cases/opencode.md) as the client. + +1. In the model console, go to the **Status** tab. Under **Service status**: + + - **WHO IS CALLING**: Select **Apps in Olares**, because OpenCode runs inside Olares. + - **WHAT API FORMAT**: Select **OpenAI-Compatible**. + - Copy the **Base URL** and note down the **Model name**. + +2. In OpenCode, click settings in the bottom-left corner, select **Providers**, then scroll down and select **Connect** next to **Custom Provider**. + +3. Enter the following details: + + - **Provider ID**: A unique identifier for this provider. For example, `olares-llm`. + - **Display name**: The name shown in the provider list. For example, `Olares LLM`. + - **Base URL**: The **Base URL** you copied from the model console. + - **Models**: + - **Model ID**: Your `MODEL_NAME`. For example, `Qwen3.6-35B-A3B`. + - **Display Name**: The name shown for this model. For example, `Qwen3.6 35B A3B`. + +4. Click **Submit** to save the configuration. The provider appears in the provider list. +5. Run a task to test the connection. This example uses the Olares skill to upload and deploy an app to Olares. + + a. At the top, click the **Search** field and select **Toggle terminal** to open a terminal. + + b. Log in to the Olares CLI. Replace `alice123@olares.com` with your own Olares ID: `olares-cli profile login --olares-id alice123@olares.com`. + + c. When prompted, type your Olares password and press **Enter**. The input stays hidden. + + d. Below the chat box, select **Big Pickle** to open the model selector, and select **Qwen3.6 35B A3B** from the list. + + e. Send a task: + + ```text + Upload and deploy this app to Olares: + https://github.com/chandruk4321/dockerize-static-web-project + ``` + + f. Respond to OpenCode's questions, decisions, and approvals until the task finishes. + + ![Running the task in OpenCode](/images/manual/olares/llm-base-model-inst-test.png#bordered) + + In this example, the Todo app is uploaded and deployed to **My Olares**. Open it from **My Olares** or the Launchpad to use the running app. + + ![Todo app deployed to My Olares](/images/manual/olares/llm-base-model-inst-task.png#bordered) + +## Uninstall model instances + +1. Open Market, go to **My Olares**, and then locate the model instance app. +2. Click the drop-down arrow next to the operation button, and then select **Uninstall**. diff --git a/docs/public/images/manual/olares/llm-base-apps-create-instance.png b/docs/public/images/manual/olares/llm-base-apps-create-instance.png new file mode 100644 index 0000000000..7fb39fa78e Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps-create-instance.png differ diff --git a/docs/public/images/manual/olares/llm-base-apps-create-instance1.png b/docs/public/images/manual/olares/llm-base-apps-create-instance1.png new file mode 100644 index 0000000000..8cd564d482 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps-create-instance1.png differ diff --git a/docs/public/images/manual/olares/llm-base-apps.png b/docs/public/images/manual/olares/llm-base-apps.png new file mode 100644 index 0000000000..9109b0a03d Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-apps.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-console-config.png b/docs/public/images/manual/olares/llm-base-model-console-config.png new file mode 100644 index 0000000000..08e0cdf699 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-config.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-console-ready.png b/docs/public/images/manual/olares/llm-base-model-console-ready.png new file mode 100644 index 0000000000..fe2e80a051 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-ready.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-console-status.png b/docs/public/images/manual/olares/llm-base-model-console-status.png new file mode 100644 index 0000000000..5a4b80fb54 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-console-status.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-inst-task.png b/docs/public/images/manual/olares/llm-base-model-inst-task.png new file mode 100644 index 0000000000..bebb9bd569 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-inst-task.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-inst-test.png b/docs/public/images/manual/olares/llm-base-model-inst-test.png new file mode 100644 index 0000000000..6ebdb900a3 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-inst-test.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-instance-installed.png b/docs/public/images/manual/olares/llm-base-model-instance-installed.png new file mode 100644 index 0000000000..94f1b150d0 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-installed.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-instance-installed1.png b/docs/public/images/manual/olares/llm-base-model-instance-installed1.png new file mode 100644 index 0000000000..b7ffe08841 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-installed1.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-instance-tag.png b/docs/public/images/manual/olares/llm-base-model-instance-tag.png new file mode 100644 index 0000000000..7bf69eb34f Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-tag.png differ diff --git a/docs/public/images/manual/olares/llm-base-model-instance-tag1.png b/docs/public/images/manual/olares/llm-base-model-instance-tag1.png new file mode 100644 index 0000000000..45aeb08396 Binary files /dev/null and b/docs/public/images/manual/olares/llm-base-model-instance-tag1.png differ diff --git a/docs/zh/manual/olares/market/llm-base-apps.md b/docs/zh/manual/olares/market/llm-base-apps.md new file mode 100644 index 0000000000..4b1664f25f --- /dev/null +++ b/docs/zh/manual/olares/market/llm-base-apps.md @@ -0,0 +1,224 @@ +--- +outline: [2, 3] +description: 了解如何在 Olares 中使用大模型底座(LLM Base App)托管本地大语言模型,并通过克隆底座运行 Ollama、vLLM、llama.cpp 或 SGLang 等推理引擎。 +--- + +# 使用大模型底座托管本地大语言模型 + +Olares V1.12.6 推出了基于 `llm-init` 项目的本地大语言模型(LLM)托管与管理平台。该平台提供四个大模型底座应用,分别对应四种推理引擎:**Ollama LLM Base**、**vLLM LLM Base**、**llama.cpp LLM Base** 和 **SGLang LLM Base**。选择对应引擎的底座,部署不同模型,并通过专属面板监控模型运行状态。 + +## 开始之前 + +- 你的 Olares 系统已升级至 V1.12.6 或更高版本。 + +## 找到大模型底座 + +1. 打开 **Market**,搜索“LLM Base”。 + + ![应用市场中的大模型底座](/images/manual/olares/llm-base-apps.png#bordered) + +2. 每个底座应用针对不同的推理场景做了优化。根据你的模型来源、性能需求和硬件条件进行选择: + + | 底座应用 | 适用场景 | + | :--- | :--- | + | **Ollama LLM Base (llm-init)** | 快速上手和广泛的模型兼容。Ollama 可通过原生模型标签自动拉取模型,最适合聊天和嵌入任务。 | + | **vLLM LLM Base (llm-init)** | 高并发场景下对 HuggingFace 模型进行高吞吐量推理服务。 | + | **SGLang LLM Base (llm-init)** | 需要高效结构化生成和高级推理服务优化的场景。 | + | **llama.cpp LLM Base (llm-init)** | 轻量 GGUF 模型、显存有限或资源紧张的部署环境。 | + +## 创建新的模型实例 + +大模型底座只是一个模板。要运行模型,你需要先将底座克隆为独立的运行实例。 + +1. 选择与你所需推理引擎匹配的底座,然后点击 **View**。例如,**Ollama LLM Base (llm-init)**。 +2. 点击 **Create**,初始化一个新的实例。 +3. 配置实例标识: + + - **New app name**:输入实例的唯一名称。该名称会显示在 **Market** > **My Olares** 中。 + - **Shortcut name for {client}**:输入在启动台上显示的唯一快捷方式名称。 + +4. 点击 **Create**,进入环境变量配置。 + +## 配置引擎环境变量 + +创建实例后,会弹出配置窗口。你需要定义引擎从哪里拉取模型、使用多少显存,以及向其他客户端应用暴露哪些能力。 + +1. 在 **Configure environment variables for {New-app-name}** 窗口中,根据目标模型和引擎填写以下信息: + + | 变量 | 说明 | + | :--- | :--- | + | **MODEL_SOURCE** | 指定模型源地址。
格式取决于所选引擎。
示例:`ollama://qwen3.5:0.8b` 或 `hf://Qwen/Qwen3.5-2B`。 | + | **MODEL_NAME** | 指定客户端应用连接时使用的模型名称。
示例:`qwen3.5-2b`。 | + | **MODEL_MODE** | 选择 **Chat** 或 **Embedding**。
示例:`chat`。 | + | **MODEL_SUPPORTS** | 输入逗号分隔的模型能力标志:示例:`supports_function_calling,supports_tool_choice`。
参考:[模型能力标志](#model-capability-flags)。 | + | **ENGINE_ARGS** | 指定引擎启动参数。
多个参数之间用空格分隔。
示例:`OLLAMA_CONTEXT_LENGTH=4096`。
参考:[引擎调优参数](#engine-tuning-arguments)。 | + | **{ENGINE}_REQUIRED
_GPU_MEMORY** | 设置实例所需的最低显存,单位为 MB 或 Gi。
示例:`8192`。 | + +2. 点击 **Confirm** 保存配置并开始安装实例。 + + 页面右侧会出现 **Instances** 面板,显示安装进度。安装完成后,实例操作按钮会变为 **Open**,表示底层服务正在运行。 + + ![大模型底座实例安装完成](/images/manual/olares/llm-base-model-instance-installed.png#bordered) + +### 参考:模型能力标志 {#model-capability-flags} + +`MODEL_SUPPORTS` 变量声明模型向外部客户端暴露的能力。这些标志对所有推理引擎通用。 + +| 类别 | 支持的标志 | +| --- | --- | +| **核心** | `supports_vision`、`supports_function_calling`、
`supports_reasoning`、`supports_native_streaming`、
`supports_response_schema`、`supports_prompt_caching`、
`supports_web_search`、`supports_parallel_function_calling` | +| **多模态** | `supports_audio_input`、`supports_audio_output`、
`supports_video_input`、`supports_pdf_input`、
`supports_computer_use`、`supports_url_context` | +| **推理与控制 token** | `supports_reasoning_effort`、`supports_thinking`、
`supports_assistant_prefill`、`supports_tool_choice`、
`supports_tokenizer` | +| **采样控制** | `supports_system_messages`、`supports_temperature`、
`supports_top_p`、`supports_top_k`、
`supports_stop_sequences`、`supports_frequency_penalty`、
`supports_presence_penalty` | +| **响应形式** | `supports_n`、`supports_logprobs`、`supports_seed`、
`supports_response_format`、`supports_logit_bias`、`supports_user` | + +### 参考:引擎调优参数 {#engine-tuning-arguments} + +使用 `ENGINE_ARGS` 变量来调整显存占用、上下文长度和处理行为。点击下方推理引擎查看可用调优参数。 + + + + + + + + +### 引擎配置示例 + + + + + + + + +## 监控下载与初始化状态 + +你可以通过实例内置面板跟踪模型下载、查看性能指标,并获取 API 连接信息。 + +1. 在底座详情页右侧的 **Instances** 面板中找到你的部署。 +2. 当状态显示底层服务正在运行时,点击 **Open**。模型实例的 **llm-init** 页面会随之打开。 +3. 在 **STATUS** 标签页中确认部署进度: + - **DOWNLOAD**:实时显示下载百分比、速度和预计完成时间(ETA)。 + - **STATUS**:跟踪失败或重试次数。如果出现网络中断或模型源地址格式错误,修复后点击 **Retry**。 + - **ENGINE**:显示初始化状态。 + + - 确认两个追踪标签均显示 **Engine alive: yes** 和 **Model exists: yes**,这表示引擎已在线并可接受请求。 + - 复制模型名称和 OpenAI 兼容 API 基础地址。 + +4. 进入 **CONFIG** 标签页,查看运行限制、执行性能探测、查看基准历史或更新变量。 + +## 将客户端应用连接到模型服务 + +本地模型实例成功运行后,其他客户端应用可以使用标准 OpenAI API 模式连接该服务。 + +1. 打开 Olares **Settings**,然后进入 **Applications** > **{Your-New-Model-Instance}** > **Shared entrance** > **{Engine} LLM API**。 +2. 复制端点 URL,并与你定义的 `MODEL_NAME` 一起填入客户端应用的模型配置部分。 + +## 卸载模型实例 + +1. 从应用市场打开目标底座应用。 +2. 在**实例**部分,找到目标模型实例,点击操作按钮旁的下拉箭头,然后点击**卸载**。