YaYa-Ubuntu 站-楊奕農: 2026/6/1

2026年6月6日

Hermes Agent + 本地 Ollama 問題總結與解決方案

開始發現的問題

我的主要目標是測試及使用「本地AI」

電腦是 AMD Ryzen 7 5700-8-Core
主板：PRIME B550M-K
GPU：NVIDIA GeForce RTX 3060/PCIe/SSE2 v: 4.6.0 /12G

NVIDIA Driver: v570.133.07

CUDA Driver Version: 12.8

OS: Linux Mint 20.3 Una

開始使用 Hermes 時，發現「完全無法使用」ollama 下的本地模型，一直有錯，不然就是很笨！
改用 Hermes 的免費 deepseek-v4-flash-free 模型來偵查，發現問題有些奇怪，可能有一部份是 ollama 的預測值導致，有的是 Hermes 的設定，經過幾小時的排查後，找出解決方有三層：

1. ✅ Ollama Modelfile (num_ctx)

2. ✅ Ollama runtime (ollama ps context_length)

3. 🔴 Hermes config (context_length)

簡單地說，要用 ollama create modelfile 的方式來包裝原生下載的 LLM, 再配合修改 Hermes config.yaml 檔。

最後留下給 Hermes agent 可用的模型

模型	參數	原生 ctx	num_ctx	溫度	top_k	top_p	視覺	音訊	工具	推理
gemma4:12b_hermes	11.9B	262144	131072	1	64	0.95	✅	✅	✅	✅
gemma4:e2b_hermes	5.1B	131072	131072	1	64	0.95	✅	✅	✅	✅
gemma4:e4b_hermes	8.0B	131072	131072	1	64	0.95	✅	✅	✅	✅
gemma3:4b_hermes	4.3B	131072	131072	1	64	0.95	✅	✗	✗	✗
llama3.2:2b_hermes	3.2B	131072	131072	1	64	0.95	✗	✗	✅	✗
qwen3:8b_hermes	8.2B	40960	65536	0.6	20	0.95	✗	✗	✅	✅
qwen2.5:8b_hermes	7.6B	32768	65536	1	64	0.95	✗	✗	✅	✗
qwen2.5-coder:8b_hermes	7.6B	32768	65536	1	64	0.95	✗	✗	✅	✗

問題一：回應被截斷在 4096 tokens

現象： finish_reason='length'，精準卡在 total_tokens=4096，無法輸出長內容。

根本原因分析： Hermes 內部程式碼 chat_completion_helpers.py:589 有硬編碼 fallback：

max_tokens = agent.max_tokens or 4096

當 model.max_tokens 未在 config.yaml 設定時，自動降為 4096。

此外，Ollama 的 /v1/chat/completions endpoint 不支援 runtime 的 options.num_ctx，模型載入記憶體後，context length 永遠是原始模型預設值（gemma4:12b = 4096）。

解法：

在 Hermes agent 中的config.yaml 中必須設定

model:

max_tokens: 32768 # 防止 Hermes 內部 fallback 到 4096

同時用 Modelfile alias 取代原始模型名（見問題二）

問題二：Context window 無法擴充

現象：即使請求中帶了 options.num_ctx: 65536，模型載入後 ollama ps 仍然顯示 context_length=4096。

根因分析：

- /v1/chat/completions（OpenAI-compatible endpoint）忽略 request body 中的 options.* 參數

- 原生 API /api/chat 和 /api/generate 則正確支援

- Hermes 使用 OpenAI SDK 走 /v1/chat/completions，所以 extra_body 中的 num_ctx / num_predict 無效

解法：使用 Modelfile alias 在模型載入「前」就固定參數：

bash

ollama create gemma4:12b_hermes -f - <<'EOF'

FROM gemma4:12b

PARAMETER num_ctx 65536

PARAMETER num_predict 32768

EOF

然後 config 中 model.default 指向 gemma4:12b_hermes 而非 gemma4:12b。

問題三：extra_body 多餘且可能衝突

現象：很多教學建議在 config.yaml 中設：

extra_body: '{"num_predict": 16384, "num_ctx": 65536}'

根因分析：

1. 因為問題二，extra_body 中的 num_ctx 在 /v1/chat/completions 上無效

2. num_predict: 16384 反而會覆蓋 Modelfile 的預設值，限制輸出長度

3. custom provider 的 build_api_kwargs_extras 已經自動透過 ollama_num_ctx 機制送出 options.num_ctx

解法：直接移除 extra_body，完全交由 Modelfile 和 custom provider 處理。

問題四：多個 profile 需重複設定

現象：每次新增 profile 都要重新設定所有參數。

解法：

bash

# 建立一個模板 profile，所有 Ollama 設定一次到位

hermes profile create ollama-template --clone-from ha-xxx

未來可在 Hermes agent 中新增

hermes profile create 新名稱 --clone-from ollama-template

最終 ha-xxx config.yaml 範本

model:

default: gemma4:12b_hermes

provider: custom

base_url: http://localhost:11434/v1

api_key: your-api-key

context_length: 65536

max_tokens: 32768

# 不用設 extra_body — 因 ollama 中的 Modelfile 已提供 num_ctx/num_predict