設定
| 項目 | 規格 |
|---|---|
| 機器 | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| 模型 | gemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB) |
| 條件 | (a) 不加前綴 (b) 加「請務必使用繁體中文台灣用語回答。」 |
| Temperature | 0.2 |
| Prompt 數量 | 20 |
| 總執行 | 20 × 2 模型 × 2 條件 = 80 次推論 |
Prompt 設計原則:
- 全部繁體中文,刻意避開使用任何陸式用語在 prompt 裡(否則會 seed 模型)
- 情境刻意誘發詞彙:例如「想跟 Apple 客服反應一個問題,怎麼聯絡?」目標誘發
聯繫;「列印機怎麼接 Mac?」目標誘發打印 - 20 個 prompt 覆蓋 17 個 hard term + 部分 soft term
Hard terms 清單(取自 knowledge/master/prohibited.md):
維權、信息、軟件、硬件、視頻、聯繫、打印、內存、固件、屏幕、鼠標、賬號、默認、移動端、互聯網、雲計算、服務器
Soft terms(情境相依,記錄但不算違規):
搞定、用戶、程序、通過、質量、客戶端
審計方法:每個輸出對 hard term 與 soft term 各跑一輪 text.count(term),記錄詞頻。任何 hard term ≥ 1 次出現就算「該 prompt 被偏移」。
腳本:tests/scripts/run_audit.py;prompt:tests/data/prompts.json;結果:tests/outputs/responses.jsonl、tests/outputs/scores.jsonl、tests/outputs/summary.json。
Setup
| Item | Spec |
|---|---|
| Machine | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| Models | gemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB) |
| Conditions | (a) no prefix (b) prefixed with "請務必使用繁體中文台灣用語回答。" |
| Temperature | 0.2 |
| Prompts | 20 |
| Total runs | 20 × 2 models × 2 conditions = 80 |
Prompt design:
- All Traditional Chinese; deliberately avoid using any mainland term in the prompt itself (otherwise we'd seed the model)
- Each prompt designed to elicit specific vocabulary: "How do I contact Apple support about a problem?" targets
聯繫; "How do I hook up the printer to my Mac?" targets打印 - 20 prompts cover 17 hard terms + some soft
Hard terms (from knowledge/master/prohibited.md):
維權 (rights advocacy), 信息 (info), 軟件 (software), 硬件 (hardware), 視頻 (video), 聯繫 (contact), 打印 (print), 內存 (memory), 固件 (firmware), 屏幕 (screen), 鼠標 (mouse), 賬號 (account), 默認 (default), 移動端 (mobile), 互聯網 (internet), 雲計算 (cloud computing), 服務器 (server)
Soft terms (context-dependent — logged but not counted as violations):
搞定, 用戶, 程序, 通過, 質量, 客戶端
Audit method: for each output, run text.count(term) for every hard and soft term, record frequencies. Any prompt with ≥1 hard hit counts as "drifted from the Taiwan baseline."
Script: tests/scripts/run_audit.py. Prompts: tests/data/prompts.json. Results: tests/outputs/responses.jsonl, tests/outputs/scores.jsonl, tests/outputs/summary.json.
結果
| 模型 × 條件 | 偏移率(≥1 hard term) | Hard term 總次數 | Soft term 總次數 |
|---|---|---|---|
| gemma4:e2b 無前綴 | 3/20 | 10 | 31 |
| gemma4:e2b 加前綴 | 5/20 | 9 | 28 |
| gemma4:12b 無前綴 | 7/20 | 10 | 29 |
| gemma4:12b 加前綴 | 5/20 | 10 | 24 |
三個反直覺的觀察:
- 12B 無前綴的偏移率比 e2b 高(7 vs 3)。比較大的模型不等於比較好的繁中用語感——可能因為大模型輸出更冗長,給「踩到誘發詞」更多機會。
- 前綴讓 e2b 的偏移率變高、12b 的變低——兩個方向相反,淨效應接近 0。前綴影響的是輸出風格與長度,不是詞彙偏好。
- Hard hits 與 drifted prompts 數字不貼合:e2b/none 只有 3 個 prompt 偏移,但 hard hits 達 10——表示單個受影響的 prompt 會重複用同一個禁用詞多次(P03 用了 6 次聯繫、P10 用了 3 次)。
Results
| Model × Condition | Drift rate (≥1 hard hit) | Total hard hits | Total soft hits |
|---|---|---|---|
| gemma4:e2b, no prefix | 3/20 | 10 | 31 |
| gemma4:e2b, with prefix | 5/20 | 9 | 28 |
| gemma4:12b, no prefix | 7/20 | 10 | 29 |
| gemma4:12b, with prefix | 5/20 | 10 | 24 |
Three counterintuitive observations:
- 12B drifts more than e2b without prefix (7 vs 3). Bigger model ≠ better Taiwanese vocabulary instinct — likely because the larger model produces longer responses, giving more chances to land on a trigger word.
- Prefix raises e2b's drift rate while lowering 12B's — opposite directions, net effect close to zero. The prefix influences style and length, not vocabulary preference.
- Hard hits ≠ prompts with drift: e2b/none has only 3 prompts that drifted but 10 hard hits — a single affected prompt uses the same prohibited word multiple times (P03 used 聯繫 six times, P10 used it three times).
最常踩雷的詞
整批 80 次推論的 hard term 詞頻:
| 詞 | e2b/none | e2b/tw | 12b/none | 12b/tw | 合計 |
|---|---|---|---|---|---|
| 聯繫 | 10 | 9 | 9 | 10 | 38 |
| 固件 | 0 | 0 | 1 | 0 | 1 |
| 其他 15 個 hard term | 0 | 0 | 0 | 0 | 0 |
這是整篇文章的核心發現:17 個禁用詞清單裡,只有 1 個詞造成 95% 以上的偏移——「聯繫」。其他像「軟件」「打印」「視頻」「屏幕」「默認」「服務器」「互聯網」「移動端」「鼠標」「賬號」「內存」這些我預期會看到的詞,整批 80 次推論都沒踩到一次。
也就是說,Gemma 4 對台灣繁中的「字面用詞」其實掌握得不錯——它知道用「軟體」「列印機」「影片」「螢幕」「預設」「伺服器」「網際網路」「行動裝置」「滑鼠」「帳號」「記憶體」。但它死認定「聯繫」是中性繁中,前綴沒辦法把它趕走。
Soft term 詞頻(兩模型合計):
| 詞 | none | tw | 觀察 |
|---|---|---|---|
| 用戶 | 38 | 36 | 平均每個 prompt 出現 ~1 次,soft 但極普遍 |
| 程序 | 15 | 16 | 程式碼脈絡裡偶爾出現 |
| 客戶端 | 4 | 0 | 加了前綴後消失 |
| 通過 | 3 | 0 | 同上 |
「用戶」雖然是 soft(在台灣可接受),但出現頻率比「聯繫」還高——若你的編輯規範堅持改成「使用者」,那是一個比 hard term 更大的後製工作量。
Most common offenders
Hard-term frequencies across all 80 runs:
| Term | e2b/none | e2b/tw | 12b/none | 12b/tw | Total |
|---|---|---|---|---|---|
| 聯繫 (contact) | 10 | 9 | 9 | 10 | 38 |
| 固件 (firmware) | 0 | 0 | 1 | 0 | 1 |
| Other 15 hard terms | 0 | 0 | 0 | 0 | 0 |
This is the headline finding: of the 17 prohibited terms on the list, one word causes 95%+ of the drift — 聯繫. The ones I expected to show up — 軟件, 打印, 視頻, 屏幕, 默認, 服務器, 互聯網, 移動端, 鼠標, 賬號, 內存 — didn't appear once across 80 runs.
Practical translation: Gemma 4 actually handles Taiwan vocabulary well at the surface level — it correctly uses 軟體, 列印機, 影片, 螢幕, 預設, 伺服器, 網際網路, 行動裝置, 滑鼠, 帳號, 記憶體. But it insists 聯繫 is neutral Chinese and no prefix dislodges it.
Soft-term counts (both models combined):
| Term | no-prefix | with-prefix | Note |
|---|---|---|---|
| 用戶 (user) | 38 | 36 | ~1 per prompt, soft but extremely common |
| 程序 (program/procedure) | 15 | 16 | Surfaces in code contexts |
| 客戶端 (client) | 4 | 0 | Disappears with prefix |
| 通過 (through/by) | 3 | 0 | Same |
用戶 is "soft" (acceptable in Taiwan) but more frequent than 聯繫. If your editorial policy mandates 使用者, that's a much bigger post-process load than the hard-term cleanup.
前綴的實際效果
| 模型 | 無前綴 hard 總計 | 加前綴 hard 總計 | 變化 | 偏移率變化 |
|---|---|---|---|---|
| gemma4:e2b | 10 | 9 | –1 | 3/20 → 5/20(↑ 2 個 prompt) |
| gemma4:12b | 10 | 10 | 0 | 7/20 → 5/20(↓ 2 個 prompt) |
數字攤開來看更刺眼:
- 前綴在「詞彙選擇」上幾乎沒效。 兩個模型的 hard hits 都在 9-10 之間,變動量在 ±1 範圍內——這個變動可能只是 sampling noise(雖然 temperature=0.2 但模型不是 100% 確定性)。
- 聯繫的頑固程度跨模型一致。 不管 e2b 或 12b、有沒有前綴,聯繫的出現都在 9-10 次。這代表這不是模型大小問題、不是 prompt 工程問題,是訓練資料分布問題。
- 前綴可能改變的是其他效果——例如它確實讓模型用繁中而不是英文回應(這個我們在 Ollama 三指令 那篇驗證過)、可能也讓 soft term
客戶端通過變少(從 4→0、3→0)——但對主要偏移源沒影響。
給 Resonance 寫作 pipeline 的建議
- 前綴照加(避免英文回應)但不要對它的詞彙效果有期待
- 加一個 post-process 詞表替換 在生成器(
tools/generate_article.py)裡,至少把聯繫 → 聯絡處理掉 - 詞表先聚焦在
聯繫。其他 15 個 hard term 觀測到的偏移率為 0,加進詞表沒壞處(防禦未來模型版本變化)但目前的優先級低 - 用戶 → 使用者 是另一條編輯線。要不要改,看你對 Resonance Stack 的「台味濃度」期望——這已經是審美而不是合規
How much does the prefix actually help?
| Model | No-prefix hard total | With-prefix hard total | Δ | Drift rate shift |
|---|---|---|---|---|
| gemma4:e2b | 10 | 9 | –1 | 3/20 → 5/20 (↑ 2 prompts) |
| gemma4:12b | 10 | 10 | 0 | 7/20 → 5/20 (↓ 2 prompts) |
The numbers are stark:
- The prefix has almost no effect on vocabulary choice. Both models stay at 9-10 hard hits — Δ ≤ 1, well within sampling noise even at temperature 0.2.
- 聯繫's stickiness is consistent across models. Whether e2b or 12B, prefix or no prefix, 聯繫 lands 9-10 times. This isn't a model-size problem, it isn't a prompt-engineering problem — it's a training-data distribution problem.
- What the prefix does change are other things — it does prevent English replies (verified in the quickstart), and it does suppress some soft terms (客戶端 4→0, 通過 3→0) — but it does not touch the main drift source.
Practical recommendation for the Resonance Stack pipeline
- Keep the prefix (it stops English output) but don't expect it to fix vocabulary
- Add a post-process replacement step in the article generator (
tools/generate_article.py) — at minimum, map聯繫 → 聯絡 - Term list can start with just 聯繫. The other 15 hard terms saw zero drift; adding them to the table is cheap insurance against future model versions but low-priority now
- 用戶 → 使用者 is a separate editorial line. Whether to change it depends on how strongly you want the "Taiwan-flavored" voice — that's aesthetic, not compliance
共振站結論
波形評級:INTERFERENCE
不是「本機模型不能寫繁中」,也不是「加了前綴就萬無一失」。是一個更實際的狀況:Gemma 4 對台灣用語的覆蓋率比預期高,但有 1-2 個頑固詞會跨越任何 prompt 設定。 寫作 pipeline 不要單靠 prompt 約束做語言品質控管,要在 generator 那層加一個小型詞表替換。
對 Resonance Stack 的具體 action:
tools/generate_article.py加入_normalize_taiwanese_terms()步驟,至少包含聯繫 → 聯絡- 不修改現有的 prompt 模板——前綴繼續加(其他用途)
- 用戶 / 使用者 的選擇延後到「voice-master.md」釐清後再實施
- 季度檢核:當 ollama library 更新模型版本時重跑這套 audit,看詞頻是否變化
Verdict
Waveform: INTERFERENCE
Not "local models can't write Traditional Chinese," not "the prefix fully solves it." A more practical picture: Gemma 4's Taiwan-vocabulary coverage is higher than expected, but 1-2 stubborn terms cross any prompt setting. Don't rely on prompt-level constraints alone for language quality — add a small term-replacement step in the generator.
Concrete actions for the Resonance Stack:
- Add a
_normalize_taiwanese_terms()step intools/generate_article.pywith at least聯繫 → 聯絡 - Don't change existing prompt templates — keep the prefix (it serves other purposes)
- Defer the 用戶 / 使用者 decision until voice-master.md is settled
- Quarterly re-audit: when ollama library publishes a new model version, rerun this audit and watch the term frequencies