Featured Teardown · Strategy Signature № 03 · INTERFERENCE

你的本地 LLM 在輸出陸式用語嗎?Gemma 4 e2b vs 12b 繁中偏移審計Is Your Local LLM Quietly Outputting Mainland Vocabulary? Auditing Gemma 4 e2b vs 12b

Ollama 三指令快速上手 那篇我講過:Gemma 預設講英文,要繁中要在 prompt 前加一句指令。但「加了那句之後是不是真的繁中台灣用語」這件事,沒人量過。我用 20 個情境誘發 prompt 把 gemma4:e2bgemma4:12b 在「沒加前綴」與「加了前綴」兩種條件下各跑一輪,再用 grep 自動審計輸出有多少陸式用語。

In the Ollama three-command quickstart, I noted that Gemma defaults to English — and you fix it with a Chinese-prefix instruction. But whether "fixed" actually means Taiwanese vocabulary, nobody measured. I ran gemma4:e2b and gemma4:12b against 20 trigger prompts in both "no prefix" and "with prefix" conditions, then grep-audited the outputs for mainland-Chinese terms.


設定

項目規格
機器MacBook Pro 14" M1 Pro, 16GB
Ollama0.30.6
模型gemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB)
條件(a) 不加前綴 (b) 加「請務必使用繁體中文台灣用語回答。」
Temperature0.2
Prompt 數量20
總執行20 × 2 模型 × 2 條件 = 80 次推論

Prompt 設計原則

  • 全部繁體中文,刻意避開使用任何陸式用語在 prompt 裡(否則會 seed 模型)
  • 情境刻意誘發詞彙:例如「想跟 Apple 客服反應一個問題,怎麼聯絡?」目標誘發 聯繫;「列印機怎麼接 Mac?」目標誘發 打印
  • 20 個 prompt 覆蓋 17 個 hard term + 部分 soft term

Hard terms 清單(取自 knowledge/master/prohibited.md):

維權、信息、軟件、硬件、視頻、聯繫、打印、內存、固件、屏幕、鼠標、賬號、默認、移動端、互聯網、雲計算、服務器

Soft terms(情境相依,記錄但不算違規):

搞定、用戶、程序、通過、質量、客戶端

審計方法:每個輸出對 hard term 與 soft term 各跑一輪 text.count(term),記錄詞頻。任何 hard term ≥ 1 次出現就算「該 prompt 被偏移」。

腳本:tests/scripts/run_audit.py;prompt:tests/data/prompts.json;結果:tests/outputs/responses.jsonltests/outputs/scores.jsonltests/outputs/summary.json

Setup

ItemSpec
MachineMacBook Pro 14" M1 Pro, 16GB
Ollama0.30.6
Modelsgemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB)
Conditions(a) no prefix (b) prefixed with "請務必使用繁體中文台灣用語回答。"
Temperature0.2
Prompts20
Total runs20 × 2 models × 2 conditions = 80

Prompt design:

  • All Traditional Chinese; deliberately avoid using any mainland term in the prompt itself (otherwise we'd seed the model)
  • Each prompt designed to elicit specific vocabulary: "How do I contact Apple support about a problem?" targets 聯繫; "How do I hook up the printer to my Mac?" targets 打印
  • 20 prompts cover 17 hard terms + some soft

Hard terms (from knowledge/master/prohibited.md):

維權 (rights advocacy), 信息 (info), 軟件 (software), 硬件 (hardware), 視頻 (video), 聯繫 (contact), 打印 (print), 內存 (memory), 固件 (firmware), 屏幕 (screen), 鼠標 (mouse), 賬號 (account), 默認 (default), 移動端 (mobile), 互聯網 (internet), 雲計算 (cloud computing), 服務器 (server)

Soft terms (context-dependent — logged but not counted as violations):

搞定, 用戶, 程序, 通過, 質量, 客戶端

Audit method: for each output, run text.count(term) for every hard and soft term, record frequencies. Any prompt with ≥1 hard hit counts as "drifted from the Taiwan baseline."

Script: tests/scripts/run_audit.py. Prompts: tests/data/prompts.json. Results: tests/outputs/responses.jsonl, tests/outputs/scores.jsonl, tests/outputs/summary.json.


結果

模型 × 條件偏移率(≥1 hard term)Hard term 總次數Soft term 總次數
gemma4:e2b 無前綴3/201031
gemma4:e2b 加前綴5/20928
gemma4:12b 無前綴7/201029
gemma4:12b 加前綴5/201024

三個反直覺的觀察:

  1. 12B 無前綴的偏移率比 e2b 高(7 vs 3)。比較大的模型不等於比較好的繁中用語感——可能因為大模型輸出更冗長,給「踩到誘發詞」更多機會。
  2. 前綴讓 e2b 的偏移率變高、12b 的變低——兩個方向相反,淨效應接近 0。前綴影響的是輸出風格與長度,不是詞彙偏好。
  3. Hard hits 與 drifted prompts 數字不貼合:e2b/none 只有 3 個 prompt 偏移,但 hard hits 達 10——表示單個受影響的 prompt 會重複用同一個禁用詞多次(P03 用了 6 次聯繫、P10 用了 3 次)。

Results

Model × ConditionDrift rate (≥1 hard hit)Total hard hitsTotal soft hits
gemma4:e2b, no prefix3/201031
gemma4:e2b, with prefix5/20928
gemma4:12b, no prefix7/201029
gemma4:12b, with prefix5/201024

Three counterintuitive observations:

  1. 12B drifts more than e2b without prefix (7 vs 3). Bigger model ≠ better Taiwanese vocabulary instinct — likely because the larger model produces longer responses, giving more chances to land on a trigger word.
  2. Prefix raises e2b's drift rate while lowering 12B's — opposite directions, net effect close to zero. The prefix influences style and length, not vocabulary preference.
  3. Hard hits ≠ prompts with drift: e2b/none has only 3 prompts that drifted but 10 hard hits — a single affected prompt uses the same prohibited word multiple times (P03 used 聯繫 six times, P10 used it three times).

最常踩雷的詞

整批 80 次推論的 hard term 詞頻:

e2b/nonee2b/tw12b/none12b/tw合計
聯繫10991038
固件00101
其他 15 個 hard term00000

這是整篇文章的核心發現:17 個禁用詞清單裡,只有 1 個詞造成 95% 以上的偏移——「聯繫」。其他像「軟件」「打印」「視頻」「屏幕」「默認」「服務器」「互聯網」「移動端」「鼠標」「賬號」「內存」這些我預期會看到的詞,整批 80 次推論都沒踩到一次

也就是說,Gemma 4 對台灣繁中的「字面用詞」其實掌握得不錯——它知道用「軟體」「列印機」「影片」「螢幕」「預設」「伺服器」「網際網路」「行動裝置」「滑鼠」「帳號」「記憶體」。但它死認定「聯繫」是中性繁中,前綴沒辦法把它趕走。

Soft term 詞頻(兩模型合計):

nonetw觀察
用戶3836平均每個 prompt 出現 ~1 次,soft 但極普遍
程序1516程式碼脈絡裡偶爾出現
客戶端40加了前綴後消失
通過30同上

「用戶」雖然是 soft(在台灣可接受),但出現頻率比「聯繫」還高——若你的編輯規範堅持改成「使用者」,那是一個比 hard term 更大的後製工作量。

Most common offenders

Hard-term frequencies across all 80 runs:

Terme2b/nonee2b/tw12b/none12b/twTotal
聯繫 (contact)10991038
固件 (firmware)00101
Other 15 hard terms00000

This is the headline finding: of the 17 prohibited terms on the list, one word causes 95%+ of the drift — 聯繫. The ones I expected to show up — 軟件, 打印, 視頻, 屏幕, 默認, 服務器, 互聯網, 移動端, 鼠標, 賬號, 內存 — didn't appear once across 80 runs.

Practical translation: Gemma 4 actually handles Taiwan vocabulary well at the surface level — it correctly uses 軟體, 列印機, 影片, 螢幕, 預設, 伺服器, 網際網路, 行動裝置, 滑鼠, 帳號, 記憶體. But it insists 聯繫 is neutral Chinese and no prefix dislodges it.

Soft-term counts (both models combined):

Termno-prefixwith-prefixNote
用戶 (user)3836~1 per prompt, soft but extremely common
程序 (program/procedure)1516Surfaces in code contexts
客戶端 (client)40Disappears with prefix
通過 (through/by)30Same

用戶 is "soft" (acceptable in Taiwan) but more frequent than 聯繫. If your editorial policy mandates 使用者, that's a much bigger post-process load than the hard-term cleanup.


前綴的實際效果

模型無前綴 hard 總計加前綴 hard 總計變化偏移率變化
gemma4:e2b109–13/20 → 5/20(↑ 2 個 prompt)
gemma4:12b101007/20 → 5/20(↓ 2 個 prompt)

數字攤開來看更刺眼:

  1. 前綴在「詞彙選擇」上幾乎沒效。 兩個模型的 hard hits 都在 9-10 之間,變動量在 ±1 範圍內——這個變動可能只是 sampling noise(雖然 temperature=0.2 但模型不是 100% 確定性)。
  2. 聯繫的頑固程度跨模型一致。 不管 e2b 或 12b、有沒有前綴,聯繫的出現都在 9-10 次。這代表這不是模型大小問題、不是 prompt 工程問題,是訓練資料分布問題。
  3. 前綴可能改變的是其他效果——例如它確實讓模型用繁中而不是英文回應(這個我們在 Ollama 三指令 那篇驗證過)、可能也讓 soft term 客戶端 通過 變少(從 4→0、3→0)——但對主要偏移源沒影響。

給 Resonance 寫作 pipeline 的建議

  • 前綴照加(避免英文回應)但不要對它的詞彙效果有期待
  • 加一個 post-process 詞表替換 在生成器(tools/generate_article.py)裡,至少把 聯繫 → 聯絡 處理掉
  • 詞表先聚焦在 聯繫。其他 15 個 hard term 觀測到的偏移率為 0,加進詞表沒壞處(防禦未來模型版本變化)但目前的優先級低
  • 用戶 → 使用者 是另一條編輯線。要不要改,看你對 Resonance Stack 的「台味濃度」期望——這已經是審美而不是合規

How much does the prefix actually help?

ModelNo-prefix hard totalWith-prefix hard totalΔDrift rate shift
gemma4:e2b109–13/20 → 5/20 (↑ 2 prompts)
gemma4:12b101007/20 → 5/20 (↓ 2 prompts)

The numbers are stark:

  1. The prefix has almost no effect on vocabulary choice. Both models stay at 9-10 hard hits — Δ ≤ 1, well within sampling noise even at temperature 0.2.
  2. 聯繫's stickiness is consistent across models. Whether e2b or 12B, prefix or no prefix, 聯繫 lands 9-10 times. This isn't a model-size problem, it isn't a prompt-engineering problem — it's a training-data distribution problem.
  3. What the prefix does change are other things — it does prevent English replies (verified in the quickstart), and it does suppress some soft terms (客戶端 4→0, 通過 3→0) — but it does not touch the main drift source.

Practical recommendation for the Resonance Stack pipeline

  • Keep the prefix (it stops English output) but don't expect it to fix vocabulary
  • Add a post-process replacement step in the article generator (tools/generate_article.py) — at minimum, map 聯繫 → 聯絡
  • Term list can start with just 聯繫. The other 15 hard terms saw zero drift; adding them to the table is cheap insurance against future model versions but low-priority now
  • 用戶 → 使用者 is a separate editorial line. Whether to change it depends on how strongly you want the "Taiwan-flavored" voice — that's aesthetic, not compliance

共振站結論

波形評級:INTERFERENCE

不是「本機模型不能寫繁中」,也不是「加了前綴就萬無一失」。是一個更實際的狀況:Gemma 4 對台灣用語的覆蓋率比預期高,但有 1-2 個頑固詞會跨越任何 prompt 設定。 寫作 pipeline 不要單靠 prompt 約束做語言品質控管,要在 generator 那層加一個小型詞表替換。

對 Resonance Stack 的具體 action:

  1. tools/generate_article.py 加入 _normalize_taiwanese_terms() 步驟,至少包含 聯繫 → 聯絡
  2. 不修改現有的 prompt 模板——前綴繼續加(其他用途)
  3. 用戶 / 使用者 的選擇延後到「voice-master.md」釐清後再實施
  4. 季度檢核:當 ollama library 更新模型版本時重跑這套 audit,看詞頻是否變化

完整審計腳本與資料 →

Verdict

Waveform: INTERFERENCE

Not "local models can't write Traditional Chinese," not "the prefix fully solves it." A more practical picture: Gemma 4's Taiwan-vocabulary coverage is higher than expected, but 1-2 stubborn terms cross any prompt setting. Don't rely on prompt-level constraints alone for language quality — add a small term-replacement step in the generator.

Concrete actions for the Resonance Stack:

  1. Add a _normalize_taiwanese_terms() step in tools/generate_article.py with at least 聯繫 → 聯絡
  2. Don't change existing prompt templates — keep the prefix (it serves other purposes)
  3. Defer the 用戶 / 使用者 decision until voice-master.md is settled
  4. Quarterly re-audit: when ollama library publishes a new model version, rerun this audit and watch the term frequencies

Full audit script and data →