Featured Teardown · Strategy Signature № 03 · INTERFERENCE

你的本地 LLM 在輸出陸式用語嗎？Gemma 4 e2b vs 12b 繁中偏移審計Is Your Local LLM Quietly Outputting Mainland Vocabulary? Auditing Gemma 4 e2b vs 12b

Josh Chen · June 11, 2026 · 11 min read

Ollama 三指令快速上手那篇我講過：Gemma 預設講英文，要繁中要在 prompt 前加一句指令。但「加了那句之後是不是真的繁中台灣用語」這件事，沒人量過。我用 20 個情境誘發 prompt 把 gemma4:e2b 與 gemma4:12b 在「沒加前綴」與「加了前綴」兩種條件下各跑一輪，再用 grep 自動審計輸出有多少陸式用語。

In the Ollama three-command quickstart, I noted that Gemma defaults to English — and you fix it with a Chinese-prefix instruction. But whether "fixed" actually means Taiwanese vocabulary, nobody measured. I ran gemma4:e2b and gemma4:12b against 20 trigger prompts in both "no prefix" and "with prefix" conditions, then grep-audited the outputs for mainland-Chinese terms.

設定

項目	規格
機器	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
模型	`gemma4:e2b` (Q4_K_M, 7.2 GB) + `gemma4:12b` (Q4_K_M, 7.6 GB)
條件	(a) 不加前綴 (b) 加「請務必使用繁體中文台灣用語回答。」
Temperature	0.2
Prompt 數量	20
總執行	20 × 2 模型 × 2 條件 = 80 次推論

Prompt 設計原則：

全部繁體中文，刻意避開使用任何陸式用語在 prompt 裡（否則會 seed 模型）
情境刻意誘發詞彙：例如「想跟 Apple 客服反應一個問題，怎麼聯絡？」目標誘發 聯繫；「列印機怎麼接 Mac？」目標誘發 打印
20 個 prompt 覆蓋 17 個 hard term + 部分 soft term

Hard terms 清單（取自 knowledge/master/prohibited.md）：

維權、信息、軟件、硬件、視頻、聯繫、打印、內存、固件、屏幕、鼠標、賬號、默認、移動端、互聯網、雲計算、服務器

Soft terms（情境相依，記錄但不算違規）：

搞定、用戶、程序、通過、質量、客戶端

審計方法：每個輸出對 hard term 與 soft term 各跑一輪 text.count(term)，記錄詞頻。任何 hard term ≥ 1 次出現就算「該 prompt 被偏移」。

腳本：tests/scripts/run_audit.py；prompt：tests/data/prompts.json；結果：tests/outputs/responses.jsonl、tests/outputs/scores.jsonl、tests/outputs/summary.json。

Setup

Item	Spec
Machine	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
Models	`gemma4:e2b` (Q4_K_M, 7.2 GB) + `gemma4:12b` (Q4_K_M, 7.6 GB)
Conditions	(a) no prefix (b) prefixed with "請務必使用繁體中文台灣用語回答。"
Temperature	0.2
Prompts	20
Total runs	20 × 2 models × 2 conditions = 80

Prompt design:

All Traditional Chinese; deliberately avoid using any mainland term in the prompt itself (otherwise we'd seed the model)
Each prompt designed to elicit specific vocabulary: "How do I contact Apple support about a problem?" targets 聯繫; "How do I hook up the printer to my Mac?" targets 打印
20 prompts cover 17 hard terms + some soft

Hard terms (from knowledge/master/prohibited.md):

維權 (rights advocacy), 信息 (info), 軟件 (software), 硬件 (hardware), 視頻 (video), 聯繫 (contact), 打印 (print), 內存 (memory), 固件 (firmware), 屏幕 (screen), 鼠標 (mouse), 賬號 (account), 默認 (default), 移動端 (mobile), 互聯網 (internet), 雲計算 (cloud computing), 服務器 (server)

Soft terms (context-dependent — logged but not counted as violations):

搞定, 用戶, 程序, 通過, 質量, 客戶端

Audit method: for each output, run text.count(term) for every hard and soft term, record frequencies. Any prompt with ≥1 hard hit counts as "drifted from the Taiwan baseline."

Script: tests/scripts/run_audit.py. Prompts: tests/data/prompts.json. Results: tests/outputs/responses.jsonl, tests/outputs/scores.jsonl, tests/outputs/summary.json.

結果

模型 × 條件	偏移率（≥1 hard term）	Hard term 總次數	Soft term 總次數
gemma4:e2b 無前綴	3/20	10	31
gemma4:e2b 加前綴	5/20	9	28
gemma4:12b 無前綴	7/20	10	29
gemma4:12b 加前綴	5/20	10	24

三個反直覺的觀察：

12B 無前綴的偏移率比 e2b 高（7 vs 3）。比較大的模型不等於比較好的繁中用語感——可能因為大模型輸出更冗長，給「踩到誘發詞」更多機會。
前綴讓 e2b 的偏移率變高、12b 的變低——兩個方向相反，淨效應接近 0。前綴影響的是輸出風格與長度，不是詞彙偏好。
Hard hits 與 drifted prompts 數字不貼合：e2b/none 只有 3 個 prompt 偏移，但 hard hits 達 10——表示單個受影響的 prompt 會重複用同一個禁用詞多次（P03 用了 6 次聯繫、P10 用了 3 次）。

Results

Model × Condition	Drift rate (≥1 hard hit)	Total hard hits	Total soft hits
gemma4:e2b, no prefix	3/20	10	31
gemma4:e2b, with prefix	5/20	9	28
gemma4:12b, no prefix	7/20	10	29
gemma4:12b, with prefix	5/20	10	24

Three counterintuitive observations:

12B drifts more than e2b without prefix (7 vs 3). Bigger model ≠ better Taiwanese vocabulary instinct — likely because the larger model produces longer responses, giving more chances to land on a trigger word.
Prefix raises e2b's drift rate while lowering 12B's — opposite directions, net effect close to zero. The prefix influences style and length, not vocabulary preference.
Hard hits ≠ prompts with drift: e2b/none has only 3 prompts that drifted but 10 hard hits — a single affected prompt uses the same prohibited word multiple times (P03 used 聯繫 six times, P10 used it three times).

最常踩雷的詞

整批 80 次推論的 hard term 詞頻：

詞	e2b/none	e2b/tw	12b/none	12b/tw	合計
聯繫	10	9	9	10	38
固件	0	0	1	0	1
其他 15 個 hard term	0	0	0	0	0

這是整篇文章的核心發現：17 個禁用詞清單裡，只有 1 個詞造成 95% 以上的偏移——「聯繫」。其他像「軟件」「打印」「視頻」「屏幕」「默認」「服務器」「互聯網」「移動端」「鼠標」「賬號」「內存」這些我預期會看到的詞，整批 80 次推論都沒踩到一次。

也就是說，Gemma 4 對台灣繁中的「字面用詞」其實掌握得不錯——它知道用「軟體」「列印機」「影片」「螢幕」「預設」「伺服器」「網際網路」「行動裝置」「滑鼠」「帳號」「記憶體」。但它死認定「聯繫」是中性繁中，前綴沒辦法把它趕走。

Soft term 詞頻（兩模型合計）：

詞	none	tw	觀察
用戶	38	36	平均每個 prompt 出現 ~1 次，soft 但極普遍
程序	15	16	程式碼脈絡裡偶爾出現
客戶端	4	0	加了前綴後消失
通過	3	0	同上

「用戶」雖然是 soft（在台灣可接受），但出現頻率比「聯繫」還高——若你的編輯規範堅持改成「使用者」，那是一個比 hard term 更大的後製工作量。

Most common offenders

Hard-term frequencies across all 80 runs:

Term	e2b/none	e2b/tw	12b/none	12b/tw	Total
聯繫 (contact)	10	9	9	10	38
固件 (firmware)	0	0	1	0	1
Other 15 hard terms	0	0	0	0	0

This is the headline finding: of the 17 prohibited terms on the list, one word causes 95%+ of the drift — 聯繫. The ones I expected to show up — 軟件, 打印, 視頻, 屏幕, 默認, 服務器, 互聯網, 移動端, 鼠標, 賬號, 內存 — didn't appear once across 80 runs.

Practical translation: Gemma 4 actually handles Taiwan vocabulary well at the surface level — it correctly uses 軟體, 列印機, 影片, 螢幕, 預設, 伺服器, 網際網路, 行動裝置, 滑鼠, 帳號, 記憶體. But it insists 聯繫 is neutral Chinese and no prefix dislodges it.

Soft-term counts (both models combined):

Term	no-prefix	with-prefix	Note
用戶 (user)	38	36	~1 per prompt, soft but extremely common
程序 (program/procedure)	15	16	Surfaces in code contexts
客戶端 (client)	4	0	Disappears with prefix
通過 (through/by)	3	0	Same

用戶 is "soft" (acceptable in Taiwan) but more frequent than 聯繫. If your editorial policy mandates 使用者, that's a much bigger post-process load than the hard-term cleanup.

前綴的實際效果

模型	無前綴 hard 總計	加前綴 hard 總計	變化	偏移率變化
gemma4:e2b	10	9	–1	3/20 → 5/20（↑ 2 個 prompt）
gemma4:12b	10	10	0	7/20 → 5/20（↓ 2 個 prompt）

數字攤開來看更刺眼：

前綴在「詞彙選擇」上幾乎沒效。 兩個模型的 hard hits 都在 9-10 之間，變動量在 ±1 範圍內——這個變動可能只是 sampling noise（雖然 temperature=0.2 但模型不是 100% 確定性）。
聯繫的頑固程度跨模型一致。 不管 e2b 或 12b、有沒有前綴，聯繫的出現都在 9-10 次。這代表這不是模型大小問題、不是 prompt 工程問題，是訓練資料分布問題。
前綴可能改變的是其他效果——例如它確實讓模型用繁中而不是英文回應（這個我們在 Ollama 三指令那篇驗證過）、可能也讓 soft term 客戶端 通過 變少（從 4→0、3→0）——但對主要偏移源沒影響。

給 Resonance 寫作 pipeline 的建議

前綴照加（避免英文回應）但不要對它的詞彙效果有期待
加一個 post-process 詞表替換 在生成器（tools/generate_article.py）裡，至少把 聯繫 → 聯絡 處理掉
詞表先聚焦在 聯繫。其他 15 個 hard term 觀測到的偏移率為 0，加進詞表沒壞處（防禦未來模型版本變化）但目前的優先級低
用戶 → 使用者 是另一條編輯線。要不要改，看你對 Resonance Stack 的「台味濃度」期望——這已經是審美而不是合規

How much does the prefix actually help?

Model	No-prefix hard total	With-prefix hard total	Δ	Drift rate shift
gemma4:e2b	10	9	–1	3/20 → 5/20 (↑ 2 prompts)
gemma4:12b	10	10	0	7/20 → 5/20 (↓ 2 prompts)

The numbers are stark:

The prefix has almost no effect on vocabulary choice. Both models stay at 9-10 hard hits — Δ ≤ 1, well within sampling noise even at temperature 0.2.
聯繫's stickiness is consistent across models. Whether e2b or 12B, prefix or no prefix, 聯繫 lands 9-10 times. This isn't a model-size problem, it isn't a prompt-engineering problem — it's a training-data distribution problem.
What the prefix does change are other things — it does prevent English replies (verified in the quickstart), and it does suppress some soft terms (客戶端 4→0, 通過 3→0) — but it does not touch the main drift source.

Practical recommendation for the Resonance Stack pipeline

Keep the prefix (it stops English output) but don't expect it to fix vocabulary
Add a post-process replacement step in the article generator (tools/generate_article.py) — at minimum, map 聯繫 → 聯絡
Term list can start with just 聯繫. The other 15 hard terms saw zero drift; adding them to the table is cheap insurance against future model versions but low-priority now
用戶 → 使用者 is a separate editorial line. Whether to change it depends on how strongly you want the "Taiwan-flavored" voice — that's aesthetic, not compliance

共振站結論

波形評級：INTERFERENCE

不是「本機模型不能寫繁中」，也不是「加了前綴就萬無一失」。是一個更實際的狀況：Gemma 4 對台灣用語的覆蓋率比預期高，但有 1-2 個頑固詞會跨越任何 prompt 設定。 寫作 pipeline 不要單靠 prompt 約束做語言品質控管，要在 generator 那層加一個小型詞表替換。

對 Resonance Stack 的具體 action：

tools/generate_article.py 加入 _normalize_taiwanese_terms() 步驟，至少包含 聯繫 → 聯絡
不修改現有的 prompt 模板——前綴繼續加（其他用途）
用戶 / 使用者 的選擇延後到「voice-master.md」釐清後再實施
季度檢核：當 ollama library 更新模型版本時重跑這套 audit，看詞頻是否變化

完整審計腳本與資料 →

Verdict

Waveform: INTERFERENCE

Not "local models can't write Traditional Chinese," not "the prefix fully solves it." A more practical picture: Gemma 4's Taiwan-vocabulary coverage is higher than expected, but 1-2 stubborn terms cross any prompt setting. Don't rely on prompt-level constraints alone for language quality — add a small term-replacement step in the generator.

Concrete actions for the Resonance Stack:

Add a _normalize_taiwanese_terms() step in tools/generate_article.py with at least 聯繫 → 聯絡
Don't change existing prompt templates — keep the prefix (it serves other purposes)
Defer the 用戶 / 使用者 decision until voice-master.md is settled
Quarterly re-audit: when ollama library publishes a new model version, rerun this audit and watch the term frequencies

Full audit script and data →

Verdict

Signature № 03 · INTERFERENCE

整批 80 個輸出，禁用詞 95% 都是「聯繫」——前綴無效，要靠 post-process 替換。95% of all prohibited-term hits across 80 runs are one word: 聯繫. The prefix doesn't fix it — post-process replacement does.

Gemma 4 e2b 與 12b 在 80 次推論裡，禁用詞偏移集中在一個詞：聯繫（39/40 次）。其他 16 個 hard term 幾乎不出現（只有 1 次固件）。「請務必使用繁體中文台灣用語回答」前綴在兩個模型上幾乎不改變聯繫的出現頻率。實務結論：前綴是必要的（避免英文回應），但對台灣用語的「詞彙選擇」沒效——必須補一個 post-process 步驟把聯繫→聯絡替換掉，這個改動成本極低。Soft term 用戶是另一個未爆彈：每個 prompt 平均出現 ~1 次，但因為「用戶」在台灣可接受，是否替換成「使用者」要另外決定。

Across 80 runs on Gemma 4 e2b and 12b, prohibited-term drift concentrates on a single word: 聯繫 (39/40 hits). The other 16 hard terms basically don't show up (one 固件 hit total). The 'please answer in Traditional Chinese using Taiwanese vocabulary' prefix barely moves the 聯繫 needle on either model. Practical conclusion: the prefix is still necessary (it prevents English replies) but doesn't reach into vocabulary choice — you need a post-process step replacing 聯繫→聯絡, which is trivial. Soft term 用戶 is a secondary issue, ~1 hit/prompt — but since 用戶 is acceptable in Taiwan, replacement is a separate editorial decision.

設定

Setup

結果

Results

最常踩雷的詞

Most common offenders

前綴的實際效果

給 Resonance 寫作 pipeline 的建議

How much does the prefix actually help?

Practical recommendation for the Resonance Stack pipeline

共振站結論

Verdict

Tune in. 每週一篇深度評測，沒有廢話。Tune in. One deep review per week. No filler.