Featured Teardown · Strategy Signature № 03 · INTERFERENCE

一個模型，四種專家：用 Prompt 把 gemma4 變成稅務、法律、營養、客服顧問One Model, Four Experts: Prompting gemma4 Into Tax, Legal, Nutrition, and Support Specialists

Josh Chen · June 15, 2026 · 11 min read

你還在每次 chat 都重新解釋背景嗎？

Are you still re-explaining your background on every chat?

每次問 ChatGPT 一個法律問題，你都要先寫一句「我是台灣租客，房東加了一條……」。每次問營養師問題，又要說明「我是台灣人，飲食以米飯為主」。這就是為什麼系統 prompt 存在——把這些「背景脈絡」一次設定好，剩下的對話就能直接聊正事。

這篇文章用同一個 gemma4:e2b（本地 Ollama），分別設定 4 種專業角色，看看一個 7B 通用模型在 prompt 加持下能不能變成可用的領域助手。

Every time you ask ChatGPT a legal question, you start with "I'm a tenant in Taiwan, my landlord added a clause that...". Every time you ask a nutrition question, you preface with "I'm Taiwanese, my diet is rice-heavy". That's exactly what system prompts solve — set the background context once, then talk shop.

This article takes the same gemma4:e2b (local Ollama) and configures it as 4 domain experts. Can a 7B generic model become a usable specialist with the right prompt?

實驗：同一個 gemma4，四種角色

角色	領域	System Prompt 重點
Tax Advisor	台灣綜所稅	引用財政部規定、提醒查證、不給投資建議
Legal Helper	台灣租賃法	引用內政部定型化契約、提醒尋求律師
Nutritionist	飲食建議	台灣飲食習慣、強調醫師確診優先
Support Lead	客服 escalation	同理→確認→方案三步驟、避免承諾賠償

每個 prompt 約 30 行，包含：身份、知識邊界、回答風格、必要免責聲明。

Experiment: Same gemma4, Four Roles

System prompts for four roles compared side-by-side

Role	Domain	System Prompt Focus
Tax Advisor	Taiwan income tax	Quote MOF regulations, recommend verification, no investment advice
Legal Helper	Taiwan rental law	Quote MOI standard contracts, recommend consulting a lawyer
Nutritionist	Diet advice	Taiwan eating habits, defer to doctor diagnosis
Support Lead	Customer support escalation	Empathy → Confirm → Solve, no compensation promises

Each prompt is ~30 lines covering: identity, knowledge boundaries, response style, mandatory disclaimers.

角色 A：稅務顧問

問題：「我去年薪資 80 萬、房租支出 12 萬，可以列舉扣除嗎？」

通用 gemma4 給的是泛泛的「依照規定可以扣除」，沒有金額上限與條件。角色化 prompt 引用財政部 113 年規定：每年上限 NT$18 萬，需房東配合開立繳款證明，且要在 5 月前完成設算扣除選項比較。

Role A: Tax Advisor

Tax role response vs generic gemma4 response

Question: "Last year I made NT$800K, paid NT$120K in rent — can I itemize that as a deduction?"

Generic gemma4 gives a vague "yes, per regulations." The role-prompted version cites Taiwan MOF 113 rules: NT$180K annual cap, requires landlord cooperation on payment certificates, must be compared against standard deduction before May filing.

角色 B：租賃法律助手

問題：「房東在我簽的契約裡加了『承租人不得申請任何政府補貼』，這合法嗎？」

通用版本：「應該不太合法，建議找律師。」角色版本：直接引用內政部定型化契約「不得記載事項」第 10 點——這條無效，且可向地方政府消保官申訴。

Role B: Rental Law Helper

Question: "My landlord added 'tenant shall not apply for any government subsidy' to the contract — is that legal?"

Generic version: "Probably not, consult a lawyer." Role version: cites MOI standard contract Forbidden Clause #10 directly — this clause is unenforceable and can be reported to the local consumer protection office.

角色 C：營養師

問題：「我是糖尿病前期，每天應該吃多少碳水？」

通用版本傾向直接給數字（45-65%），無風險提醒。角色版本：先說「請先看新陳代謝科醫師」，再提供 ADA 建議範圍 + 台灣常見糖友飲食地雷（粥、芋圓、市售飲料）。

Role C: Nutritionist

Question: "I'm pre-diabetic, how many carbs should I eat per day?"

Generic version: gives numbers (45-65%), no risk caveat. Role version: leads with "please see an endocrinologist first," then provides ADA-recommended range plus Taiwan-specific common pitfalls (congee, taro balls, sweetened beverages).

角色 D：客服 escalation 助手

情境：客戶投訴外送送錯餐、要求賠償雙倍。

通用版本：直接道歉並提議退款。角色版本：依照「同理→確認→方案」三步驟——先共情、確認損失、提供「重送 + NT$100 抵用券」這種可控的補償框架，不承諾雙倍。

Role D: Support Escalation Lead

Scenario: customer received wrong delivery, demands double refund.

Generic version: apologizes and offers refund. Role version: follows "Empathize → Confirm → Solve" steps — empathy first, confirm loss, propose controlled compensation ("redeliver + NT$100 voucher") without committing to double.

評分結果

用 gemma4 自己當 LLM-as-judge，5 維度 10 分制評分 60 個回答（每角色 12 個）：

維度	通用	稅務	法律	營養	客服
領域準確性	9.33	9.00	9.33	9.00	9.75
引用具體性	3.83	4.25	5.67	3.50	5.83
風險意識	9.75	9.92	10.00	9.92	10.00
在地化	8.58	9.25	9.75	9.58	9.33
行動可執行	9.92	9.42	9.92	9.67	10.00
總分（/50）	41.42	41.83	44.67	41.67	44.92

反直覺發現：通用版總分 41.42，角色版平均 43.27，差距僅 1.85 分（4.5%）。但實際看內容——通用版開頭「我無法提供正式建議」，角色版直接列方案——質的差距遠大於分數差距。

這揭露了 LLM-as-judge 的友善偏誤：gemma4 評自己生成的內容時，傾向給高分（accuracy、risk、actionable 三個維度都 9 分以上），難拉開差距。唯一拉開的是「引用具體性」——角色 prompt 強制引用條號的效果。

Scoring Results

4 roles × 5 dimensions scoring table with LLM-as-judge bias analysis

Used gemma4 itself as LLM-as-judge, scoring 60 answers (12 per role) on a 10-point scale:

Dimension	Generic	Tax	Legal	Nutrition	Support
Domain Accuracy	9.33	9.00	9.33	9.00	9.75
Citation Specificity	3.83	4.25	5.67	3.50	5.83
Risk Awareness	9.75	9.92	10.00	9.92	10.00
Localization	8.58	9.25	9.75	9.58	9.33
Actionability	9.92	9.42	9.92	9.67	10.00
Total (/50)	41.42	41.83	44.67	41.67	44.92

Counterintuitive finding: Generic scored 41.42 vs role average 43.27 — only a 1.85-point gap (4.5%). But the content quality gap is far larger: generic opens with "I cannot provide formal advice," role-specific versions list concrete options immediately.

This reveals LLM-as-judge friendliness bias: gemma4 scoring its own generations tends toward high scores (accuracy, risk, actionability all above 9.0), making differentiation hard. The only dimension that genuinely separated was Citation Specificity — direct evidence that role prompts forcing condition-code references actually work.

決策框架：什麼任務適合「角色化 prompt」

值得做的情境：

重複性高的對話（每天問同類問題）
領域有明確邊界（稅務 / 法律 / 客服等）
你能寫出該領域的「免責聲明」（這代表你了解風險邊界）

不要期待的事：

補不了訓練資料的時間性（2024 後的新法規）
救不了數學能力（複雜計算照樣會算錯）
不能取代專業認證（律師、醫師、會計師）

和 RAG / Fine-tune 的關係：

Prompt engineering 是 0 成本起點，先試這個
RAG 補「最新資訊」的不足（下一篇文章）
Fine-tune 改變模型行為，門檻最高

Decision Framework: When Role Prompting Pays Off

Worth doing when:

Conversations repeat (you ask similar questions daily)
Domain has clear boundaries (tax / legal / support)
You can write that domain's "disclaimer" (proves you understand the risk boundary)

Don't expect it to:

Fix knowledge cutoff (regulations after 2024)
Fix math weakness (complex calculations still fail)
Replace professional certification (lawyer, doctor, CPA)

Relationship to RAG / Fine-tune:

Prompt engineering is the zero-cost starting point — try it first
RAG fills the "current information" gap (next article)
Fine-tuning changes model behavior — highest barrier

Verdict

Signature № 03 · INTERFERENCE

角色化 prompt 內容質感明顯提升，但 LLM-as-judge 給的分數只差 4.5%——揭露自評偏誤的工程現實。Role prompts dramatically improve content quality, but LLM-as-judge only shows a 4.5% score gap — exposing the self-evaluation bias problem.

用同一個 gemma4:e2b 配 4 種系統 prompt（稅務、法律、營養、客服），對 12 個問題跑出 60 個回答，再用 gemma4 自己當評審打 5 維度分數。結果：通用版 41.42 分，角色版平均 43.27（差距 1.85 分）。但看內容——通用版「我無法提供」、角色版直接列方案——質的差距遠大於分數。唯一拉開的維度是「引用具體性」（角色版 5.83 vs 通用 3.83）。結論：LLM-as-judge 有友善偏誤，難客觀評估；要量化提升，最好的指標是「結構化合規率」而非「整體分數」。

Used one gemma4:e2b with 4 system prompts (tax, legal, nutrition, support). 60 answers across 12 questions, all scored by gemma4 itself across 5 dimensions. Generic: 41.42. Role average: 43.27. A mere 1.85-point gap (4.5%). But the content quality gap is far larger — generic refuses, roles list options. The one dimension that genuinely separated was Citation Specificity (5.83 role vs 3.83 generic). Conclusion: LLM-as-judge has a friendliness bias and struggles to differentiate. For quantifying improvements, structural compliance rate is a better metric than overall scores.