設定
| 項目 | 規格 |
|---|---|
| 機器 | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| 模型 | gemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB) |
| Temperature | 0.2 |
| 重複次數 | 每組合 5 次 |
| 總執行 | 7 格式 × 2 模型 × 5 = 70 次 |
共同任務:「列出 3 本你會推薦給程式設計初學者的書,每本含書名、作者、推薦原因(30 字內)。」這個任務本身結構簡單,重點在「能不能按指定格式吐」,不在「書選得好不好」。
7 種格式:
- JSON:
{"books": [{"title", "author", "why"}, ...]} - YAML:同樣 shape,YAML 語法
- XML:
<books><book><title>...</title>...</book></books> - Markdown table:表頭 + 3 列資料
- CSV:表頭 + 3 列、含雙引號跳脫
- Mermaid:mindmap 語法
- Plain(控制組):自由格式、一行一本書
驗證方法(全部機器判定):
| 格式 | 驗證 |
|---|---|
| JSON | json.loads 過 + 有 books[3]、每本含 title/author/why |
| YAML | yaml.safe_load 過 + 同樣 shape |
| XML | ElementTree.fromstring 過 + 有 3 個 <book>、各含 title/author/why |
| Markdown table | regex 找 header + separator + ≥ 3 列、每列 ≥ 3 欄 |
| CSV | csv.reader 過 + header 含 title/author/why + ≥ 3 資料列、每列 ≥ 3 欄 |
| Mermaid | 首行 mindmap 或 graph 或 flowchart + 至少 4 行內容 |
| Plain | 至少 3 行非空(基本上一定過) |
寬容處理:模型若加上 markdown 圍籬(``` `json `` ... `` ` ```)會自動先剝掉再驗證——因為下游程式很容易補這一步,這算「可接受瑕疵」。前綴解釋文字(如「以下是 JSON:」)會讓 parser 失敗、算 FAIL。
腳本:tests/scripts/run_format_test.py;資料:tests/data/format_prompts.json;結果:tests/outputs/format_results.jsonl。
Setup
| Item | Spec |
|---|---|
| Machine | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| Models | gemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB) |
| Temperature | 0.2 |
| Repetitions | 5 per combination |
| Total runs | 7 formats × 2 models × 5 = 70 |
Shared task: "List three books you'd recommend to a beginner programmer, each with title, author, and rationale (≤30 chars)." The task itself is simple — the test is whether the model honors the requested format.
7 formats:
- JSON:
{"books": [{"title", "author", "why"}, ...]} - YAML: same shape in YAML
- XML:
<books><book><title>...</title>...</book></books> - Markdown table: header row + 3 data rows
- CSV: header + 3 rows with quote escaping
- Mermaid: mindmap syntax
- Plain (control): free-form, one book per line
Validation (all machine-decided):
| Format | Validator |
|---|---|
| JSON | json.loads succeeds + has books[3] each with title/author/why |
| YAML | yaml.safe_load succeeds + same shape |
| XML | ElementTree.fromstring succeeds + 3 <book> each with title/author/why |
| Markdown table | regex finds header + separator + ≥3 data rows × ≥3 cols |
| CSV | csv.reader parses + header has title/author/why + ≥3 data rows × ≥3 cols |
| Mermaid | first line is mindmap/graph/flowchart + ≥4 content lines |
| Plain | ≥3 non-empty lines (basically always passes) |
Lenience: if the model wraps output in markdown fences (``` `json `` ... `` ` ```) we strip them before parsing — easy for any downstream caller to do, counts as "acceptable blemish." Pre/post-prose ("Here's the JSON:") breaks the parser → FAIL.
Script: tests/scripts/run_format_test.py. Prompts: tests/data/format_prompts.json. Results: tests/outputs/format_results.jsonl.
一次過率矩陣
| 格式 | gemma4:e2b | gemma4:12b |
|---|---|---|
| JSON | 5/5 | 5/5 |
| YAML | 3/5 | 5/5 |
| XML | 5/5 | 4/5 |
| Markdown table | 5/5 | 4/5 |
| CSV | 5/5 | 4/5 |
| Mermaid | 5/5 | 4/5 |
| Plain | 5/5 | 4/5 |
| 合計 | 33/35 (94%) | 31/35 (89%) |
第一個直覺結論「12B 比 e2b 不可靠」其實是錯的——只有 YAML 那一格的差異是真實的格式問題。剩下 5 個 12B 的 4/5,全部是同一個失敗模式:模型回了空字串。
Pass rate matrix
| Format | gemma4:e2b | gemma4:12b |
|---|---|---|
| JSON | 5/5 | 5/5 |
| YAML | 3/5 | 5/5 |
| XML | 5/5 | 4/5 |
| Markdown table | 5/5 | 4/5 |
| CSV | 5/5 | 4/5 |
| Mermaid | 5/5 | 4/5 |
| Plain | 5/5 | 4/5 |
| Total | 33/35 (94%) | 31/35 (89%) |
The first-glance reading "12B is less reliable than e2b" is misleading — only the YAML column captures a real format-compliance gap. The other five 4/5s on 12B are all the same failure mode: the model returned an empty string.
兩種失敗:一種是格式錯、另一種是模型不回應
e2b YAML 兩次失敗:書名含冒號
兩次 YAML 失敗都是同一個原因。模型生成:
books:
- title: Clean Code: A Handbook of Agile Software Craftsmanship
author: Robert C. Martin
why: ...
YAML 解析器看到 title: Clean Code: A Handbook 就崩了——: 在 value 裡需要引號跳脫。正確寫法:
books:
- title: "Clean Code: A Handbook of Agile Software Craftsmanship"
這不是 LLM 偶發瑕疵,是 YAML 格式本身對字串內容的脆弱性。任何字串若含 :、#、- 開頭、{ 等保留字符都會踩雷。12B 在這個情境下會自動 quote,e2b 不會——這是 e2b 唯一一個結構性弱點。
12B 五次失敗:空回應
XML rep 4、Markdown table rep 4、CSV rep 5、Mermaid rep 4、Plain rep 4——全部回了空字串("")。原本以為這些是格式錯,仔細看 raw response 才發現是「模型沒輸出任何 token」。
對應的 wall time:
| 失敗 | wall time |
|---|---|
| XML rep 4 | 291.7 s |
| Markdown table rep 4 | 293.8 s |
| CSV rep 5 | 339.4 s |
| Mermaid rep 4 | 291.7 s |
| Plain rep 4 | 344.5 s |
中位數成功案例 wall time 約 100-150 秒,這些失敗都是 290-340 秒。12B 在 M1 Pro 16GB 連續跑 30+ 次後會偶爾「卡住、超時、返回空」——這是 ollama / 推論引擎在持續高負載下的可靠性問題,不是模型不會寫格式。
實務上:應該在 client 端加 if response == "": retry。這比擔心格式不合規更有效。
Two kinds of failure: format errors vs the model not responding
e2b YAML failures (×2): book titles with colons
Both YAML failures had the same cause. The model produced:
books:
- title: Clean Code: A Handbook of Agile Software Craftsmanship
author: Robert C. Martin
why: ...
YAML's parser breaks at title: Clean Code: — a : inside a value needs quoting. Correct version:
books:
- title: "Clean Code: A Handbook of Agile Software Craftsmanship"
This isn't an LLM glitch — it's YAML's inherent fragility around string content. Any value containing :, #, a leading -, or { will trip it. 12B reflexively quotes such values; e2b doesn't. That's e2b's only structural weakness in this test.
12B failures (×5): empty responses
XML rep 4, Markdown table rep 4, CSV rep 5, Mermaid rep 4, Plain rep 4 — every one returned an empty string (""). At first I assumed these were format errors; checking the raw response showed the model emitted no tokens.
Wall times for the empties:
| Failure | Wall time |
|---|---|
| XML rep 4 | 291.7 s |
| Markdown table rep 4 | 293.8 s |
| CSV rep 5 | 339.4 s |
| Mermaid rep 4 | 291.7 s |
| Plain rep 4 | 344.5 s |
Median successful 12B wall times sit around 100-150s; these are 290-340s. After 30+ sustained calls, 12B on M1 Pro 16GB occasionally stalls, times out, and returns an empty string — an ollama / runtime reliability issue under sustained load, not a format-compliance failure.
The practical fix is client-side: if response == "": retry. That's a bigger lever than format engineering.
決策表:你的下游 pipeline 該用哪個格式?
| 你的情況 | 建議格式 | 理由 |
|---|---|---|
| 嚴格 schema、不可重試 | JSON | 兩模型 10/10、解析最嚴、parser 成熟 |
| 可以一次失敗 retry | JSON 或 YAML(12B) | 12B 上 YAML 也 5/5;e2b 上要避免 |
| 給人看的清單 | Markdown table | 兩模型 9/10、人類友好 |
| 跨 LLM 移植性最高 | JSON | 任何 LLM 訓練資料 JSON 都最多 |
| 要視覺化(diagram) | Mermaid(但要 retry) | 9/10,且 client 可直接渲染 |
| 小型本機模型 (≤ 4B) | 避開 YAML | e2b 在含冒號字串上會崩 |
| 高頻批次 | JSON + retry-on-empty | 12B 長跑會偶發空回應,要有 fallback |
整體而言:JSON 是無懸念的預設選擇。其他格式有各自的情境,但起步點都是 JSON。
Decision table
| Situation | Pick | Why |
|---|---|---|
| Strict schema, no retry | JSON | 10/10 on both models, strictest parser, mature tooling |
| Tolerates one retry | JSON or YAML (on 12B) | YAML hits 5/5 on 12B; avoid on e2b |
| Human-facing list | Markdown table | 9/10 on both, friendly to read |
| Best LLM portability | JSON | Heaviest representation in any LLM's training data |
| Diagrams / visualization | Mermaid (with retry) | 9/10, renders directly in many clients |
| Small local model (≤ 4B) | Avoid YAML | e2b will break on values containing colons |
| High-frequency batch | JSON + retry-on-empty | Sustained 12B runs produce occasional empties; need a fallback |
Overall: JSON is the no-brainer default. Other formats have their niches, but they're all the second pick after JSON.
共振站結論
波形評級:STEADY_PULSE
兩個 takeaway:
- JSON 是預設選擇。沒有需要考慮的情境會讓你後悔。
- 「格式失敗」與「模型沒回應」是兩個獨立的問題,要分開處理:
- 格式失敗 → 加 strict validator + retry - 空回應 → 加 timeout + retry-on-empty - 兩個 retry 加起來覆蓋率就 > 99%
如果你正在設計本機 LLM pipeline:
- 預設 JSON(schema 控好)
- e2b 不要碰 YAML
- 12B 連續跑 30+ 次要有 retry-on-empty 保險
- Mermaid / Markdown table 是「給人看」的選擇,不是 production 用
Verdict
Waveform: STEADY_PULSE
Two takeaways:
- JSON is the default. No situation in this test would make you regret it.
- "Format failure" and "model didn't respond" are two distinct problems. Treat them separately:
- Format failure → strict validator + retry - Empty response → timeout + retry-on-empty - Combined coverage exceeds 99%
If you're designing a local LLM pipeline:
- Default to JSON (with a tight schema)
- e2b: don't touch YAML
- 12B in batch (30+ calls): add retry-on-empty
- Mermaid / Markdown table are "for humans" formats, not production paths