Featured Teardown · Local AI Signature № 02 · STEADY-PULSE

本地 LLM 哪些格式吐得乾淨？JSON / YAML / XML / CSV / Mermaid 7 種實測Which Output Formats Can Local LLMs Hit Cleanly? JSON/YAML/XML/CSV/Mermaid Measured

Josh Chen · June 9, 2026 · 12 min read

寫 prompt 要求結構化輸出時，總會有一種「應該可以吧」的賭博感。JSON 多數時候 OK、YAML 偶爾踩雷、XML 看人品、Mermaid……祝你好運。這篇把 7 種格式 × 兩個本機模型每組合跑 5 次，用機器驗證每個輸出能不能直接餵下游 parser，給你一張可信的選擇表。

When you ask a local LLM for structured output, there's always a "should work, right?" feel. JSON usually OK, YAML occasional landmines, XML depends on the day, Mermaid — good luck. This piece runs 7 formats × 2 local models × 5 repetitions, machine-validates every output for downstream parseability, and gives you a defensible decision table.

設定

項目	規格
機器	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
模型	`gemma4:e2b` (Q4_K_M, 7.2 GB) + `gemma4:12b` (Q4_K_M, 7.6 GB)
Temperature	0.2
重複次數	每組合 5 次
總執行	7 格式 × 2 模型 × 5 = 70 次

共同任務：「列出 3 本你會推薦給程式設計初學者的書，每本含書名、作者、推薦原因（30 字內）。」這個任務本身結構簡單，重點在「能不能按指定格式吐」，不在「書選得好不好」。

7 種格式：

JSON：{"books": [{"title", "author", "why"}, ...]}
YAML：同樣 shape，YAML 語法
XML：<books><book><title>...</title>...</book></books>
Markdown table：表頭 + 3 列資料
CSV：表頭 + 3 列、含雙引號跳脫
Mermaid：mindmap 語法
Plain（控制組）：自由格式、一行一本書

驗證方法（全部機器判定）：

格式	驗證
JSON	`json.loads` 過 + 有 books[3]、每本含 title/author/why
YAML	`yaml.safe_load` 過 + 同樣 shape
XML	`ElementTree.fromstring` 過 + 有 3 個 `<book>`、各含 title/author/why
Markdown table	regex 找 header + separator + ≥ 3 列、每列 ≥ 3 欄
CSV	`csv.reader` 過 + header 含 title/author/why + ≥ 3 資料列、每列 ≥ 3 欄
Mermaid	首行 `mindmap` 或 `graph` 或 `flowchart` + 至少 4 行內容
Plain	至少 3 行非空（基本上一定過）

寬容處理：模型若加上 markdown 圍籬（``` `json `` ... `` ` ```）會自動先剝掉再驗證——因為下游程式很容易補這一步，這算「可接受瑕疵」。前綴解釋文字（如「以下是 JSON：」）會讓 parser 失敗、算 FAIL。

腳本：tests/scripts/run_format_test.py；資料：tests/data/format_prompts.json；結果：tests/outputs/format_results.jsonl。

Setup

Item	Spec
Machine	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
Models	`gemma4:e2b` (Q4_K_M, 7.2 GB) + `gemma4:12b` (Q4_K_M, 7.6 GB)
Temperature	0.2
Repetitions	5 per combination
Total runs	7 formats × 2 models × 5 = 70

Shared task: "List three books you'd recommend to a beginner programmer, each with title, author, and rationale (≤30 chars)." The task itself is simple — the test is whether the model honors the requested format.

7 formats:

JSON: {"books": [{"title", "author", "why"}, ...]}
YAML: same shape in YAML
XML: <books><book><title>...</title>...</book></books>
Markdown table: header row + 3 data rows
CSV: header + 3 rows with quote escaping
Mermaid: mindmap syntax
Plain (control): free-form, one book per line

Validation (all machine-decided):

Format	Validator
JSON	`json.loads` succeeds + has books[3] each with title/author/why
YAML	`yaml.safe_load` succeeds + same shape
XML	`ElementTree.fromstring` succeeds + 3 `<book>` each with title/author/why
Markdown table	regex finds header + separator + ≥3 data rows × ≥3 cols
CSV	`csv.reader` parses + header has title/author/why + ≥3 data rows × ≥3 cols
Mermaid	first line is `mindmap`/`graph`/`flowchart` + ≥4 content lines
Plain	≥3 non-empty lines (basically always passes)

Lenience: if the model wraps output in markdown fences (``` `json `` ... `` ` ```) we strip them before parsing — easy for any downstream caller to do, counts as "acceptable blemish." Pre/post-prose ("Here's the JSON:") breaks the parser → FAIL.

Script: tests/scripts/run_format_test.py. Prompts: tests/data/format_prompts.json. Results: tests/outputs/format_results.jsonl.

一次過率矩陣

格式	gemma4:e2b	gemma4:12b
JSON	5/5	5/5
YAML	3/5	5/5
XML	5/5	4/5
Markdown table	5/5	4/5
CSV	5/5	4/5
Mermaid	5/5	4/5
Plain	5/5	4/5
合計	33/35 (94%)	31/35 (89%)

第一個直覺結論「12B 比 e2b 不可靠」其實是錯的——只有 YAML 那一格的差異是真實的格式問題。剩下 5 個 12B 的 4/5，全部是同一個失敗模式：模型回了空字串。

Pass rate matrix

Format	gemma4:e2b	gemma4:12b
JSON	5/5	5/5
YAML	3/5	5/5
XML	5/5	4/5
Markdown table	5/5	4/5
CSV	5/5	4/5
Mermaid	5/5	4/5
Plain	5/5	4/5
Total	33/35 (94%)	31/35 (89%)

The first-glance reading "12B is less reliable than e2b" is misleading — only the YAML column captures a real format-compliance gap. The other five 4/5s on 12B are all the same failure mode: the model returned an empty string.

兩種失敗：一種是格式錯、另一種是模型不回應

e2b YAML 兩次失敗：書名含冒號

兩次 YAML 失敗都是同一個原因。模型生成：

books:
  - title: Clean Code: A Handbook of Agile Software Craftsmanship
    author: Robert C. Martin
    why: ...

YAML 解析器看到 title: Clean Code: A Handbook 就崩了——: 在 value 裡需要引號跳脫。正確寫法：

books:
  - title: "Clean Code: A Handbook of Agile Software Craftsmanship"

這不是 LLM 偶發瑕疵，是 YAML 格式本身對字串內容的脆弱性。任何字串若含 :、#、- 開頭、{ 等保留字符都會踩雷。12B 在這個情境下會自動 quote，e2b 不會——這是 e2b 唯一一個結構性弱點。

12B 五次失敗：空回應

XML rep 4、Markdown table rep 4、CSV rep 5、Mermaid rep 4、Plain rep 4——全部回了空字串（""）。原本以為這些是格式錯，仔細看 raw response 才發現是「模型沒輸出任何 token」。

對應的 wall time：

失敗	wall time
XML rep 4	291.7 s
Markdown table rep 4	293.8 s
CSV rep 5	339.4 s
Mermaid rep 4	291.7 s
Plain rep 4	344.5 s

中位數成功案例 wall time 約 100-150 秒，這些失敗都是 290-340 秒。12B 在 M1 Pro 16GB 連續跑 30+ 次後會偶爾「卡住、超時、返回空」——這是 ollama / 推論引擎在持續高負載下的可靠性問題，不是模型不會寫格式。

實務上：應該在 client 端加 if response == "": retry。這比擔心格式不合規更有效。

Two kinds of failure: format errors vs the model not responding

e2b YAML failures (×2): book titles with colons

Both YAML failures had the same cause. The model produced:

books:
  - title: Clean Code: A Handbook of Agile Software Craftsmanship
    author: Robert C. Martin
    why: ...

YAML's parser breaks at title: Clean Code: — a : inside a value needs quoting. Correct version:

books:
  - title: "Clean Code: A Handbook of Agile Software Craftsmanship"

This isn't an LLM glitch — it's YAML's inherent fragility around string content. Any value containing :, #, a leading -, or { will trip it. 12B reflexively quotes such values; e2b doesn't. That's e2b's only structural weakness in this test.

12B failures (×5): empty responses

XML rep 4, Markdown table rep 4, CSV rep 5, Mermaid rep 4, Plain rep 4 — every one returned an empty string (""). At first I assumed these were format errors; checking the raw response showed the model emitted no tokens.

Wall times for the empties:

Failure	Wall time
XML rep 4	291.7 s
Markdown table rep 4	293.8 s
CSV rep 5	339.4 s
Mermaid rep 4	291.7 s
Plain rep 4	344.5 s

Median successful 12B wall times sit around 100-150s; these are 290-340s. After 30+ sustained calls, 12B on M1 Pro 16GB occasionally stalls, times out, and returns an empty string — an ollama / runtime reliability issue under sustained load, not a format-compliance failure.

The practical fix is client-side: if response == "": retry. That's a bigger lever than format engineering.

決策表：你的下游 pipeline 該用哪個格式？

你的情況	建議格式	理由
嚴格 schema、不可重試	JSON	兩模型 10/10、解析最嚴、parser 成熟
可以一次失敗 retry	JSON 或 YAML（12B）	12B 上 YAML 也 5/5；e2b 上要避免
給人看的清單	Markdown table	兩模型 9/10、人類友好
跨 LLM 移植性最高	JSON	任何 LLM 訓練資料 JSON 都最多
要視覺化（diagram）	Mermaid（但要 retry）	9/10，且 client 可直接渲染
小型本機模型 (≤ 4B)	避開 YAML	e2b 在含冒號字串上會崩
高頻批次	JSON + retry-on-empty	12B 長跑會偶發空回應，要有 fallback

整體而言：JSON 是無懸念的預設選擇。其他格式有各自的情境，但起步點都是 JSON。

Decision table

Situation	Pick	Why
Strict schema, no retry	JSON	10/10 on both models, strictest parser, mature tooling
Tolerates one retry	JSON or YAML (on 12B)	YAML hits 5/5 on 12B; avoid on e2b
Human-facing list	Markdown table	9/10 on both, friendly to read
Best LLM portability	JSON	Heaviest representation in any LLM's training data
Diagrams / visualization	Mermaid (with retry)	9/10, renders directly in many clients
Small local model (≤ 4B)	Avoid YAML	e2b will break on values containing colons
High-frequency batch	JSON + retry-on-empty	Sustained 12B runs produce occasional empties; need a fallback

Overall: JSON is the no-brainer default. Other formats have their niches, but they're all the second pick after JSON.

共振站結論

波形評級：STEADY_PULSE

兩個 takeaway：

JSON 是預設選擇。沒有需要考慮的情境會讓你後悔。
「格式失敗」與「模型沒回應」是兩個獨立的問題，要分開處理：

- 格式失敗 → 加 strict validator + retry - 空回應 → 加 timeout + retry-on-empty - 兩個 retry 加起來覆蓋率就 > 99%

如果你正在設計本機 LLM pipeline：

預設 JSON（schema 控好）
e2b 不要碰 YAML
12B 連續跑 30+ 次要有 retry-on-empty 保險
Mermaid / Markdown table 是「給人看」的選擇，不是 production 用

完整測試腳本 →

Verdict

Waveform: STEADY_PULSE

Two takeaways:

JSON is the default. No situation in this test would make you regret it.
"Format failure" and "model didn't respond" are two distinct problems. Treat them separately:

- Format failure → strict validator + retry - Empty response → timeout + retry-on-empty - Combined coverage exceeds 99%

If you're designing a local LLM pipeline:

Default to JSON (with a tight schema)
e2b: don't touch YAML
12B in batch (30+ calls): add retry-on-empty
Mermaid / Markdown table are "for humans" formats, not production paths

Full test script →

Verdict

Signature № 02 · STEADY-PULSE

JSON 是最安全的。其他格式失敗時常常是模型空回應，不是格式錯。JSON is the safest. Other formats fail when the model returns nothing, not when it returns the wrong format.

70 次推論的結論：JSON 在 e2b 與 12b 各跑 5 次全過。YAML 在 12b 全過、e2b 兩次踩到「書名含冒號」這個經典 YAML 陷阱。最反直覺的發現：12B 在 XML / Markdown table / CSV / Mermaid / Plain 各失敗 1 次——但全部都是「模型回空字串」這種推論層失敗，不是格式不合規。也就是說，當 12B 真的回應時、格式幾乎都對；失敗代價在「沒回應」而不是「回了亂格式」。實務建議：用 JSON 為主、加 timeout + retry 就能 cover 12B 的推論不穩定。

70 inferences in: JSON passes 5/5 on both e2b and 12b. YAML passes 5/5 on 12b; e2b trips twice on book titles containing colons — the classic YAML escaping landmine. The counterintuitive finding: 12B fails XML / Markdown table / CSV / Mermaid / Plain once each — but every failure is the model returning an *empty string*, not malformed output. When 12B does respond, the format is nearly always correct; the cost of failure is silence, not garbage. Practical takeaway: prefer JSON, add timeout + retry, and you've covered 12B's inference-layer flakiness.

設定

Setup

一次過率矩陣

Pass rate matrix

兩種失敗：一種是格式錯、另一種是模型不回應

e2b YAML 兩次失敗：書名含冒號

12B 五次失敗：空回應

Two kinds of failure: format errors vs the model not responding

e2b YAML failures (×2): book titles with colons

12B failures (×5): empty responses

決策表：你的下游 pipeline 該用哪個格式？

Decision table

共振站結論

Verdict

Tune in. 每週一篇深度評測，沒有廢話。Tune in. One deep review per week. No filler.