Featured Teardown · Local AI Signature № 02 · STEADY-PULSE

本地 LLM 哪些格式吐得乾淨?JSON / YAML / XML / CSV / Mermaid 7 種實測Which Output Formats Can Local LLMs Hit Cleanly? JSON/YAML/XML/CSV/Mermaid Measured

寫 prompt 要求結構化輸出時,總會有一種「應該可以吧」的賭博感。JSON 多數時候 OK、YAML 偶爾踩雷、XML 看人品、Mermaid……祝你好運。這篇把 7 種格式 × 兩個本機模型每組合跑 5 次,用機器驗證每個輸出能不能直接餵下游 parser,給你一張可信的選擇表。

When you ask a local LLM for structured output, there's always a "should work, right?" feel. JSON usually OK, YAML occasional landmines, XML depends on the day, Mermaid — good luck. This piece runs 7 formats × 2 local models × 5 repetitions, machine-validates every output for downstream parseability, and gives you a defensible decision table.


設定

項目規格
機器MacBook Pro 14" M1 Pro, 16GB
Ollama0.30.6
模型gemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB)
Temperature0.2
重複次數每組合 5 次
總執行7 格式 × 2 模型 × 5 = 70 次

共同任務:「列出 3 本你會推薦給程式設計初學者的書,每本含書名、作者、推薦原因(30 字內)。」這個任務本身結構簡單,重點在「能不能按指定格式吐」,不在「書選得好不好」。

7 種格式

  1. JSON{"books": [{"title", "author", "why"}, ...]}
  2. YAML:同樣 shape,YAML 語法
  3. XML<books><book><title>...</title>...</book></books>
  4. Markdown table:表頭 + 3 列資料
  5. CSV:表頭 + 3 列、含雙引號跳脫
  6. Mermaid:mindmap 語法
  7. Plain(控制組):自由格式、一行一本書

驗證方法(全部機器判定):

格式驗證
JSONjson.loads 過 + 有 books[3]、每本含 title/author/why
YAMLyaml.safe_load 過 + 同樣 shape
XMLElementTree.fromstring 過 + 有 3 個 <book>、各含 title/author/why
Markdown tableregex 找 header + separator + ≥ 3 列、每列 ≥ 3 欄
CSVcsv.reader 過 + header 含 title/author/why + ≥ 3 資料列、每列 ≥ 3 欄
Mermaid首行 mindmapgraphflowchart + 至少 4 行內容
Plain至少 3 行非空(基本上一定過)

寬容處理:模型若加上 markdown 圍籬(``` `json `` ... `` ` ```)會自動先剝掉再驗證——因為下游程式很容易補這一步,這算「可接受瑕疵」。前綴解釋文字(如「以下是 JSON:」)會讓 parser 失敗、算 FAIL。

腳本:tests/scripts/run_format_test.py;資料:tests/data/format_prompts.json;結果:tests/outputs/format_results.jsonl

Setup

ItemSpec
MachineMacBook Pro 14" M1 Pro, 16GB
Ollama0.30.6
Modelsgemma4:e2b (Q4_K_M, 7.2 GB) + gemma4:12b (Q4_K_M, 7.6 GB)
Temperature0.2
Repetitions5 per combination
Total runs7 formats × 2 models × 5 = 70

Shared task: "List three books you'd recommend to a beginner programmer, each with title, author, and rationale (≤30 chars)." The task itself is simple — the test is whether the model honors the requested format.

7 formats:

  1. JSON: {"books": [{"title", "author", "why"}, ...]}
  2. YAML: same shape in YAML
  3. XML: <books><book><title>...</title>...</book></books>
  4. Markdown table: header row + 3 data rows
  5. CSV: header + 3 rows with quote escaping
  6. Mermaid: mindmap syntax
  7. Plain (control): free-form, one book per line

Validation (all machine-decided):

FormatValidator
JSONjson.loads succeeds + has books[3] each with title/author/why
YAMLyaml.safe_load succeeds + same shape
XMLElementTree.fromstring succeeds + 3 <book> each with title/author/why
Markdown tableregex finds header + separator + ≥3 data rows × ≥3 cols
CSVcsv.reader parses + header has title/author/why + ≥3 data rows × ≥3 cols
Mermaidfirst line is mindmap/graph/flowchart + ≥4 content lines
Plain≥3 non-empty lines (basically always passes)

Lenience: if the model wraps output in markdown fences (``` `json `` ... `` ` ```) we strip them before parsing — easy for any downstream caller to do, counts as "acceptable blemish." Pre/post-prose ("Here's the JSON:") breaks the parser → FAIL.

Script: tests/scripts/run_format_test.py. Prompts: tests/data/format_prompts.json. Results: tests/outputs/format_results.jsonl.


一次過率矩陣

格式gemma4:e2bgemma4:12b
JSON5/55/5
YAML3/55/5
XML5/54/5
Markdown table5/54/5
CSV5/54/5
Mermaid5/54/5
Plain5/54/5
合計33/35 (94%)31/35 (89%)

第一個直覺結論「12B 比 e2b 不可靠」其實是錯的——只有 YAML 那一格的差異是真實的格式問題。剩下 5 個 12B 的 4/5,全部是同一個失敗模式:模型回了空字串

Pass rate matrix

Formatgemma4:e2bgemma4:12b
JSON5/55/5
YAML3/55/5
XML5/54/5
Markdown table5/54/5
CSV5/54/5
Mermaid5/54/5
Plain5/54/5
Total33/35 (94%)31/35 (89%)

The first-glance reading "12B is less reliable than e2b" is misleading — only the YAML column captures a real format-compliance gap. The other five 4/5s on 12B are all the same failure mode: the model returned an empty string.


兩種失敗:一種是格式錯、另一種是模型不回應

e2b YAML 兩次失敗:書名含冒號

兩次 YAML 失敗都是同一個原因。模型生成:

books:
  - title: Clean Code: A Handbook of Agile Software Craftsmanship
    author: Robert C. Martin
    why: ...

YAML 解析器看到 title: Clean Code: A Handbook 就崩了——: 在 value 裡需要引號跳脫。正確寫法:

books:
  - title: "Clean Code: A Handbook of Agile Software Craftsmanship"

這不是 LLM 偶發瑕疵,是 YAML 格式本身對字串內容的脆弱性。任何字串若含 :#- 開頭、{ 等保留字符都會踩雷。12B 在這個情境下會自動 quote,e2b 不會——這是 e2b 唯一一個結構性弱點。

12B 五次失敗:空回應

XML rep 4、Markdown table rep 4、CSV rep 5、Mermaid rep 4、Plain rep 4——全部回了空字串"")。原本以為這些是格式錯,仔細看 raw response 才發現是「模型沒輸出任何 token」。

對應的 wall time:

失敗wall time
XML rep 4291.7 s
Markdown table rep 4293.8 s
CSV rep 5339.4 s
Mermaid rep 4291.7 s
Plain rep 4344.5 s

中位數成功案例 wall time 約 100-150 秒,這些失敗都是 290-340 秒。12B 在 M1 Pro 16GB 連續跑 30+ 次後會偶爾「卡住、超時、返回空」——這是 ollama / 推論引擎在持續高負載下的可靠性問題,不是模型不會寫格式。

實務上:應該在 client 端加 if response == "": retry。這比擔心格式不合規更有效。

Two kinds of failure: format errors vs the model not responding

e2b YAML failures (×2): book titles with colons

Both YAML failures had the same cause. The model produced:

books:
  - title: Clean Code: A Handbook of Agile Software Craftsmanship
    author: Robert C. Martin
    why: ...

YAML's parser breaks at title: Clean Code: — a : inside a value needs quoting. Correct version:

books:
  - title: "Clean Code: A Handbook of Agile Software Craftsmanship"

This isn't an LLM glitch — it's YAML's inherent fragility around string content. Any value containing :, #, a leading -, or { will trip it. 12B reflexively quotes such values; e2b doesn't. That's e2b's only structural weakness in this test.

12B failures (×5): empty responses

XML rep 4, Markdown table rep 4, CSV rep 5, Mermaid rep 4, Plain rep 4 — every one returned an empty string (""). At first I assumed these were format errors; checking the raw response showed the model emitted no tokens.

Wall times for the empties:

FailureWall time
XML rep 4291.7 s
Markdown table rep 4293.8 s
CSV rep 5339.4 s
Mermaid rep 4291.7 s
Plain rep 4344.5 s

Median successful 12B wall times sit around 100-150s; these are 290-340s. After 30+ sustained calls, 12B on M1 Pro 16GB occasionally stalls, times out, and returns an empty string — an ollama / runtime reliability issue under sustained load, not a format-compliance failure.

The practical fix is client-side: if response == "": retry. That's a bigger lever than format engineering.


決策表:你的下游 pipeline 該用哪個格式?

你的情況建議格式理由
嚴格 schema、不可重試JSON兩模型 10/10、解析最嚴、parser 成熟
可以一次失敗 retryJSON 或 YAML(12B)12B 上 YAML 也 5/5;e2b 上要避免
給人看的清單Markdown table兩模型 9/10、人類友好
跨 LLM 移植性最高JSON任何 LLM 訓練資料 JSON 都最多
要視覺化(diagram)Mermaid(但要 retry)9/10,且 client 可直接渲染
小型本機模型 (≤ 4B)避開 YAMLe2b 在含冒號字串上會崩
高頻批次JSON + retry-on-empty12B 長跑會偶發空回應,要有 fallback

整體而言:JSON 是無懸念的預設選擇。其他格式有各自的情境,但起步點都是 JSON。

Decision table

SituationPickWhy
Strict schema, no retryJSON10/10 on both models, strictest parser, mature tooling
Tolerates one retryJSON or YAML (on 12B)YAML hits 5/5 on 12B; avoid on e2b
Human-facing listMarkdown table9/10 on both, friendly to read
Best LLM portabilityJSONHeaviest representation in any LLM's training data
Diagrams / visualizationMermaid (with retry)9/10, renders directly in many clients
Small local model (≤ 4B)Avoid YAMLe2b will break on values containing colons
High-frequency batchJSON + retry-on-emptySustained 12B runs produce occasional empties; need a fallback

Overall: JSON is the no-brainer default. Other formats have their niches, but they're all the second pick after JSON.


共振站結論

波形評級:STEADY_PULSE

兩個 takeaway:

  1. JSON 是預設選擇。沒有需要考慮的情境會讓你後悔。
  2. 「格式失敗」與「模型沒回應」是兩個獨立的問題,要分開處理:

- 格式失敗 → 加 strict validator + retry - 空回應 → 加 timeout + retry-on-empty - 兩個 retry 加起來覆蓋率就 > 99%

如果你正在設計本機 LLM pipeline:

  • 預設 JSON(schema 控好)
  • e2b 不要碰 YAML
  • 12B 連續跑 30+ 次要有 retry-on-empty 保險
  • Mermaid / Markdown table 是「給人看」的選擇,不是 production 用

完整測試腳本 →

Verdict

Waveform: STEADY_PULSE

Two takeaways:

  1. JSON is the default. No situation in this test would make you regret it.
  2. "Format failure" and "model didn't respond" are two distinct problems. Treat them separately:

- Format failure → strict validator + retry - Empty response → timeout + retry-on-empty - Combined coverage exceeds 99%

If you're designing a local LLM pipeline:

  • Default to JSON (with a tight schema)
  • e2b: don't touch YAML
  • 12B in batch (30+ calls): add retry-on-empty
  • Mermaid / Markdown table are "for humans" formats, not production paths

Full test script →