設定
| 項目 | 規格 |
|---|---|
| 機器 | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| 模型 | gemma4:e2b (Q4_K_M, 7.2GB) |
| Endpoint | http://localhost:11434/api/generate |
| Temperature | 0.2 |
| 重複次數 | 每組合 5 次,跑前先 warm up 一次(避開冷載入) |
兩種 prompt:
- 短:「請用一句話回答:什麼是繁體中文?」
- 長:「請用約 400-500 字介紹 Apple Silicon 的 unified memory 架構對本地 LLM 推論的影響,至少分三段。」
兩種模式:
stream: true— 一收到 token 就 push 給 clientstream: false— 等模型生成完才回完整 JSON
完整腳本在 tests/scripts/run_streaming_test.py;原始數據在 tests/outputs/results.jsonl。
Setup
| Item | Value |
|---|---|
| Machine | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| Model | gemma4:e2b (Q4_K_M, 7.2GB) |
| Endpoint | http://localhost:11434/api/generate |
| Temperature | 0.2 |
| Repetitions | 5 per combination, one warm-up call first |
Two prompts:
- Short: "Answer in one sentence: what is Traditional Chinese?"
- Long: "In 400-500 characters across at least three paragraphs, explain how Apple Silicon's unified memory affects local LLM inference."
Two modes:
stream: true— push tokens as they're generatedstream: false— wait for full generation, return one JSON
Script: tests/scripts/run_streaming_test.py. Raw data: tests/outputs/results.jsonl.
數字(5 次中位數)
| 場景 | TTFT(首字出現) | 總時間 | tokens/s | 輸出 token 數 |
|---|---|---|---|---|
| 短 + stream | 7.69 s | 8.15 s | 44.5 | 349 |
| 短 + blocking | 7.65 s | 7.65 s | 45.3 | 324 |
| 長 + stream | 9.71 s | 20.13 s | 50.8 | 997 |
| 長 + blocking | 19.48 s | 19.48 s | 50.9 | 968 |
兩件事先點出來:
- tokens/s 兩種模式幾乎相同(短 44-45、長 50-51)——streaming 不會讓模型跑更快。
- 總時間幾乎相同——streaming 也不會讓總工作少做。
streaming 唯一改變的是「第一個字什麼時候到」。長 prompt 上,blocking 模式讓 client 乾等 19.5 秒才看到第一個字;stream 模式 9.7 秒就開始出字——使用者看見輸出的時間提前一半。
短 prompt 上這個差距消失:兩邊都是 7.7 秒左右。原因是這個 prompt 的回應總長度本來就接近 TTFT 的尺度,stream 與 blocking 的差別小到不可感知。
The numbers (median of 5)
| Scenario | TTFT (first token) | Total time | tokens/s | Output tokens |
|---|---|---|---|---|
| Short + stream | 7.69 s | 8.15 s | 44.5 | 349 |
| Short + blocking | 7.65 s | 7.65 s | 45.3 | 324 |
| Long + stream | 9.71 s | 20.13 s | 50.8 | 997 |
| Long + blocking | 19.48 s | 19.48 s | 50.9 | 968 |
Two things up front:
- tokens/s is nearly identical across modes (short 44-45, long 50-51) — streaming doesn't make the model go faster.
- Total time is nearly identical — streaming doesn't reduce total work.
The only thing streaming changes is when the first token arrives. On the long prompt, blocking makes the client stare at nothing for 19.5 seconds; stream starts producing output at 9.7 seconds — the user sees half the wait.
On the short prompt the gap disappears: ~7.7 seconds either way. The response is short enough that streaming's head-start has nothing to amortize against.
決策規則
| 預期回應長度 | 我會用 | 理由 |
|---|---|---|
| < 100 token / < 3 秒 | blocking | TTFT 與總時間貼合,stream 拿不到優勢;blocking 程式碼簡單 |
| 100-300 token / 3-8 秒 | 二選一都可以 | 體感差距不明顯,看你 client 哪邊好寫 |
| 300-1000 token / 8-30 秒 | stream | 不 stream 等於要使用者乾瞪螢幕;UX 落差顯著 |
| > 1000 token / > 30 秒 | stream(必) | blocking 在 client 端常踩 timeout;stream 還能順便顯示進度 |
「我怎麼知道會多長」這題其實很實用:
- 摘要、tag、翻譯短句 → 多數 < 200 token → blocking 夠
- 改寫長文、generate code、解釋 → 多數 500+ token → stream
- 不確定時預設 stream,因為 stream 的 client 處理多一點程式碼就好,blocking 撞 timeout 是要 debug 整個 stack
Decision rule
| Expected response length | I'd use | Why |
|---|---|---|
| < 100 tokens / < 3s | blocking | TTFT and total time collapse together — streaming buys nothing; blocking code is simpler |
| 100-300 tokens / 3-8s | either is fine | Perceived difference is small; pick what's cleaner for your client |
| 300-1000 tokens / 8-30s | stream | Without streaming the user stares at a frozen UI; UX gap is real |
| > 1000 tokens / > 30s | stream (mandatory) | Blocking trips client timeouts; streaming doubles as progress feedback |
"How do I know the length in advance" is a fair question:
- Summaries, tags, short translations → mostly < 200 tokens → blocking is enough
- Long-form rewrites, code generation, explanations → mostly 500+ tokens → stream
- When uncertain, default to stream — extra streaming code on the client is cheaper than debugging blocking-mode timeouts
一個發現:模型不太守「一句話」
附帶數據:我給的 prompt 是「請用一句話回答:什麼是繁體中文?」——「短」prompt 的設計目標是 ~50 token 回應。但 gemma4:e2b 兩種模式都回了 324-349 token(約一段話)。
這不算這篇的主題,但意味著「短 vs 長」用 prompt 限制長度在小模型上不可靠。如果你的 client 程式需要短回應,要嘛截斷、要嘛在 prompt 加更嚴格的格式約束(例如「只回 N 個字」+「不要解釋」)。從 stream 觀察角度看:即便你以為要的是短回應,實際出來可能是 7-8 秒的中等回應——這已經到了「不確定時預設 stream」的區間。
A side finding: the model ignores "one sentence"
Side data: my prompt asked "Answer in one sentence: what is Traditional Chinese?" — intended ~50-token response. gemma4:e2b returned 324-349 tokens (a paragraph) in both modes.
Not this article's main point, but worth flagging: length constraints in prompts are unreliable on small models. If your client needs short output, either truncate post-hoc, or tighten the prompt format ("output exactly N characters", "no explanation"). From a streaming perspective: even "short" prompts can quietly turn into 7-8 second medium responses — which is already in the "default to stream when uncertain" zone.
程式碼
Stream (Python)
import json, urllib.request
body = json.dumps({
"model": "gemma4:e2b",
"prompt": "...",
"stream": True,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=body,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
for line in resp:
chunk = json.loads(line)
print(chunk.get("response", ""), end="", flush=True)
if chunk.get("done"):
break
Blocking (Python)
import json, urllib.request
body = json.dumps({
"model": "gemma4:e2b",
"prompt": "...",
"stream": False,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=body,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
payload = json.loads(resp.read())
print(payload["response"])
兩邊都用 stdlib,沒有額外依賴。streaming 多了 NDJSON 的逐行解析、需要處理 done: true 結束訊號——大約多 5 行程式碼。
Code
Stream (Python)
import json, urllib.request
body = json.dumps({
"model": "gemma4:e2b",
"prompt": "...",
"stream": True,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=body,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
for line in resp:
chunk = json.loads(line)
print(chunk.get("response", ""), end="", flush=True)
if chunk.get("done"):
break
Blocking (Python)
import json, urllib.request
body = json.dumps({
"model": "gemma4:e2b",
"prompt": "...",
"stream": False,
}).encode()
req = urllib.request.Request(
"http://localhost:11434/api/generate",
data=body,
headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
payload = json.loads(resp.read())
print(payload["response"])
Both use stdlib — no extra deps. Streaming adds NDJSON line-by-line parsing and a done: true exit check — roughly 5 extra lines.
常見問題
Q:那 stream 的 client 程式不會更慢嗎? A:解析 NDJSON 與額外的 print/flush 在 M 系列上對 wall time 影響可忽略。實測 stream 總時間只比 blocking 多 ~0.5 秒(短)或 ~0.6 秒(長),這部分多在「結束 chunk 之後 client 拿到完整輸出」這段,跟模型推論時間相比微不足道。
Q:12B 模型結果會一樣嗎? A:原理一樣(streaming = 同樣總時間,更早出字),但 12B 在 M1 Pro 預期 tok/s 較低(~12-15 vs e2b 的 ~50),所以 breakpoint 會往下移——可能 200 token 的回應就到「該 stream」門檻。
Q:Ollama 的 /api/chat 跟 /api/generate 結論一樣嗎? A:應該一樣,因為兩者底層是同一推論引擎、stream 機制相同。/api/chat 多了訊息歷史結構,但 streaming 行為一致。
FAQ
Q: Doesn't the streaming client code slow things down? A: NDJSON parsing and the extra prints/flushes are negligible on Apple Silicon. Measured stream totals are only ~0.5s (short) or ~0.6s (long) above blocking. That gap lives in the post-final-chunk assembly on the client, not in inference, and it's marginal next to model time.
Q: Will 12B results look the same? A: Same principle (stream = identical total time, earlier first token), but 12B runs slower on M1 Pro (~12-15 tok/s vs e2b's ~50), so the crossover shifts down — even a 200-token response can land in "should stream" territory.
Q: Does /api/chat behave the same as /api/generate? A: Should match. The underlying inference engine and streaming mechanism are the same. /api/chat adds message-history structure but streaming behaves identically.
共振站結論
波形評級:STEADY_PULSE
規則:
- 預期回應 > 5 秒 → stream
- 預期回應 < 3 秒 → blocking
- 中間區段 → 二選一,看 client 程式碼哪邊好寫
- 不確定時 → 預設 stream
streaming 不是性能優化,是 UX 優化。它讓你的使用者從「在等」變成「在看」——這對本機 LLM 特別重要,因為 tok/s 比雲端慢 2-5 倍,「在看」的耐受度比「在等」高得多。
Verdict
Waveform: STEADY_PULSE
Rules:
- Expected response > 5s → stream
- Expected response < 3s → blocking
- In between → either, pick whichever client code is cleaner
- When uncertain → default to stream
Streaming is a UX optimization, not a performance one. It turns "waiting" into "watching" — a meaningful distinction for local LLMs, which run 2-5x slower than cloud APIs and where users tolerate "watching" far better than "waiting."