Featured Teardown · Automation Signature № 02 · STEADY-PULSE

Ollama HTTP API：stream=true 真的比較快嗎？M1 Pro 實測拆給你看Ollama HTTP API: Is stream=true Actually Faster? M1 Pro Numbers

Josh Chen · June 7, 2026 · 9 min read

Ollama HTTP API 文件第一個範例就是 stream: true——多數人複製貼上後就不再回頭想「另一條路是什麼」。我把兩條路同樣的 prompt 各跑 5 次，發現大家的直覺有一半是錯的。

The first example in Ollama's HTTP API docs uses stream: true — most people copy-paste it and never look back. I ran both modes on the same prompts five times each. Half the common intuition is wrong.

設定

項目	規格
機器	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
模型	`gemma4:e2b` (Q4_K_M, 7.2GB)
Endpoint	`http://localhost:11434/api/generate`
Temperature	0.2
重複次數	每組合 5 次，跑前先 warm up 一次（避開冷載入）

兩種 prompt：

短：「請用一句話回答：什麼是繁體中文？」
長：「請用約 400-500 字介紹 Apple Silicon 的 unified memory 架構對本地 LLM 推論的影響，至少分三段。」

兩種模式：

stream: true — 一收到 token 就 push 給 client
stream: false — 等模型生成完才回完整 JSON

完整腳本在 tests/scripts/run_streaming_test.py；原始數據在 tests/outputs/results.jsonl。

Setup

Item	Value
Machine	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
Model	`gemma4:e2b` (Q4_K_M, 7.2GB)
Endpoint	`http://localhost:11434/api/generate`
Temperature	0.2
Repetitions	5 per combination, one warm-up call first

Two prompts:

Short: "Answer in one sentence: what is Traditional Chinese?"
Long: "In 400-500 characters across at least three paragraphs, explain how Apple Silicon's unified memory affects local LLM inference."

Two modes:

stream: true — push tokens as they're generated
stream: false — wait for full generation, return one JSON

Script: tests/scripts/run_streaming_test.py. Raw data: tests/outputs/results.jsonl.

數字（5 次中位數）

場景	TTFT（首字出現）	總時間	tokens/s	輸出 token 數
短 + stream	7.69 s	8.15 s	44.5	349
短 + blocking	7.65 s	7.65 s	45.3	324
長 + stream	9.71 s	20.13 s	50.8	997
長 + blocking	19.48 s	19.48 s	50.9	968

兩件事先點出來：

tokens/s 兩種模式幾乎相同（短 44-45、長 50-51）——streaming 不會讓模型跑更快。
總時間幾乎相同——streaming 也不會讓總工作少做。

streaming 唯一改變的是「第一個字什麼時候到」。長 prompt 上，blocking 模式讓 client 乾等 19.5 秒才看到第一個字；stream 模式 9.7 秒就開始出字——使用者看見輸出的時間提前一半。

短 prompt 上這個差距消失：兩邊都是 7.7 秒左右。原因是這個 prompt 的回應總長度本來就接近 TTFT 的尺度，stream 與 blocking 的差別小到不可感知。

The numbers (median of 5)

Scenario	TTFT (first token)	Total time	tokens/s	Output tokens
Short + stream	7.69 s	8.15 s	44.5	349
Short + blocking	7.65 s	7.65 s	45.3	324
Long + stream	9.71 s	20.13 s	50.8	997
Long + blocking	19.48 s	19.48 s	50.9	968

Two things up front:

tokens/s is nearly identical across modes (short 44-45, long 50-51) — streaming doesn't make the model go faster.
Total time is nearly identical — streaming doesn't reduce total work.

The only thing streaming changes is when the first token arrives. On the long prompt, blocking makes the client stare at nothing for 19.5 seconds; stream starts producing output at 9.7 seconds — the user sees half the wait.

On the short prompt the gap disappears: ~7.7 seconds either way. The response is short enough that streaming's head-start has nothing to amortize against.

決策規則

預期回應長度	我會用	理由
< 100 token / < 3 秒	blocking	TTFT 與總時間貼合，stream 拿不到優勢；blocking 程式碼簡單
100-300 token / 3-8 秒	二選一都可以	體感差距不明顯，看你 client 哪邊好寫
300-1000 token / 8-30 秒	stream	不 stream 等於要使用者乾瞪螢幕；UX 落差顯著
> 1000 token / > 30 秒	stream（必）	blocking 在 client 端常踩 timeout；stream 還能順便顯示進度

「我怎麼知道會多長」這題其實很實用：

摘要、tag、翻譯短句 → 多數 < 200 token → blocking 夠
改寫長文、generate code、解釋 → 多數 500+ token → stream
不確定時預設 stream，因為 stream 的 client 處理多一點程式碼就好，blocking 撞 timeout 是要 debug 整個 stack

Decision rule

Expected response length	I'd use	Why
< 100 tokens / < 3s	blocking	TTFT and total time collapse together — streaming buys nothing; blocking code is simpler
100-300 tokens / 3-8s	either is fine	Perceived difference is small; pick what's cleaner for your client
300-1000 tokens / 8-30s	stream	Without streaming the user stares at a frozen UI; UX gap is real
> 1000 tokens / > 30s	stream (mandatory)	Blocking trips client timeouts; streaming doubles as progress feedback

"How do I know the length in advance" is a fair question:

Summaries, tags, short translations → mostly < 200 tokens → blocking is enough
Long-form rewrites, code generation, explanations → mostly 500+ tokens → stream
When uncertain, default to stream — extra streaming code on the client is cheaper than debugging blocking-mode timeouts

一個發現：模型不太守「一句話」

附帶數據：我給的 prompt 是「請用一句話回答：什麼是繁體中文？」——「短」prompt 的設計目標是 ~50 token 回應。但 gemma4:e2b 兩種模式都回了 324-349 token（約一段話）。

這不算這篇的主題，但意味著「短 vs 長」用 prompt 限制長度在小模型上不可靠。如果你的 client 程式需要短回應，要嘛截斷、要嘛在 prompt 加更嚴格的格式約束（例如「只回 N 個字」+「不要解釋」）。從 stream 觀察角度看：即便你以為要的是短回應，實際出來可能是 7-8 秒的中等回應——這已經到了「不確定時預設 stream」的區間。

A side finding: the model ignores "one sentence"

Side data: my prompt asked "Answer in one sentence: what is Traditional Chinese?" — intended ~50-token response. gemma4:e2b returned 324-349 tokens (a paragraph) in both modes.

Not this article's main point, but worth flagging: length constraints in prompts are unreliable on small models. If your client needs short output, either truncate post-hoc, or tighten the prompt format ("output exactly N characters", "no explanation"). From a streaming perspective: even "short" prompts can quietly turn into 7-8 second medium responses — which is already in the "default to stream when uncertain" zone.

程式碼

Stream (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": True,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    for line in resp:
        chunk = json.loads(line)
        print(chunk.get("response", ""), end="", flush=True)
        if chunk.get("done"):
            break

Blocking (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": False,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    payload = json.loads(resp.read())
print(payload["response"])

兩邊都用 stdlib，沒有額外依賴。streaming 多了 NDJSON 的逐行解析、需要處理 done: true 結束訊號——大約多 5 行程式碼。

Code

Stream (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": True,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    for line in resp:
        chunk = json.loads(line)
        print(chunk.get("response", ""), end="", flush=True)
        if chunk.get("done"):
            break

Blocking (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": False,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    payload = json.loads(resp.read())
print(payload["response"])

Both use stdlib — no extra deps. Streaming adds NDJSON line-by-line parsing and a done: true exit check — roughly 5 extra lines.

常見問題

Q：那 stream 的 client 程式不會更慢嗎？ A：解析 NDJSON 與額外的 print/flush 在 M 系列上對 wall time 影響可忽略。實測 stream 總時間只比 blocking 多 ~0.5 秒（短）或 ~0.6 秒（長），這部分多在「結束 chunk 之後 client 拿到完整輸出」這段，跟模型推論時間相比微不足道。

Q：12B 模型結果會一樣嗎？ A：原理一樣（streaming = 同樣總時間，更早出字），但 12B 在 M1 Pro 預期 tok/s 較低（~12-15 vs e2b 的 ~50），所以 breakpoint 會往下移——可能 200 token 的回應就到「該 stream」門檻。

Q：Ollama 的 /api/chat 跟 /api/generate 結論一樣嗎？ A：應該一樣，因為兩者底層是同一推論引擎、stream 機制相同。/api/chat 多了訊息歷史結構，但 streaming 行為一致。

FAQ

Q: Doesn't the streaming client code slow things down? A: NDJSON parsing and the extra prints/flushes are negligible on Apple Silicon. Measured stream totals are only ~0.5s (short) or ~0.6s (long) above blocking. That gap lives in the post-final-chunk assembly on the client, not in inference, and it's marginal next to model time.

Q: Will 12B results look the same? A: Same principle (stream = identical total time, earlier first token), but 12B runs slower on M1 Pro (~12-15 tok/s vs e2b's ~50), so the crossover shifts down — even a 200-token response can land in "should stream" territory.

Q: Does /api/chat behave the same as /api/generate? A: Should match. The underlying inference engine and streaming mechanism are the same. /api/chat adds message-history structure but streaming behaves identically.

共振站結論

波形評級：STEADY_PULSE

規則：

預期回應 > 5 秒 → stream
預期回應 < 3 秒 → blocking
中間區段 → 二選一，看 client 程式碼哪邊好寫
不確定時 → 預設 stream

streaming 不是性能優化，是 UX 優化。它讓你的使用者從「在等」變成「在看」——這對本機 LLM 特別重要，因為 tok/s 比雲端慢 2-5 倍，「在看」的耐受度比「在等」高得多。

Ollama HTTP API 文件 →

Verdict

Waveform: STEADY_PULSE

Rules:

Expected response > 5s → stream
Expected response < 3s → blocking
In between → either, pick whichever client code is cleaner
When uncertain → default to stream

Streaming is a UX optimization, not a performance one. It turns "waiting" into "watching" — a meaningful distinction for local LLMs, which run 2-5x slower than cloud APIs and where users tolerate "watching" far better than "waiting."

Ollama HTTP API docs →

Verdict

Signature № 02 · STEADY-PULSE

stream 不會更快，只是讓你早點看到字。長回應一定要 stream，短回應別。Streaming isn't faster — it just shows output sooner. Long responses must stream; short ones don't need to.

實測 stream=true 與 stream=false 的總時間幾乎相同——差距在第一個 token 何時出現。短回應（~350 tokens、7-8 秒）兩種模式體感一樣，blocking 簡單一點。長回應（~1000 tokens、~20 秒）blocking 會讓使用者乾等 20 秒不見動靜，stream 在 10 秒就開始出字。breakpoint 在「預期回應時間超過 5 秒」這條線。

stream=true and stream=false take the same total time — the difference is when the first token appears. For short responses (~350 tokens, 7-8s), both modes feel identical and blocking is simpler. For long responses (~1000 tokens, ~20s), blocking makes the user stare at nothing for 20 seconds while stream shows output starting at 10s. The crossover sits around 'expected response > 5 seconds.'

設定

Setup

數字（5 次中位數）

The numbers (median of 5)

決策規則

Decision rule

一個發現：模型不太守「一句話」

A side finding: the model ignores "one sentence"

程式碼

Stream (Python)

Blocking (Python)

Code

Stream (Python)

Blocking (Python)

常見問題

FAQ

共振站結論

Verdict

Tune in. 每週一篇深度評測，沒有廢話。Tune in. One deep review per week. No filler.