Featured Teardown · Automation Signature № 02 · STEADY-PULSE

Ollama HTTP API:stream=true 真的比較快嗎?M1 Pro 實測拆給你看Ollama HTTP API: Is stream=true Actually Faster? M1 Pro Numbers

Ollama HTTP API 文件第一個範例就是 stream: true——多數人複製貼上後就不再回頭想「另一條路是什麼」。我把兩條路同樣的 prompt 各跑 5 次,發現大家的直覺有一半是錯的。

The first example in Ollama's HTTP API docs uses stream: true — most people copy-paste it and never look back. I ran both modes on the same prompts five times each. Half the common intuition is wrong.


設定

項目規格
機器MacBook Pro 14" M1 Pro, 16GB
Ollama0.30.6
模型gemma4:e2b (Q4_K_M, 7.2GB)
Endpointhttp://localhost:11434/api/generate
Temperature0.2
重複次數每組合 5 次,跑前先 warm up 一次(避開冷載入)

兩種 prompt:

  • :「請用一句話回答:什麼是繁體中文?」
  • :「請用約 400-500 字介紹 Apple Silicon 的 unified memory 架構對本地 LLM 推論的影響,至少分三段。」

兩種模式:

  • stream: true — 一收到 token 就 push 給 client
  • stream: false — 等模型生成完才回完整 JSON

完整腳本在 tests/scripts/run_streaming_test.py;原始數據在 tests/outputs/results.jsonl

Setup

ItemValue
MachineMacBook Pro 14" M1 Pro, 16GB
Ollama0.30.6
Modelgemma4:e2b (Q4_K_M, 7.2GB)
Endpointhttp://localhost:11434/api/generate
Temperature0.2
Repetitions5 per combination, one warm-up call first

Two prompts:

  • Short: "Answer in one sentence: what is Traditional Chinese?"
  • Long: "In 400-500 characters across at least three paragraphs, explain how Apple Silicon's unified memory affects local LLM inference."

Two modes:

  • stream: true — push tokens as they're generated
  • stream: false — wait for full generation, return one JSON

Script: tests/scripts/run_streaming_test.py. Raw data: tests/outputs/results.jsonl.


數字(5 次中位數)

場景TTFT(首字出現)總時間tokens/s輸出 token 數
短 + stream7.69 s8.15 s44.5349
短 + blocking7.65 s7.65 s45.3324
長 + stream9.71 s20.13 s50.8997
長 + blocking19.48 s19.48 s50.9968

兩件事先點出來:

  1. tokens/s 兩種模式幾乎相同(短 44-45、長 50-51)——streaming 不會讓模型跑更快。
  2. 總時間幾乎相同——streaming 也不會讓總工作少做。

streaming 唯一改變的是「第一個字什麼時候到」。長 prompt 上,blocking 模式讓 client 乾等 19.5 秒才看到第一個字;stream 模式 9.7 秒就開始出字——使用者看見輸出的時間提前一半。

短 prompt 上這個差距消失:兩邊都是 7.7 秒左右。原因是這個 prompt 的回應總長度本來就接近 TTFT 的尺度,stream 與 blocking 的差別小到不可感知。

The numbers (median of 5)

ScenarioTTFT (first token)Total timetokens/sOutput tokens
Short + stream7.69 s8.15 s44.5349
Short + blocking7.65 s7.65 s45.3324
Long + stream9.71 s20.13 s50.8997
Long + blocking19.48 s19.48 s50.9968

Two things up front:

  1. tokens/s is nearly identical across modes (short 44-45, long 50-51) — streaming doesn't make the model go faster.
  2. Total time is nearly identical — streaming doesn't reduce total work.

The only thing streaming changes is when the first token arrives. On the long prompt, blocking makes the client stare at nothing for 19.5 seconds; stream starts producing output at 9.7 seconds — the user sees half the wait.

On the short prompt the gap disappears: ~7.7 seconds either way. The response is short enough that streaming's head-start has nothing to amortize against.


決策規則

預期回應長度我會用理由
< 100 token / < 3 秒blockingTTFT 與總時間貼合,stream 拿不到優勢;blocking 程式碼簡單
100-300 token / 3-8 秒二選一都可以體感差距不明顯,看你 client 哪邊好寫
300-1000 token / 8-30 秒stream不 stream 等於要使用者乾瞪螢幕;UX 落差顯著
> 1000 token / > 30 秒stream(必)blocking 在 client 端常踩 timeout;stream 還能順便顯示進度

「我怎麼知道會多長」這題其實很實用:

  • 摘要、tag、翻譯短句 → 多數 < 200 token → blocking 夠
  • 改寫長文、generate code、解釋 → 多數 500+ token → stream
  • 不確定時預設 stream,因為 stream 的 client 處理多一點程式碼就好,blocking 撞 timeout 是要 debug 整個 stack

Decision rule

Expected response lengthI'd useWhy
< 100 tokens / < 3sblockingTTFT and total time collapse together — streaming buys nothing; blocking code is simpler
100-300 tokens / 3-8seither is finePerceived difference is small; pick what's cleaner for your client
300-1000 tokens / 8-30sstreamWithout streaming the user stares at a frozen UI; UX gap is real
> 1000 tokens / > 30sstream (mandatory)Blocking trips client timeouts; streaming doubles as progress feedback

"How do I know the length in advance" is a fair question:

  • Summaries, tags, short translations → mostly < 200 tokens → blocking is enough
  • Long-form rewrites, code generation, explanations → mostly 500+ tokens → stream
  • When uncertain, default to stream — extra streaming code on the client is cheaper than debugging blocking-mode timeouts

一個發現:模型不太守「一句話」

附帶數據:我給的 prompt 是「請用一句話回答:什麼是繁體中文?」——「短」prompt 的設計目標是 ~50 token 回應。但 gemma4:e2b 兩種模式都回了 324-349 token(約一段話)。

這不算這篇的主題,但意味著「短 vs 長」用 prompt 限制長度在小模型上不可靠。如果你的 client 程式需要短回應,要嘛截斷、要嘛在 prompt 加更嚴格的格式約束(例如「只回 N 個字」+「不要解釋」)。從 stream 觀察角度看:即便你以為要的是短回應,實際出來可能是 7-8 秒的中等回應——這已經到了「不確定時預設 stream」的區間。

A side finding: the model ignores "one sentence"

Side data: my prompt asked "Answer in one sentence: what is Traditional Chinese?" — intended ~50-token response. gemma4:e2b returned 324-349 tokens (a paragraph) in both modes.

Not this article's main point, but worth flagging: length constraints in prompts are unreliable on small models. If your client needs short output, either truncate post-hoc, or tighten the prompt format ("output exactly N characters", "no explanation"). From a streaming perspective: even "short" prompts can quietly turn into 7-8 second medium responses — which is already in the "default to stream when uncertain" zone.


程式碼

Stream (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": True,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    for line in resp:
        chunk = json.loads(line)
        print(chunk.get("response", ""), end="", flush=True)
        if chunk.get("done"):
            break

Blocking (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": False,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    payload = json.loads(resp.read())
print(payload["response"])

兩邊都用 stdlib,沒有額外依賴。streaming 多了 NDJSON 的逐行解析、需要處理 done: true 結束訊號——大約多 5 行程式碼。

Code

Stream (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": True,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    for line in resp:
        chunk = json.loads(line)
        print(chunk.get("response", ""), end="", flush=True)
        if chunk.get("done"):
            break

Blocking (Python)

import json, urllib.request

body = json.dumps({
    "model": "gemma4:e2b",
    "prompt": "...",
    "stream": False,
}).encode()
req = urllib.request.Request(
    "http://localhost:11434/api/generate",
    data=body,
    headers={"Content-Type": "application/json"},
)
with urllib.request.urlopen(req) as resp:
    payload = json.loads(resp.read())
print(payload["response"])

Both use stdlib — no extra deps. Streaming adds NDJSON line-by-line parsing and a done: true exit check — roughly 5 extra lines.


常見問題

Q:那 stream 的 client 程式不會更慢嗎? A:解析 NDJSON 與額外的 print/flush 在 M 系列上對 wall time 影響可忽略。實測 stream 總時間只比 blocking 多 ~0.5 秒(短)或 ~0.6 秒(長),這部分多在「結束 chunk 之後 client 拿到完整輸出」這段,跟模型推論時間相比微不足道。

Q:12B 模型結果會一樣嗎? A:原理一樣(streaming = 同樣總時間,更早出字),但 12B 在 M1 Pro 預期 tok/s 較低(~12-15 vs e2b 的 ~50),所以 breakpoint 會往下移——可能 200 token 的回應就到「該 stream」門檻。

Q:Ollama 的 /api/chat/api/generate 結論一樣嗎? A:應該一樣,因為兩者底層是同一推論引擎、stream 機制相同。/api/chat 多了訊息歷史結構,但 streaming 行為一致。

FAQ

Q: Doesn't the streaming client code slow things down? A: NDJSON parsing and the extra prints/flushes are negligible on Apple Silicon. Measured stream totals are only ~0.5s (short) or ~0.6s (long) above blocking. That gap lives in the post-final-chunk assembly on the client, not in inference, and it's marginal next to model time.

Q: Will 12B results look the same? A: Same principle (stream = identical total time, earlier first token), but 12B runs slower on M1 Pro (~12-15 tok/s vs e2b's ~50), so the crossover shifts down — even a 200-token response can land in "should stream" territory.

Q: Does /api/chat behave the same as /api/generate? A: Should match. The underlying inference engine and streaming mechanism are the same. /api/chat adds message-history structure but streaming behaves identically.


共振站結論

波形評級:STEADY_PULSE

規則:

  • 預期回應 > 5 秒 → stream
  • 預期回應 < 3 秒 → blocking
  • 中間區段 → 二選一,看 client 程式碼哪邊好寫
  • 不確定時 → 預設 stream

streaming 不是性能優化,是 UX 優化。它讓你的使用者從「在等」變成「在看」——這對本機 LLM 特別重要,因為 tok/s 比雲端慢 2-5 倍,「在看」的耐受度比「在等」高得多。

Ollama HTTP API 文件 →

Verdict

Waveform: STEADY_PULSE

Rules:

  • Expected response > 5s → stream
  • Expected response < 3s → blocking
  • In between → either, pick whichever client code is cleaner
  • When uncertain → default to stream

Streaming is a UX optimization, not a performance one. It turns "waiting" into "watching" — a meaningful distinction for local LLMs, which run 2-5x slower than cloud APIs and where users tolerate "watching" far better than "waiting."

Ollama HTTP API docs →