Featured Teardown · Local AI Signature № 01 · CLEAN-SIGNAL

Gemma 4 12B 號稱吃 8K context，中段真的不會失憶嗎？M1 Pro 實測Does gemma4:12b Really Use Its Context Window? Needle-in-Haystack on M1 Pro

Josh Chen · June 8, 2026 · 11 min read

「12B 支援 8K context」這句話在 Ollama Library 頁面上看起來像是承諾，但真實意思可能只是「模型不會崩」，不是「中段你藏東西它找得到」。我把一句明確的事實藏在 1K、2.5K、4.5K 長度的繁中文章不同位置，量 12B 真實的「找得到率」。

"12B supports an 8K context" reads like a promise on the Ollama Library page, but practically it might just mean "the model won't crash" — not "if you hide a fact in the middle, it'll find it." I planted one explicit fact in 1K, 2.5K, and 4.5K-character Chinese documents at five positions each, and measured the actual recall rate on gemma4:12b.

設定

項目	規格
機器	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
模型	`gemma4:12b` (Q4_K_M, 7.6 GB)
num_ctx	8192
temperature	0.0（要可重現）
文件長度	1K / 2.5K / 4.5K 中文字（≈ tokens）
Needle 位置	0% / 25% / 50% / 75% / 100%
每組合	2 個不同 needle（換掉具體溫度與城市以避免 cache shortcut）
總執行	3 × 5 × 2 = 30 次推論

Needle 模板（隨機抽兩種）：

重要提示：在 {date}，{city} 的氣溫達到攝氏 {temp} 度，破了該年度的單日高溫紀錄。

Filler：10 段我手工撰寫的繁中日常段落（買菜、修家電、爬山、整理筆記……），刻意避開任何科技、氣象、法律語彙——這樣 needle 在主題上會明顯凸出，模型不能用主題對比偷雞。

Query：

請根據上面這段文字回答：{date} 那天 {city} 的氣溫是攝氏多少度？只回答數字（含單位），不要解釋。

Pass 判定：response 用 regex 抓所有數字，看 needle 的精確溫度（例如 37.2）是否出現在數字清單裡。是 → PASS。

完整資料生成器 tests/scripts/build_haystack.py、執行器 tests/scripts/run_needle.py、結果 tests/outputs/needle_results.jsonl。

Setup

Item	Spec
Machine	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
Model	`gemma4:12b` (Q4_K_M, 7.6 GB)
num_ctx	8192
temperature	0.0 (for reproducibility)
Doc lengths	1K / 2.5K / 4.5K Chinese chars (≈ tokens)
Needle positions	0% / 25% / 50% / 75% / 100%
Per combination	2 different needles (swap temp + city to defeat caching)
Total runs	3 × 5 × 2 = 30 inferences

Needle template (2 variants):

Important: on {date}, {city} reached {temp}°C, setting an annual high.

Filler: 10 hand-written mundane Chinese paragraphs (grocery shopping, appliance repair, hiking, note-taking…) — deliberately avoiding any tech, weather, or legal vocabulary so the needle stands out topically and the model can't cheat via topic contrast.

Query:

From the text above, what was {city}'s temperature on {date}? Just give the number with units. No explanation.

Pass criterion: regex-extract every number from the response and check if the needle's exact temperature (e.g. 37.2) appears. Yes → PASS.

Data generator: tests/scripts/build_haystack.py. Runner: tests/scripts/run_needle.py. Results: tests/outputs/needle_results.jsonl.

結果（15 個位置組合，每組合 2 次）

Context	0%	25%	50%	75%	100%	合計
1K	2/2	2/2	2/2	2/2	2/2	10/10
2.5K	2/2	2/2	2/2	2/2	2/2	10/10
4.5K	2/2	2/2	2/2	2/2	2/2	10/10
總計						30/30

老實說我沒料到全綠。文獻上 "lost in the middle"（Liu et al., 2023）是 LLM 處理長 context 的經典痛點——資訊塞在 25-75% 位置時準確率明顯下滑。在我這個範圍與這個任務型態，12B + 8K context 完全沒踩到這個雷。

要解釋這個結果有幾個方向：

Needle 太顯眼。我刻意把 needle 包成「重要提示：在 {date}…」的固定句式，模型在大量繁中日常段落（買菜、修家電、爬山）裡撈到這個句型就 lock 上了。如果 needle 改寫成日常語氣，結果應該會慘很多——這是後續可以追的實驗。
任務是「精準擷取數字」。比起「整合多段資訊」「跨段落推理」這類任務，這題對模型的要求最低。
4.5K 還在 8K context window 的 56% 利用率。沒有逼到上限。要看真實衰減點，得用 6-7K 文件再測一次。

但「lost in the middle 在這範圍實質不存在」這個結論對本機 RAG 還是有實用價值：當你把 retrieval 的 Top-K 段落塞給 12B 摘要或回答，至少在 4.5K 以內不必煩惱「答案藏在中段」的問題。

Results (15 position cells × 2 reps each)

Context	0%	25%	50%	75%	100%	Total
1K	2/2	2/2	2/2	2/2	2/2	10/10
2.5K	2/2	2/2	2/2	2/2	2/2	10/10
4.5K	2/2	2/2	2/2	2/2	2/2	10/10
Total						30/30

Honestly didn't expect a perfect sweep. The literature's "lost in the middle" effect (Liu et al., 2023) is the classic long-context pain point — accuracy dips when info sits at 25–75%. In this range and task shape, 12B + 8K context didn't hit it.

A few reads on the result:

The needle was too distinctive. I formatted it as a fixed phrase ("Important: on {date}…") embedded in mundane Chinese paragraphs (groceries, repairs, hiking). Once the model spots that phrase template inside thousands of characters of small-talk, it locks on. Rewriting the needle to sound like normal prose would probably look much worse — a follow-up worth running.
The task is literal extraction. Compared with "synthesize across paragraphs" or "reason about multiple facts," number extraction is the easiest mode.
4.5K is still 56% of the 8K window. Not pushed to the ceiling. To find a real degradation point, I'd retest at 6–7K.

That said, "no lost-in-the-middle in this range" is a useful conclusion for local RAG: when you feed the retrieval Top-K to 12B for summarization or QA, at least within 4.5K you don't need to worry about answers being buried in the middle.

失敗案例分析

沒有失敗可以分析。 30 個測試全部正確抽到 needle 的溫度數字。

不過附帶觀察到一個副作用：模型不守「只回數字」指示。 prompt 寫得很明確：

請根據上面這段文字回答：{date} 那天 {city} 的氣溫是攝氏多少度？只回答數字（含單位，例如『37.2 度』），不要解釋。

實際輸出平均 250-280 token——這比目標格式（「37.2 度」= 4 token）多了 60-70 倍。內容大致是：先回答「37.2 度」，然後附加「原文提到……」「這代表破了該年度的單日高溫紀錄」一類的延伸說明。

從 Ollama streaming 那篇的延伸結論看，這對使用者體感的影響是：即使你以為要的是短回應，本機模型給的是中段回應——所以該用 stream 的時機比直覺更早。

Failure analysis

There are no failures to analyze. All 30 trials retrieved the correct temperature.

A side observation though: the model ignores "just give the number" instructions. The prompt was explicit:

From the text above, what was {city}'s temperature on {date}? Just give the number with units (e.g. "37.2 度"). No explanation.

Actual responses average 250-280 tokens — 60-70× the target format ("37.2 度" = 4 tokens). The content usually answers correctly first ("37.2 度") then appends explanations like "the text mentions…" and "this broke the annual single-day high record."

Tying back to the streaming article, this reinforces the conclusion: even when you expect a short response, local models deliver medium-length output — so the threshold for using streaming is lower than intuition suggests.

速度與記憶體影響

實際量到的時間隨 context 變化（中位數）：

Context	Prompt tokens	平均 wall time	輸出 token 數
1K	874	28.3 s	281
2.5K	1998	35.5 s	248
4.5K	3489	49.3 s	256

讀法：context 從 1K 拉到 4.5K（5 倍）只讓 wall time 從 28 秒長到 49 秒（1.75 倍）。增加的時間幾乎都是 prompt evaluation——M1 Pro 在「吃進長 context」這件事上比想像中快。輸出 token 數三個 context 都差不多（~250），所以生成階段時間幾乎一樣。

記憶體：num_ctx=8192 載入 gemma4:12b 後，搭配同時開的 Safari + Obsidian，M1 Pro 16GB 的記憶體壓力圖在跑測期間維持綠色——沒進入 swap 階段。這個 setup 是可以日常用的。但要再開 e2b 或第三個 app 就會吃緊。

Speed and memory impact

Measured wall time vs context length (median):

Context	Prompt tokens	Avg wall time	Output tokens
1K	874	28.3 s	281
2.5K	1998	35.5 s	248
4.5K	3489	49.3 s	256

How to read it: stretching context from 1K to 4.5K (5×) only pushes wall time from 28s to 49s (1.75×). The added time is almost all prompt evaluation — M1 Pro chews through long Chinese context faster than I'd guessed. Output token count is similar across all three (~250), so generation time barely moves.

Memory: with num_ctx=8192, gemma4:12b loaded alongside Safari and Obsidian running. Memory-pressure graph stayed green throughout — no swap. The setup is fine for daily use. Adding e2b or a third heavy app would tip it into yellow.

共振站結論

波形評級：CLEAN_SIGNAL

12B + 8K context 在這個任務範圍（≤ 4.5K 輸入）可以信任：

✅ 0% / 25% / 50% / 75% / 100% 五個位置命中率都 100%
✅ 速度可接受：4.5K 輸入下 wall time ~50 秒，主要是 prompt eval
✅ M1 Pro 16GB 可以同時開 Safari + Obsidian，沒進 swap
⚠️ 「只回數字」的格式約束模型完全不理——下游程式要自己截斷
⚠️ 沒測 6-7K 接近 8K 上限的衰減——未來再追

對本機 RAG 設計的意涵：

Top-K 召回後的段落塞給 12B 摘要，4.5K 以內不必擔心「答案藏中段」
不需要做「把重要段落放第一或最後」這種討好 lost-in-the-middle 的工程
如果你的 RAG 內容超過 4.5K，要嘛分段、要嘛再做這個 audit 看 6-7K 衰減點

在 Ollama Library 看 gemma4:12b →

Verdict

Waveform: CLEAN_SIGNAL

12B + 8K context is trustworthy in this task range (≤ 4.5K input):

✅ Hit rate is 100% at every tested position (0/25/50/75/100%)
✅ Speed is acceptable: 4.5K input → ~50s wall time, dominated by prompt eval
✅ M1 Pro 16GB holds Safari + Obsidian + 12B without swapping
⚠️ Model completely ignores the "just give the number" format constraint — downstream code must truncate
⚠️ Didn't probe 6-7K (near the 8K ceiling) — future work

Implications for local RAG design:

After Top-K retrieval, feeding the passages to 12B for summarization or QA is fine as long as you stay under 4.5K — no need to worry about middle-of-context burial
No need for "put the important passage first or last" gymnastics that target lost-in-the-middle
If your RAG context exceeds 4.5K: chunk, or rerun this audit at 6-7K to find your decay point

See gemma4:12b on Ollama Library →

Verdict

Signature № 01 · CLEAN-SIGNAL

1K/2.5K/4.5K × 五個位置 30 題全中。8K context 是真的——但記得他改不掉冗長的脾氣。30 of 30 hits across 1K/2.5K/4.5K × 5 positions. The 8K context is real — just expect it to ignore your length cap.

Gemma 4 12B 在 num_ctx=8192 設定下，跨三種文件長度與五個 needle 位置共 30 次測試全部命中。沒有觀察到「lost-in-the-middle」現象，0% 與 100% 兩端與中段命中率都是 100%。代價是兩件事：(1) prompt 長度從 1K → 4.5K，wall time 從 28 秒拉到 49 秒——主要花在 prompt eval。(2) 即使指示「只回答數字、不要解釋」，模型仍回 250+ tokens，要不要套截斷由你決定。本機 RAG 在這個範圍可以信任 retrieval recall，無需做分段策略。

Gemma 4 12B with num_ctx=8192 hit every needle across three document lengths and five positions — 30 of 30. No 'lost in the middle' observed; 0%, 50%, and 100% positions all scored 100%. Two costs: (1) wall time scales from 28s (1K context) to 49s (4.5K) — dominated by prompt eval. (2) Despite an explicit 'just give the number, no explanation' instruction, the model still returns 250+ tokens; truncate downstream if you need terse output. Local RAG can trust retrieval recall in this range without chunking heuristics.

設定

Setup

結果（15 個位置組合，每組合 2 次）

Results (15 position cells × 2 reps each)

失敗案例分析

Failure analysis

速度與記憶體影響

Speed and memory impact

共振站結論

Verdict

Tune in. 每週一篇深度評測，沒有廢話。Tune in. One deep review per week. No filler.