設定
| 項目 | 規格 |
|---|---|
| 機器 | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| Embedding 模型 | nomic-embed-text (274 MB, 768 維) |
| Endpoint | http://localhost:11434/api/embed |
| Corpus | 30 段繁中段落,6 主題各 5 篇 |
| Queries | 15 個查詢,每個對應 1 個 ground-truth doc |
Corpus 主題:
- A(Apple ecosystem):M1 Pro vs M1、unified memory、macOS Spotlight、iPhone 健康 app、iCloud 同步
- B(Local AI / LLM):Ollama HTTP API、K-quants、nomic embedding、streaming UX、Gemma 預設英文
- C(Taiwan 法規):租賃契約條款、不得記載事項、租金補貼、悠遊卡月票、新版身分證
- D(台灣料理):滷肉飯、鹽酥雞、牛肉麵、蘿蔔糕、肉圓
- E(個人理財):勞退新制、健保負擔、ETF(0050/0056)、信用卡循環利率、所得稅
- F(攝影基礎):光圈、ISO、RAW/JPEG、白平衡、等效焦距
每篇段落都嵌入一個明確的「事實 hook」——一個可被精確查詢的資料點,讓 query → doc 的對應關係可機器判定。
Query 設計原則:
- 每個 query 對應一個明確的 ground-truth doc
- query 用詞與 doc 段落故意拉開差異(測語意,不測字面)
- 範例:doc 用「M1 是 68.25 GB/s,M1 Pro 拉到 200 GB/s」,query 問「我想知道 M1 Pro 比 M1 強的地方在哪」
指標:
- Top-1 命中率:rank=1 的 query 比例
- Top-3 命中率:ground truth 出現在 top 3 的比例
- Top-5 命中率:同上但 top 5
- 平均 rank:所有 query ground truth 排名的平均
完整 corpus 與 query 在 tests/data/;腳本 tests/scripts/run_retrieval.py;結果 tests/outputs/retrieval_results.jsonl。
Setup
| Item | Spec |
|---|---|
| Machine | MacBook Pro 14" M1 Pro, 16GB |
| Ollama | 0.30.6 |
| Embedding model | nomic-embed-text (274 MB, 768-dim) |
| Endpoint | http://localhost:11434/api/embed |
| Corpus | 30 Chinese paragraphs, 5 per theme × 6 themes |
| Queries | 15, each mapped to exactly one ground-truth doc |
Corpus themes:
- A (Apple): M1 Pro vs M1, unified memory, macOS Spotlight, iPhone health app, iCloud sync
- B (Local AI/LLM): Ollama HTTP API, K-quants, nomic embedding, streaming UX, Gemma defaulting to English
- C (Taiwan regulation): rental contract clauses, prohibited clauses, rent subsidy, EasyCard pass, new ID card
- D (Taiwanese food): braised pork rice, popcorn chicken, beef noodle soup, daikon cake, meatballs
- E (Personal finance): pension system, NHI premiums, ETFs (0050/0056), credit card APR, income tax
- F (Photography basics): aperture, ISO, RAW/JPEG, white balance, equivalent focal length
Each paragraph embeds one explicit "fact hook" — a data point that a precise query can target — making query → doc mapping machine-decidable.
Query design rules:
- Each query maps to exactly one ground-truth doc
- Query phrasing is intentionally lexically distant from the doc (testing semantic match, not surface overlap)
- Example: doc says "M1 is 68.25 GB/s, M1 Pro is 200 GB/s"; query asks "I want to know where M1 Pro outperforms M1"
Metrics:
- Top-1 hit rate: queries where the ground truth ranks first
- Top-3 hit rate: ground truth in top 3
- Top-5 hit rate: ground truth in top 5
- Mean rank: average ground-truth rank across all queries
Corpus and queries: tests/data/. Script: tests/scripts/run_retrieval.py. Results: tests/outputs/retrieval_results.jsonl.
結果
| 指標 | 值 |
|---|---|
| Top-1 命中率 | 12/15 = 80.0% |
| Top-3 命中率 | 12/15 = 80.0% |
| Top-5 命中率 | 12/15 = 80.0% |
注意三個數字一樣——這比數字本身更有意思。命中的 12 個查詢全部 rank=1,沒命中的 3 個直接掉到 rank 6 以後。完全沒有「差一點」的情況。語意檢索常見的「rank 2 邊緣未命中」這個分布形狀,在 nomic 對繁中身上不存在。
主題拆分:
| 主題 | n | Top-1 | 平均 rank |
|---|---|---|---|
| B · Local AI | 5 | 5/5 | 1.0 |
| C · Taiwan regulation | 3 | 3/3 | 1.0 |
| D · 台灣料理 | 1 | 1/1 | 1.0 |
| E · 理財 | 1 | 1/1 | 1.0 |
| A · Apple ecosystem | 4 | 2/4 | 11.25 |
| F · 攝影 | 1 | 0/1 | 6.0 |
法規、本地 AI、料理、理財——四個主題全中。Apple 主題踩雷兩個。攝影那個是基礎光學概念,沒中。
Results
| Metric | Value |
|---|---|
| Top-1 hit rate | 12/15 = 80.0% |
| Top-3 hit rate | 12/15 = 80.0% |
| Top-5 hit rate | 12/15 = 80.0% |
The three numbers being identical is more interesting than the headline. All 12 hits ranked first; the 3 misses dropped to rank 6 or worse. There are no "near misses at rank 2." The usual gentle-decay shape of semantic retrieval doesn't show up here.
Per-theme:
| Theme | n | Top-1 | Mean rank |
|---|---|---|---|
| B · Local AI | 5 | 5/5 | 1.0 |
| C · Taiwan regulation | 3 | 3/3 | 1.0 |
| D · Taiwanese food | 1 | 1/1 | 1.0 |
| E · Finance | 1 | 1/1 | 1.0 |
| A · Apple ecosystem | 4 | 2/4 | 11.25 |
| F · Photography | 1 | 0/1 | 6.0 |
Regulation, Local AI, food, finance — all clean. Apple takes two hits. The photography one is a basic optics question that missed.
三個失敗案例拆給你看
Q02 — rank 13
Query:為什麼蘋果晶片跑大模型比較有優勢?
Ground truth (A02):「Apple Silicon 的統一記憶體架構讓 CPU 與 GPU 共用同一塊 RAM……本地跑 7B 以上模型時這個架構特別有用。」
實際 Top-3:
- D05 · 肉圓(0.642)
- D02 · 鹽酥雞(0.631)
- F04 · 白平衡(0.628)
GT 排在第 13 名。明顯踩到「中英術語對應」的雷——查詢用了「蘋果晶片」,文件用了「Apple Silicon」。nomic 在繁中應該要能把這對映上,但顯然沒抓到。
Q03 — rank 30(最後一名)
Query:電腦剛裝完軟體變得很卡,跟系統搜尋功能有關嗎?
Ground truth (A03):「macOS 的 Spotlight 索引服務 mds 在背景跑時會吃 CPU,安裝大型程式後常出現幾分鐘明顯卡頓……」
實際 Top-3:
- D02 · 鹽酥雞(0.673)
- A04 · iPhone 健康 app(0.646)
- F04 · 白平衡(0.637)
GT 在 30 篇中排最後。這個失敗最有意思——查詢用的是日常語言(「卡」「系統搜尋功能」),文件用的是技術名詞(「Spotlight 索引服務 mds」)。nomic 在「日常 ↔ 技術」這層映射上似乎沒辦法跨。
Q15 — rank 6
Query:拍照背景模糊要怎麼調?
Ground truth (F01):「光圈用 f 值表示,數字越小光圈越大、進光量越多;同時景深越淺,背景虛化效果越明顯。」
實際 Top-3:
- D02 · 鹽酥雞(0.635)
- D03 · 牛肉麵(0.607)
- C05 · 新版身分證(0.604)
「背景模糊」與「景深越淺、背景虛化」是攝影裡的同義概念,但 nomic 沒對到。鹽酥雞為什麼會排第一……我也想知道。
一個怪現象:鹽酥雞文件是「未匹配磁鐵」
三個失敗案例的 Top-1 全部是 D02 · 鹽酥雞。把 D02 的 embedding 跟其他 doc 比,它的相似度分布看起來特別「中庸」——當 query 跟所有 doc 都對不太上時,D02 就會浮上來。實務上的意涵:檢索失敗時,nomic 不會輸出「找不到」訊號,而是會給你一個語意上最「平庸」的文件當答案。 這對 RAG pipeline 是個陷阱——LLM 拿到鹽酥雞文件回答 Apple Silicon 的問題,會直接幻覺出一篇看似合理的胡扯。
Three failures, unpacked
Q02 — rank 13
Query: "Why do Apple chips have an advantage for running large models?"
Ground truth (A02): "Apple Silicon's unified memory architecture lets CPU and GPU share one RAM pool… especially useful for 7B+ models locally."
Actual Top-3:
- D05 · meatballs (0.642)
- D02 · popcorn chicken (0.631)
- F04 · white balance (0.628)
GT lands at 13 out of 30. Classic Chinese-English vocab gap — the query says "蘋果晶片" (Apple chip in Chinese), the doc says "Apple Silicon" (English). A properly TC-trained embedder should map these; nomic clearly doesn't.
Q03 — rank 30 (dead last)
Query: "My computer just slowed down after installing software — could it be the system search?"
Ground truth (A03): "macOS's Spotlight indexer mds eats CPU in the background; large installs cause noticeable lag for minutes…"
Actual Top-3:
- D02 · popcorn chicken (0.673)
- A04 · iPhone health app (0.646)
- F04 · white balance (0.637)
GT is dead last out of 30. Most telling failure here — the query uses everyday language ("slow", "system search"), the doc uses technical terms ("Spotlight indexer", "mds"). Nomic can't bridge the everyday ↔ technical gap.
Q15 — rank 6
Query: "How do I make the background blurry when taking photos?"
Ground truth (F01): "Aperture is expressed as an f-number; smaller f-number means wider aperture, more light, shallower depth of field, more background blur."
Actual Top-3:
- D02 · popcorn chicken (0.635)
- D03 · beef noodle soup (0.607)
- C05 · new ID card (0.604)
"Background blur" and "shallow depth of field / bokeh" are synonyms in photography vocabulary. Nomic didn't connect them. Why popcorn chicken ranked first — your guess is as good as mine.
A weird pattern: the popcorn chicken doc is an "unmatched magnet"
Across all three failures, the Top-1 is D02 · popcorn chicken. Examining D02's embedding against other docs, its similarity distribution is unusually "neutral" — when a query doesn't fit any doc well, D02 rises to the top by default. The practical takeaway: when retrieval misses, nomic doesn't signal "no match" — it returns the most semantically average doc. This is a trap for RAG pipelines: feed an Apple Silicon question to an LLM with the popcorn chicken doc as context and you'll get a confident hallucinated answer.
共振站結論
波形評級:INTERFERENCE
可以用,但不能直接信任:
- 適合:RAG 第一階段「召回」——目標是把可能相關的 doc 從 1000 篇縮到 20 篇,這個任務 nomic 在繁中綽綽有餘(測試結果暗示 Top-20 應該接近全中)
- 不適合:直接把 nomic 的 Top-1 拿去當答案。失敗 20%、且失敗時的錯誤訊號是「鹽酥雞」這種荒謬等級——你的 LLM 會被誤導
- 必須補:Top-K 之後接一個 rerank 步驟。可以是 LLM rerank(把 Top-20 餵 gemma4:12b 讓它挑),或另一個專門的 cross-encoder
踩雷情境清單(要特別小心):
- 查詢用中文俗稱、文件用英文技術名詞(蘋果晶片 ↔ Apple Silicon)
- 查詢用日常語言、文件用領域術語(卡 ↔ Spotlight 索引服務)
- 查詢與文件是領域同義詞(背景模糊 ↔ 景深淺)
這三類我會建議在 prompt template 加一句「請幫我用更技術的用語改寫這個查詢」,讓 LLM 先擴寫 query 再丟給 nomic——這是廉價的補救方法。
Verdict
Waveform: INTERFERENCE
Usable, but don't trust it directly:
- Fits: first-stage RAG recall — shrink 1000 docs to 20. The data suggests nomic handles this comfortably on Traditional Chinese (Top-20 should be near-complete).
- Doesn't fit: handing the raw Top-1 back as your answer. 20% miss rate plus the catastrophic failure mode (popcorn chicken as answer to an Apple Silicon question) will mislead any downstream LLM.
- Must add: a rerank stage after Top-K. Either LLM-rerank (feed Top-20 to gemma4:12b and let it pick) or a dedicated cross-encoder.
Failure patterns to watch for:
- Query uses casual Chinese name, doc uses English technical term (蘋果晶片 ↔ Apple Silicon)
- Query is colloquial, doc is domain-jargon (kā ↔ Spotlight indexer)
- Query and doc are domain synonyms (background blur ↔ shallow depth of field)
For these three patterns I'd add a "rewrite this query using technical vocabulary" prompt step before nomic — a cheap mitigation.