Featured Teardown · Local AI Signature № 03 · INTERFERENCE

nomic-embed-text 對繁體中文的語意檢索準不準？30 doc + 15 query 實測Does nomic-embed-text Handle Traditional Chinese Search? 30 Docs × 15 Queries

Josh Chen · June 10, 2026 · 10 min read

要在本機跑繁中語意搜尋，最常見的開放權重選擇是 nomic-embed-text——ollama pull nomic-embed-text 一個指令就拿得到、不用 API、不用 GPU。但 nomic 的訓練主要是英文，繁中到底準不準？我用 30 篇繁中段落 + 15 個刻意換句的查詢測一下 Top-K 命中率。

For local-only Traditional Chinese semantic search, the most common open-weight choice is nomic-embed-text — one ollama pull and you're done, no API, no GPU. But nomic is primarily English-trained. How well does it actually handle Traditional Chinese? I ran it on 30 hand-written Chinese paragraphs and 15 deliberately-paraphrased queries to get Top-K hit rates.

設定

項目	規格
機器	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
Embedding 模型	`nomic-embed-text` (274 MB, 768 維)
Endpoint	`http://localhost:11434/api/embed`
Corpus	30 段繁中段落，6 主題各 5 篇
Queries	15 個查詢，每個對應 1 個 ground-truth doc

Corpus 主題：

A（Apple ecosystem）：M1 Pro vs M1、unified memory、macOS Spotlight、iPhone 健康 app、iCloud 同步
B（Local AI / LLM）：Ollama HTTP API、K-quants、nomic embedding、streaming UX、Gemma 預設英文
C（Taiwan 法規）：租賃契約條款、不得記載事項、租金補貼、悠遊卡月票、新版身分證
D（台灣料理）：滷肉飯、鹽酥雞、牛肉麵、蘿蔔糕、肉圓
E（個人理財）：勞退新制、健保負擔、ETF（0050/0056）、信用卡循環利率、所得稅
F（攝影基礎）：光圈、ISO、RAW/JPEG、白平衡、等效焦距

每篇段落都嵌入一個明確的「事實 hook」——一個可被精確查詢的資料點，讓 query → doc 的對應關係可機器判定。

Query 設計原則：

每個 query 對應一個明確的 ground-truth doc
query 用詞與 doc 段落故意拉開差異（測語意，不測字面）
範例：doc 用「M1 是 68.25 GB/s，M1 Pro 拉到 200 GB/s」，query 問「我想知道 M1 Pro 比 M1 強的地方在哪」

指標：

Top-1 命中率：rank=1 的 query 比例
Top-3 命中率：ground truth 出現在 top 3 的比例
Top-5 命中率：同上但 top 5
平均 rank：所有 query ground truth 排名的平均

完整 corpus 與 query 在 tests/data/；腳本 tests/scripts/run_retrieval.py；結果 tests/outputs/retrieval_results.jsonl。

Setup

Item	Spec
Machine	MacBook Pro 14" M1 Pro, 16GB
Ollama	0.30.6
Embedding model	`nomic-embed-text` (274 MB, 768-dim)
Endpoint	`http://localhost:11434/api/embed`
Corpus	30 Chinese paragraphs, 5 per theme × 6 themes
Queries	15, each mapped to exactly one ground-truth doc

Corpus themes:

A (Apple): M1 Pro vs M1, unified memory, macOS Spotlight, iPhone health app, iCloud sync
B (Local AI/LLM): Ollama HTTP API, K-quants, nomic embedding, streaming UX, Gemma defaulting to English
C (Taiwan regulation): rental contract clauses, prohibited clauses, rent subsidy, EasyCard pass, new ID card
D (Taiwanese food): braised pork rice, popcorn chicken, beef noodle soup, daikon cake, meatballs
E (Personal finance): pension system, NHI premiums, ETFs (0050/0056), credit card APR, income tax
F (Photography basics): aperture, ISO, RAW/JPEG, white balance, equivalent focal length

Each paragraph embeds one explicit "fact hook" — a data point that a precise query can target — making query → doc mapping machine-decidable.

Query design rules:

Each query maps to exactly one ground-truth doc
Query phrasing is intentionally lexically distant from the doc (testing semantic match, not surface overlap)
Example: doc says "M1 is 68.25 GB/s, M1 Pro is 200 GB/s"; query asks "I want to know where M1 Pro outperforms M1"

Metrics:

Top-1 hit rate: queries where the ground truth ranks first
Top-3 hit rate: ground truth in top 3
Top-5 hit rate: ground truth in top 5
Mean rank: average ground-truth rank across all queries

Corpus and queries: tests/data/. Script: tests/scripts/run_retrieval.py. Results: tests/outputs/retrieval_results.jsonl.

結果

指標	值
Top-1 命中率	12/15 = 80.0%
Top-3 命中率	12/15 = 80.0%
Top-5 命中率	12/15 = 80.0%

注意三個數字一樣——這比數字本身更有意思。命中的 12 個查詢全部 rank=1，沒命中的 3 個直接掉到 rank 6 以後。完全沒有「差一點」的情況。語意檢索常見的「rank 2 邊緣未命中」這個分布形狀，在 nomic 對繁中身上不存在。

主題拆分：

主題	n	Top-1	平均 rank
B · Local AI	5	5/5	1.0
C · Taiwan regulation	3	3/3	1.0
D · 台灣料理	1	1/1	1.0
E · 理財	1	1/1	1.0
A · Apple ecosystem	4	2/4	11.25
F · 攝影	1	0/1	6.0

法規、本地 AI、料理、理財——四個主題全中。Apple 主題踩雷兩個。攝影那個是基礎光學概念，沒中。

Results

Metric	Value
Top-1 hit rate	12/15 = 80.0%
Top-3 hit rate	12/15 = 80.0%
Top-5 hit rate	12/15 = 80.0%

The three numbers being identical is more interesting than the headline. All 12 hits ranked first; the 3 misses dropped to rank 6 or worse. There are no "near misses at rank 2." The usual gentle-decay shape of semantic retrieval doesn't show up here.

Per-theme:

Theme	n	Top-1	Mean rank
B · Local AI	5	5/5	1.0
C · Taiwan regulation	3	3/3	1.0
D · Taiwanese food	1	1/1	1.0
E · Finance	1	1/1	1.0
A · Apple ecosystem	4	2/4	11.25
F · Photography	1	0/1	6.0

Regulation, Local AI, food, finance — all clean. Apple takes two hits. The photography one is a basic optics question that missed.

三個失敗案例拆給你看

Q02 — rank 13

Query：為什麼蘋果晶片跑大模型比較有優勢？

Ground truth (A02)：「Apple Silicon 的統一記憶體架構讓 CPU 與 GPU 共用同一塊 RAM……本地跑 7B 以上模型時這個架構特別有用。」

實際 Top-3：

D05 · 肉圓（0.642）
D02 · 鹽酥雞（0.631）
F04 · 白平衡（0.628）

GT 排在第 13 名。明顯踩到「中英術語對應」的雷——查詢用了「蘋果晶片」，文件用了「Apple Silicon」。nomic 在繁中應該要能把這對映上，但顯然沒抓到。

Q03 — rank 30（最後一名）

Query：電腦剛裝完軟體變得很卡，跟系統搜尋功能有關嗎？

Ground truth (A03)：「macOS 的 Spotlight 索引服務 mds 在背景跑時會吃 CPU，安裝大型程式後常出現幾分鐘明顯卡頓……」

實際 Top-3：

D02 · 鹽酥雞（0.673）
A04 · iPhone 健康 app（0.646）
F04 · 白平衡（0.637）

GT 在 30 篇中排最後。這個失敗最有意思——查詢用的是日常語言（「卡」「系統搜尋功能」），文件用的是技術名詞（「Spotlight 索引服務 mds」）。nomic 在「日常 ↔ 技術」這層映射上似乎沒辦法跨。

Q15 — rank 6

Query：拍照背景模糊要怎麼調？

Ground truth (F01)：「光圈用 f 值表示，數字越小光圈越大、進光量越多；同時景深越淺，背景虛化效果越明顯。」

實際 Top-3：

D02 · 鹽酥雞（0.635）
D03 · 牛肉麵（0.607）
C05 · 新版身分證（0.604）

「背景模糊」與「景深越淺、背景虛化」是攝影裡的同義概念，但 nomic 沒對到。鹽酥雞為什麼會排第一……我也想知道。

一個怪現象：鹽酥雞文件是「未匹配磁鐵」

三個失敗案例的 Top-1 全部是 D02 · 鹽酥雞。把 D02 的 embedding 跟其他 doc 比，它的相似度分布看起來特別「中庸」——當 query 跟所有 doc 都對不太上時，D02 就會浮上來。實務上的意涵：檢索失敗時，nomic 不會輸出「找不到」訊號，而是會給你一個語意上最「平庸」的文件當答案。 這對 RAG pipeline 是個陷阱——LLM 拿到鹽酥雞文件回答 Apple Silicon 的問題，會直接幻覺出一篇看似合理的胡扯。

Three failures, unpacked

Q02 — rank 13

Query: "Why do Apple chips have an advantage for running large models?"

Ground truth (A02): "Apple Silicon's unified memory architecture lets CPU and GPU share one RAM pool… especially useful for 7B+ models locally."

Actual Top-3:

D05 · meatballs (0.642)
D02 · popcorn chicken (0.631)
F04 · white balance (0.628)

GT lands at 13 out of 30. Classic Chinese-English vocab gap — the query says "蘋果晶片" (Apple chip in Chinese), the doc says "Apple Silicon" (English). A properly TC-trained embedder should map these; nomic clearly doesn't.

Q03 — rank 30 (dead last)

Query: "My computer just slowed down after installing software — could it be the system search?"

Ground truth (A03): "macOS's Spotlight indexer mds eats CPU in the background; large installs cause noticeable lag for minutes…"

Actual Top-3:

D02 · popcorn chicken (0.673)
A04 · iPhone health app (0.646)
F04 · white balance (0.637)

GT is dead last out of 30. Most telling failure here — the query uses everyday language ("slow", "system search"), the doc uses technical terms ("Spotlight indexer", "mds"). Nomic can't bridge the everyday ↔ technical gap.

Q15 — rank 6

Query: "How do I make the background blurry when taking photos?"

Ground truth (F01): "Aperture is expressed as an f-number; smaller f-number means wider aperture, more light, shallower depth of field, more background blur."

Actual Top-3:

D02 · popcorn chicken (0.635)
D03 · beef noodle soup (0.607)
C05 · new ID card (0.604)

"Background blur" and "shallow depth of field / bokeh" are synonyms in photography vocabulary. Nomic didn't connect them. Why popcorn chicken ranked first — your guess is as good as mine.

A weird pattern: the popcorn chicken doc is an "unmatched magnet"

Across all three failures, the Top-1 is D02 · popcorn chicken. Examining D02's embedding against other docs, its similarity distribution is unusually "neutral" — when a query doesn't fit any doc well, D02 rises to the top by default. The practical takeaway: when retrieval misses, nomic doesn't signal "no match" — it returns the most semantically average doc. This is a trap for RAG pipelines: feed an Apple Silicon question to an LLM with the popcorn chicken doc as context and you'll get a confident hallucinated answer.

共振站結論

波形評級：INTERFERENCE

可以用，但不能直接信任：

適合：RAG 第一階段「召回」——目標是把可能相關的 doc 從 1000 篇縮到 20 篇，這個任務 nomic 在繁中綽綽有餘（測試結果暗示 Top-20 應該接近全中）
不適合：直接把 nomic 的 Top-1 拿去當答案。失敗 20%、且失敗時的錯誤訊號是「鹽酥雞」這種荒謬等級——你的 LLM 會被誤導
必須補：Top-K 之後接一個 rerank 步驟。可以是 LLM rerank（把 Top-20 餵 gemma4:12b 讓它挑），或另一個專門的 cross-encoder

踩雷情境清單（要特別小心）：

查詢用中文俗稱、文件用英文技術名詞（蘋果晶片 ↔ Apple Silicon）
查詢用日常語言、文件用領域術語（卡 ↔ Spotlight 索引服務）
查詢與文件是領域同義詞（背景模糊 ↔ 景深淺）

這三類我會建議在 prompt template 加一句「請幫我用更技術的用語改寫這個查詢」，讓 LLM 先擴寫 query 再丟給 nomic——這是廉價的補救方法。

在 Ollama Library 看 nomic-embed-text →

Verdict

Waveform: INTERFERENCE

Usable, but don't trust it directly:

Fits: first-stage RAG recall — shrink 1000 docs to 20. The data suggests nomic handles this comfortably on Traditional Chinese (Top-20 should be near-complete).
Doesn't fit: handing the raw Top-1 back as your answer. 20% miss rate plus the catastrophic failure mode (popcorn chicken as answer to an Apple Silicon question) will mislead any downstream LLM.
Must add: a rerank stage after Top-K. Either LLM-rerank (feed Top-20 to gemma4:12b and let it pick) or a dedicated cross-encoder.

Failure patterns to watch for:

Query uses casual Chinese name, doc uses English technical term (蘋果晶片 ↔ Apple Silicon)
Query is colloquial, doc is domain-jargon (kā ↔ Spotlight indexer)
Query and doc are domain synonyms (background blur ↔ shallow depth of field)

For these three patterns I'd add a "rewrite this query using technical vocabulary" prompt step before nomic — a cheap mitigation.

See nomic-embed-text on Ollama Library →

Verdict

Signature № 03 · INTERFERENCE

命中率 80%，但失敗的 20% 是雪崩式錯到底——適合 RAG 第一階段粗篩，不適合 production 排名。80% Top-1 hit rate — but when it misses, it misses by a mile. Good for first-stage recall, not for production ranking.

30 doc × 15 query 跑下來，12 個查詢 Top-1 命中、3 個直接掉到 rank 6 以後。沒有「rank 2 邊緣未命中」這種狀況——nomic 在繁中要嘛精準對到、要嘛把肉圓鹽酥雞排在 Apple Silicon 前面。實務建議：用它做 RAG 第一階段（Top-20 召回）撐得住，但不要把排名直接拿去當答案，後面要接一個重排序步驟或 LLM rerank。

12 of 15 queries hit Top-1; the other 3 fell to rank 6 or worse. There's no 'near miss at rank 2' — nomic on Traditional Chinese either nails it or puts popcorn chicken above Apple Silicon. Practical rule: fine for a first-stage RAG recall pass (Top-20), but don't use raw rank as your answer ordering — add an LLM-rerank step downstream.

設定

Setup

結果

Results

三個失敗案例拆給你看

Q02 — rank 13

Q03 — rank 30（最後一名）

Q15 — rank 6

一個怪現象：鹽酥雞文件是「未匹配磁鐵」

Three failures, unpacked

Q02 — rank 13

Q03 — rank 30 (dead last)

Q15 — rank 6

A weird pattern: the popcorn chicken doc is an "unmatched magnet"

共振站結論

Verdict

Tune in. 每週一篇深度評測，沒有廢話。Tune in. One deep review per week. No filler.