Featured Teardown · Local AI Signature № 03 · INTERFERENCE

把共振站變成可問答的知識庫:本地 RAG 聊天機器人實作Turn Your Blog Into a Searchable Knowledge Base: Local RAG with Ollama

每次問同樣的問題又要翻舊文章嗎?

Tired of digging through old articles for the same answer?

我自己已經寫了 5 篇關於 Ollama、n8n、LangGraph 的文章。一個月後,當我想回憶「LangGraph 重試成功率到底是多少」,我會打開部落格、用瀏覽器 Cmd+F、滾上滾下找答案。

這個情境就是 RAG(Retrieval-Augmented Generation)解決的問題:把你寫過的所有東西變成 LLM 可以查的知識庫。

這篇文章用全本地工具實作:Ollama(gemma4 + nomic-embed-text)+ ChromaDB + 200 行 Python。

I've written 5 articles on Ollama, n8n, and LangGraph. A month later, when I want to remember "what was LangGraph's retry success rate again?" — I open the blog, hit Cmd+F, scroll up and down hunting for it.

That's exactly the problem RAG (Retrieval-Augmented Generation) solves: turn everything you've written into a searchable knowledge base for an LLM.

This article does it with fully local tooling: Ollama (gemma4 + nomic-embed-text) + ChromaDB + ~200 lines of Python.


RAG 三步驟

RAG 流程圖:articles → chunks → embeddings → ChromaDB → LLM
  1. Chunk:把文章切成小塊(500-800 token),每塊獨立完整
  2. Embed:用 embedding model 把每塊變成向量(768 維)
  3. Retrieve:問問題時,把問題也轉成向量,找最相似的 top-3 chunks,當作 context 給 LLM

關鍵:embedding model 跟生成 LLM 是不同模型。nomic-embed-text 只做語意向量、不會回答問題;gemma4:e2b 只生成、不懂語意檢索。

The Three Steps of RAG

RAG flow: articles → chunks → embeddings → ChromaDB → LLM
  1. Chunk: split articles into 500-800 token pieces, each self-contained
  2. Embed: convert each chunk into a vector (768-dim) using an embedding model
  3. Retrieve: at query time, embed the question too, find the top-3 most similar chunks, feed them as context to the LLM

Key insight: the embedding model and the generation LLM are different. nomic-embed-text only produces semantic vectors; it can't answer questions. gemma4:e2b only generates; it can't do semantic search.


Ingest:把文章寫進向量庫

跑 ingest_articles.py:5 篇文章解析成 50 chunks,全部寫入 ChromaDB

ingest_articles.py 做的事:

  1. 掃描 content///index.md
  2. 移除 frontmatter 和 shortcode(保留純文字)
  3. 切塊(每 500 token,重疊 50 token)
  4. 對每塊呼叫 ollama embeddings nomic-embed-text
  5. 寫入 ChromaDB 並加入 metadata(文章 slug、category、chunk index)

5 篇文章切成 50 chunks,約 30 秒跑完。

ChromaDB 內容預覽:3 個 chunk 的 metadata 與前 100 字

Ingest: Writing Articles to the Vector Store

Running ingest_articles.py: 5 articles parsed into 50 chunks, all written to ChromaDB

ingest_articles.py does this:

  1. Scans content///index.md
  2. Strips frontmatter and shortcodes (keeps clean text)
  3. Chunks (500 tokens each, 50-token overlap)
  4. Calls ollama embeddings nomic-embed-text for each chunk
  5. Writes to ChromaDB with metadata (article slug, category, chunk index)

5 articles → 50 chunks → ~30 seconds total.

ChromaDB preview: 3 chunks with metadata and first 100 chars

RAG Chatbot:問問題

RAG demo CLI:問「LangGraph 重試機制成功率?」並收到精確回答

rag_chatbot.py 跑的流程:

question = input("> ")
# 1. embed the question
q_vec = ollama_embed(question)
# 2. retrieve top-3 chunks
chunks = chroma.query(q_vec, n_results=3)
# 3. build context-augmented prompt
prompt = f"以下是參考資料:\n{chunks}\n\n問題:{question}\n請根據資料回答。"
# 4. generate with gemma4
answer = ollama_generate(prompt)
print(answer)
檢索到的 top-3 chunks 及相似度分數

RAG Chatbot: Asking Questions

RAG demo CLI: asking 'LangGraph retry success rate?' and getting a precise answer

rag_chatbot.py flow:

question = input("> ")
# 1. embed the question
q_vec = ollama_embed(question)
# 2. retrieve top-3 chunks
chunks = chroma.query(q_vec, n_results=3)
# 3. build context-augmented prompt
prompt = f"Reference material:\n{chunks}\n\nQuestion: {question}\nAnswer based on the material."
# 4. generate with gemma4
answer = ollama_generate(prompt)
print(answer)
Top-3 retrieved chunks with similarity scores

RAG vs 純 LLM 對比

同問題「OpenClaude 有什麼資安漏洞?」純 gemma4 答案 vs RAG 答案

問題:「OpenClaude 有什麼資安漏洞?」

純 gemma4:「由於我無法對特定的、即時運行的軟體或模型進行實時的安全審計,因此我無法提供關於 OpenClaude 的確切、最新的資安漏洞列表。」(拒答,給出泛泛的 LLM 一般風險分類)

RAG 版本:「漏洞類型:Sandbox Bypass → Path Traversal。嚴重度:High。識別碼:GHSA-m6rx-7pvw-2f73。揭露日期 2026-04-20。緩解措施:Docker 容器隔離。」(引用 openclaude-ollama 文章內容)

這是 RAG 最漂亮的案例:純 LLM 不敢答,RAG 給出可驗證的具體事實。但⋯⋯這只是 5 個 in-corpus 問題裡的 2 個成功案例。

RAG vs Pure LLM

Same question 'What security vulnerabilities does OpenClaude have?' Pure gemma4 vs RAG answer

Question: "What security vulnerabilities does OpenClaude have?"

Pure gemma4: "Since I cannot perform real-time security audits on specific running software, I cannot provide an accurate, up-to-date list of vulnerabilities for OpenClaude." (Refusal, followed by generic LLM risk categories.)

RAG version: "Vulnerability type: Sandbox Bypass → Path Traversal. Severity: High. ID: GHSA-m6rx-7pvw-2f73. Disclosed: 2026-04-20. Mitigation: Docker container isolation." (Cites the openclaude-ollama article.)

This is RAG at its best: pure LLM refuses to answer, RAG gives verifiable specifics. But… this is only one of 2 successes out of 5 in-corpus questions.


兩種失敗:OOC 與 chunk 粒度

兩種失敗模式:in-corpus 但 chunk 撈錯(LangGraph 重試次數)vs OOC 但 LLM 仍幻覺(Stripe 收費)

跑完 7 個問題後,失敗分成兩類,性質完全不同:

Class 1 — Out-of-corpus(共振站沒寫過的問題)

問「Stripe 訂閱方案怎麼收費?」——dists [0.274, 0.319, 0.334] 看似有相關 chunks,但 LLM 看內容後判斷無法回答,正確回「資料中沒有答案」✅

但是純 gemma4(無 RAG)反而會「努力回答」,給出通用的 Stripe 介紹(可能含過時或錯誤資訊)。這是 RAG 真正的價值——拒答能力。

Class 2 — In-corpus 但 chunk 粒度不對

問「LangGraph 動態工作流的重試機制觸發了幾次?」——應該檢索到 static-vs-dynamic,但 top-3 撈到 mcp-local-workflow 與 openclaude-ollama。「重試 0 次」這個事實雖然在語料庫,但不在 top-3 chunks 內,RAG 回「資料中沒有答案」❌

這不是 embedding 模型問題(dists 都 < 0.32),也不是 LLM 推理問題,是切塊策略問題——chunk 大小、overlap、metadata 都會影響檢索品質。

實際成功率:

類別成功 / 總數
Out-of-corpus 拒答2 / 2
In-corpus 命中2 / 5
整體4 / 7 (57%)

Two Kinds of Failure: OOC vs Chunk Granularity

Two failure modes: in-corpus but wrong chunks retrieved (LangGraph retry count) vs OOC but LLM still hallucinates (Stripe pricing)

After running 7 questions, failures fall into two categories with completely different natures:

Class 1 — Out-of-corpus (topics Resonance Stack never covered)

Q: "How does Stripe charge?" — dists [0.274, 0.319, 0.334] look reasonably close, but the LLM reads the actual chunks and correctly responds "not in the knowledge base" ✅

Meanwhile, pure gemma4 (without RAG) "tries to answer" and produces a generic Stripe overview that may contain outdated or fabricated specifics. This is RAG's real value — the ability to refuse.

Class 2 — In-corpus but wrong chunk granularity

Q: "How many times did LangGraph's retry trigger?" — Should retrieve from static-vs-dynamic, but top-3 returned mcp-local-workflow and openclaude-ollama. The fact "retry triggered 0 times" exists in the corpus, but not in the top-3 chunks. RAG answers "no answer in the material" ❌

This isn't an embedding model problem (all distances < 0.32), nor an LLM reasoning problem. It's a chunking strategy problem — chunk size, overlap, metadata all affect retrieval quality.

Actual hit rate:

CategoryHit / Total
Out-of-corpus refusal2 / 2
In-corpus retrieval2 / 5
Overall4 / 7 (57%)

決策框架:RAG vs Fine-tune

Prompt EngineeringRAGFine-tune
成本0低(embedding 算一次)高(需訓練資料 + GPU)
適合改變回答風格補最新資訊 / 私有資料改變模型行為
維運改 prompt重新 ingest重新訓練
推薦順序先試再試最後考慮

90% 的「我想要 AI 知道我們公司的事」需求,RAG 都能解決。

Decision Framework: RAG vs Fine-tune

Prompt EngineeringRAGFine-tune
Cost$0Low (one-time embed)High (training data + GPU)
Best forChange response styleAdd fresh / private infoChange model behavior
MaintenanceEdit the promptRe-ingestRe-train
Try orderFirstSecondLast resort

90% of "I want the AI to know about our company" needs can be solved with RAG.