Featured Teardown · Local AI Signature № 03 · INTERFERENCE

把共振站變成可問答的知識庫：本地 RAG 聊天機器人實作Turn Your Blog Into a Searchable Knowledge Base: Local RAG with Ollama

Josh Chen · June 25, 2026 · 13 min read

每次問同樣的問題又要翻舊文章嗎？

Tired of digging through old articles for the same answer?

我自己已經寫了 5 篇關於 Ollama、n8n、LangGraph 的文章。一個月後，當我想回憶「LangGraph 重試成功率到底是多少」，我會打開部落格、用瀏覽器 Cmd+F、滾上滾下找答案。

這個情境就是 RAG（Retrieval-Augmented Generation）解決的問題：把你寫過的所有東西變成 LLM 可以查的知識庫。

這篇文章用全本地工具實作：Ollama（gemma4 + nomic-embed-text）+ ChromaDB + 200 行 Python。

I've written 5 articles on Ollama, n8n, and LangGraph. A month later, when I want to remember "what was LangGraph's retry success rate again?" — I open the blog, hit Cmd+F, scroll up and down hunting for it.

That's exactly the problem RAG (Retrieval-Augmented Generation) solves: turn everything you've written into a searchable knowledge base for an LLM.

This article does it with fully local tooling: Ollama (gemma4 + nomic-embed-text) + ChromaDB + ~200 lines of Python.

RAG 三步驟

RAG 流程圖：articles → chunks → embeddings → ChromaDB → LLM

Chunk：把文章切成小塊（500-800 token），每塊獨立完整
Embed：用 embedding model 把每塊變成向量（768 維）
Retrieve：問問題時，把問題也轉成向量，找最相似的 top-3 chunks，當作 context 給 LLM

關鍵：embedding model 跟生成 LLM 是不同模型。nomic-embed-text 只做語意向量、不會回答問題；gemma4:e2b 只生成、不懂語意檢索。

The Three Steps of RAG

RAG flow: articles → chunks → embeddings → ChromaDB → LLM

Chunk: split articles into 500-800 token pieces, each self-contained
Embed: convert each chunk into a vector (768-dim) using an embedding model
Retrieve: at query time, embed the question too, find the top-3 most similar chunks, feed them as context to the LLM

Key insight: the embedding model and the generation LLM are different. nomic-embed-text only produces semantic vectors; it can't answer questions. gemma4:e2b only generates; it can't do semantic search.

Ingest：把文章寫進向量庫

跑 ingest_articles.py：5 篇文章解析成 50 chunks，全部寫入 ChromaDB

ingest_articles.py 做的事：

掃描 content///index.md
移除 frontmatter 和 shortcode（保留純文字）
切塊（每 500 token，重疊 50 token）
對每塊呼叫 ollama embeddings nomic-embed-text
寫入 ChromaDB 並加入 metadata（文章 slug、category、chunk index）

5 篇文章切成 50 chunks，約 30 秒跑完。

ChromaDB 內容預覽：3 個 chunk 的 metadata 與前 100 字

Ingest: Writing Articles to the Vector Store

Running ingest_articles.py: 5 articles parsed into 50 chunks, all written to ChromaDB

ingest_articles.py does this:

Scans content///index.md
Strips frontmatter and shortcodes (keeps clean text)
Chunks (500 tokens each, 50-token overlap)
Calls ollama embeddings nomic-embed-text for each chunk
Writes to ChromaDB with metadata (article slug, category, chunk index)

5 articles → 50 chunks → ~30 seconds total.

ChromaDB preview: 3 chunks with metadata and first 100 chars

RAG Chatbot：問問題

RAG demo CLI：問「LangGraph 重試機制成功率？」並收到精確回答

rag_chatbot.py 跑的流程：

question = input("> ")
# 1. embed the question
q_vec = ollama_embed(question)
# 2. retrieve top-3 chunks
chunks = chroma.query(q_vec, n_results=3)
# 3. build context-augmented prompt
prompt = f"以下是參考資料：\n{chunks}\n\n問題：{question}\n請根據資料回答。"
# 4. generate with gemma4
answer = ollama_generate(prompt)
print(answer)

RAG Chatbot: Asking Questions

RAG demo CLI: asking 'LangGraph retry success rate?' and getting a precise answer

rag_chatbot.py flow:

question = input("> ")
# 1. embed the question
q_vec = ollama_embed(question)
# 2. retrieve top-3 chunks
chunks = chroma.query(q_vec, n_results=3)
# 3. build context-augmented prompt
prompt = f"Reference material:\n{chunks}\n\nQuestion: {question}\nAnswer based on the material."
# 4. generate with gemma4
answer = ollama_generate(prompt)
print(answer)

Top-3 retrieved chunks with similarity scores

RAG vs 純 LLM 對比

問題：「OpenClaude 有什麼資安漏洞？」

純 gemma4：「由於我無法對特定的、即時運行的軟體或模型進行實時的安全審計，因此我無法提供關於 OpenClaude 的確切、最新的資安漏洞列表。」（拒答，給出泛泛的 LLM 一般風險分類）

RAG 版本：「漏洞類型：Sandbox Bypass → Path Traversal。嚴重度：High。識別碼：GHSA-m6rx-7pvw-2f73。揭露日期 2026-04-20。緩解措施：Docker 容器隔離。」（引用 openclaude-ollama 文章內容）

這是 RAG 最漂亮的案例：純 LLM 不敢答，RAG 給出可驗證的具體事實。但⋯⋯這只是 5 個 in-corpus 問題裡的 2 個成功案例。

RAG vs Pure LLM

Question: "What security vulnerabilities does OpenClaude have?"

Pure gemma4: "Since I cannot perform real-time security audits on specific running software, I cannot provide an accurate, up-to-date list of vulnerabilities for OpenClaude." (Refusal, followed by generic LLM risk categories.)

RAG version: "Vulnerability type: Sandbox Bypass → Path Traversal. Severity: High. ID: GHSA-m6rx-7pvw-2f73. Disclosed: 2026-04-20. Mitigation: Docker container isolation." (Cites the openclaude-ollama article.)

This is RAG at its best: pure LLM refuses to answer, RAG gives verifiable specifics. But… this is only one of 2 successes out of 5 in-corpus questions.

兩種失敗：OOC 與 chunk 粒度

兩種失敗模式：in-corpus 但 chunk 撈錯（LangGraph 重試次數）vs OOC 但 LLM 仍幻覺（Stripe 收費）

跑完 7 個問題後，失敗分成兩類，性質完全不同：

Class 1 — Out-of-corpus（共振站沒寫過的問題）

問「Stripe 訂閱方案怎麼收費？」——dists [0.274, 0.319, 0.334] 看似有相關 chunks，但 LLM 看內容後判斷無法回答，正確回「資料中沒有答案」✅

但是純 gemma4（無 RAG）反而會「努力回答」，給出通用的 Stripe 介紹（可能含過時或錯誤資訊）。這是 RAG 真正的價值——拒答能力。

Class 2 — In-corpus 但 chunk 粒度不對

問「LangGraph 動態工作流的重試機制觸發了幾次？」——應該檢索到 static-vs-dynamic，但 top-3 撈到 mcp-local-workflow 與 openclaude-ollama。「重試 0 次」這個事實雖然在語料庫，但不在 top-3 chunks 內，RAG 回「資料中沒有答案」❌

這不是 embedding 模型問題（dists 都 < 0.32），也不是 LLM 推理問題，是切塊策略問題——chunk 大小、overlap、metadata 都會影響檢索品質。

實際成功率：

類別	成功 / 總數
Out-of-corpus 拒答	2 / 2 ✅
In-corpus 命中	2 / 5
整體	4 / 7 (57%)

Two Kinds of Failure: OOC vs Chunk Granularity

Two failure modes: in-corpus but wrong chunks retrieved (LangGraph retry count) vs OOC but LLM still hallucinates (Stripe pricing)

After running 7 questions, failures fall into two categories with completely different natures:

Class 1 — Out-of-corpus (topics Resonance Stack never covered)

Q: "How does Stripe charge?" — dists [0.274, 0.319, 0.334] look reasonably close, but the LLM reads the actual chunks and correctly responds "not in the knowledge base" ✅

Meanwhile, pure gemma4 (without RAG) "tries to answer" and produces a generic Stripe overview that may contain outdated or fabricated specifics. This is RAG's real value — the ability to refuse.

Class 2 — In-corpus but wrong chunk granularity

Q: "How many times did LangGraph's retry trigger?" — Should retrieve from static-vs-dynamic, but top-3 returned mcp-local-workflow and openclaude-ollama. The fact "retry triggered 0 times" exists in the corpus, but not in the top-3 chunks. RAG answers "no answer in the material" ❌

This isn't an embedding model problem (all distances < 0.32), nor an LLM reasoning problem. It's a chunking strategy problem — chunk size, overlap, metadata all affect retrieval quality.

Actual hit rate:

Category	Hit / Total
Out-of-corpus refusal	2 / 2 ✅
In-corpus retrieval	2 / 5
Overall	4 / 7 (57%)

決策框架：RAG vs Fine-tune

	Prompt Engineering	RAG	Fine-tune
成本	0	低（embedding 算一次）	高（需訓練資料 + GPU）
適合	改變回答風格	補最新資訊 / 私有資料	改變模型行為
維運	改 prompt	重新 ingest	重新訓練
推薦順序	先試	再試	最後考慮

90% 的「我想要 AI 知道我們公司的事」需求，RAG 都能解決。

Decision Framework: RAG vs Fine-tune

	Prompt Engineering	RAG	Fine-tune
Cost	$0	Low (one-time embed)	High (training data + GPU)
Best for	Change response style	Add fresh / private info	Change model behavior
Maintenance	Edit the prompt	Re-ingest	Re-train
Try order	First	Second	Last resort

90% of "I want the AI to know about our company" needs can be solved with RAG.

Verdict

Signature № 03 · INTERFERENCE

RAG 不是萬靈丹——對未知問題能漂亮拒答，但對自己語料的問題，命中率取決於 chunk 怎麼切。RAG isn't a silver bullet — it elegantly refuses out-of-corpus questions, but in-corpus hit rate depends entirely on how you chunk.

用 nomic-embed-text + ChromaDB + gemma4:e2b 把共振站 9 篇文章切成 50 chunks。跑 7 個問題（5 in-corpus、2 out-of-corpus）：OOC 拒答 2/2 完美，但 in-corpus 只 2/5 命中（OpenClaude 漏洞、Ollama RAM 建議）。失敗的 3 題不是檢索完全失敗（distance < 0.4），而是 top-3 chunks 不含關鍵 fact。結論：embedding 模型沒問題、模型推理沒問題、chunk 切割粒度才是真正的工程問題。

Built with nomic-embed-text + ChromaDB + gemma4:e2b. Chunked 9 articles into 50 pieces, ran 7 questions (5 in-corpus, 2 OOC). OOC refusal: 2/2 perfect. In-corpus accuracy: only 2/5 — OpenClaude vulnerability and Ollama RAM advice. The 3 failures weren't retrieval misses (all distances < 0.4) — the top-3 chunks simply didn't contain the key fact. Lesson: embedding model is fine, reasoning model is fine, chunking strategy is the actual engineering problem.