Featured Teardown · Automation Signature № 03 · INTERFERENCE

本地 LLM + Playwright：能不能讓 gemma4 自己瀏覽網頁？Local LLM Drives the Browser: Can gemma4 Navigate the Web?

Josh Chen · August 1, 2026 · 13 min read

瀏覽器 agent 為什麼一直紅？

Browser agents have been hot for a year now: browser-use, Anthropic Computer Use, OpenAI Operator, Microsoft NLWeb. As an individual user, the real question is: can this run locally? Does it have to send data out?

過去一年，「AI 自己瀏覽網頁」這件事從研究專案變成生產工具：browser-use、Anthropic Computer Use、OpenAI Operator、Microsoft NLWeb⋯⋯每家都在推。但作為個人用戶，最關心的問題是：這東西能不能跑在本地？需要外送資料嗎？

這篇文章用 Playwright + 本地 Ollama gemma4，做兩個版本的「抓共振站 verdict 表格」實驗：v1 寫死 CSS selector，v2 讓 LLM 自己推理 selector。然後刻意改版 DOM，看哪個版本還活著。

This article uses Playwright + local Ollama gemma4 to build two versions of "extract verdicts from Resonance Stack": v1 hardcodes CSS selectors, v2 lets the LLM infer the selectors. Then I deliberately break the DOM and see which version still works.

為什麼不用 browser-use 框架？

Browser Agent 架構：task → LLM 推理 → Playwright 動作 → 觀察 → 再次推理

browser-use 是現在最紅的開源框架，但對「我只想抓 6 個 card」這個 task 來說太重——要 LangChain、要設定 LLM provider、要管 action space。

我用最小可行的方式：Playwright 直接抓 DOM + Ollama 在需要時推理。架構就 3 層：

task：自然語言指令（「抓共振站所有 verdict」）
LLM 思考：給 gemma4 看 HTML 片段，讓它推理 CSS selector
Playwright 執行：用 LLM 給的 selector 抓資料

LLM 在這個架構不是「主導決策」，而是「翻譯員」——把不確定的 DOM 翻譯成 Playwright 能用的 selector。

Why Not Use the browser-use Framework?

Browser Agent architecture: task → LLM reasoning → Playwright action → observation → loop

browser-use is the most popular open-source framework right now, but it's overkill for a task as simple as "extract 6 cards" — it needs LangChain, LLM provider config, action space management.

I used the minimal approach: Playwright directly queries the DOM, Ollama steps in only when needed. Three layers:

Task: natural language ("extract all verdicts from Resonance Stack")
LLM reasoning: feed gemma4 an HTML snippet, ask it to infer CSS selectors
Playwright execution: use the LLM's selectors to extract data

The LLM here isn't the decision-maker — it's a translator, converting an uncertain DOM into Playwright-compatible selectors.

兩個版本：寫死 vs LLM 推理

v1 寫死 selector 的執行步驟：navigate → query → filter → visit → output

任務：抓首頁 6 個 article cards 的 verdict、tag、signature，整理成 markdown 表格。

v1（爬蟲式，~100 行）：

cards = page.evaluate("""() => {
    return Array.from(document.querySelectorAll('article.card')).map(c => ({
        href: c.querySelector('a')?.getAttribute('href'),
        tag:  c.querySelector('.card__tag')?.textContent.trim(),
    }));
}""")

CSS selector .card、.card__tag 直接寫死。執行 0.01 秒。

v2（LLM-driven，~150 行）：

把 HTML 片段給 gemma4：「請從以下 HTML 辨識文章卡片的 CSS selector，回傳 JSON。」LLM 推理後回傳 {"card_selector": ".card", "link_selector": ".card__title a", "tag_selector": ".card__tag"}，再用 Playwright 抓。

LLM 推理約 15-30 秒，但每次跑都重新看 HTML，不依賴硬編值。

Two Versions: Hardcoded vs LLM-driven

v1 walkthrough with hardcoded selectors: navigate → query → filter → visit → output

Task: extract verdict, tag, and signature from 6 article cards into a markdown table.

v1 (scraper-style, ~100 lines):

cards = page.evaluate("""() => {
    return Array.from(document.querySelectorAll('article.card')).map(c => ({
        href: c.querySelector('a')?.getAttribute('href'),
        tag:  c.querySelector('.card__tag')?.textContent.trim(),
    }));
}""")

CSS selectors .card, .card__tag are hardcoded. Runs in 0.01 seconds.

v2 (LLM-driven, ~150 lines):

Feed the HTML snippet to gemma4: "Identify the CSS selectors for article cards. Return JSON." LLM returns {"card_selector": ".card", "link_selector": ".card__title a", "tag_selector": ".card__tag"}. Then Playwright uses those.

LLM reasoning takes 15–30 seconds, but every run looks at the current HTML — no hardcoded values.

v2 的 LLM 推理長什麼樣

LLM 思考過程：input prompt → gemma4 response → 解析出的 selector

把首頁 .grid 區塊的 HTML（約 3500 字元）餵給 gemma4，附上指令：「請辨識文章卡片的 CSS selector，回傳純 JSON。」

gemma4 在 DOM 改版後的回應特別有趣：

{
  "card_selector": "article.tile-v2",
  "link_selector": ".card__title a",
  "tag_selector": ".pill"
}

注意：我改了 class（.card → .tile-v2、.card__tag → .pill），但 .card__title 沒改。gemma4 正確識別出哪些 class 變了、哪些沒變——這是 LLM 的「常識推理」能力。

What the v2 LLM Reasoning Looks Like

LLM reasoning process: input prompt → gemma4 response → extracted selectors

Feed the homepage .grid HTML (~3500 chars) to gemma4, with the instruction "Identify article card CSS selectors, return pure JSON."

After breaking the DOM, gemma4's response is particularly interesting:

{
  "card_selector": "article.tile-v2",
  "link_selector": ".card__title a",
  "tag_selector": ".pill"
}

Note: I renamed .card → .tile-v2 and .card__tag → .pill, but .card__title stayed. gemma4 correctly identifies which classes changed and which didn't — that's the LLM's common-sense reasoning at work.

抓取結果

兩個版本在原始 DOM 上都抓到 6/6。輸出 markdown 表格直接可用：

Slug	Category	Signature	Verdict
static-vs-dynamic	Strategy	№ 03 INTERFERENCE	Dynamic 比 Static 準確 22%⋯⋯
prompt-engineering-domain-expert	Strategy	№ 03 INTERFERENCE	角色化 prompt 內容質感⋯⋯
rag-chatbot-ollama	Local AI	№ 03 INTERFERENCE	RAG 不是萬靈丹⋯⋯
⋯⋯

寫入 outputs/verdicts.md，可直接用於文章內部引用、社群分享、SEO 摘要。

Extracted Results

Playwright headless screenshot of the Resonance Stack homepage

Auto-extracted verdict table for 6 articles

Both versions extracted 6/6 on the original DOM. The markdown table is directly usable:

Slug	Category	Signature	Verdict
static-vs-dynamic	Strategy	№ 03 INTERFERENCE	Dynamic beat static by 22%...
prompt-engineering-domain-expert	Strategy	№ 03 INTERFERENCE	Role prompts improve quality...
rag-chatbot-ollama	Local AI	№ 03 INTERFERENCE	RAG isn't a silver bullet...
...

Written to outputs/verdicts.md — directly usable for internal article references, social sharing, SEO summaries.

DOM 改版測試：速度 vs 韌性

v1 vs v2 對比矩陣：原始 DOM 都 6/6，改版後 v1 死透、v2 仍 6/6

用 page.evaluate 注入 JS，把首頁的 .card 改成 .tile-v2、.card__tag 改成 .pill（模擬「網站重構」這種真實事件）。再跑兩個版本：

場景	Hit	耗時
v1 原始 DOM	6/6	0.01s
v1 改版後 DOM	0/6	0.01s
v2 原始 DOM	6/6	16.25s
v2 改版後 DOM	6/6	15.23s

結論清楚：v1 比 v2 快 1500 倍，但 DOM 改版直接死。v2 慢，但是「網站改版那天」還能跑。

這個對比解釋了為什麼 browser agent 一直紅：開發者面對的真實痛點不是「我抓不到」，而是「我抓得到，但網站改版後又要重寫」。v2 把「適應改版」的成本從「每次都要工程師介入」變成「每次跑都自動處理」。

DOM Change Test: Speed vs Resilience

v1 vs v2 matrix: both 6/6 on original DOM, v1 dies and v2 survives on broken DOM

Using page.evaluate to inject JS, I renamed .card → .tile-v2 and .card__tag → .pill (simulating a real "site refactor" event). Then reran both versions:

Scenario	Hit	Time
v1 original DOM	6/6	0.01s
v1 broken DOM	0/6	0.01s
v2 original DOM	6/6	16.25s
v2 broken DOM	6/6	15.23s

The takeaway is clear: v1 is 1500× faster than v2, but dies on DOM changes. v2 is slow but still works the day the site refactors.

This comparison explains why browser agents have been hot: the real pain isn't "I can't extract data" — it's "I can extract data, but every site refactor breaks my code." v2 shifts the cost of "adapting to changes" from "engineer intervenes every time" to "auto-handled every run."

本機模型對照：e2b vs 12b

文章前面用 gemma4:e2b（2B 級）跑出 v2 的結果。Gemma 4 12B 出來後我把同樣的 task 重跑一輪——對照的不是「local vs cloud」，是「local 內的 size 取捨」。重跑日 2026-06-08，首頁卡片數已從原本的 6 張長到 14 張，但任務型態完全一樣：餵 grid HTML 片段、讓 LLM 推理 CSS selector、Playwright 執行。

模型	原始 DOM 推理時間	命中	改版 DOM 推理時間	命中	模型檔案大小
`gemma4:e2b`	28.9 s	6/6	19.4 s	6/6	7.2 GB
`gemma4:12b`	67.1 s	14/14	61.8 s	14/14	7.6 GB

兩個觀察值得拆開講。

1. 12B 在速度上付出代價、品質沒拿到對應回報

12B 在這個 task 比 e2b 慢 2.3-3.2 倍。但兩個模型在「能不能跑」這層完全打平——兩邊都是 100% 命中。對於「browser agent 該選哪個本機模型」這題，這個結果其實是清楚的：不必為了瀏覽器 agent 換 12B。e2b 已經夠用。

2. 12B 的 selector 更挑剔

把 LLM 給的 selector 攤開來看：

場景	e2b 給的 link selector	12b 給的 link selector
原始 DOM	`'a'`	`'.card__title a'`
改版 DOM	`'a'`	`'h3.headline a'`

e2b 用最寬的 a selector——能用、但每個 card 裡可能不只一個 <a> 的網站會炸。12B 自動加了 h3.headline 或 .card__title 的祖先層級——這在多連結 card 的情境會穩很多。

這是個有意思的「同樣命中率、不同品質」案例：兩邊 14/14，但 12B 給的程式碼更耐操。如果你的 task 是 throwaway 試一次（像這個實驗），e2b 完全夠；如果是要丟進長期 pipeline、要對未來看不到的網站結構容錯，12B 的多花 ~40 秒可能值得。

結論：在這個任務上 e2b 與 12B 都過。選擇本機 LLM 不應該以「browser agent 需要」為主要考量——這個任務是 LLM 用得最輕的場景之一。模型大小取捨應該看你主要工作流的其他任務（程式碼生成、reasoning、長 context）。

完整對照腳本與輸出：tests/scripts/browser_agent_v2_compare.py、tests/outputs/agent_v2_gemma4_12b_original.json、tests/outputs/agent_v2_gemma4_12b_broken.json。

Local Model Comparison: e2b vs 12b

The original v2 experiment used gemma4:e2b (2B-class). When Gemma 4 12B landed I reran the same task — the comparison isn't "local vs cloud," it's "size trade-off within local." Reran on 2026-06-08; the homepage has since grown from 6 cards to 14, but the task shape is identical: feed the grid HTML snippet to the LLM, infer CSS selectors, let Playwright execute.

Model	Original DOM inference	Hits	Broken DOM inference	Hits	File size
`gemma4:e2b`	28.9 s	6/6	19.4 s	6/6	7.2 GB
`gemma4:12b`	67.1 s	14/14	61.8 s	14/14	7.6 GB

Two observations worth unpacking.

1. 12B pays a speed cost without a quality payoff

12B is 2.3-3.2× slower than e2b on this task. But both models tie at 100% extraction on both DOM states. For "which local model for a browser agent," the answer is clean: you don't need to upgrade to 12B for this. e2b suffices.

2. 12B picks more discriminating selectors

The link selector each model chose:

Scenario	e2b's link selector	12b's link selector
Original DOM	`'a'`	`'.card__title a'`
Broken DOM	`'a'`	`'h3.headline a'`

e2b used the broadest possible a — works, but would break on any site where cards contain multiple anchors. 12B reflexively added the ancestor qualifier (h3.headline or .card__title), which is robust against multi-link cards.

A genuinely interesting "same hit rate, different quality" case: 14/14 on both, but 12B's code is more durable. For a throwaway experiment (this one), e2b is plenty. For a long-running pipeline that needs to absorb unseen site structures, 12B's extra ~40s may earn its keep.

Takeaway: both work on this task. Don't choose a local LLM based on "browser agent needs" — this is one of the lightest tasks an LLM does. Size choice should be driven by other parts of your workflow (code generation, reasoning, long context).

Full comparison script and outputs: tests/scripts/browser_agent_v2_compare.py, tests/outputs/agent_v2_gemma4_12b_original.json, tests/outputs/agent_v2_gemma4_12b_broken.json.

決策框架：什麼時候該用 Browser Agent

選 v1（爬蟲式）當：

DOM 你控制（自家網站、內網工具）
跑很頻繁（每天、每小時）—— LLM 成本累積快
任務固定（不會變需求）

選 v2（LLM-driven）當：

第三方網站，DOM 不在你掌握
跑頻率低（每週、每月）—— 慢一點可以接受
預期 selector 會變

Hybrid 模式（兩者都用）：

先試 v1，失敗時 fallback 到 v2
真實 production 系統的合理做法

不適合的情境：

需要登入 / 處理 cookie banner / 多分頁 state —— 這些是 agent framework（browser-use）才解決得了的層次
多模態理解（看圖填表）—— 本地 gemma4 視覺能力不夠

Decision Framework: When to Use a Browser Agent

2×2 matrix: DOM stability × task complexity, four quadrants with strategies

Choose v1 (scraper-style) when:

You control the DOM (own site, internal tool)
High frequency (daily, hourly) — LLM costs add up fast
Fixed task (no shifting requirements)

Choose v2 (LLM-driven) when:

Third-party site you don't control
Low frequency (weekly, monthly) — slowness is acceptable
Selectors are expected to change

Hybrid mode (use both):

Try v1 first, fall back to v2 on failure
Reasonable for real production systems

Not suitable when:

Login required / cookie banners / multi-tab state — that's where you need a full agent framework like browser-use
Multimodal understanding (read image to fill form) — local gemma4's vision isn't there yet

Verdict

Signature № 03 · INTERFERENCE

硬編 selector 快 1500 倍，LLM-driven 慢但 DOM 改版不死——browser agent 的真正命題是「速度 vs 韌性」。Hard-coded selectors are 1500× faster, but LLM-driven survives DOM changes — the real trade-off in browser agents is speed vs resilience.

用 Playwright + gemma4:e2b 跑共振站 localhost，抓 6 篇文章的 verdict、tag、signature 整理成表格。對比兩個版本：v1（寫死 .card selector）原版 6/6 通過、耗時 0.01 秒；改版 DOM 後（.card → .tile-v2）→ 0/6 直接死。v2（LLM 推理 selector）原版 6/6、耗時 16 秒；改版後 LLM 自己找到 .tile-v2，仍然 6/6。結論：穩定網站用 v1，常改版的網站才需要 agent 式的 v2。

Used Playwright + gemma4:e2b on Resonance Stack localhost, extracting verdict + tag + signature from 6 articles into a table. Compared two versions: v1 (hardcoded .card selector) — 6/6 on original, 0.01s; broken DOM (.card → .tile-v2) → 0/6 dead. v2 (LLM infers selectors) — 6/6 on original, 16s; broken DOM, LLM finds .tile-v2 on its own, still 6/6. Lesson: stable sites use v1; only sites that change frequently need v2's agent-style approach.