web: create readable extraction ticket

2026-05-31 05:41:29 +09:00 · 2026-05-31 05:41:29 +09:00 · dc5ce2ba72
commit dc5ce2ba72
parent 787abf4f7d
3 changed files with 92 additions and 0 deletions
--- a/work-items/open/20260530-204045-webfetch-readable-extraction/artifacts/.gitkeep
+++ b/work-items/open/20260530-204045-webfetch-readable-extraction/artifacts/.gitkeep
--- a/work-items/open/20260530-204045-webfetch-readable-extraction/item.md
+++ b/work-items/open/20260530-204045-webfetch-readable-extraction/item.md
@ -0,0 +1,71 @@
+---
+id: 20260530-204045-webfetch-readable-extraction
+slug: webfetch-readable-extraction
+title: WebFetch: extract main HTML content with lightweight readability
+status: open
+kind: task
+priority: P2
+labels: [web, tools, html]
+created_at: 2026-05-30T20:40:45Z
+updated_at: 2026-05-30T20:41:21Z
+assignee: null
+legacy_ticket: null
+---
+
+## Background
+
+`WebFetch` currently returns bounded, safety-checked content but the HTML path is still close to raw page text: it strips tags with a small local formatter and does not try to isolate the article/main content. For LLM use, a reader-mode style extraction layer is more useful than raw boilerplate-heavy page text.
+
+`readability-js` was investigated but brings QuickJS / bundled JavaScript dependency weight. The desired direction is a lightweight pure-Rust extraction backend with fallback to the current `html_to_text` behavior.
+
+Reference implementations checked out with ghq under `.worktree/ghq-root/` for planning:
+
+- `github.com/quambene/readability-rs` — crate `readability-rs`, MIT, small arc90-style extractor (`Readable { title, content, text }`).
+- `github.com/theiskaa/readabilityrs` — crate `readabilityrs`, Apache-2.0, larger Mozilla Readability port with metadata/markdown support.
+- `github.com/readable-app/readability.rs` — crate `readable-readability`, MIT, kuchiki-based extractor but sparse docs and older maintenance surface.
+
+## Requirements
+
+- Keep `WebSearch` and `WebFetch` as separate tools. Do not add an automatic summarization/research tool in this ticket.
+- Add a reader-mode extraction path for HTML responses in `WebFetch`.
+- Use a pure-Rust dependency or local extraction implementation; do not use `readability-js`, QuickJS, Node, Python, or subprocess-based extraction.
+- Prefer the lightweight `readability-rs` crate if it builds cleanly and produces usable `title` + main `text`; escalate if the crate is incompatible or obviously too low-quality for the included fixtures.
+- Preserve the current network safety behavior: configured provider requirement, private/local host rejection, bounded redirects, response size limits, binary rejection, output truncation, and untrusted-content warning semantics.
+- Preserve fallback behavior. If readability extraction fails or returns empty/too-short text, return HTML text produced by the existing local fallback rather than failing the entire fetch.
+- Structure HTML fetch output so the LLM can distinguish extraction metadata from document content. At minimum include:
+  - extraction method (`readability` or fallback name)
+  - fallback indicator / reason when applicable
+  - title when available
+  - main text
+  - existing fetch metadata such as URL/final URL/status/content type/truncation
+- Keep HTML returned to the LLM as text by default. Do not expose full extracted HTML unless there is a clear existing output field need.
+- Add focused tests with small HTML fixtures covering:
+  - article/main content is preferred over nav/footer/sidebar boilerplate
+  - fallback is used when readability extraction is not useful
+  - output remains bounded/truncated under the existing output limit
+
+## Non-goals
+
+- Provider expansion or changes to `WebSearch` provider selection.
+- LLM-generated summaries inside `WebFetch`.
+- Browser rendering, JavaScript execution, or dynamic page support.
+- Large benchmark suite or exhaustive readability quality comparison.
+- Public API/protocol changes beyond the tool result JSON shape.
+
+## Implementation plan
+
+1. Add the selected pure-Rust readability dependency to `crates/tools`.
+2. Introduce a small internal HTML extraction helper, e.g. `extract_html_document(html, base_url, output_limit)`, wrapping readability success and fallback.
+3. Update the `ContentKind::Html` branch in `WebFetch` rendering to use the helper.
+4. Keep existing `html_to_text` as fallback and testable utility.
+5. Update tests in `crates/tools/src/web.rs` or a focused tools test module.
+6. Validate with formatting, focused tools tests, and broader checks appropriate to the dependency change.
+
+## Acceptance criteria
+
+- `WebFetch` HTML responses prefer extracted main content over navigation/footer/sidebar boilerplate in tests.
+- `WebFetch` still returns useful bounded text when readability extraction fails or is empty.
+- Tool output clearly reports extraction method and fallback status.
+- No JavaScript engine/runtime dependency is introduced.
+- `Cargo.lock` and Nix cargo hash implications are handled or explicitly reported.
+- `cargo fmt --check`, focused tools tests, `cargo check -p tools`, and `./tickets.sh doctor` pass or any failure is clearly reported as unrelated/pre-existing.
--- a/work-items/open/20260530-204045-webfetch-readable-extraction/thread.md
+++ b/work-items/open/20260530-204045-webfetch-readable-extraction/thread.md
@ -0,0 +1,21 @@
+<!-- event: create author: tickets.sh at: 2026-05-30T20:40:45Z -->
+
+## Created
+
+Created by tickets.sh create.
+
+---
+
+<!-- event: plan author: hare at: 2026-05-30T20:41:21Z -->
+
+## Plan
+
+Planning note:
+
+- ghq checkouts for prior art were placed under `.worktree/ghq-root/` so they stay inside the repository write scope and under the ignored `.worktree/` area.
+- `readability-js` is intentionally excluded from the implementation path because it pulls in QuickJS/rquickjs and bundled JavaScript.
+- Candidate preference for this ticket is `readability-rs` first because it is small, MIT licensed, and exposes a simple `extract` API returning `title`, extracted HTML, and text. If it fails to build or extraction is unusable on the ticket fixtures, the coder should stop and report rather than silently switching to a heavier dependency.
+- `readabilityrs` is the heavier pure-Rust backup candidate and useful for reference, but adopting it changes the dependency footprint more significantly.
+
+
+---