4.5 KiB
4.5 KiB
| id | slug | title | status | kind | priority | labels | created_at | updated_at | assignee | legacy_ticket | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20260530-204045-webfetch-readable-extraction | webfetch-readable-extraction | WebFetch: extract main HTML content with lightweight readability | closed | task | P2 |
|
2026-05-30T20:40:45Z | 2026-05-30T20:55:13Z | null | null |
Background
WebFetch currently returns bounded, safety-checked content but the HTML path is still close to raw page text: it strips tags with a small local formatter and does not try to isolate the article/main content. For LLM use, a reader-mode style extraction layer is more useful than raw boilerplate-heavy page text.
readability-js was investigated but brings QuickJS / bundled JavaScript dependency weight. The desired direction is a lightweight pure-Rust extraction backend with fallback to the current html_to_text behavior.
Reference implementations checked out with ghq under .worktree/ghq-root/ for planning:
github.com/quambene/readability-rs— cratereadability-rs, MIT, small arc90-style extractor (Readable { title, content, text }).github.com/theiskaa/readabilityrs— cratereadabilityrs, Apache-2.0, larger Mozilla Readability port with metadata/markdown support.github.com/readable-app/readability.rs— cratereadable-readability, MIT, kuchiki-based extractor but sparse docs and older maintenance surface.
Requirements
- Keep
WebSearchandWebFetchas separate tools. Do not add an automatic summarization/research tool in this ticket. - Add a reader-mode extraction path for HTML responses in
WebFetch. - Use a pure-Rust dependency or local extraction implementation; do not use
readability-js, QuickJS, Node, Python, or subprocess-based extraction. - Prefer the lightweight
readability-rscrate if it builds cleanly and produces usabletitle+ maintext; escalate if the crate is incompatible or obviously too low-quality for the included fixtures. - Preserve the current network safety behavior: configured provider requirement, private/local host rejection, bounded redirects, response size limits, binary rejection, output truncation, and untrusted-content warning semantics.
- Preserve fallback behavior. If readability extraction fails or returns empty/too-short text, return HTML text produced by the existing local fallback rather than failing the entire fetch.
- Structure HTML fetch output so the LLM can distinguish extraction metadata from document content. At minimum include:
- extraction method (
readabilityor fallback name) - fallback indicator / reason when applicable
- title when available
- main text
- existing fetch metadata such as URL/final URL/status/content type/truncation
- extraction method (
- Keep HTML returned to the LLM as text by default. Do not expose full extracted HTML unless there is a clear existing output field need.
- Add focused tests with small HTML fixtures covering:
- article/main content is preferred over nav/footer/sidebar boilerplate
- fallback is used when readability extraction is not useful
- output remains bounded/truncated under the existing output limit
Non-goals
- Provider expansion or changes to
WebSearchprovider selection. - LLM-generated summaries inside
WebFetch. - Browser rendering, JavaScript execution, or dynamic page support.
- Large benchmark suite or exhaustive readability quality comparison.
- Public API/protocol changes beyond the tool result JSON shape.
Implementation plan
- Add the selected pure-Rust readability dependency to
crates/tools. - Introduce a small internal HTML extraction helper, e.g.
extract_html_document(html, base_url, output_limit), wrapping readability success and fallback. - Update the
ContentKind::Htmlbranch inWebFetchrendering to use the helper. - Keep existing
html_to_textas fallback and testable utility. - Update tests in
crates/tools/src/web.rsor a focused tools test module. - Validate with formatting, focused tools tests, and broader checks appropriate to the dependency change.
Acceptance criteria
WebFetchHTML responses prefer extracted main content over navigation/footer/sidebar boilerplate in tests.WebFetchstill returns useful bounded text when readability extraction fails or is empty.- Tool output clearly reports extraction method and fallback status.
- No JavaScript engine/runtime dependency is introduced.
Cargo.lockand Nix cargo hash implications are handled or explicitly reported.cargo fmt --check, focused tools tests,cargo check -p tools, and./tickets.sh doctorpass or any failure is clearly reported as unrelated/pre-existing.