4.3 KiB
4.3 KiB
Created
Created by tickets.sh create.
Plan
Planning note:
- ghq checkouts for prior art were placed under
.worktree/ghq-root/so they stay inside the repository write scope and under the ignored.worktree/area. readability-jsis intentionally excluded from the implementation path because it pulls in QuickJS/rquickjs and bundled JavaScript.- Candidate preference for this ticket is
readability-rsfirst because it is small, MIT licensed, and exposes a simpleextractAPI returningtitle, extracted HTML, and text. If it fails to build or extraction is unusable on the ticket fixtures, the coder should stop and report rather than silently switching to a heavier dependency. readabilityrsis the heavier pure-Rust backup candidate and useful for reference, but adopting it changes the dependency footprint more significantly.
Implementation report
Implementation report from coder Pod webfetch-readable-coder-20260530:
- Branch:
webfetch-readable-extraction - Commit:
7906ca532666669417c20d831a08103c2f0f80dd(web: extract readable html content) - Changed files:
Cargo.lock,crates/tools/Cargo.toml,crates/tools/src/web.rs,package.nix - Added
readability-rs = 0.5.0totoolsand updated Nix cargo hash. - Added a WebFetch HTML extraction helper that uses readability for main text when useful and falls back to existing
html_to_textwhen readability fails or returns too-short text. - Added
html_extractionmetadata with method/fallback/reason/title and kept output bounded. - Full extracted HTML is not returned.
Validation reported by coder:
cargo fmt --checkpassedcargo test -p tools webpassed (10 passed)cargo check -p toolspassed, with only existingllm-workerdead_code warning./tickets.sh doctorpassedgit diff --checkpassednix build .#insomniapassed
Unresolved issues: none.
Review: approve
External review by reviewer Pod webfetch-readable-reviewer-20260530: approve.
Summary:
- The change adds a pure-Rust
readability-rsextraction path forWebFetchHTML responses. - HTML responses use reader-mode text when extraction is useful and fall back to existing local
html_to_textotherwise. - Output JSON includes separate
html_extractionmetadata plus documenttext, while preserving fetch metadata and untrusted-content warning.
Requirements check:
WebSearch/WebFetchseparation preserved.- Pure Rust dependency only; no QuickJS, Node, Python, browser, or subprocess path.
- Existing WebFetch safety behavior remains in place.
- Fallback behavior exists for readability errors and too-short/empty text.
- Output separates extraction metadata from text.
- Full extracted HTML is not exposed.
- Tests cover fallback metadata, article/main preference over nav/footer, truncation, and existing WebSearch/fetch safety behavior.
- Dependency and Nix hash changes are reasonable.
Blockers: none.
Non-blocking follow-up:
- Optional future direct test for a stable readability error path; current fallback coverage is sufficient for this ticket.
Implementation report
Main workspace validation after merge:
cargo fmt --checkpassedcargo test -p tools webpassed (10 passed)cargo check -p toolspassed with existingllm-workerdead_code warning./tickets.sh doctorpassedgit diff --checkpassednix build .#insomniapassed (with dirty tree warning due to unrelated.insomnia/workflow/multi-agent-workflow.mdlocal modification)
Closed
Implemented WebFetch HTML reader-mode extraction with pure-Rust readability-rs, preserving existing safety checks and fallback to local html_to_text. Output now reports html_extraction metadata and bounded main text without exposing extracted HTML by default. Reviewed externally and approved; validation passed including focused tools tests and nix build .#insomnia.