diff --git a/work-items/open/20260530-204045-webfetch-readable-extraction/item.md b/work-items/open/20260530-204045-webfetch-readable-extraction/item.md index 84bf3868..3e77217c 100644 --- a/work-items/open/20260530-204045-webfetch-readable-extraction/item.md +++ b/work-items/open/20260530-204045-webfetch-readable-extraction/item.md @@ -7,7 +7,7 @@ kind: task priority: P2 labels: [web, tools, html] created_at: 2026-05-30T20:40:45Z -updated_at: 2026-05-30T20:41:21Z +updated_at: 2026-05-30T20:54:26Z assignee: null legacy_ticket: null --- diff --git a/work-items/open/20260530-204045-webfetch-readable-extraction/thread.md b/work-items/open/20260530-204045-webfetch-readable-extraction/thread.md index 0f1ee6a7..16f7d90d 100644 --- a/work-items/open/20260530-204045-webfetch-readable-extraction/thread.md +++ b/work-items/open/20260530-204045-webfetch-readable-extraction/thread.md @@ -18,4 +18,61 @@ Planning note: - `readabilityrs` is the heavier pure-Rust backup candidate and useful for reference, but adopting it changes the dependency footprint more significantly. +--- + + + +## Implementation report + +Implementation report from coder Pod `webfetch-readable-coder-20260530`: + +- Branch: `webfetch-readable-extraction` +- Commit: `7906ca532666669417c20d831a08103c2f0f80dd` (`web: extract readable html content`) +- Changed files: `Cargo.lock`, `crates/tools/Cargo.toml`, `crates/tools/src/web.rs`, `package.nix` +- Added `readability-rs = 0.5.0` to `tools` and updated Nix cargo hash. +- Added a WebFetch HTML extraction helper that uses readability for main text when useful and falls back to existing `html_to_text` when readability fails or returns too-short text. +- Added `html_extraction` metadata with method/fallback/reason/title and kept output bounded. +- Full extracted HTML is not returned. + +Validation reported by coder: + +- `cargo fmt --check` passed +- `cargo test -p tools web` passed (10 passed) +- `cargo check -p tools` passed, with only existing `llm-worker` dead_code warning +- `./tickets.sh doctor` passed +- `git diff --check` passed +- `nix build .#insomnia` passed + +Unresolved issues: none. + + +--- + + + +## Review: approve + +External review by reviewer Pod `webfetch-readable-reviewer-20260530`: approve. + +Summary: +- The change adds a pure-Rust `readability-rs` extraction path for `WebFetch` HTML responses. +- HTML responses use reader-mode text when extraction is useful and fall back to existing local `html_to_text` otherwise. +- Output JSON includes separate `html_extraction` metadata plus document `text`, while preserving fetch metadata and untrusted-content warning. + +Requirements check: +- `WebSearch` / `WebFetch` separation preserved. +- Pure Rust dependency only; no QuickJS, Node, Python, browser, or subprocess path. +- Existing WebFetch safety behavior remains in place. +- Fallback behavior exists for readability errors and too-short/empty text. +- Output separates extraction metadata from text. +- Full extracted HTML is not exposed. +- Tests cover fallback metadata, article/main preference over nav/footer, truncation, and existing WebSearch/fetch safety behavior. +- Dependency and Nix hash changes are reasonable. + +Blockers: none. + +Non-blocking follow-up: +- Optional future direct test for a stable readability error path; current fallback coverage is sufficient for this ticket. + + ---