yoi/.yoi/tickets/closed/20260530-215928-webfetch-local-reader-markdown/item.md

5.1 KiB

id slug title status kind priority labels created_at updated_at assignee legacy_ticket
20260530-215928-webfetch-local-reader-markdown webfetch-local-reader-markdown WebFetch: replace readability dependency with Markdown-preserving local reader closed task P2
web
tools
html
2026-05-30T21:59:28Z 2026-05-30T22:21:39Z null null

Background

webfetch-readable-extraction added readability-rs to improve WebFetch HTML output. It proved the direction, but the next design step is to own the reader behavior instead of depending on an article extractor that flattens links to plain text.

For LLM research workflows, article text without links is lossy: links inside the readable body often point to RFCs, docs, downloads, related pages, or citations that the agent must be able to follow. At the same time, navigation/sidebar content should be omitted by default, while still being discoverable when the page is documentation/book-like and navigation links are important.

Requirements

  • Replace the readability-rs dependency with a local, pure-Rust HTML reader extractor in crates/tools.
  • Keep WebSearch and WebFetch separate. Do not add summarization or research orchestration in this ticket.
  • WebFetch HTML output should be Markdown-ish text, not plain text:
    • preserve inline links as [label](absolute-url);
    • preserve useful headings/lists/paragraph breaks enough for LLM readability;
    • do not expose full HTML by default.
  • Add optional include_navigation: Option<bool> to WebFetchInput, defaulting to false.
  • Detect navigation-like content (nav, sidebar/toc/menu/breadcrumb-ish class/id/role, previous/next chapter areas, etc.) generically.
    • With include_navigation=false, omit navigation from the main text by default.
    • If navigation was detected and omitted, include metadata/notice in the tool result such as “navigation was detected and omitted; re-run with include_navigation=true if navigation/sidebar links are needed.”
    • With include_navigation=true, include a bounded ## Navigation section containing navigation links rendered as Markdown.
  • Treat reader failure as a page-selection/readability signal, not as a second hidden reader mode:
    • report readable=false or equivalent metadata/reason when no useful main content was selected;
    • fallback text may remain as diagnostic last resort, but metadata must make clear it is fallback/raw-ish output.
  • Preserve current WebFetch safety behavior:
    • configured provider requirement;
    • private/local host rejection;
    • bounded redirects, response size, and output size;
    • binary rejection;
    • untrusted-content warning semantics.
  • Preserve output bounding for both main text and navigation content.
  • Avoid site-specific branches for mdBook/docs.rs/rustdoc/etc.; use generic DOM/tag/class/id/role heuristics only.

Non-goals

  • Firefox/Mozilla Readability compatibility.
  • JavaScript execution, browser rendering, QuickJS, Node, Python, or subprocess extraction.
  • Search result ranking changes or provider expansion.
  • LLM summarization inside WebFetch.
  • Exhaustive benchmark/quality suite.

Implementation guidance

  • Prefer using a lightweight DOM parser dependency already implied by the current dependency graph if possible (html5ever / rcdom or similar). It is acceptable to retain such parser dependencies directly while removing readability-rs.
  • Build a small local extractor with clear stages:
    1. parse HTML;
    2. classify nodes as navigation/skipped/main candidates;
    3. select the best main candidate using simple scoring (text length, paragraph count, link density, positive tags like main/article, negative class/id words);
    4. render selected content as bounded Markdown with absolute links;
    5. optionally render bounded navigation links under ## Navigation.
  • Keep the existing simple html_to_text path only as explicit diagnostic fallback when local reader extraction cannot find useful content.
  • Keep result JSON compatibility where practical, but update html_extraction metadata to expose method, readable status, navigation status, fallback status/reason, and title when available.

Acceptance criteria

  • readability-rs is removed from direct dependencies and no JavaScript runtime dependency is introduced.
  • HTML article fixture renders body links as Markdown [label](absolute-url).
  • Navigation/sidebar/footer are omitted from main text by default.
  • When navigation is omitted, result metadata or notice clearly says navigation was detected and can be included via include_navigation=true.
  • With include_navigation=true, bounded navigation links appear under a separate ## Navigation section.
  • Link-heavy navigation-only pages are not misreported as successfully readable main content.
  • Existing safety and bounds tests continue to pass.
  • Focused tests cover link preservation, navigation omission notice, navigation inclusion, reader failure/fallback metadata, and truncation/bounds.
  • cargo fmt --check, focused tools tests, cargo check -p tools, ./tickets.sh doctor, git diff --check, and Nix build/hash handling pass or failures are clearly reported.