76 lines
5.0 KiB
Markdown
76 lines
5.0 KiB
Markdown
---
|
|
id: 20260530-215928-webfetch-local-reader-markdown
|
|
slug: webfetch-local-reader-markdown
|
|
title: 'WebFetch: replace readability dependency with Markdown-preserving local reader'
|
|
status: closed
|
|
kind: task
|
|
priority: P2
|
|
labels: [web, tools, html]
|
|
created_at: 2026-05-30T21:59:28Z
|
|
updated_at: 2026-05-30T22:21:39Z
|
|
assignee: null
|
|
---
|
|
|
|
## Background
|
|
|
|
`webfetch-readable-extraction` added `readability-rs` to improve `WebFetch` HTML output. It proved the direction, but the next design step is to own the reader behavior instead of depending on an article extractor that flattens links to plain text.
|
|
|
|
For LLM research workflows, article text without links is lossy: links inside the readable body often point to RFCs, docs, downloads, related pages, or citations that the agent must be able to follow. At the same time, navigation/sidebar content should be omitted by default, while still being discoverable when the page is documentation/book-like and navigation links are important.
|
|
|
|
## Requirements
|
|
|
|
- Replace the `readability-rs` dependency with a local, pure-Rust HTML reader extractor in `crates/tools`.
|
|
- Keep `WebSearch` and `WebFetch` separate. Do not add summarization or research orchestration in this ticket.
|
|
- `WebFetch` HTML output should be Markdown-ish text, not plain text:
|
|
- preserve inline links as `[label](absolute-url)`;
|
|
- preserve useful headings/lists/paragraph breaks enough for LLM readability;
|
|
- do not expose full HTML by default.
|
|
- Add optional `include_navigation: Option<bool>` to `WebFetchInput`, defaulting to `false`.
|
|
- Detect navigation-like content (`nav`, sidebar/toc/menu/breadcrumb-ish class/id/role, previous/next chapter areas, etc.) generically.
|
|
- With `include_navigation=false`, omit navigation from the main text by default.
|
|
- If navigation was detected and omitted, include metadata/notice in the tool result such as “navigation was detected and omitted; re-run with include_navigation=true if navigation/sidebar links are needed.”
|
|
- With `include_navigation=true`, include a bounded `## Navigation` section containing navigation links rendered as Markdown.
|
|
- Treat reader failure as a page-selection/readability signal, not as a second hidden reader mode:
|
|
- report `readable=false` or equivalent metadata/reason when no useful main content was selected;
|
|
- fallback text may remain as diagnostic last resort, but metadata must make clear it is fallback/raw-ish output.
|
|
- Preserve current WebFetch safety behavior:
|
|
- configured provider requirement;
|
|
- private/local host rejection;
|
|
- bounded redirects, response size, and output size;
|
|
- binary rejection;
|
|
- untrusted-content warning semantics.
|
|
- Preserve output bounding for both main text and navigation content.
|
|
- Avoid site-specific branches for mdBook/docs.rs/rustdoc/etc.; use generic DOM/tag/class/id/role heuristics only.
|
|
|
|
## Non-goals
|
|
|
|
- Firefox/Mozilla Readability compatibility.
|
|
- JavaScript execution, browser rendering, QuickJS, Node, Python, or subprocess extraction.
|
|
- Search result ranking changes or provider expansion.
|
|
- LLM summarization inside `WebFetch`.
|
|
- Exhaustive benchmark/quality suite.
|
|
|
|
## Implementation guidance
|
|
|
|
- Prefer using a lightweight DOM parser dependency already implied by the current dependency graph if possible (`html5ever` / rcdom or similar). It is acceptable to retain such parser dependencies directly while removing `readability-rs`.
|
|
- Build a small local extractor with clear stages:
|
|
1. parse HTML;
|
|
2. classify nodes as navigation/skipped/main candidates;
|
|
3. select the best main candidate using simple scoring (text length, paragraph count, link density, positive tags like `main`/`article`, negative class/id words);
|
|
4. render selected content as bounded Markdown with absolute links;
|
|
5. optionally render bounded navigation links under `## Navigation`.
|
|
- Keep the existing simple `html_to_text` path only as explicit diagnostic fallback when local reader extraction cannot find useful content.
|
|
- Keep result JSON compatibility where practical, but update `html_extraction` metadata to expose method, readable status, navigation status, fallback status/reason, and title when available.
|
|
|
|
## Acceptance criteria
|
|
|
|
- `readability-rs` is removed from direct dependencies and no JavaScript runtime dependency is introduced.
|
|
- HTML article fixture renders body links as Markdown `[label](absolute-url)`.
|
|
- Navigation/sidebar/footer are omitted from main text by default.
|
|
- When navigation is omitted, result metadata or notice clearly says navigation was detected and can be included via `include_navigation=true`.
|
|
- With `include_navigation=true`, bounded navigation links appear under a separate `## Navigation` section.
|
|
- Link-heavy navigation-only pages are not misreported as successfully readable main content.
|
|
- Existing safety and bounds tests continue to pass.
|
|
- Focused tests cover link preservation, navigation omission notice, navigation inclusion, reader failure/fallback metadata, and truncation/bounds.
|
|
- `cargo fmt --check`, focused tools tests, `cargo check -p tools`, `./tickets.sh doctor`, `git diff --check`, and Nix build/hash handling pass or failures are clearly reported.
|