From 4887aa33d953c940099d1f92d4f9c7483c7807a6 Mon Sep 17 00:00:00 2001 From: Hare Date: Sat, 20 Jun 2026 21:24:09 +0900 Subject: [PATCH] ticket: record webfetch pdf implementation report --- .yoi/tickets/00001KVJA7V2R/item.md | 2 +- .yoi/tickets/00001KVJA7V2R/thread.md | 50 ++++++++++++++++++++++++++++ 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/.yoi/tickets/00001KVJA7V2R/item.md b/.yoi/tickets/00001KVJA7V2R/item.md index 4811dabf..a41133bf 100644 --- a/.yoi/tickets/00001KVJA7V2R/item.md +++ b/.yoi/tickets/00001KVJA7V2R/item.md @@ -2,7 +2,7 @@ title: 'WebFetch: PDF を page-delimited text として取得できるようにする' state: 'inprogress' created_at: '2026-06-20T10:46:48Z' -updated_at: '2026-06-20T12:09:50Z' +updated_at: '2026-06-20T12:24:02Z' assignee: null readiness: 'implementation_ready' risk_flags: ['security', 'dependency', 'public-api', 'output-bounds'] diff --git a/.yoi/tickets/00001KVJA7V2R/thread.md b/.yoi/tickets/00001KVJA7V2R/thread.md index 93f33c55..3cfbcea0 100644 --- a/.yoi/tickets/00001KVJA7V2R/thread.md +++ b/.yoi/tickets/00001KVJA7V2R/thread.md @@ -105,3 +105,53 @@ Next action: - Wait for Coder implementation report, then inspect branch diff/validation evidence and route to Reviewer。 --- + + + +## Implementation report + +Coder implementation report received from `yoi-coder-00001KVJA7V2R`. + +Implementation commit: +- `b1af95ad web: fetch pdf text by pages` + +Changed areas reported: +- `crates/tools/src/web.rs`: + - Added `application/pdf` handling for `WebFetch`。 + - PDF bytes bypass UTF-8 / `reject_binary()` text path。 + - Uses `pdf_extract::extract_text_from_mem_by_pages()` inside `tokio::task::spawn_blocking`。 + - Returns Markdown-ish page sections like `## Page 1`, `## Page 2`。 + - Adds `pdf_extraction` metadata with method/page/readability/diagnostic fields。 + - Keeps existing `html_extraction` semantics intact。 + - Preserves unsupported binary MIME rejection。 + - Added deterministic in-memory PDF test fixtures and tests for PDF page-delimited output, output truncation, malformed PDF diagnostic error, and unsupported binary MIME rejection。 +- `crates/tools/Cargo.toml`: added `pdf-extract = "0.10.0"`。 +- `Cargo.lock`: updated for `pdf-extract` and transitive dependencies。 +- `package.nix`: updated `cargoHash` to `sha256-rvsjn4BBxd9vt4nytPgUh4l/OQCRpqHbUR4jHoH589U=`。 + +Coder validation reported as passing: +- `cargo fmt --check` +- `cargo test -p tools web` +- `cargo check -p tools` +- `git diff --check` +- `nix build .#yoi --no-link` + +Dependency / package impact: +- New Rust dependency: `pdf-extract 0.10.0`。 +- Nix vendor hash updated and `nix build .#yoi --no-link` passed。 + +Known risks / deferrals: +- Only `application/pdf` is supported; no extension sniffing or `application/octet-stream` PDF guessing。 +- No OCR, scanned-PDF support, table reconstruction, cache, subprocess, Poppler, or Pdfium integration。 +- Malformed PDFs return diagnostic extraction error; textless PDFs are represented with `readable=false` metadata when extraction succeeds but no text is found。 + +Orchestrator evidence checked before review dispatch: +- Implementation worktree is clean。 +- HEAD is `b1af95ad`。 +- Diff from acceptance `e752a720..HEAD` is one implementation commit touching 4 files, about 552 insertions / 23 deletions。 +- `git diff --check e752a720..HEAD` produced no diagnostics。 + +Next action: +- Dispatch Reviewer for r1 review against Ticket requirements, with focus on WebFetch network/binary safety preservation, PDF extraction bounds/metadata, dependency/Nix impact, malformed/textless behavior, unsupported binary rejection, and HTML/text regression safety。 + +---