ticket: close webfetch pdf text

2026-06-20 21:31:42 +09:00 · 2026-06-20 21:31:42 +09:00 · 7d087afbf6
commit 7d087afbf6
parent 59c59a6a70
3 changed files with 98 additions and 2 deletions
--- a/.yoi/tickets/00001KVJA7V2R/item.md
+++ b/.yoi/tickets/00001KVJA7V2R/item.md
@ -1,8 +1,8 @@
 ---
 title: 'WebFetch: PDF を page-delimited text として取得できるようにする'
-state: 'done'
+state: 'closed'
 created_at: '2026-06-20T10:46:48Z'
-updated_at: '2026-06-20T12:31:02Z'
+updated_at: '2026-06-20T12:31:33Z'
 assignee: null
 readiness: 'implementation_ready'
 risk_flags: ['security', 'dependency', 'public-api', 'output-bounds']
--- a/.yoi/tickets/00001KVJA7V2R/resolution.md
+++ b/.yoi/tickets/00001KVJA7V2R/resolution.md
@ -0,0 +1,40 @@
+## Resolution
+
+`00001KVJA7V2R` を完了しました。
+
+実装内容:
+- `WebFetch` に `application/pdf` handling を追加しました。
+- PDF bytes は UTF-8 / `reject_binary()` text path を bypass します。
+- `pdf_extract::extract_text_from_mem_by_pages()` を `tokio::task::spawn_blocking` 内で使用します。
+- PDF output は `## Page 1`, `## Page 2` のような page-delimited text として返します。
+- `transformed_as` / `pdf_extraction.method` は `pdf_text_by_pages` を使い、semantic Markdown とは主張しません。
+- `pdf_extraction` metadata に method/page/readability/diagnostic 情報を追加しました。
+- `max_response_bytes` / `max_output_bytes` / redirects / private-local host rejection / embedded credential rejection など既存 WebFetch safety pipeline は維持しました。
+- `application/pdf` のみ対応し、extension sniffing や `application/octet-stream` PDF guessing は追加していません。
+- Unsupported binary MIME rejection は維持しました。
+- Existing HTML/text behavior and `html_extraction` metadata は維持しました。
+- Tests for valid page-delimited PDF output、PDF truncation、malformed PDF diagnostic error、unsupported binary rejection を追加しました。
+- `pdf-extract = "0.10.0"` dependency を追加し、`Cargo.lock` / `package.nix` `cargoHash` を更新しました。
+
+主な commit:
+- `b1af95ad web: fetch pdf text by pages`
+- `97edfe8a merge: webfetch pdf text`
+
+Review:
+- r1 は `approve`。
+- Reviewer は WebFetch safety pipeline、exact `application/pdf` handling、binary path separation、`pdf_text_by_pages` metadata、output bounds、unsupported binary rejection、HTML metadata preservation、native PDF runtime dependency が無いことを確認しました。
+
+最終 validation:
+- `cargo fmt --check`
+- `git diff --check HEAD^1..HEAD`
+- `cargo test -p tools web`
+- `cargo check -p tools`
+- `cargo tree -p pdf-extract`
+- `nix build .#yoi --no-link`
+
+Package impact:
+- New Rust dependency: `pdf-extract 0.10.0`
+- `nix path-info -S .#yoi`: `115259736`
+
+Validation log:
+- `/run/user/1000/yoi/yoi-orchestrator/bash-output/bash-z7rcEU.log`
--- a/.yoi/tickets/00001KVJA7V2R/thread.md
+++ b/.yoi/tickets/00001KVJA7V2R/thread.md
@ -256,3 +256,59 @@ Final state:
 Implementation was merged into Orchestrator branch at `97edfe8a`, review approved, and final Orchestrator validation passed: `cargo fmt --check`, `git diff --check HEAD^1..HEAD`, `cargo test -p tools web`, `cargo check -p tools`, `cargo tree -p pdf-extract`, and `nix build .#yoi --no-link`.

 ---
+
+<!-- event: state_changed author: hare at: 2026-06-20T12:31:33Z from: done to: closed reason: closed field: state -->
+
+## State changed
+
+Ticket を closed にしました。
+
+
+---
+
+<!-- event: close author: hare at: 2026-06-20T12:31:33Z status: closed -->
+
+## 完了
+
+## Resolution
+
+`00001KVJA7V2R` を完了しました。
+
+実装内容:
+- `WebFetch` に `application/pdf` handling を追加しました。
+- PDF bytes は UTF-8 / `reject_binary()` text path を bypass します。
+- `pdf_extract::extract_text_from_mem_by_pages()` を `tokio::task::spawn_blocking` 内で使用します。
+- PDF output は `## Page 1`, `## Page 2` のような page-delimited text として返します。
+- `transformed_as` / `pdf_extraction.method` は `pdf_text_by_pages` を使い、semantic Markdown とは主張しません。
+- `pdf_extraction` metadata に method/page/readability/diagnostic 情報を追加しました。
+- `max_response_bytes` / `max_output_bytes` / redirects / private-local host rejection / embedded credential rejection など既存 WebFetch safety pipeline は維持しました。
+- `application/pdf` のみ対応し、extension sniffing や `application/octet-stream` PDF guessing は追加していません。
+- Unsupported binary MIME rejection は維持しました。
+- Existing HTML/text behavior and `html_extraction` metadata は維持しました。
+- Tests for valid page-delimited PDF output、PDF truncation、malformed PDF diagnostic error、unsupported binary rejection を追加しました。
+- `pdf-extract = "0.10.0"` dependency を追加し、`Cargo.lock` / `package.nix` `cargoHash` を更新しました。
+
+主な commit:
+- `b1af95ad web: fetch pdf text by pages`
+- `97edfe8a merge: webfetch pdf text`
+
+Review:
+- r1 は `approve`。
+- Reviewer は WebFetch safety pipeline、exact `application/pdf` handling、binary path separation、`pdf_text_by_pages` metadata、output bounds、unsupported binary rejection、HTML metadata preservation、native PDF runtime dependency が無いことを確認しました。
+
+最終 validation:
+- `cargo fmt --check`
+- `git diff --check HEAD^1..HEAD`
+- `cargo test -p tools web`
+- `cargo check -p tools`
+- `cargo tree -p pdf-extract`
+- `nix build .#yoi --no-link`
+
+Package impact:
+- New Rust dependency: `pdf-extract 0.10.0`
+- `nix path-info -S .#yoi`: `115259736`
+
+Validation log:
+- `/run/user/1000/yoi/yoi-orchestrator/bash-output/bash-z7rcEU.log`
+
+---