6.3 KiB
| id | slug | title | status | kind | priority | labels | created_at | updated_at | assignee | legacy_ticket | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20260529-061224-responses-reasoning-context-safety | responses-reasoning-context-safety | Fix context safety accounting for Responses reasoning | open | bug | P1 |
|
2026-05-29T06:12:24Z | 2026-05-29T06:12:24Z | null | null |
Background
A long-running gpt-5.5 session hit context_length_exceeded while the TUI still showed roughly 190k/400k. The failing request was in session 019e6bcf-fc62-7f93-b117-39369699c2c3, segment 019e6e18-c777-7be0-af32-9a2585e19ff7, turn=1195, llm_call=9.
The immediate trace showed the last successful usage event reported input_tokens=197700, while the failed request returned no usage. The request diagnostics also showed reasoning.context="current_turn" and a large request body (items_len=2617, items_json_bytes=1775947, raw_json_bytes=1834360, wire_bytes=686528). The same segment contained hundreds of persisted reasoning items with substantial encrypted_content.
A cross-check against /home/hare/ghq/github.com/openai/codex found that upstream Codex does not assume every configured context window is directly usable. Its model metadata has both context_window and max_context_window, and ModelInfo::resolve_context_window() clamps user model_context_window by max_context_window when present. Upstream also carries a GPT_5_BEDROCK_CONTEXT_WINDOW = 272_000, which matches the observed successful-session ceiling much better than the locally configured 1M window. Insomnia needs to distinguish advertised/configured window, backend max window, and compact/request thresholds.
Two implementation areas need to be corrected together so context safety checks match what the Responses backend actually receives:
openai_responsesrequest construction appears to project persistedItem::Reasoningentries, includingencrypted_content, back into the next request without enforcing the intendedreasoning.context/ current-turn / function-call adjacency policy documented indocs/ref/model-reasoning-context.md.- Pod request-threshold safety checks appear to use persisted usage history and can miss in-flight usage records from earlier LLM calls in the same run, so a long tool loop can keep issuing requests based on stale token occupancy.
Requirements
- Reconcile
docs/ref/model-reasoning-context.mdwithcrates/llm-worker/src/llm_client/scheme/openai_responses/request.rs.- Define exactly which reasoning items may be sent for
reasoning.context="current_turn". - Preserve the provider requirements for tool/function-call continuity.
- Do not silently resend old reasoning
encrypted_contentoutside the documented policy.
- Define exactly which reasoning items may be sent for
- Reconcile Insomnia model metadata/config semantics with upstream Codex's
context_window/max_context_windowsplit.- Support or document a backend max-window clamp so a user-visible 1M configured window cannot mask an effective backend limit such as 272k.
- Ensure TUI displayed context window, compact thresholds, and request safety checks all use consistent effective-window semantics.
- Update request construction so persisted reasoning items are included only when required by the documented policy.
- Add focused tests covering old reasoning items, current-turn reasoning, function-call adjacency, and encrypted reasoning content.
- Update Pod context safety accounting so request-threshold / pre-request checks include in-flight
UsageTrackerrecords from the current run, not only persisted session-log usage history.- Ensure long same-run tool loops can trigger compact/prune/stop decisions using the latest successful usage before the next request is sent.
- Preserve the existing principle that
Usage.input_tokensis request prompt occupancy, while acknowledging failedcontext_length_exceededresponses may not include usage. - Improve diagnostics for context overflow and near-overflow cases.
- Record at least items count, item JSON bytes, raw/wire request bytes, reasoning item count, reasoning encrypted-content bytes, and whether provider usage was absent.
- Keep diagnostics out of model context unless they are intentionally logged as normal visible events.
Implementation notes
- Upstream Codex references for comparison:
/home/hare/ghq/github.com/openai/codex/codex-rs/models-manager/models.jsondefinesgpt-5.5withcontext_window=272000andmax_context_window=272000.codex-rs/models-manager/src/model_info.rsclamps configuredmodel_context_windowbymax_context_windowwhen applying config overrides.codex-rs/protocol/src/openai_models.rsderivesauto_compact_token_limit()from the resolved context window.codex-rs/core/src/context_manager/history.rstracksserver_reasoning_includedand uses encrypted reasoning estimates only when the server usage does not already include them.
- Do not blindly port Codex internals. Preserve Insomnia's existing manifest/model layering and session-log authority; add the smallest typed concepts needed to represent an effective backend max window and to make safety accounting conservative enough.
- If exact reasoning inclusion policy is ambiguous, make the request builder policy explicit in code and tests, and update
docs/ref/model-reasoning-context.mdalongside the implementation. - Treat provider
context_length_exceededresponses withusage=nullas expected; diagnostics must rely on request-shape counters rather than nonexistent failed-request token usage.
Acceptance criteria
reasoning.context="current_turn"no longer causes old persisted reasoningencrypted_contentto be resent outside the documented policy.- Function/tool-call continuity still works for Responses models that require adjacent reasoning/function-call state.
- Request safety checks include current-run in-flight usage before sending subsequent LLM calls.
- A focused regression test covers a single run with multiple LLM calls where later calls would exceed the threshold if in-flight usage were ignored.
- A focused regression test covers a history containing old reasoning items and verifies request input contains only the allowed reasoning subset.
- Context overflow diagnostics make it clear when provider usage is absent and expose request-size/reasoning-size counters.
cargo fmt --check- Relevant
cargo test/cargo checkforllm-workerandpodpass.