← Latest brief

Brief #89

176 articles analyzed

Context engineering is entering a maturity crisis: practitioners are discovering that the bottleneck isn't model capability or orchestration frameworks—it's the systematic preservation and structure of information across execution boundaries. The shift from 'better prompts' to 'context architecture' is happening faster than tooling can catch up, forcing teams to build their own context management infrastructure.

Contract-Based Context Isolation Prevents Agent Degradation

Practitioners are abandoning open-ended agent sessions in favor of bounded task contracts with explicit acceptance criteria, discovering that context pollution—not capability limits—causes agent performance decay over long interactions. Session isolation with deterministic completion criteria outperforms continuous context accumulation.

Replace continuous agent sessions with explicit task contracts: define acceptance criteria upfront, impose session boundaries with git checkpoints, require plan approval before execution. Measure session length vs. output quality to find your degradation threshold.
6个真实生产项目积累的15条实战经验 (6 Real Production Projects, 15 Battle-Tested Lessons)

Chinese practitioner reports 50% failure rate when agents operate without bounded contracts. Success pattern: {TASK}_CONTRACT.md with explicit acceptance criteria, neutral prompting, and session checkpoints. 'Open-ended 24-hour sessions destroy compounding; contract-based closure preserves it.'

规划与执行的严格分离,绝不让 Claude 在你审查并批准书面计划之前写一行代码

Practitioner enforces plan-review-annotate loops before execution, treating implementation as mechanical once plan is locked. Multi-cycle annotation phase forces precise problem definition before code generation starts, preventing context drift.

Hanson is a magician solving neural network challenge under 15-min/5K constraints

Hard time/token constraints (15 min, 5K chars) forced model into structured phases: exploratory → hypothesis → reference → implementation → optimization. Bounded context enabled intelligent prioritization; unlimited context would have caused exploration sprawl.


MCP Configuration Is Context Poisoning Attack Surface

Project-level MCP server configs and environment variables—designed to preserve context across sessions—create security vulnerabilities when treated as trusted without re-validation. The same persistence mechanisms that enable intelligence compounding become injection vectors when repositories are untrusted.

Treat MCP configs and project settings as untrusted input requiring validation at each session start. Implement content-addressable verification (hash-based, not path-based) for MCP servers. Never auto-execute code from project-level configs without explicit user re-approval.
Claude Code Flaws Allow Remote Code Execution and API Key Exfiltration

Check Point Research disclosed that .mcp.json, claude/settings.json, and env vars persist across sessions without re-validation, enabling malicious repos to execute code and exfiltrate credentials before user approval. Configuration poisoning attacks exploit the trust boundary.

Tool Description Drift Breaks Agent Context Silently

MCP tool schemas and API contracts change frequently, but agents continue operating with stale mental models, producing silent failures that only surface after user impact. Context compounding breaks when the context layer itself is mutable without detection.

Implement contract verification for MCP servers: hash tool descriptions, detect breaking changes, version control tool schemas. Build diffs into CI/CD to catch schema mutations before agents see them. Treat tool definitions as code requiring change management.
Your MCP Server's Tool Descriptions Changed Last Night. Nobody Noticed.

Practitioner discovered that MCP tool descriptions change silently, breaking agent assumptions without visibility. Agents continue with outdated context, causing failures only visible after user discovery. Contract verification pattern needed: treat tool schemas like API versioning.

Evaluation Suites Are Context Preservation Mechanisms

Practitioners are discovering that eval suites function as persistent memory about what works—not just validation gates. The 50-test-case library becomes the compounding intelligence layer that prevents regression and enables systematic prompt evolution.

Build eval suites before optimizing prompts. Treat 50+ test cases as the artifact that compounds knowledge, not the prompt itself. Version control evals alongside prompts. Use pass rate trends to detect when capabilities have been internalized by base models.
Changed one word in your GPT-4 prompt and accuracy dropped 15%?

Practitioner reports uncontrolled prompt changes cause silent 15% accuracy drops. Solution: 50 input/output pairs as eval suite. Without this, cannot distinguish broken prompts from legitimate variation. Eval suite IS preserved intelligence across iterations.

Parallel Agent Execution Requires External State Coordination

Running multiple agents in parallel without shared external state causes context fragmentation and duplicate work. Practitioners are building custom session managers because terminal multiplexers and IDEs weren't architected for stateful, long-running agent coordination.

If running 3+ concurrent agents, invest in external state coordination infrastructure before adding more agents. Use git worktree or equivalent for workspace isolation. Monitor memory pressure and session degradation as early warning signals.
WezTerm wasn't made to withstand multi-day concurrent AI agents

Practitioner building FrankenTerm because WezTerm degrades under dozens of concurrent agents running for days. Session state and memory management fail. Generic terminal multiplexers weren't designed for this constraint class.

Context Metadata Pollution Degrades Agent Reasoning

API parsing outputs (bounding boxes, OCR confidence, layout metadata) designed for rendering actively harm agent reasoning by consuming context window before inference begins. Practitioners are building preprocessing pipelines to separate clean content from queryable metadata.

Preprocess document parsing outputs before feeding to agents: extract clean text/markdown for reasoning context, store metadata (coordinates, confidence, layout) in separate queryable store accessible via tools. Don't inline metadata into agent context.
来自 @raunakdoesdev 的分享,法律合同审核、保险理赔、金融10-K报告 (Document parsing API metadata harms agent reasoning)

Raunak reports customer friction: raw API metadata (coordinates, confidence, blocks) fills context before agent reasoning begins. Solution: split into (a) clean Markdown for inference, (b) structured metadata as queryable tool. Agents reason on content, query metadata only when needed.

Scheduled Execution Without Result Persistence Is Context Reset

Scheduled agent tasks that don't automatically persist results to external systems (Telegram, Slack, databases) lose intelligence between runs. Practitioners are discovering that automation without memory is just repeated single-shot execution.

When implementing scheduled agents, design result persistence upfront: route structured outputs to external databases, messaging systems, or file stores. Don't rely on agent memory alone. Treat each scheduled run as a session boundary requiring explicit state handoff.
Claude Code Scheduled Tasks + Telegram notification

Practitioner routes Claude scheduled task results to Telegram bot to ensure outputs persist in external always-on system. Without this, task results are ephemeral. .env file becomes critical context bridge between Claude execution and persistent external state.

Multi-Model Routing Requires Skill-Level Context Tagging

Practitioners routing across multiple models (Sonnet/Codex/GLM) are building skill-tagging systems to match tasks to appropriate model contexts, discovering that model selection is a context engineering problem requiring metadata about what each model understands.

If using multiple models, build skill tagging at the task level before routing. Don't assume all models have equivalent context understanding. Create skill metadata that explicitly maps task requirements to model capabilities.
For those wondering: my agentic workflow uses Sonnet 4.6 + Codex + GLM 5

Practitioner uses different models for different functions (Sonnet for decisions, Codex for implementation, GLM for search). Recommends building personal tools/hooks/skills—suggesting value is in abstraction layer allowing model swap without context disruption.

Documentation Quality Is Agent Performance Bottleneck

Practitioners are discovering that well-documented codebases dramatically improve agent output quality—not because agents 'read' docs, but because explicit context (docstrings, inline comments) clarifies intent agents would otherwise hallucinate. The question 'does documentation help LLMs?' is being actively tested.

Run controlled experiments in your codebase: compare agent output quality on well-documented vs undocumented modules. If documentation helps, invest in inline context density (docstrings, architecture comments) as agent performance optimization.
I want to see a benchmark on if llm do better with documented code

Practitioner hypothesizes that explicit documentation (docstrings, inline comments) improves LLM reasoning by clarifying intent. Suggests treating documentation as AI interface, not just human interface. Core question: does richer explicit context = better agent reasoning?

Voice Input Enables Context Density Without Typing Overhead

Practitioners using voice notes to interact with agents report being able to provide richer context faster than text, suggesting voice may be the preferred input modality for high-context instructions—not because of transcription quality, but because of reduced friction in expressing nuanced requirements.

Test voice input for complex multi-step instructions where typing would be slow. Measure whether richer context expression in voice improves agent output quality vs text-based prompts of equivalent length.
Voice note chats in Telegram with my self-developing AI agent

Mike Kelly reports using voice notes in Telegram to manage multiple parallel AI initiatives and self-improvement loops. Voice input lowers friction for providing unstructured context and task updates to agents working on themselves.