← Latest brief

Brief #66

30 articles analyzed

Context engineering has matured from prompt optimization to supply chain security and architecture patterns. Practitioners are discovering that context boundaries—what gets preserved, shared, and executed—are now security attack surfaces, coordination mechanisms, and the primary design constraint outweighing model capability.

Markdown Documentation Becomes Executable Malware Vector

Agent ecosystems that consume skills/plugins from untrusted sources face supply chain attacks where documentation context (Markdown) doubles as code execution instructions, bypassing traditional security boundaries because agents treat installation steps as legitimate commands.

Audit any agent system consuming external skills/context: implement explicit permission boundaries separating documentation (read-only) from execution (requires confirmation). Never trust 'most downloaded' as security signal. Require cryptographic signing for executable context.
@shao__meng: 一个 Markdown 文件能有多危险?Agent Skills 供应链攻击实录

1Password VP documents actual malware in agent skill ecosystem where Markdown installation instructions executed shell commands, downloaded crypto miners. Six-step attack chain exploited that agents don't distinguish informational vs executable context.

@Nick_Davidov: Asked Claude Cowork organize my wife's desktop

Practitioner lost irreplaceable family photos when Claude Code executed filesystem operations without safety constraints. Demonstrates that capability without explicit destructive-operation boundaries creates catastrophic risk—same context boundary failure as Markdown malware.

@badlogicgames: If you give Gemini a tool called 'calculator' it will entirely ignore the input schema

Tool schema context being ignored by LLMs shows context injection failures aren't malicious-only—systems fail to preserve structural constraints across execution boundaries even in benign scenarios.


Shared Context File Beats Agent-Specific Prompts

Multi-agent coordination scales better through a single shared context document (CLAUDE.md) that all agents reference, rather than duplicating system prompts per agent—small tweaks cascade improvements because intelligence compounds in one place.

Consolidate multi-agent system prompts into one shared context file all agents read. Make it version-controlled and observable. Test whether changes to shared context improve coordination faster than tuning individual agents.
@dani_avila7: Each agent shares the CLAUDE.md file as a common resource

Practitioner discovered centralized CLAUDE.md as coordination hub—'tweaks' to shared context improve collaboration across agents without per-agent reconfiguration.

Context Phase Decomposition Creates Depth Not Models

Deep research outputs require multi-phase context decomposition where each phase builds on prior outputs (source selection → analysis → synthesis) with explicit structural schemas—speed-optimized context produces shallow results regardless of model quality.

For complex tasks, decompose into explicit phases with intermediate outputs that become inputs to next phase. Define structural schemas for each output. Avoid tools that optimize for speed by skipping context—they trade depth for velocity.
@EXM7777: deep research is the most underused tool in AI right now

Practitioner workflow: context clarity (audience/problem) → source specificity → phase decomposition (each builds on last) → output schema = depth. Without phase context preservation, AI scrapes shallowly regardless of capability.

Humans Stop Reading Code They're Shipping

Practitioners are shipping production codebases they've never read, relying entirely on AI-maintained context and artifact inspection—this works for greenfield single-contributor projects but creates invisible knowledge debt and single-point-of-failure risk.

If adopting this pattern: (1) constrain to greenfield solo projects, (2) maintain test/scenario documentation as primary knowledge artifact, (3) explicitly plan context transfer strategy for handoffs, (4) recognize this inverts traditional code review and creates AI-as-single-point-of-failure risk.
@trq212: I now have several useful codebases that I literally have not read

Practitioner admits shipping multiple useful codebases without reading any code—interaction limited to artifacts. Works because: greenfield (clear problem), sole contributor (no coordination), AI maintains context. Feels 'alien' but functional.

Multi-Agent Orchestration Creates Context Waste Not Intelligence

Practitioners abandoning multi-agent patterns because agent switching forces context reloading, duplicated reads, and wasted tokens—for linear workflows, single-threaded context preservation outperforms orchestration complexity.

Default to single-agent for linear workflows. Only introduce multi-agent when: (1) task parallelism provides measurable benefit, (2) you can externalize coordination to observable layer (channels/state management), (3) context boundaries are explicit and necessary. Prototype and measure token consumption before committing to multi-agent architecture.
@badlogicgames: and now you know why pi doesn't have subagents built-in

Experienced engineer (libGDX creator) reports multi-agent delegation created redundant operations: context switching, file re-reading after context clears, duplicated processing. Linear feature implementation didn't need orchestration—needed single-threaded context.

Context Window Degradation Breaks Agent Loops Mid-Task

Local models in agentic workflows fail not from reasoning quality but from cache-clearing when context fills—agents enter loops instead of progressing, revealing context budget management as distinct failure mode from hallucination.

When evaluating models for agentic work, benchmark context window behavior under extended tasks: measure whether agents complete multi-step workflows or degrade into loops as context fills. Test cache clearing behavior. Track token consumption per task completion, not just per-token pricing.
@slow_developer: glm 4.7 flash is a really underrated local model for agentic work

Practitioner testing local models found glm 4.7 doesn't hallucinate tool calls but degrades mid-task: as context fills and cache clears, agent loops instead of completing. This is context budget failure, not reasoning failure.

Agents Should Pull Context Not Receive It

Inversion-of-control pattern from software engineering applies to context: instead of harnesses pushing context to agents, agents should decide what context they need and pull it—this enables organic accumulation and better retrieval strategies across turns.

Redesign agent harnesses to expose context as queryable resources (APIs, vector stores, function libraries) rather than pre-loaded prompts. Let agents decide what context to retrieve based on their reasoning. Measure whether agents improve retrieval strategies across iterations.
@irl_danB: the first time I read about inversion-of-control early in my career

AI researcher maps inversion-of-control to agent context: current harnesses push context to agents (like hardcoded dependencies), but better pattern is agents pull what they need (like dependency injection). RLM experiments show agents build better retrieval loops when controlling their own context decisions.

Token Consumption Not Price Determines Real Cost

Per-token pricing is misleading because models consume vastly different token volumes on identical tasks—teams must benchmark actual token usage on their workload or risk 3x cost surprises despite identical sticker prices.

Before deploying a model: benchmark total token consumption (input + output) on representative tasks from YOUR workload. Track tokens-per-successful-completion, not just per-token price. Models that loop, retry, or use verbose reasoning patterns will cost more than sticker price suggests.
@charlespacker: per-token cost is a small part of the overall cost story

Context Bench leaderboard operator reports Opus 4.6 costs 3x more than 4.5 despite same per-token price—because it uses 3x tokens on code tasks. Total cost = unit price × consumption. Consumption varies by model AND task type.