Brief #93
Practitioners are discovering that context engineering bottlenecks shift from 'what model can do' to 'what context the system preserves across boundaries'—whether that's session boundaries (compaction destroying intelligence), organizational boundaries (agents breaking infrastructure), or maintainer boundaries (AI code creating tech debt). The sharpest signal: context gathering capability now differentiates model versions more than raw capability.
Context Gathering Separates Model Versions More Than Benchmarks
Same model family shows 'stark' performance differences based solely on ability to identify required context independently. This capability—not speed, not reasoning—determines production viability for complex engineering work.
Practitioner discovers Opus 4.5 vs 4.6 differ dramatically on context gathering for identical workloads—model selection now hinges on this, not reasoning capability
Author reframes evaluation from speed to 'task-completion horizon'—how long models maintain context coherence without requiring debugging loops
To fix hallucinations, practitioners must trace from output back to source data—models that auto-gather correct context avoid this debugging tax entirely
Invisible Context Compaction Destroys Compound Intelligence
Automatic context pruning without user visibility breaks the fundamental promise of session-based intelligence. Users can't debug 'AI forgetting' when the system silently destroys context.
User reports Codex compacting context automatically 'without doing anything'—loss of visibility into what's preserved destroys ability to maintain coherence
Multi-Agent Token Costs Compound Super-Linearly Through Context Overlap
Parallel agent execution doesn't scale linearly—token usage explodes because agents redundantly process overlapping context. Cost control requires profiling token spend per agent role, not just per task.
Practitioner discovered parallel agents cause token costs to 'compound'—switching models (Claude → GLM-5) became the lever to control this
Evaluation Context Framing Changes Agent Behavior More Than Instructions
Telling agents they're being tested with explicit evaluation criteria (quality + efficiency) changes performance—distinct from task instructions. Agents optimize for stated meta-goals when evaluation framework is visible.
Practitioner observes performance improvement when providing meta-context about evaluation criteria—agent internalizes goals beyond task instructions
Human Review Gates Are Context Architecture for Risk Awareness
Organizations respond to AI infrastructure failures by adding human approval layers—this is actually a context engineering pattern: injecting decision-severity awareness the AI lacks.
AWS AI tool lacked context about decision severity (production infrastructure modification). Organizational fix: human gate that carries 'this matters' context
Retrieval Plus Hierarchy Elevates Weak Models to Strong-Model Performance
9B parameter models with retrieval + hierarchical planning match larger models on complex tasks. Quality of context engineering inversely correlates with model size requirements.
Offline + weak model constraints solved via hierarchical planning + retrieval architecture—context design lifted capability ceiling without larger model