Brief #115
MCP's security model is collapsing under real-world pressure while practitioners discover that context preservation requires architectural discipline, not framework magic—the bottleneck has shifted from protocol design to operational hardening and multi-agent state management.
Tool Definition Bloat Consumes 50% of Context
EXTENDS context-window-management — confirms context limits are constraint but reveals MCP specifically creates new bloat vectorMCP's eager-loading architecture burns 55K-134K tokens on tool definitions before any work begins. The 'universal protocol' dream created a context engineering nightmare that requires lazy-loading workarounds.
5 servers with 58 tools = 55K tokens upfront. Context inspection via /context revealed invisible tax practitioners couldn't see.
Tool search introduced as fix—lazy loading pattern required because preloading exhausted context budget
MCP Security Model Fails at Production Scale
MCP servers are shipping with no authentication by default, creating a massive attack surface. 1,000+ publicly exposed servers with zero auth controls, plus agent skills that bypass MCP entirely via shell execution, reveal that the protocol's security assumptions don't survive contact with reality.
1,000+ MCP servers exposed publicly with no authorization—protocol designed for local use is being deployed without security controls
Claude Thinking Budget Cuts Break Context Conventions
Reducing Claude's internal reasoning tokens causes it to ignore CLAUDE.md conventions and burn tokens on retry loops. The bottleneck isn't prompt engineering—it's whether the model has budget to cross-reference its own constraints.
Quantization test shows Claude 4.6 ignores CLAUDE.md conventions, creates retry loops, burns tokens when thinking budget is reduced
Multi-Agent Systems Require Governance Layers, Not Just Orchestration
Multi-agent coordination fails without formal consensus protocols and auditability. The bottleneck in regulated/production environments isn't agent capability—it's provable governance over coordination state.
Academic research shows regulated domains require formal coordination protocols with auditable consensus, not informal emergence
Agent Self-Forking Reveals Emergent Context Branching
Agents are autonomously creating execution forks to explore parallel reasoning paths. This isn't planned multi-agent architecture—it's agents meta-reasoning about their own context management needs.
Practitioner observed agents spontaneously forking execution context to handle parallel reasoning—emergent behavior, not designed feature
Attention-Weighted KV Cache Compression for Hierarchical Agents
Multi-agent orchestrators accumulate rich reasoning trajectories but can't efficiently transmit context to workers. Attention matching identifies relevant trajectory segments for selective KV cache compression—preserving orchestrator intelligence without token explosion.
Use worker's attention patterns on historical reasoning to compress only relevant context into KV cache—solves hierarchical delegation bottleneck
LLM-Maintained Knowledge Graphs Scale Without RAG
Practitioners report LLMs autonomously maintaining structured wikis with backlinks and indices at 400K+ word scale. Intelligence compounds when outputs feed back into the knowledge base—context engineering through architecture, not retrieval.
LLM maintains wiki structure with auto-indexing and backlinks. Query outputs become KB inputs, creating compounding loop at 400K words without RAG.
Notebook-to-Production Gap is Context Architecture Gap
AI prototypes fail in production not due to model limits but because notebook context (cell-dependent, manual, environment-specific) doesn't translate. Deployment requires explicit context preservation through logging, config, and error handling.
Practitioners can prototype but fail to deploy. Gap is not AI capability—it's missing context architecture (logging, error handling, config management).
BM25 with Re-Ranking Beats Neural Retrieval
Properly tuned lexical retrieval (BM25) outperforms dense embeddings for RAG context retrieval. Re-ranking layers provide cheap, effective context refinement. The bottleneck isn't sophisticated neural methods—it's appropriate baseline tuning.
Research shows BM25 with proper setup beats BERT-based retrievers. Re-ranking is highly effective additional layer. Simpler methods work.