← Latest brief

Brief #107

29 articles analyzed

Agent reliability bottlenecks shifted from model capability to architectural discipline. Practitioners discovered that state management, context portability, and security boundaries matter more than framework features—but vendor tooling still optimizes for the wrong problems.

Agent Skills Supply Chain Is Already Compromised

CONTRADICTS model-context-protocol — baseline assumes MCP provides security guarantees; skills bypass MCP entirely

MCP standardization created an attack surface before security primitives existed. Skills bypass MCP boundaries entirely via embedded shell commands, making tool directories a vector for arbitrary code execution.

Audit every skill in your agent directories for embedded shell commands or bundled executables. Treat skill imports like dependency pulls—verify source, inspect contents, sandbox execution. Don't trust skill marketplaces.
@shao__meng: 一个 Markdown 文件能有多危险?Agent Skills 供应链攻击实录,你的 Agent SKills 真的安全吗?

Demonstrated that Agent Skills bypass MCP tool-calling boundaries completely—Markdown files can contain direct shell commands, bundled scripts, hardcoded credentials without any MCP mediation

Design patterns for the securing of LLM agents - cusy

Context-Minimization pattern shows architectural constraints prevent injection cascades—but only if tools are isolated. Skills that bundle execution bypass these protections

Claude Code Flaws Allow Remote Code Execution and API Key Exfiltration

Repository-defined .mcp.json configurations can be exploited to override MCP servers and exfiltrate credentials—attack vector enabled by trusting skill definitions


Context Portability Breaks on Harness-Specific Tool Design

CONTRADICTS tool-integration-patterns — baseline assumes MCP standardizes tools; reality shows tools remain tightly coupled to execution context

Shared skill directories assumed tools are context-independent, but most tools embed implicit harness dependencies (CLI paths, prompts referencing specific frameworks) that break when moved between agents.

Design skills with explicit context manifests: declare required environment variables, expected file paths, harness assumptions. Test portability by running skills in second harness before sharing.
@sarahwooders: Common misconception: skill portability across agent harnesses

Identified three types of implicit context coupling: harness-specific CLI tools, prompts that reference the harness, hardcoded paths/ports. Skills designed for one harness fail silently when moved.

Explicit State Files Prevent Context Drift Disasters

EXTENDS state-management — baseline mentions problem; this provides the explicit solution pattern

Agent frameworks that rely on implicit state tracking (conversation history, cached assumptions) accumulate technical debt that manifests as catastrophic failures. Explicit state files—tracked in version control—are the only reliable context preservation mechanism.

Implement .state.json files for every multi-session agent project. Document existing resources, configuration assumptions, deployment topology. Version control state files alongside code. Never deploy agents without state context.
Claude Code deletes developers' production setup, including its database and snapshots

Missing state file caused Claude to create duplicate resources, nuking production. Disaster directly resulted from implicit state management—no ground truth for 'what exists.'

Demand-Driven Tool Discovery Beats Context Stuffing

EXTENDS context-window-management — baseline identifies context bloat problem; this shows practical solution

Loading full API specs (2.3M tokens) into context destroys agent performance. Successful systems query tool catalogs on-demand when agents need capabilities, not upfront.

Build tool registries with semantic search, not static imports. Let agents query 'what tools help with X' rather than seeing all tools. Measure context token usage—if >30% is tool definitions, implement lazy loading.
@NodeCongress: At @CloudflareDev their OpenAPI spec is 2.3M+ tokens

Cloudflare's 2.3M token OpenAPI spec can't fit in context. Solution: agents query available tools when needed rather than loading full catalog upfront.

RAG Lost to Pre-Indexed Integrations

Retrieval pipelines proved less effective than pre-indexed connectors (Microsoft 365, filesystem abstractions) because chunk-based retrieval loses document structure that users actually see.

Evaluate whether you actually need RAG or just need structured access. If documents have clear hierarchy (docs, wikis, codebases), try filesystem abstractions or pre-indexed connectors before building retrieval pipelines.
@emollick: The RAG era was short-lived, but intense

Mollick declares RAG era over, with Microsoft 365 connectors on every Claude plan. Signals shift from retrieve-on-demand to pre-indexed access.

Task Complexity Classification Determines Agent Strategy

EXTENDS multi-agent-orchestration — baseline covers coordination mechanisms; this adds task-complexity-driven selection logic

Agents fail when practitioners use the same interaction strategy (planning mode, one-shot) for all tasks. Order-0 problems need one-shot, order-1 needs planning, order-n needs recursive decomposition—clarity about task order is the bottleneck.

Before spawning agents, classify task complexity: trivial (one API call) → one-shot. Moderate (3-5 steps, known path) → planning mode. Complex (unknown decomposition) → recursive multi-agent. Don't over-engineer simple tasks.
@dexhorthy: this guy gets it

Practitioner discovered using planning mode for simple problems wastes tokens. Task complexity (order-0, order-1, order-n) determines optimal context strategy.

LLM-Maintained Context Indices Beat External Search

Using LLMs to maintain summary indices and metadata files (rather than external vector stores) enables knowledge compounding—wiki grows smarter across sessions because the LLM both consumes and produces context.

For knowledge-work domains (research, documentation), let LLMs maintain summary files and indices alongside raw data. Closed-loop: LLM reads raw → writes summaries → queries summaries → discovers gaps → updates summaries. This compounds rather than retrieves.
@shao__meng: 论文、代码库、数据集、图片 → raw/ 目录

LLM-maintained wiki with auto-updating summaries queried 400K words without RAG. Knowledge accumulated across sessions—research compounded rather than resetting.