← All concepts

inference optimization

28 articles · 15 co-occurring · 2 contradictions · 48 briefs

Same model, you choose how hard it thinks, and now with xhigh for gnarly agent loops, step down for latency/cost" — Claude Opus 4.7 effort parameter allows direct control over inference compute alloca

Mystery solved: Anthropic reveals changes to Claude's harnesses and operating instructions likely caused degradation | VentureBeat

[strong] "On March 4, Anthropic changed the default reasoning effort from `high` to `medium` for Claude Code to address UI latency issues. This change was intended to prevent the interface from appearing "frozen" while the model thought, but it resulted in a noticeable drop in intelligence for complex tasks." — Documents a real-world case where inference optimization for latency explicitly reduced reasoning depth and task performance, showing the trade-off between speed and capability.

@emollick: I was told by Anthropic that they are looking at ways of fixing this, which i...

[observed] "regularly decides that non-math/code stuff is 'low effort' & produces worse results" — Demonstrates failure mode where adaptive router misclassifies task difficulty, reducing output quality on non-technical work

2026-W22
28
2026-W21
187
2026-W20
165
2026-W19
115
2026-W18
156
2026-W17
141
2026-W16
96
2026-W15
76

Instead of one API call, optillm makes multiple calls using different techniques and combines the results. You're trading compute for accuracy" — optillm is a concrete implementation of inference-time

The Inference Pipeline reads features at request time, plus the latest registered artifact, and produces a prediction (or a generation, or an action) within whatever latency and cost budget the busine

Same model, you choose how hard it thinks, and now with xhigh for gnarly agent loops, step down for latency/cost" — Claude Opus 4.7 effort parameter allows direct control over inference compute alloca

On March 4, Anthropic changed the default reasoning effort from `high` to `medium` for Claude Code to address UI latency issues. This change was intended to prevent the interface from appearing "froze

Through context engineering, they doubled the intelligence of existing foundation models" — Poetiq case study demonstrates context engineering in practice: a specialized harness forced LLM logic verif

algorithmic framework for context engineering that models placement, compression, and scheduling as coupled optimization problems" — Demonstrates practical algorithmic approach to optimizing context h

Real reasoning is a dynamic search and should live as external infrastructure you plug models into." — Article presents a novel architectural pattern: decoupling reasoning from model weights and expos

Filtering 6.8k doc/sec on an m4 max" — Luxical demonstrates practical inference optimization achieving high-throughput document filtering on CPU hardware (M4 Max)

transformers track Bayes with 10⁻³-bit precision. And we now know why." — Research demonstrates transformers execute Bayesian inference with measurable precision through empirical testing in controlle

New AI chips and software aim to make large AI models faster and cheaper to run" — Article highlights infrastructure improvements for model efficiency, a core concern of inference optimization

[inferred] "our 4B RLM matches Sonnet 4.6 in quality while running significantly faster and cheaper" — Provides evidence that smaller models optimized with recursive training can achieve superior infe

[DIRECT] "Optimization of [context engineering] in LLM Inference" — Core focus is algorithmic optimization specifically during the inference phase of LLMs

Practical gains—speed, efficiency, and targeted models—are driving real investment and deployment" — Article cites speed and efficiency as key drivers of investment decisions, showing inference optimi

We used this to develop an adaptive sampling algorithm for test-time compute." — Paper demonstrates practical implementation of adaptive computation strategy to optimize inference-time resource usage.

The shared global mask lets us stack all 320 solves into batched tensor operations...typical GPU memory per batch, reducing latency from 30+ seconds on an A100 GPU to practical levels for real-time ag

for a "differentiation" among a pool of candidates - RL / model post training - inference engineering" — Article positions inference engineering as a differentiation skill for competitive advantage in

llama.cpp adds MTP for the Qwen3.6 family" — Article demonstrates MTP (speculative decoding) implementation in llama.cpp achieving significant throughput improvement (25→45 tok/s, +78%), a concrete ex

the 2025 "Test-Time Compute" breakthrough" — Article identifies Test-Time Compute as a major 2025 breakthrough reshaping prompt engineering strategy, providing evidence of paradigm shift.

Unweight to attack the real bottleneck on H100s: memory bandwidth, not compute" — Article identifies and addresses memory bandwidth as the primary constraint in LLM inference on modern GPUs, providing

[observed] "regularly decides that non-math/code stuff is 'low effort' & produces worse results" — Demonstrates failure mode where adaptive router misclassifies task difficulty, reducing output qualit

talks from PhD researchers at @berkeley_ai and @StanfordAILab on agent memory / continual learning and local inference" — Article announces academic research talks on local inference, indicating activ

The token use and latency improvements in 5.4 make a huge difference here" — Article evidence that improved token efficiency and latency are critical for solving complex real-world tasks within time c

builtin support for local LLMs (@ollama @lmstudio)" — Letta's feature announcement directly enables running LLMs locally, supporting privacy-preserving and offline-capable agent deployments.

The thesis here is 'spend as much compute as you need to solve a task'" — Article introduces the compute-first optimization thesis as opposed to token minimization — a novel strategy that reframes inf

local Qwen3.6-35b-a3b on a M4 Max 128GB" — Demonstrates running local inference on consumer hardware (M4 Max) as an optimization strategy instead of API calls, reducing latency and costs.

when you ask for a model that's not loaded it'll automatically load it up, clear the vram, and use your recipes" — vllm studio demonstrates practical inference optimization through automatic model loa

[INFERRED] "that's how you get much higher margins over time" — Article connects efficient model orchestration to long-term profitability, indicating optimization directly impacts unit economics.

query this concept
$ db.articles("inference-optimization")
$ db.cooccurrence("inference-optimization")
$ db.contradictions("inference-optimization")