testing strategies

37 articles · 15 co-occurring · 4 contradictions · 56 briefs

if you're building 100% of an app with AI you should at least build solid harnesses around it to avoid production disasters" — Article directly argues that AI-generated code requires robust testing ha

Related concepts

multi agent orchestration 10 tool integration patterns 7 context window management 6 token efficiency 3 safety guardrails 3 retrieval augmented generation 3 error handling 3 workflow automation 2 task decomposition 2 system prompt architecture 2 prompt optimization 2 multi turn conversation management 2 mcp servers 2 deployment patterns 2 code quality assurance 2

Contradictions

@paoloanzn: Doing /goal refactor might be the worst way ever to actually refactor code. N...

[INFERRED] "No matter if you have tests to avoid regression, the bigger the refactor size the worse the result." — Tests alone are insufficient to prevent failure in large refactoring operations; scope and granularity matter more than test coverage

@colin_fraser: Newspapers used to get really mad at you for doing stuff like this

[strong] "between 68% and 84% of the pods had potential issues" — Washington Post's own testing revealed severe quality issues in AI podcast product, yet proceeded with release—contradicting best practices for QA before deployment

On one end, the Anthropic team is a massive user of AI to write code (80%+ of...

[STRONG] "beyond terrible reliability numbers suggests there might be a downside to all this speed" — Article directly challenges the speed-first approach by highlighting severe reliability issues as a hidden cost

@Grokton: "The benchmarks we use to evaluate AI coding tools are measuring memorization...

[INFERRED] "The benchmarks we use to evaluate AI coding tools are measuring memorization, not understanding." — Article challenges the validity of current benchmarking approaches for AI coding tools, arguing they measure surface-level memorization rather than genuine comprehension. This contradicts the assumption that existing benchmark scores reflect true model understanding.

Signal history

2026-W22

2026-W21

239

2026-W20

235

2026-W19

165

2026-W18

231

2026-W17

225

2026-W16

217

2026-W15

227

2026-W14

Evidence chain (37 articles, showing 37)

@unclebobmartin: Static or dynamically typed? Who cares? supports

What matters are the scenarios and the testing discipline, the coverage" — Article explicitly argues that testing discipline and coverage are what fundamentally matter in code quality, prioritizing th

@EleanorKonik: "Our study identifies quality assurance as a major bottleneck for early Curso... extends

quality assurance as a major bottleneck for early Cursor adopters and calls for it to be a first-class citizen in the design of agentic AI coding tools" — Study identifies QA as critical design requir

@Hesamation: "100% OF MY CODE IS WRITTEN BY CLADE CODE" so I guess the lesson is, if you'r... supports

@shao__meng: 先让 Claude Code / Codex / Cursor 写代码，接着在同一个 Session 让 AI 自己审查代码？ example_of

第一个 Session：仅编写测试用例（明确定义需求和边界条件）；另一个 Session：根据测试用例编写实现代码，确保代码必须通过所有测试" — Article demonstrates test-first workflow where separate AI sessions write tests before implementation, ensuring objective veri

@paoloanzn: In this repo's AGENTS.md file I explicitly define two hard rules for committi... supports

([ai model], [human reviewed: T | F], [tested: T | F])" — Proposal includes [tested: T|F] as mandatory metadata field, treating test verification as critical criterion for AI code commits

@IntuitMachine: A new kind of machine learning loop. example_of

Along with hill climbing, evals also explicitly capture and protect against regressions over time. Once our agent handles a case correctly, we don't want to lose that gain. The eval becomes a regressi

@jeffreyhuber: one feature of Chroma which is underrated is our copy-on-write forking of col... example_of

experimentation and A/B testing" — Copy-on-write forking enables trivial A/B testing by allowing isolated collection variants without performance overhead

MCP Inspector - Model Context Protocol supports

Start Development: Launch Inspector with your server. Verify basic connectivity. Check capability negotiation. Iterative testing: Make server changes. Rebuild the server. Reconnect the Inspector." — A

AddyOsmani.com - The Code Agent Orchestra - what makes multi-agent coding work supports

The bottleneck is no longer generation. It's verification. Agents can produce impressive output at incredible speed. Knowing with confidence whether that output is correct is the hard part." — Article

Tools – Model Context Protocol （MCP） example_of

A comprehensive testing strategy for MCP tools should cover: Functional testing, Integration testing with external systems, Security testing for authentication and sanitization, Performance testing un

Building your own LLM evaluation framework - n8n Blog supports

one small change to a prompt, a model swap, or a slight tweak to a node can turn a perfectly functional workflow into an unpredictable mess" — Article identifies the fragility of AI systems to changes

@shao__meng: 重读 OpenAI Codex、Anthropic Claude 官方对 Agent Skills 测试评估和提升的文章，以及 @mgechev @its... supports

每次只改一个变量，确保归因清晰；保留完整变更日志，记录每次尝试的改动、原因和结果" — Article presents systematic methodology for iterative skill improvement through isolated variable changes and comprehensive logging

@unclebobmartin: I have a data corruption problem. I detect it in the main loop with a regula... example_of

So I have instrumented each and have told the AI to run several long sweeps to debug." — Article demonstrates a concrete debugging strategy: instrumentation of suspected code sites combined with repea

@kieranklaassen: I really like this technique that I learned from @nbaschez , where you keep s... example_of

at the end of the day, you just check if everything looks good in the next branch and ship that to main" — Article describes validating multiple merged PRs at once before shipping to production, reduc

@doodlestein: Agent Coding Life Hack: extends

Also I need you, once you've fixed and verified each of those problems is completely resolved and working properly, to create extremely in-depth e2e integration tests that would have caught each of th

@bcherny: 👋 Appreciate the feedback. extends

There was a subtle bug that missed several rounds of manual review. We're working on how we can better catch it automatically next time." — Highlights gap in manual code review catching subtle bugs an

Claude Code supports

All tests are now passing! The refactoring was successful." — Article demonstrates validation of changes through comprehensive test suite execution (800 tests)

@corbtt: This is the most delightfully creative post-training experiment I've seen in ... supports

sadly, vindication for the "black-box testing isn't enough to trust models released by actors we don't trust"" — Provides empirical evidence that black-box evaluation cannot detect implanted backdoors

Stop Shipping AI Slop: The Claude Code QA Tool That Fixes the Biggest Mistake GTM Engineers Are Making supports

The key is to use AI to save some time, then spend enough time checking the work carefully" — Article advocates for human review and careful checking as essential guardrail against AI-generated errors

@mattpocockuk: Uncle Bob gets it supports

The software I'm creating nowadays is vastly more robust than I'd ever been able to create manually. I don't mean that the code is better. I mean the surrounding tests are vastly better." — Evidence t

@colin_fraser: Newspapers used to get really mad at you for doing stuff like this contradicts

between 68% and 84% of the pods had potential issues" — Washington Post's own testing revealed severe quality issues in AI podcast product, yet proceeded with release—contradicting best practices for

@simonw: I see a lot of complaints about untested AI slop in pull requests. Submitting... supports

[INFERRED] "Your job is to deliver code you have proven to work" — Article argues that submitting untested AI-generated code violates professional engineering standards; emphasizes testing as a duty b

@dbreunig: Why I'm excited about RLMs: they're a simple, generally applicable test-time ... supports

[DIRECT] "simple, generally applicable test-time strategy with tons of low-hanging fruit for optimization" — Article directly describes RLMs as a test-time optimization strategy with significant room

@Grady_Booch: This is why I keep an air gap between Claude and my release production code supports

This is why I keep an air gap between Claude and my release production code" — Article advocates for isolation/validation layer (air gap) between AI code generation and production systems — empirical

@unclebobmartin: I always thought that mutation testing was a good idea; but it was always har... example_of

Now it's just one more Claude agent working down a list. And the benefit is amazing." — Demonstrates practical automation of mutation testing using Claude agents, showing cost reduction and scalabilit

On one end, the Anthropic team is a massive user of AI to write code (80%+ of... contradicts

beyond terrible reliability numbers suggests there might be a downside to all this speed" — Article directly challenges the speed-first approach by highlighting severe reliability issues as a hidden c

@unclebobmartin: Earlier I posted that codex was doing things faster than I expected. The rea... supports

I have my project seriously over-constrained with tests, and independent tools that check integrity." — Author employs multiple layers of automated testing and integrity verification as primary defens

@shao__meng: https://t.co/IoBUd945Ki supports

[INFERRED] "test coverage gaps" — Article identifies test coverage gaps as one of four critical review dimensions, supporting the concept that coverage analysis is essential to code quality assurance

@jessitron: "We care more about measuring correctness now because agents write unreliable... extends

[INFERRED] "testing used to be 50% of the work and now it's 99%" — Article presents empirical observation that testing proportion has dramatically increased in development workflows, suggesting fundam

@haider1: i gave up on opus 4.6 supports

[INFERRED] "opus 4.5 at least tests different things and tries to fix them" — User implies broader regression coverage in 4.5; systematic testing and incremental fixes shown as superior to 4.6's react

@mtm_io: Add Sentry self hosted, connect @clawdbot , have it pull issues via api or in... supports

[INFERRED] "Nice thing is that you raise test coverage at the same time" — Suggests that automated issue fixing and code commits by AI agents can improve test coverage metrics as a beneficial side eff

@paoloanzn: Doing /goal refactor might be the worst way ever to actually refactor code. N... contradicts

@simonw: This seems like a good bet to me - coding agents make it no longer remotely e... extends

[INFERRED] "most change that AI tools will bring for software engineers are likely to be making the practices that the best eng teams did until now, the baseline for those that want to stay competitiv

@Grokton: "The benchmarks we use to evaluate AI coding tools are measuring memorization... contradicts

I feel proud to have made this the last 45 days @garryslist Almost 1:1 test... supports

[INFERRED] "Almost 1:1 test to code ratio is a big unlock" — Author demonstrates that maintaining 1:1 test-to-code ratio with AI assistance (Claude Code) is achievable and valuable in Rails projects

@rileybrown: Keep going example_of

[inferred] "collect all the metrics needed to calculate the winning one" — ThumbLoop demonstrates practical metrics collection for A/B testing thumbnail performance, showing how automated systems gath

@Hesamation: The "agent might fuck up" anxiety is real. supports

[INFERRED] "These agents don't work as they promised." — Social commentary expressing practitioner concern that deployed agents fail to meet advertised capabilities. Reflects real-world reliability ga

query this concept

$ db.articles("testing-strategies")

$ db.cooccurrence("testing-strategies")

$ db.contradictions("testing-strategies")