← All concepts

model evaluation benchmarking

2 articles · 3 co-occurring · 0 contradictions · 5 briefs

scores 87/120 on this year's Putnam, one of the world's most prestigious math competitions" — Nomos 1 is evaluated on the Putnam mathematical reasoning benchmark, demonstrating model performance asses

2026-W15
10

scores 87/120 on this year's Putnam, one of the world's most prestigious math competitions" — Nomos 1 is evaluated on the Putnam mathematical reasoning benchmark, demonstrating model performance asses

[INFERRED] "i gave my students a challenge which claude code, codex *and* gemini failed" — Article describes comparative evaluation of three different AI code models on a single challenging task, prov

query this concept
$ db.articles("model-evaluation-benchmarking")
$ db.cooccurrence("model-evaluation-benchmarking")
$ db.contradictions("model-evaluation-benchmarking")