2 articles · 3 co-occurring · 0 contradictions · 5 briefs
scores 87/120 on this year's Putnam, one of the world's most prestigious math competitions" — Nomos 1 is evaluated on the Putnam mathematical reasoning benchmark, demonstrating model performance asses
scores 87/120 on this year's Putnam, one of the world's most prestigious math competitions" — Nomos 1 is evaluated on the Putnam mathematical reasoning benchmark, demonstrating model performance asses
[INFERRED] "i gave my students a challenge which claude code, codex *and* gemini failed" — Article describes comparative evaluation of three different AI code models on a single challenging task, prov