model evaluation benchmarking

2 articles · 3 co-occurring · 0 contradictions · 99 briefs

scores 87/120 on this year's Putnam, one of the world's most prestigious math competitions" — Nomos 1 is evaluated on the Putnam mathematical reasoning benchmark, demonstrating model performance asses

Related concepts

specialized models 1 reasoning and planning 1 agent capability framework 1

Signal history

2026-W30

2026-W29

2026-W28

2026-W27

2026-W26

2026-W25

2026-W24

2026-W23

2026-W22

2026-W21

2026-W20

2026-W19

Evidence chain (2 articles, showing 2)

@NousResearch: Today we open source Nomos 1. At just 30B parameters, it scores 87/120 on thi... example_of

@xeophon: okay, small thread! example_of

[INFERRED] "i gave my students a challenge which claude code, codex *and* gemini failed" — Article describes comparative evaluation of three different AI code models on a single challenging task, prov

query this concept

$ db.articles("model-evaluation-benchmarking")

$ db.cooccurrence("model-evaluation-benchmarking")

$ db.contradictions("model-evaluation-benchmarking")