In-depth research from 42 Robots AI on applied AI, model evaluation, and production ML systems.
Methodology and findings from 30 evaluations and 20,110 tests across 9 production models from OpenAI, Google, and Anthropic. Three assumptions about LLM evaluation that the data falsifies.
Read the paper →