Model Performance Benchmarks
Model | Overall | Overall w/ Style Control | Hard Prompts | Hard Prompts w/ Style Control | Coding | Math | Creative Writing | Instruction Following | Longer Query | Multi-Turn |
---|---|---|---|---|---|---|---|---|---|---|
gemini-2.5-pro-preview-05-06 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
o3-2025-04-16 | 2 | 1 | 2 | 1 | 1 | 1 | 5 | 2 | 5 | 4 |
chatgpt-4o-latest-20250326 | 2 | 3 | 3 | 3 | 2 | 5 | 2 | 3 | 2 | 1 |
grok-3-preview-02-24 | 2 | 5 | 2 | 4 | 2 | 4 | 2 | 3 | 2 | 3 |
gpt-4.5-preview-2025-02-27 | 4 | 3 | 2 | 3 | 3 | 2 | 2 | 2 | 2 | 2 |
gemini-2.5-flash-preview-04-17 | 4 | 5 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 4 |
deepseek-v3-0324 | 7 | 5 | 7 | 4 | 3 | 5 | 5 | 7 | 6 | 4 |
gpt-4.1-2025-04-14 | 7 | 5 | 5 | 3 | 7 | 8 | 5 | 6 | 2 | 5 |
hunyan-turbos-20240416 | 7 | 13 | 5 | 6 | 7 | 10 | 4 | 7 | 5 | 4 |
deepseek-r1 | 8 | 8 | 8 | 4 | 8 | 3 | 6 | 7 | 8 | 6 |
gemini-2.0-flash-001 | 9 | 16 | 8 | 20 | 8 | 10 | 6 | 11 | 9 | 10 |
o4-mini-2025-04-16 | 9 | 5 | 7 | 3 | 6 | 1 | 14 | 11 | 12 | 8 |
o1-2024-12-17 | 10 | 8 | 7 | 5 | 8 | 5 | 7 | 7 | 7 | 11 |