Model Performance Benchmarks

ModelOverallOverall w/ Style ControlHard PromptsHard Prompts w/ Style ControlCodingMathCreative WritingInstruction FollowingLonger QueryMulti-Turn
gemini-2.5-pro-preview-05-061111111111
o3-2025-04-162121115254
chatgpt-4o-latest-202503262333252321
grok-3-preview-02-242524242323
gpt-4.5-preview-2025-02-274323322222
gemini-2.5-flash-preview-04-174523222224
deepseek-v3-03247574355764
gpt-4.1-2025-04-147553785625
hunyan-turbos-20240416713567104754
deepseek-r18884836786
gemini-2.0-flash-001916820810611910
o4-mini-2025-04-169573611411128
o1-2024-12-17108758577711