OpenAI is facing renewed scrutiny after independent benchmark results for its o3 model fell well below initial claims, raising transparency concerns in the AI community.
When OpenAI unveiled o3 in December, it reported that the model could solve over 25% of questions in the FrontierMath benchmark — far surpassing other models, which scored under 2%. However, new independent testing by Epoch AI found o3’s score closer to 10%, sparking debate over how OpenAI evaluates and presents its models.
Epoch noted differences in testing methodology, including computational resources and versions of the benchmark, which may explain the gap. OpenAI’s internal tests used a more powerful, unreleased version of o3, while the public version, launched last week, is tuned for speed and usability, according to OpenAI staff.
The ARC Prize Foundation, which tested an earlier version of o3, confirmed that the public release differs significantly from the internal variant, supporting claims that compute tiers influence performance.
Despite the shortfall, OpenAI’s newer models — including o3-mini-high and o4-mini — outperform o3 on the same benchmark, with o3-pro expected soon. Still, the incident adds to growing skepticism around AI benchmarking, where companies often set the narrative with limited transparency.
Benchmark controversy isn’t new. Earlier this year, Epoch was criticized for delaying disclosure of OpenAI funding behind FrontierMath, which many academic contributors were unaware of.
The lesson: AI benchmarks are useful — but best taken with a grain of salt.