Was xAI misleading about Grok 3’s benchmarks?

Debates about AI benchmarks and how they are reported by AI labs are becoming more public. This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of sharing misleading results about its latest AI model, Grok 3. Igor Babuschkin, one of the co-founders of xAI, defended the company’s actions.

The truth likely lies somewhere in the middle. xAI posted a graph on its blog showing Grok 3’s performance on AIME 2025, a set of tough math questions from a recent math exam. While some experts doubt AIME’s credibility as an AI benchmark, it is commonly used to test a model’s math skills. In the graph, two versions of Grok 3 were shown outperforming OpenAI’s best model, o3-mini-high, on AIME 2025. However, OpenAI employees highlighted that xAI’s graph didn’t include o3-mini-high’s score at “cons@64,” a metric that gives models 64 attempts to answer each question and selects the most common answer as final. This could inflate a model’s score significantly, making it seem like one model is better than another when it might not be the case.

Imagem destacada

When looking at the first scores on the benchmark, Grok 3 Reasoning Beta and Grok 3 mini Reasoning fall below o3-mini-high’s score. Grok 3 Reasoning Beta also slightly lags behind OpenAI’s o1 model set to medium computing. Despite this, xAI is marketing Grok 3 as the “world’s smartest AI.”

Babuschkin argued that OpenAI has also shared misleading benchmark charts in the past, comparing the performance of its own models. A neutral party in the debate created a more accurate graph showing the performance of almost every model at cons@64.

However, AI researcher Nathan Lambert emphasized that an essential metric remains unknown: the computational and financial cost required for each model to achieve its best score. This shows the limited information that most AI benchmarks provide about a model’s limitations and strengths.

Leave a Reply

Your email address will not be published. Required fields are marked *