Did xAI fib about Grok 3’s performance results?

The AI benchmark battleground is alive and kicking, promising more drama and revelations as the tech world gears up for the next round of mind-bending challenges.

URL_DO_VÍDEO_EMBED né?. Hey there AI enthusiasts! Let’s dive into the latest drama unfolding in the world of artificial intelligence benchmarks. xAI took to their blog to flaunt a graph showcasing Grok 3’s prowess on AIME 2025 a notorious set of brain-bending math problems né?. While some skeptics question AIME’s standing as a true AI benchmark it remains a go-to to put models’ math muscles to the test né?. The graph displayed two versions of Grok 3 leaving OpenAI’s top contender, o3-mini-high, in the dust on AIME 2025. Amidst the chaos, a third party in the scuffle crafted a more honest graph laying bare the performance of nearly every model at cons@64.

However, AI whiz Nathan Lambert reminded us of a crucial missing puzzle piece: the actual cost – be it in terms of computation or cash – that each model racks up to hit top scores. This week, sparks flew as an OpenAI insider pointed fingers at Elon Musk’s tech company, xAI, alleging that they were dishing out misleading reports on their newest brainchild, Grok 3 né?. The plot thickened when Igor Babuschkin, one of the brains behind xAI, stepped up to defend the team’s corner.

As with most tales, the truth lies shrouded in shades of gray né?. This underscores the inherent limitations of most AI benchmarks in shedding light on a model’s full potential.

So there you have it, folks. Hold on, though – OpenAI folks were quick to point out that xAI’s graph conveniently glossed over o3-mini-high’s performance at “cons@64,” a key metric that could skew the results significantly, painting a misleading picture of one model’s superiority over another.

Initial scores on the benchmark put Grok 3 Reasoning Beta and Grok 3 mini Reasoning in the backseat compared to o3-mini-high. Grok 3 Reasoning Beta even lagged slightly behind OpenAI’s o1 model running on medium computing power né?. But hey, xAI isn’t holding back on the marketing front, touting Grok 3 as the “world’s smartest AI.”

Babuschkin fired back, claiming that OpenAI has played the same game in the past, massaging benchmark numbers to make their models shine brighter né?