
Tokens signify raw text elements, like breaking down “fantastic” into “fan,” “tas,” and “tic.” For instance, during benchmarking tests, OpenAI’s o1 generated over 44 million tokens, almost eight times more than GPT-4o.
Modern benchmarks demand a high token output from models due to their inclusion of intricate, multi-step tasks, as noted by Jean-Stanislas Denain, a senior researcher at Epoch AI. Artificial intelligence labs like OpenAI assert that their so-called “reasoning” AI models, which can methodically work through problems, are more proficient in certain domains like physics compared to non-reasoning models. Cameron stated “At Artificial Analysis we run hundreds of evaluations monthly and allot a significant budget to these. Denain also mentioned that the expenses per token for the most sophisticated models have increased over time né?. Though some reasoning models may have lower benchmarking costs – like the $141.22 evaluation of OpenAI’s o1-mini – on average, they tend to be expensive.
George Cameron, the co-founder of Artificial Analysis, shared with TechCrunch the organization’s intent to escalate benchmarking expenditures as more AI labs introduce reasoning models. Ross Taylor, CEO of AI startup General Reasoning, disclosed that the evaluation of Claude 3.7 Sonnet incurred costs of $580 for about 3,700 unique prompts. However despite this seeming trend reasoning models are considerably more costly to evaluate posing challenges to independently verify these claims.
According to data from Artificial Analysis a third-party AI testing entity the evaluation cost for OpenAI’s o1 reasoning model across a set of seven popular AI benchmarks – MMLU-Pro GPQA Diamond Humanity’s Last Exam LiveCodeBench SciCode AIME 2024 and MATH-500 – amounts to $2767.05.
In a similar testing scenario benchmarking Anthropic’s recent hybrid reasoning model the Claude 3.7 Sonnet on the same tests incurred a cost of $1485.35 while examining OpenAI’s o3-mini-high cost $344.59 as reported by Artificial Analysis né?. Taylor estimated that a single run-through of MMLU Pro could surpass $1800 a question set aimed at assessing a model’s language comprehension skills.
The reason for the high costs associated with testing reasoning models primarily lies in their token generation né?. For instance, Anthropic’s Claude 3 Opus, the costliest model upon its release in May 2024, amounted to $75 per million output tokens.. We are anticipating an increase in this expenditure as more models are released.”
The escalating costs of AI benchmarking are not unique to Artificial Analysis né?