The increasing capabilities of modern artificial intelligence systems are stretching traditional valuation methods to their breaking point, challenging businesses and public organizations in how best to work with rapidly evolving technology. I’m confronting you.
As more models come to market, flaws in metrics commonly used to assess performance, accuracy, and safety are being exposed, according to those who build, test, and invest in AI tools. It is said that there is. Traditional tools are easy to operate and too narrow in scope for the complexity of modern models, they said.
A technology race sparked by the 2022 release of OpenAI’s chatbot ChatGPT and fueled by tens of billions of dollars in funding from venture capitalists and big tech companies such as Microsoft, Google, and Amazon, is set to begin in 2020 to assess AI advances. It erased many of the old standards.
“Public benchmarks have a shelf life,” said Aidan Gomez, founder and CEO of AI startup Cohere. “It’s useful until people optimize it.” [their models] We played along and played games. Previously it took several years. It’s been a few months now. ”
Google, Anthropic, Cohere, and Mistral have each launched AI models over the past two months in an effort to unseat Microsoft-backed OpenAI from the top public rankings of large-scale language models (LLMs) that power systems such as ChatGPT. It has been released.
New AI systems emerge regularly that “completely outperform” existing benchmarks, Gomez said. “As models improve, their capabilities make these ratings obsolete,” he said.
According to a KPMG survey of more than 1,300 global CEOs, generative AI is a top investment priority for 70 percent of CEOs, and the question of how to value LLMs is a topic that has not been addressed by academics. It is moving to the board of directors.
“People don’t use technology they don’t trust,” said Shelly McKinley, chief legal officer of GitHub, a repository of code owned by Microsoft. “It is the duty of companies to bring reliable products to the market.”
Governments are also struggling with how to deploy and manage the risks of modern AI models. Last week, the US and UK signed a landmark bilateral agreement on AI safety, a new agreement the two countries established last year to “minimize surprises from rapid and unexpected advances in AI.” It is based on the AI Research Institute.
Last year, President Joe Biden issued an executive order requiring government agencies such as the National Institute of Standards and Technology to create benchmarks to assess the risks of AI tools.
Whether assessing safety, performance, or efficiency, groups responsible for stress testing AI systems are scrambling to keep up with the cutting edge of technology.
“The top-level decision that many companies are making is whether and which LLMs to use,” said Rishi Bommasani, who leads the team at the Stanford Center for Basic Modeling Research.
Bommasani’s team developed a global assessment of language models that tests criteria such as reasoning, memorization, and susceptibility to disinformation.
Other public systems include the Massive Multitask Language Understanding benchmark, a dataset built in 2020 by Berkeley students to test models on questions from 57 subject areas. The other is HumanEval, which measures coding ability across 164 programming questions.
But assessments have struggled to keep up with the sophistication of today’s AI models, which can perform a series of connected tasks over long periods of time. Such complex tasks are difficult to assess in controlled settings.
“The first thing to realize is that it’s very difficult to actually properly evaluate models, just as it’s very difficult to properly evaluate humans,” says venture capital firm Index Ventures. Mike Volpi, partner at “If you focus on one thing, ‘Can you jump high or run fast?’ It’s easy. But what about human intelligence? That’s an almost impossible task.”
Another growing concern about public testing is that the training data for the model may include the exact questions used in the assessment.
“It may not be intentional cheating. It may be more benign,” said Stanford’s Bommasani. “But we are still learning how to limit this contamination problem between what the model is trained on and what it is tested on.”
Benchmarks are “very monolithic,” he added. “We value how strong LLM is, but your reputation as a company is more than that. You have to consider cost. [and] Do you want open source? [where code is publicly available] Or closed source. ”
Hugging Face is a $4.5 billion startup that provides tools for AI development and is an influential platform for open source models, hosting a leaderboard called LMSys. LMSys ranks models based on their ability to complete tailored tests set by individual users, rather than a fixed set. question of. As a result, we can capture users’ actual preferences more directly.
Cohere’s Gomez said the leaderboards are useful for individual users, but are of limited use for companies with specific requirements for their AI models.
Instead, he recommends that companies build “in-house test sets that only require hundreds of samples rather than thousands.”
“We always say human evaluation is best,” he said. “This is the most signalful and representative way to judge performance.”
Index Ventures’ Volpi said model selection by individual companies is as much an art as it is a science.
“These metrics are like when you buy a car, it has this much horsepower and this much torque and it goes from 0 to 100 km/h,” he said. “The only way to actually decide whether to buy it or not is to take it for a drive.”
© 2024 Financial Times Company. All rights reserved. May not be redistributed, copied or modified in any way.