Nearly two dozen researchers from Tsinghua University, Ohio State University and the University of California at Berkeley collaborated to create a method for measuring the capabilities of large language models (LLMs) as real-world agents.
LLMs such as OpenAI’s ChatGPT and Anthropic’s Claude have taken the technology world by storm over the past year, as cutting-edge “chatbots” have proven useful at a variety of tasks, including coding, cryptocurrency trading and text generation.
Related: OpenAI launches web crawler 'GPTBot' amid plans for next model: GPT-5
Typically, these models are benchmarked based on their ability to output text perceived as humanlike or by their scores on plain-language tests designed for humans. By comparison, far fewer papers have been published on the subject of LLM models as agents.
Artificial intelligence (AI) agents perform specific tasks, such as following a set of instructions within a specific environment. For example, researchers will often train an AI agent to navigate a complex digital environment as a method for studying the use of machine learning to develop autonomous robots safely.
Traditional machine learning agents like the one in the video above aren’t typically built as LLMs due to the prohibitive costs involved with training models such as ChatGPT and Claude. However, the largest LLMs have shown promise as agents.
The team from Tsinghua, Ohio State and UC Berkeley developed a tool called AgentBench to evaluate and measure LLM models’ capabilities as real-world agents, something the team claims is the first of its kind.
According to the researchers’ preprint paper, the main challenge in creating AgentBench was going beyond traditional AI learning environments — video games
Read more on cointelegraph.com