GPT-4 is way ahead of the competition (and here's the proof)

AgentBench ranks 25 LLMs across 8 different tasks

Aug 16, 2023

In the world of artificial intelligence, Large Language Models (LLMs) have taken centre stage. These powerful models have demonstrated their prowess in a wide range of applications, from question answering to text summarization. But their potential doesn't stop there. The growing interest in using LLMs as agents for autonomous goal-completion tasks has opened up new possibilities. Enter AgentBench, a multi-dimensional evolving benchmark designed to evaluate LLMs' reasoning and decision-making abilities.

What is AgentBench?

AgentBench is a benchmark that consists of eight distinct environments, each designed to assess LLMs' reasoning and decision-making abilities in a multi-turn open-ended generation setting. The environments include Operating System (OS), Database (DB), Knowledge Graph (KG), Digital Card Game (DCG), Lateral Thinking Puzzles (LTP), House Holding (HH), Web Shopping (WS), and Web Browsing (WB). For example, in the OS environment, LLMs are tasked with finding the total number of non-empty directories inside the '/etc' directory. AgentBench was created using 25 LLMs, including both API-based and open-sourced models.

The Rising Potential of LLMs and the Need for AgentBench

LLMs have made significant strides in recent years, showcasing their impressive abilities in question-answering and text summarization. Their uncanny ability to understand human intent and execute instructions has spurred the development of innovative applications like AutoGPT, BabyAGI, and AgentGPT, which leverage LLMs for autonomous goal-completion tasks. As these models continue to evolve and their potential applications expand, the question arises: how do we know which LLM is the best for such tasks?

This is where the need for a comprehensive benchmark like AgentBench becomes increasingly apparent. Despite the remarkable achievements of LLMs, there was a gap in the market for a tool that could assess their performance across different environments. AgentBench fills this gap by providing a multi-dimensional benchmark that evaluates LLMs' performance in various environments, from operating systems to web browsing.

As LLMs continue to evolve and their applications become more diverse, it is crucial to have a tool like AgentBench that can assess their performance as agents. By providing a comprehensive assessment of LLMs' capabilities across different environments, AgentBench offers a valuable resource for researchers, developers, and businesses looking to harness the power of LLMs for autonomous goal-completion tasks.

Experimentation and Results

For the AgentBench experiment, LLMs were divided into two categories: API-based and open-sourced models. The API-based models are closed-sourced LLMs available only through APIs, while the open-sourced models come from academia and some companies like Meta, without publicly serving APIs. During the experimentation, several common model errors were observed, such as the model not understanding the task instructions or outputting incorrect or incomplete actions.

The results of the study are presented in the below image:

GPT-4's Dominance

Among the 25 LLMs evaluated in the AgentBench benchmark, GPT-4 stood out as the clear winner. Its performance outshone other models, showcasing its superior ability to understand task instructions and execute actions accurately. Its dominance in the AgentBench benchmark is proof of its potential as an agent for autonomous goal-completion tasks.

Conclusion

The AgentBench benchmark has provided valuable insights into the performance of LLMs as agents. GPT-4's dominance in the benchmark is a testament to its potential in the field of autonomous goal completion. As LLMs continue to evolve, the possibilities for their applications are endless. The results of the AgentBench benchmark have set the stage for the future development of LLMs and their applications in goal-oriented tasks.

If you're interested in learning more about the AgentBench benchmark and the full research paper, we encourage you to explore it further. Share your thoughts on the results and the potential of LLMs as agents. The future of LLMs is bright, and I am excited to see what's next.