Retrieval-Augmented-Generation (RAG) vs Long-context LLMs: Which to Choose?
Is RAG superior to longer context LLMs?
TL;DR: LLMs like GPT-4, LLAMA2 LONG and Claude 2 boast Long context windows due to technological advancements in GPUs and memory-efficient exact attention. Retrieval Augmentation employs retrievers like Dragon, Contriever, and OpenAI’s text-embedding-ada-002 to provide LLMs with crucial context. A pivotal finding is that a 4K LLM augmented by retrieval can rival the efficacy of a 16K context LLM. Models like Nemo GPT-43B and LLAMA2-70B were pivotal in these comparisons, indicating that RAG could enhance the performance of an LLM irrespective of its context length.
Business Implications
Retrieval-augmented LLMs can achieve top-tier AI capabilities without escalating operational costs, ensuring a competitive advantage.
Harnessing open-sourced retriever tools can drive superior AI performance, elevating customer satisfaction and trust, while keeping the costs low.
Retrieval-augmented models can provide high precision in responses, derived from expansive or latest content.
In the rapidly evolving world of Large Language Models(LLMs), the two competing strategies have been contending for the spotlight: extending the context window of Large Language Models (LLMs) and augmenting LLMs with retrieval mechanisms. As the latter has been a familiar solution for a long time, it prompts us to ask two key questions:
Is retrieval augmentation superior to longer context LLMs?
Could a synthesis of both strategies unlock unprecedented capabilities?
Long Context LLMs
Long-context LLMs have recently become the focal point of discussions within research, production, and open-source domains. Spearheading this momentum is the advent of faster GPUs complemented by memory-efficient exact attention. These technologies have allowed for building Long Context LLMs with up to 32K (LLAMA2 LONG, GPT-4) and 100K 😱 (Claude 2).
Conceptual Understanding of Retrieval Augmentation
Beyond the context window expansion lies the retrieval method, a well-established, alternative approach. In this method, LLMs are provided with only the relevant context fetched by a retriever, offering higher scalability and speed. This strategy essentially transforms the retrieval-augmented decoder-only LLM into a model with sparse attention, where the attention pattern is determined not beforehand but by the retriever's discernment. In simpler words, the non-retrieved data (context) is treated as irrelevant and has zero-values attention weights.
Experiment Overview
The primary objective is to juxtapose the retrieval augmented method (using a retriever) and the everything-in-context method (adding content to the LLM context). Venturing beyond smaller models, the investigation centres on models larger than 40B parameters.
The two models used are:
Nemo GPT-43B: A model boasting 43 billion parameters and trained on a staggering 1.1T tokens. Its training data contains a 70% English corpus, enriched by multilingual and code data sources like Common Crawl, Wikipedia, and StackExchange.
LLAMA2-70B: A publicly accessible model with 70B parameters, trained on approximately 2T tokens dominated by English data.
The experimental landscape spanned seven datasets, encompassing single-document QA, multi-document QA, and query-based summarization for zero-shot evaluations.
Retrieval Mechanism Explored
Three retrievers were employed:
Dragon: Renowned for benchmark-setting performances across supervised and zero-shot information retrieval.
Contriever model: Another proficient retriever.
OpenAI embedding: Specifically text-embedding-ada-002.
The retrieval process: Questions and documents are encoded into vectors, rankings are established via methods like cosine similarity, and then the top tokens, based on their similarities, are selected and dispatched alongside the question in the LLM prompt.
Key Findings and Observations
The findings are not either/or but an and:
A retrieval-augmented LLM with a 4K context window astonishingly mirrored the prowess of an LLM with a 16K context window on long-context tasks, albeit with significantly reduced computational demands.
Retrieval-augmented LLAMA2-70B with a 32K context window overshadowed GPT-3.5-Turbo-16K and Davinci003 across various long-context tasks.
The experiments cemented the understanding that irrespective of context window size, retrieval enhances LLM performance.
Interestingly, when presented with identical evidence chunks, long-context LLMs (16K, 32K) outshine their 4K context counterparts.
Public retrievers often performed better than proprietary solutions like OpenAI embeddings.
When To Use RAG?
When you have question-answering tasks based on long-form documents (PDFs, Text files).
When you have to support an LLM by passing extra information that is not already available in its context to reply to a user.
When you want the LLM to always have the latest information.
When you want the LLM to not hallucinate and make up random information — tho using a RAG is not a guarantee of eliminating hallucination completely.
When Is LLM Context Just Enough?
When answering questions based on short texts that can fit well inside the context of LLMs.
Summarization tasks that involve the LLM being able to see the whole document.
When the conversation with LLM is based on its already trained/fine-tuned data.
Closing Thoughts…
Retrieval augmentation Generation in LLMs significantly amplifies their strengths, enhancing perplexity, accuracy, and learning capacities. The evidence is compelling: a 4K context LLM, when powered by retrieval, can go toe-to-toe with a 16K context counterpart, ensuring computational efficiency during inference.