Extending LLAMA to 32K tokens - Catching Up with ChatGPT

LLAMA 2 LONG defeats GPT-3.5-16K ChatGPT model

Oct 06, 2023

TL;DR: Meta's advancement in LLAMA 2 models manifests in handling long contexts, defeating the GPT-3.5-16K model. The newly introduced models: LLAMA 2 LONG 7B, 13B, 34B, and 70B, are centered around long-context continual pretraining, negating the need for a vast volume of long texts. The smaller 7B/13B variants were trained with 32,768-token sequences while the larger 34B/70B variants with 16,384-token sequences, demonstrated a significant performance improvement with an increased context window.

Business Implications

Enhanced Data Security: By self-hosting LLAMA 2 LONG models, companies can bolster data security while sustaining a model performance comparable to GPT, mitigating reliance on external providers.
Cost-Effectiveness: The adoption of LLAMA 2 LONG models presents a more economical alternative to OpenAI’s GPT, with expenses relegated primarily to server hosting, thus reducing operational costs.
Efficient Long-Context Handling: The proficiency of LLAMA 2 LONG in managing long-context tasks paves the way for developers to create chat solutions devoid of semantic retrievers, enhancing LLMs' capacity in question-answering tasks over extensive documents, streamlining the development process.

Meta's recent stride in Large Language Models (LLMs) lays down a milestone in extending the capabilities of open-source LLMs, particularly LLAMA 2, to handle long contexts efficiently, nudging closer to the prowess of models like GPT-4. There are 4 new models:

LLAMA 2 LONG 7B
LLAMA 2 LONG 13B
LLAMA 2 LONG 34B
LLAMA 2 LONG 70B

Expanding the Horizons of Language Models

Meta has taken a giant leap by introducing a series of long-context LLMs supporting effective context windows of up to 32,768 tokens. Through the lens of continual pretraining, LLAMA 2 has been extended with longer training sequences on a dataset where long texts were upsampled, starting a new frontier for open-source language models.

Achieving Superior Performance

Meta's research challenges the assumption that a wealth of long texts in the pre-train dataset is crucial for excelling in long-context tasks. It reveals that the method of long-context continual pretraining, not the volume of long texts, is key to achieving superior performance.

The strategy of long-context continual pretraining builds upon the existing knowledge and architecture of LLAMA 2, rather than starting from scratch with long sequences. This approach is less resource-intensive and time-efficient, showcasing notable improvement in long-context tasks. It utilizes a dataset where long texts are upsampled, providing a solid foundation for the LLAMA models to handle extended contexts efficiently.

Furthermore, the research uncovers a significant between performance improvement and context length, indicating continual performance improvement as the context length increases, up to 32,768 tokens.

A Glimpse into the Training Arena

The training regime embarked upon was both simple and cost-effective. A total of 400 billion tokens, formed as long training sequences, were utilized to train the existing LLAMA 2 model. The smaller 7B/13B variants were trained with 32,768-token sequences while the larger 34B/70B variants with 16,384-token sequences. The positional encoding modifications were a cornerstone to ensure the model's adeptness at attending to longer sequences.

Instruction Tuning for Long-Context Tasks

Instruction tuning emerged as a key ingredient in navigating the challenges of LLM alignment, especially under long-context scenarios. A simple yet effective approach leveraging a pre-built short-prompt dataset exhibited surprising efficacy on long-context benchmarks.

Comparative Analysis and Results

When pitted against benchmarks, the LLAMA 2 models trained with long sequences showcased significant improvements, especially on long-context tasks. The end result was a chat model, devoid of any human-annotated data, displaying a stronger overall performance than gpt-3.5-turbo-16k and other open-source models across a suite of long-context benchmarks. However, when thrown in the ring with the GPT-4 32K model, the LLAMA 2 Long 70B model fell short, indicating there's still ground to cover.

Source: **Effective Long-Context Scaling of Foundation Models**

Paper Link

Bridging the Future: LLMs in Complex Use Cases and Beyond

The narrative of LLMs is evolving rapidly with each passing day. They now stand on the verge of serving more intricate use cases, from analyzing knowledge-rich documents to powering more genuine chat interactions. This expedition of Meta reflects not just a technical advancement but a step towards a future where human-digital interactions are more intuitive and enriched.

The journey of extending LLAMA 2 to 32k tokens while keeping an eye on GPT models reflects a competitive spirit driving the field towards uncharted territories. The ingenuity in training methodologies and instruction tuning, as demonstrated, not only propels LLAMA 2 closer to the prowess of GPT but also lights the way for future endeavours in the domain.