PDF Triage: Elevating ChatGPT's Question Answering Capabilities

Unlocking Advanced Document Insights with Enhanced QA Techniques

Sep 20, 2023

TL;DR: PDFTriage enhances LLMs' ability to handle large documents by leveraging the Adobe Extract API. With functions such as fetch_pages and fetch_table, it addresses document structure and table reasoning questions as well. The gpt-35-turbo-0613 model, when using PDFTriage, produces answers with fewer retrieved tokens, increasing efficiency.

Business Implications

By mirroring users' perceptions of documents, PDFTriage can elevate product UX, potentially leading to increased adoption and retention.
Its robust performance across varying document lengths offers businesses a versatile tool, reducing the need for multiple solutions.
Precise and efficient data extraction can provide CXOs with actionable insights, optimizing business strategies.

The ability to extract precise information from documents is important in the digital age. Large Language Models (LLMs) have been at the forefront of this, but they face challenges when the document's size exceeds its context length. This limitation often leads to inefficiencies in document question answering (QA).

The Problem with Current LLM Approaches

LLMs, despite their prowess, falter when the document doesn't fit within their context length. The prevailing solution has been to retrieve relevant contexts from the document and present them as plain text.

However, this approach overlooks the inherent structure of many documents. When users think of a PDF or a webpage, they visualize pages, tables, and sections. Representing these as mere text creates a disconnect between the user's mental model and the system's representation. This incongruity becomes glaringly evident when seemingly simple questions stump the QA system.

For instance:

"Can you summarize the key takeaways from pages 5-7?"
"What year [in Table 3] has the maximum revenue?"

Both questions require an understanding of the document's structure, something that plain text representation lacks.

Introducing PDFTriage: Bridging the Gap

Enter PDFTriage, a groundbreaking approach that enables models to retrieve context based on either the document's structure or its content. This method proves effective where traditional retrieval-augmented LLMs fall short. By giving models access to a document's structural metadata, PDFTriage can handle a variety of questions that stump plain retrieval-augmented LLMs.

How PDFTriage Works

The genius of PDFTriage lies in its three-step method:

Generate Document Metadata: Using the Adobe Extract API, PDFs are transformed into an HTML-like tree. This tree, rich with metadata like section titles, tables, and figures, is then parsed to extract valuable structural information.
LLM-based Triage: The LLM queries the document, selecting precise content based on the question at hand.
Answer Using Retrieved Content: With the relevant context retrieved, the LLM generates a comprehensive answer.

PDFTriage employs five functions to achieve this:

fetch_pages: Retrieves text from specified pages.
fetch_sections: Extracts text from a given section.
fetch_table: Gathers text from a specified table caption.
fetch_figure: Obtains text surrounding a particular figure caption.
retrieve: Issues a natural language query over the document, fetching pertinent chunks.

These functions will be called by LLMs like GPT to synthesise various pieces of information to craft the final answer.

PDFTriage in Action: Testing and Results

To validate PDFTriage's capabilities, a dataset comprising roughly 900 human-written questions spanning 90 documents was curated. These questions spanned categories like "document structure questions," "table reasoning questions," and even "trick questions". PDFTriage was tested on the gpt-35-turbo-0613 model.

The results were illuminating. Human evaluators consistently favoured PDFTriage over traditional retrieval methods. Specifically, PDFTriage was preferred 50.7% of the time, outperforming both Page Retrieval and Chunk Retrieval methods.

Moreover, PDFTriage showcased its efficiency by requiring fewer tokens to produce superior answers. Impressively, the length of the document had a negligible effect on PDFTriage's performance, underscoring its adaptability to both short and long documents.

Paper Link

Benefits of PDFTriage

PDFTriage's approach aligns seamlessly with the user's perception of structured documents. By recognizing and utilizing a document's structure, it offers:

Enhanced answer quality and accuracy.
Improved readability and informativeness.
Efficient answers with fewer retrieved tokens.
Consistent performance across varying document lengths.

Conclusion

PDFTriage is set to redefine the realm of document QA. By aligning more closely with users' perceptions of structured documents, it offers a more intuitive and efficient solution to information retrieval. Its potential impact on future QA systems is immense, promising more accurate, efficient, and user-friendly outcomes.

For those keen on harnessing the power of advanced QA systems, it's time to delve deeper into PDFTriage. Its applications are vast, and its promise is undeniable. Embrace the future of structured document querying today.