Fact Checking ChatGPT: Using FacTool to Detect Factual Errors

FacTool - A Framework Designed to Detect Factual Errors in LLM Outputs.

Sep 06, 2023

The rise of Large Language Models (LLMs) like ChatGPT has revolutionized the AI domain. These models offer high-quality text outputs, but they're not without challenges. Key among these is the issue of factual errors in generated texts. This article delves into the world of FacTool, a beacon in the murky waters of LLM-generated content.

Challenges and Limitations of LLM-Generated Texts

LLMs have ushered in a new age of content creation, handling a vast array of tasks from question answering to code generation. However, with their prowess come certain limitations:

A heightened risk of factual inconsistencies across diverse tasks.
Outputs that, while detailed, often blur individual factual boundaries.
A notable absence of concrete sources/evidence accompanying the generated content.
An inherent tendency to produce text that, while sounding credible, might be riddled with inaccuracies.

These limitations are especially critical in high-stakes domains like healthcare, finance, and law, emphasizing the dire need for rigorous fact-checking.

The FacTool Solution

Addressing the above challenges is FacTool — a domain-agnostic framework designed to detect factual errors in texts produced by LLMs. Think of FacTool as a guardian, ensuring the veracity of every piece of information generated by models like ChatGPT.

Tested Domains for FacTool

FacTool's versatility shines through its testing across multiple tasks:

Knowledge-based QA: Validating the accuracy of answers.
Code Generation: Validating the efficacy of generated codes.
Mathematical Reasoning: Ensuring accurate problem-solving.
Scientific Literature Review: Validating the credibility of AI-generated citations and reviews.

How FacTool Works

Harnessing the power of tool augmentation, FacTool operates through:

Claim extraction: LLMs, like ChatGPT, extract claims from the generated response. Claims are the sentences that need to be fact-checked. For instance, in Knowledge-based QA, each claim is a part of the answer generated, while in code generation, every snippet becomes a claim to verify.
Query generation: These claims are transformed into queries that can be sent to external tools to retrieve factual responses.
Tool querying & Evidence collection: Relevant evidence is collated by sending the generated query to external tools such as Google Search, Python interpreters or Google Scholar APIs.
Agreement verification: The final step involves presenting the claim and evidence back to the LLM, which then verifies the claim's authenticity.

FacTool's Stellar Performance

Benchmarked against established methods like 3-shot chain-of-thought, FacTool, especially when powered by GPT-4, consistently outperformed competitors. Key findings include:

Superior performance in scientific literature review tasks.
Enhanced sensitivity in error detection compared to self-check chain-of-thought methods.
In Knowledge-based QA, FacTool debunked false claims, such as "Argentina has not won the World Cup since 1986".
In code generation, it surpassed other baselines, showcasing its capability in technical domains.
Both GPT-4 and ChatGPT versions of FacTool excelled in mathematical reasoning, identifying errors in calculations.

Paper link

Code link

Conclusion

With the world leaning heavily on LLMs for varied tasks, the importance of tools like FacTool cannot be overstated. Ensuring the factuality of generated content paves the way for a future where we can trust AI-generated outputs without second-guessing.

For those venturing into the world of LLMs, integrating FacTool can be a game-changer. Especially in critical domains, it's more than a tool—it's a necessity. As generative AI continues to evolve, ensuring the accuracy of generated content will remain paramount, and FacTool is leading the charge in this endeavour.