Google DeepMind Develops AI Fact Checker for LLMS Called SAFE

Google's DeepMind has unveiled SAFE, an AI-based system designed to fact-check Large Language Models (LLMs) outputs like ChatGPT.

The development of this new system aims to address the persistent issue of accuracy that LLM-generated content often grapples with.

Google DeepMind's Search-Augmented Factuality Evaluator (SAFE)

LLMs, celebrated for their ability to generate text, answer questions, and tackle mathematical problems, have long been criticized for their lack of precision.

Verification of LLM-generated content typically demands manual scrutiny, significantly diminishing its reliability and utility, according to the research team.

SAFE, short for Search-Augmented Factuality Evaluator, conducts its fact-checking by leveraging an LLM to scrutinize responses and cross-referencing them with search engine results for verification.

This methodology mirrors the fact-checking process adopted by human users who use search engines to corroborate information.

To assess its effectiveness, the DeepMind team subjected SAFE to rigorous testing, fact-checking approximately 16,000 assertions derived from multiple LLMs. Comparative analysis against human fact-checkers revealed that SAFE aligned with human assessments 72% of the time.

Notably, when discrepancies arose between SAFE and human evaluators, SAFE emerged as the more accurate judge in 76% of cases.

DeepMind has made the SAFE code openly accessible on GitHub, inviting broader utilization of its fact-checking capabilities within the AI community.

"SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results," the researchers wrote.

Employing LLM for SAFE

DeepMind's process involves employing an LLM, such as GPT-4, to deconstruct long-form responses into individual facts. These facts are then subjected to a multi-step evaluation process, wherein search queries are dispatched to Google Search to ascertain factual accuracy based on search results.

Moreover, DeepMind advocates for extending the F1 score as an aggregate metric for long-form factuality assessment. This metric balances precision, measured by the percentage of supported facts in a response, with recall, relative to a hyperparameter representing desired response length.

Empirical testing showcased the potential of LLM agents to achieve superhuman performance in fact-checking tasks. Across a dataset comprising approximately 16,000 individual facts, SAFE's alignment with human annotators stood at an impressive 72%.

Furthermore, in a subset of 100 contentious cases, SAFE demonstrated a superior accuracy rate of 76% compared to human evaluators.

The research team also notes that SAFE presents a cost-effective alternative to human annotators, boasting efficiency gains of over 20 times while maintaining robust performance.

Additionally, benchmarking across thirteen language models underscored the correlation between model size and factuality performance, with larger models generally outperforming their counterparts.

The DeepMind team's findings were further detailed in the pre-print server arXiv.

The code for SAFE has also been made available on the open-source GitHub site.