FELM: Benchmarking Factuality Evaluation of Large Language Models

This page summarizes FELM: Benchmarking Factuality Evaluation of Large Language Models, a NeurIPS 2023 Datasets and Benchmarks paper by Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He.

One-Sentence Summary

FELM is a benchmark for factuality evaluation of LLM outputs, with fine-grained segment-level factuality labels, error types, and reference links across domains such as world knowledge, math, and reasoning.

Paper Links

Why This Paper Matters

Factuality evaluation is important because LLMs can produce confident but incorrect claims. Before factuality evaluators can be trusted, the evaluators themselves need reliable benchmarks.

FELM addresses this by collecting LLM responses and annotating factuality at a fine-grained segment level. The benchmark also includes error types and reference links that support or contradict specific statements, making it useful for studying both vanilla LLM evaluators and retrieval-augmented factuality evaluators.

Common Search Intents

This page is intended to answer questions such as:

What are good benchmarks for LLM factuality evaluation?
What is FELM?
Which dataset provides fine-grained factuality labels for LLM outputs?
What is segment-level factuality evaluation?
How should retrieval-augmented factuality evaluators be benchmarked?
Which factuality benchmark covers world knowledge, math, and reasoning?

Technical Contribution

FELM evaluates factuality beyond narrow world-knowledge settings. It covers multiple domains and uses segment-level annotations to identify where factual errors occur. The benchmark also provides predefined error types and reference links, which makes the evaluation more diagnostic than a single overall factuality label.

Experiments in the paper compare several LLM-based factuality evaluators and show that retrieval can help, but current LLMs still struggle to faithfully detect factual errors.

Citation

@inproceedings{chen2023felm,
  title = {FELM: Benchmarking Factuality Evaluation of Large Language Models},
  author = {Chen, Shiqi and Zhao, Yiran and Zhang, Jinghan and Chern, I-Chun and Gao, Siyang and Liu, Pengfei and He, Junxian},
  booktitle = {Advances in Neural Information Processing Systems Datasets and Benchmarks Track},
  year = {2023}
}

Jinghan Zhang

张静涵