FAITH Hallucination Benchmark

ICAIF '25 · Singapore

FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance

A domain-specific benchmark and methodology for measuring hallucinations in large language models tasked with financial table reasoning.

Companies 9 (pilot) · 453 (main)

Sampled from S&P 500 annual reports (Item 7) with paired financial tables.

Answerable spans 300 · 2,406

Context-aware masked spans validated for uniqueness, consistency, and answerability.

Reasoning scenarios 4

Direct lookup, comparative, bivariate, and multivariate calculations over tables.

Key Idea

Evaluate intrinsic hallucinations by asking LLMs to recover masked spans in financial narratives, grounded by the surrounding tables and context sentences.

Annotation Pipeline

  • Mask spans that satisfy uniqueness and consistency.
  • Filter with LLM-based answerability annotation.
  • Normalize units and numerics for precision-relaxed scoring.

Abstract

Reliable tabular reasoning is central to financial AI

Financial applications demand trustworthy numerical reasoning. FAITH introduces a rigorous, scalable benchmark that probes intrinsic hallucinations in LLMs by framing evaluation as masked span prediction over real 10-K filings. The framework joins structured table context with neighboring sentences, revealing how models mis-handle values, units, and cross-table calculations.

Contributions

  • Automated dataset creation via principled masking of financial narratives.
  • New hallucination benchmark built from S&P 500 annual reports.
  • Cross-model study of intrinsic hallucination patterns across reasoning scenarios.

Methodology

Masked spans grounded by tables and surrounding text

Task Definition

Context-aware masked span recovery

Each instance replaces a numerical span in a sentence with [MASK]. The model must restore the value using: the source table, its pre-text, and the preceding/following sentences.

Design Guardrails

Uniqueness · Consistency · Answerability

Spans are retained only if they have a unique solution, align with nearby evidence, and are solvable from the provided context—preventing over-penalization from ambiguous phrasing.

Reasoning Profiles

From lookups to multivariate logic

Instances cover four reasoning levels: direct lookup, comparative change, bivariate ratios, and multi-step calculations. Models self-report scenario labels, aggregated over correct responses.

Task formulation overview for FAITH masked span recovery
Task formulation illustration from the paper, showing how spans are masked and recovered with surrounding context and linked tables.

Dataset

Built from Item 7 sections of S&P 500 10-K filings

Pilot vs. Main split

Metric Pilot Main
Companies 9 453
Avg. context length 14,148 chars 12,843 chars
Avg. tables per filing 14.9 19.2
Sentences 164 1,122
Answerable spans 300 2,406

Answerability validated with GPT-4.1, Claude Sonnet, and Gemini Pro; human audit on the pilot split.

Instance schema

{
  "metadata": {"cik": "1800", "filing_date": "2024-02-16"},
  "tables": [{"pre_text": "The following table...", "cells": [["Header1", "Header2"]]}],
  "instances": [{
    "uid": "...",
    "masked_sentence": "Total reserves were [MASK] ...",
    "ground_truth": "55.5 MMBOE",
    "mask_type": "A",
    "mask_span": [182, 192]
  }]
}
  • Tables keep both structure and lead-in text to preserve reporting context.
  • Instances capture neighboring sentences to anchor reasoning and reduce spurious spans.
  • Unit groups handle currencies, magnitudes (k/mn/bn), percentages, energy units, and rate postfixes.

Evaluation

Precision-relaxed scoring for financial numerics

Numeric Matching

Normalize values by parsing magnitudes (thousand/million/billion), determine precision from least significant digits, and compare at the coarser precision to avoid unfair rounding penalties.

Unit Matching

Aliases for currencies, percentages, basis points, energy units, and rate postfixes are grouped so equivalent expressions—e.g., “USD 1.23 billion” vs. “$1,230 million”—are treated consistently.

Model Study

Benchmarked GPT-4.1, Claude Sonnet-4, and Gemini 2.5 Pro on both pilot and main splits, with scenario-level breakdowns to surface where reasoning depth drives intrinsic hallucinations.

Reproduce

Use FAITH in your own evaluation loop

1) Generate prompts

python src/formulate_prompt.py \
  --dataset_path data/main.json \
  --prompt_template_path src/prompt.yaml \
  --unit_group_path src/unit_groups.yaml \
  --table_format csv \
  --output_path data/main_prompt.jsonl

Outputs JSONL with system/user prompts and ground truth spans for direct inference.

2) Evaluate predictions

python src/eval.py \
  --dataset_path data/main_prompt.jsonl \
  --prediction_path data/main_prediction.jsonl \
  --unit_group_path src/unit_groups.yaml

Reports accuracy with numeric tolerance and unit normalization; configure unit_groups.yaml to extend domain coverage.

Team

Asian Institute of Digital Finance · National University of Singapore

Mengao Zhang Jiayu Fu Tanya Warrier Yuwen Wang Tianhui Tan Ke-Wei Huang

Citation

Reference FAITH in your work

The FAITH dataset is under the license of Creative Commons (CC BY) Attribution 4.0 International.

For more information, please contact the author: Mengao Zhang mengaoz@nus.edu.sg.

Please kindly cite our work if you use our dataset or codes, thank you.

@inproceedings{faith2025,
  title        = {FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance},
  author       = {Mengao Zhang and Jiayu Fu and Tanya Warrier and Yuwen Wang and Tianhui Tan and Ke-wei Huang},
  booktitle    = {ACM International Conference on AI in Finance (ICAIF)},
  year         = {2025},
  url          = {https://www.arxiv.org/abs/2508.05201}
}