1 minute read

📄 Read the full paper on ACL Proceedings or on arXiv!

Conference arXiv Dataset GitHub

22/08/2025 - Paper accepted at EMNLP 2025!

Our paper, LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA, has been accepted to the EMNLP Main Conference 2025!

👏 Huge thanks to my co-authors Tommaso Bonomo and Roberto Navigli.

See you in 修正 (Suzhou)! 🇨🇳

EMNLP 2025 Logo

Are we correctly evaluating LLMs on Narrative Understanding? 🤔

Narrative Question-Answering is one of the most popular benchmarks to assess the performance of LLMs on lengthy documents. The de-facto standard is NarrativeQA, a dataset containing (book, summary) pairs with summary-based questions. While valuable, NarrativeQA may potentially suffer from two issues:

  1. Quality issues: document-summary pairs were created semi-automatically, texts were scraped from webpages and questions were generated by non-expert annotators
  2. Evaluation issues: all metrics used on the datasets have been taken from other, sometimes unrelated tasks (e.g., translation), and used as-is without any proper correlation analysis

To solve these problems, we introduce LiteraryQA, a high-quality, automatically refined and manually validated subset of NarrativeQA. After fixing the data, we carry out an extensive meta evaluation of 4 n-gram-based metrics (EM, F1, ROUGE-L, METEOR), 1 neural metric (BERTScore), and thre LLMs in the LLM-as-a-Judge paradigm.

Our Key Findings:

  1. Traditional evaluation strategies poorly align with human judgemenets
  2. LLM-as-a-Judge shows highest correlation but at a computational costMETEOR is the only n-gram-based metric still usable for quick and cheap evaluation
  3. Data quality plays a crucial role and enables higher correlation with humans on all metrics, allowing local small models to surpass closed ones
  4. Evaluating an answer against two references may not be enough in the narrative domain

Our findings highlight the need for more reliable and standardized evaluation strategies to ensure fair model comparison.