LiteraryQA @EMNLP 2025

1 minute read

📄 Read the full paper on ACL Proceedings or on arXiv!

22/08/2025 - Paper accepted at EMNLP 2025!

Our paper, LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA, has been accepted to the EMNLP Main Conference 2025!

👏 Huge thanks to my co-authors Tommaso Bonomo and Roberto Navigli.

See you in 修正 (Suzhou)! 🇨🇳

Are we correctly evaluating LLMs on Narrative Understanding? 🤔

Narrative Question-Answering is one of the most popular benchmarks to assess the performance of LLMs on lengthy documents. The de-facto standard is NarrativeQA, a dataset containing (book, summary) pairs with summary-based questions. While valuable, NarrativeQA may potentially suffer from two issues:

Quality issues: document-summary pairs were created semi-automatically, texts were scraped from webpages and questions were generated by non-expert annotators
Evaluation issues: all metrics used on the datasets have been taken from other, sometimes unrelated tasks (e.g., translation), and used as-is without any proper correlation analysis

To solve these problems, we introduce LiteraryQA, a high-quality, automatically refined and manually validated subset of NarrativeQA. After fixing the data, we carry out an extensive meta evaluation of 4 n-gram-based metrics (EM, F1, ROUGE-L, METEOR), 1 neural metric (BERTScore), and thre LLMs in the LLM-as-a-Judge paradigm.

Our Key Findings:

Traditional evaluation strategies poorly align with human judgemenets
LLM-as-a-Judge shows highest correlation but at a computational cost – METEOR is the only n-gram-based metric still usable for quick and cheap evaluation
Data quality plays a crucial role and enables higher correlation with humans on all metrics, allowing local small models to surpass closed ones
Evaluating an answer against two references may not be enough in the narrative domain

Our findings highlight the need for more reliable and standardized evaluation strategies to ensure fair model comparison.

Share on

X Facebook LinkedIn Bluesky

Luca Gioffré

22/08/2025 - Paper accepted at EMNLP 2025!

Are we correctly evaluating LLMs on Narrative Understanding? 🤔

Share on