Publications
Preprints
Can’t spoiler them yet! ![]()
2025
-
Luca Moroni*, Tommaso Bonomo, Luca Gioffré, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli. 2025. What we Learned from Continually Training Minerva: a Case Study on Italian. In Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2025), pages xxx–xxx, Cagliari, Italy. CEUR Workshop Proceedings.
We explore continual pretraining strategies to improve Italian-language performance using Minerva by testing different data mixtures (mathematical, encyclopedic, and narrative) and extended context windows. We introduce INDAQA, a new Italian narrative QA benchmark, and find that both data composition and longer context significantly enhance performance on Italian tasks. We also convert the ITALIC benchmark from MC to OE format to disentangle whether models struggle with format adherence or with recalling cultural knowledge. -
Tommaso Bonomo*, Luca Gioffré*, and Roberto Navigli. 2025. LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages xxx–xxx, Suzhou, China. Association for Computational Linguistics.
We introduce LiteraryQA, a high-quality subset of NarrativeQA that addresses the benchmark’s reliability issues through systematic cleaning of documents and validation of question-answer pairs. Our meta-evaluation reveals that traditional n-gram metrics poorly correlate with human judgment, while LLM-based evaluation, even using smaller open-weight models, achieves strong agreement with human rankings. We provide benchmark results for state-of-the-art long-context LLMs and establish best practices for evaluating narrative question answering systems. -
Francesco Maria Molfese, Luca Moroni, Luca Gioffré, Alessandro Scirè, Simone Conia, and Roberto Navigli. 2025. Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18477–18494, Vienna, Austria. Association for Computational Linguistics.
Traditional MCQA evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, there exists a fundamental trade-off between constraining a model’s output to simplify answer extraction and allowing it to freely generate to improve reasoning. These findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.