less than 1 minute read

Preprints

Can’t spoiler them yet! :eyes:

2025

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question AnsweringFrancesco Maria Molfese, Luca Moroni, Luca Gioffré, Alessandro Scirè, Simone Conia, Roberto Navigli – ACL Findings 2025
ACL arXiv Hugging Face Dataset GitHub post

Traditional MCQA evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, there exists a fundamental trade-off between constraining a model’s output to simplify answer extraction and allowing it to freely generate to improve reasoning. These findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.