Publications
Preprints
Can’t spoiler them yet!
2025
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering
–
Francesco Maria Molfese, Luca Moroni, Luca Gioffré, Alessandro Scirè, Simone Conia, Roberto Navigli
–
ACL Findings 2025
Traditional MCQA evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, there exists a fundamental trade-off between constraining a model’s output to simplify answer extraction and allowing it to freely generate to improve reasoning. These findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.