1 minute read

đź“„ Read the full paper on ACL Proceedings or on arXiv!

Conference Dataset Dataset

22/08/2025 - Paper accepted at CLiC-it 2025!

Our paper, What we Learned from Continually Training Minerva: a Case Study on Italian, has been accepted to the CLiC-it Conference 2025!

👏 Huge thanks to my co-authors Luca Moroni*, Tommaso Bonomo, Lu Xu, Domenico Fedele, Leonardo Colosi, Andrei Stefan Bejgu, Alessandro Scirè and Roberto Navigli.

See you in Cagliari! 🇮🇹

CLiC-it 2025 Logo

What We Learned from Continually Training Minerva: Insights for Italian LLM Development

Training large language models for less-represented languages presents unique challenges. In this work, we investigated how different data recipes and context length extensions affect Italian LLM performance.

We used Minerva-7B, a fully open-source bilingual model, pretrained on 50% Italian and 50% English content, to test three data recipes during continual pretraining: mathematical, encyclopedic, and copyrighted literary content from both Italian and English. We also explored extending the model’s context window to handle longer documents.

To evaluate long-context understanding, we created INDAQA, the Italian Narrative Dataset for Question-Answering, the first narrative long-context benchmark for Italian.

Our Key Findings:

  1. Context Extension Beats Brute Force: Extending Minerva’s context window to handle chapter- or book-length texts achieved state-of-the-art performance on long Italian documents. Our models outperformed both Italian-adapted models fine-tuned from English foundations and models trained on many more trillion tokens. The takeaway: strategic continual pretraining on well-designed Italian data can compete with—and surpass—the brute-force approach of adapting massive English-centric models.
  2. Multiple-Choice Tests Mislead on Cultural Knowledge When testing cultural knowledge using multiple-choice questions, results were misleading—models could score well through pattern matching without genuine understanding. But with open-ended question answering, where models generate free-form responses, Minerva excelled and surpassed all competitors. For fair evaluation of language-specific capabilities, we need formats that truly test comprehension and generation.

We contribute INDAQA to the community and demonstrate the importance of evaluation format when assessing language-specific models.