Skip to main content

Spanish Government promotes open access to its ALIA AI models

Spanish authorities release ALIA AI models

Published on: 22/04/2025 News

ALIA is the public AI infrastructure in Spanish and co-official languages of Spain (Catalan, Basque, Galician) developed by the Spanish government. The project aims to facilitate the creation of IT solutions and services, boosting the promotion of Spain’s official and co-official languages. Six years after the ALIA project started, the Spanish authorities have shared their AI models with the general public, following the 'Public Code, Public money’ approach.  

Public and open AI project

The Secretary of State for Digitalisation and Artificial Intelligence leads the project, while the Barcelona Supercomputing Centre - Centro Nacional de Supercomputación (BSC-CNS) coordinates it. The ALIA project is one of the main deliverables of the Artificial Intelligence Strategy 2024, which established it as one of its primary strategic objectives, the development of foundation and language models aiming to: 

‘generate ethical and trustworthy AI standards, with open-source and transparent models, guaranteeing the protection of fundamental rights, the protection of intellectual property rights and the protection of personal data, and developing a  framework of best practices in this field’ 

The Spanish government announced its intention to apply ALIA in two pilot projects: an internal chatbot that will streamline the work of the Tax Agency and its citizen service; and an application in primary care medicine that, based on advanced data analysis, will facilitate the diagnosis of heart failure.

Released models

During the ‘HispanIA 2040’ event, the President of Spain, Pedro Sanchez, announced the publication and availability to all users of the first batch of AI models. In this regard, the ALIA project has released the following models under the Apache Licence, Version 2.0 in HuggingFace:

  • ALIA-40B: Transformer-based decoder-only language model that has been pre-trained from scratch on 9.37 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code.
  • Salamandra-7b and Salamandra-2b: Transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code. All training scripts and configuration files are publicly available on GitHub.
  • mRoBERTa: Multilingual foundational model based on the RoBERTa architecture. It has been pretrained from scratch using 35 European languages and code. The pretraining corpus consists of 12.8TB of high-quality data. This is significantly larger compared to previous state-of-the-art encoder-only foundational models like XLM-RoBERTa-base and XLM-RoBERTa-large, whose training datasets included less multilingual data, amounting to 2.5 TB.
  • RoBERTa-ca: foundational Catalan language model built on the RoBERTa architecture. It uses vocabulary adaptation from mRoBERTa, a method that initialises all weights from mRoBERTa while applying a specialised treatment to the embedding matrix. This treatment carefully handles the differences between the two tokenisers. The model is then continually pretrained using a Catalan-only corpus, consisting of 95GB of high-quality data in Catalan.

These models have been verified by the Spanish Agency for the Supervision of Artificial Intelligence (AESIA). 

Login or create an account to comment.