Open-source AI for Bulgaria: INSAIT’s BgGPT can foster public and private sector innovation

INSAIT developed the first Bulgarian large language model and promotes its applications across sector

The Responsible Organisation

The Institute for Computer Science, Artificial Intelligence and Technology (INSAIT) is a research institute in Sofia, Bulgaria. It is funded by the Bulgarian government and has received grants from Google, Amazon Web Services, and VMWare. INSAIT has developed the BgGPT solution, a large language model for the Bulgarian language, with government funding.

The National Revenue Agency is a Bulgarian government agency that is piloting the use of BgGPT to improve public services.

For the purpose of this article, an interview has been conducted with Borislav Petrov, the Executive Director of INSAIT, and Emiliyan Pavlov, Data Scientist at INSAIT.

The problem

The Bulgarian research centre INSAIT developed BgGPT, a large language model (LLM) specifically designed and tailored for the Bulgarian language, marking a significant advancement in natural language processing for Bulgarian speakers. This project is originated from the Bulgarian government's initiative to leverage AI technology for enhancing public services.

The BgGPT model, in this context, was planned to address the problems posed by the lack of Bulgarian-based language models, as well as maximising opportunities to foster innovation across sectors and applications. In fact, existing language models are not adequate for Bulgarian language and fail to generate fluent and accurate text. BgGPT was proposed to overcome this limitation and provide a model capable of producing high-quality Bulgarian text. In addition, INSAIT recognised the importance of open-source accessibility. There was an opportunity to promote AI models that are freely available to contribute to AI’s equitable adoption and to foster innovation across a wide range of applications in Bulgaria, without the constraints of proprietary models. These opportunities are considered particularly relevant to encourage public sectors’ applications to improve public services and administrative processes.

The solution and its implementation

INSAIT launched the BgGPT LLM for the Bulgarian language in March 2024. This initiative is among the first nationwide deployment of such a model in Europe and is part of a broader strategy to create locally tailored AI solutions that address specific public and private sector needs, putting at the service of Bulgarian private and public organisations a large language model trained with a large amount of Bulgarian data. Its development was funded through significant donations, including €150 million from the Bulgarian government, and large donations and partnerships with researchers and organisations like Google and Amazon Web Services. With this solution, INSAIT aims at going beyond the launching of an LLM, to build outreach, partnerships and technical collaboration to promote the widespread adoption of this model.

Regarding its technical features, a 7B language model was initially chosen for the BgGPT solution. Among different large language models, 7B models are those that have 7 billion of parameters, which are the values that the model learns during the training process, and that determine how the model processes and generates text. This chosen size allows flexibility and ease of fine-tuning, which could facilitate rapid and cost-effective adoption by both public and private sector organisations. The 7B model, from Mr. Petrov’s perspective, provided the right balance between performance and accessibility for a wider range of users and applications. In the future, the intention is to expand the model’s capabilities, therefore incorporating more parameters and training data. This way, the model would be able to perform a wider range of tasks.

BgGPT is open source and free to use, with the goal of making it easier for Bulgarian companies, individualsand public organisations to build products and services that use the BgGPT AI model, such as specific chatbots and tailored text-generation tools in Bulgarian. For this reason, BgGPT is released under an Apache 2.0 license on Hugging Face, which is an open-source license that grants users the freedom to utilise the model for both commercial and non-commercial purposes. This fosters wider accessibility and encourages the development of innovative applications based on BgGPT.

Additionally, BgGPT was trained on a massive dataset of over 3 billion Bulgarian sentences, including both text and code from various sources, such as the Bulgarian National Corpus of texts, the Bulgarian Wikipedia, and online content from the Bulgarian web. Trained on these data, BgGPT can generate text, translate languages, write various kinds of creative content, and answer questions.

Technical support for the development of specific use cases

INSAIT is currently supporting other public organisations and private sectors actors in developing specific applications based on the BgGPT model. Besides the availability of the model in open-source format, the research institute offers hands-on collaboration to develop use cases tailored to specific organisational needs. Mr. Petrov explains that, in these pilot projects, they also work on integrating the BgGPT model with other components, such as the retrieval method generation to source the LLM with different knowledge bases. This method, in fact, integrates in a scalable way updated knowledge bases in the model, ensuring accuracy of the outputs generated.

For example, INSAIT is working in the development of a BgGPT-based chatbot for Bulgaria's National Revenue Agency, which uses the described LLM to assist citizens and businesses with tax inquiries. INSAIT provided not only the LLM but also the technical support needed to integrate the model into the agency's systems. As Mr. Petrov, INSAIT’s Executive Director, explained, this collaboration involved “fine-tuning based on frequently asked questions curated by the National Revenue Agency” and ensuring that the model could retrieve accurate and transparent information from various knowledge bases, such as the Bulgarian tax code and the agency’s website.

Besides the support to the National Revenue Agency, the BgGPT team is also developing over 10 other pilot projects currentlyInizio modulo with various organisations, exploring diverse applications of the model across different sectors. Furthermore, INSAIT is considering extending the development of BgGPT to other languages.

INSAIT's commitment to open-source technology, public accessibility and tailored technical support can have profound impacts across both public and private sector organisations, promoting them to explore and develop innovative applications and fostering innovation across the Bulgarian ecosystem.

Expected benefits

The BgGPT model can offer several significant benefits for both public and private organisations:

One of the key advantages is cost efficiency. As Mr. Petrov explained, implementing BgGPT can significantly reduce operational costs compared to acquiring commercial alternatives. For instance, for certain tasks, using proprietary models might cost around €50,000, while adopting and tailoring the Bulgarian LLM could require only €300. The open-source nature of the model makes it highly affordable, which can represent a significant benefit especially for smaller organisations or public institutions with limited budgets.
Together with affordability, the open-source and technical support offered for the BgGPT implementation can help organisations build the technical capacity needed to implement and scale it effectively. This lowers the barrier for organisations with limited AI expertise and the dependency from technological providers, which in the larger scale can promote skills development across the Bulgarian innovation ecosystem and public sector.
Additionally, the model is characterised by its flexibility and scalability to be tailored to each organisation’s needs. Following Mr. Petrov insights, the model was designed to be adaptable and fine-tuned for different use cases and to be customised for different needs.
Another benefit is data privacy. Mr. Petrov and Pavlov highlight that, by deploying BgGPT on local infrastructures, organisations can maintain complete control over their data. Mr. Petrov highlighted that this mitigates the risks associated with sharing sensitive information with third-party corporations, which is crucial for public institutions. He emphasised “you are not dependent on a corporation... the data stays on your infrastructure”, ensuring compliance with data protection regulations.
The solution also enhances transparency and trustworthiness in public services. For example, the implementation of BgGPT for the National Revenue Agency included a feature that retrieves accurate sources for the chatbot’s responses, allowing employees to easily verify the information. This not only helps to reduce errors, but also increases public trust in automated systems, as responses can be traced back to reliable sources such as legal documents or official guidelines.

Main challenges

While the model presents significant benefits, its implementation also faces some challenges:

One major obstacle is the limited pool of specialists with the deep technical expertise needed to manage and deploy advanced AI models. As Mr. Petrov noted, "this is not like classical software development", as working with LLM requires highly skilled engineers, which are in short supply in Europe. This lack of skilled personnel can negatively impact the adoption process of AI-based tools, particularly in the public sector.
Another challenge is the high computational cost associated with training and fine-tuning large models like BgGPT. While the model is cost-efficient for deployment, the initial training required significant computational resources and financial sustainability might need to be considered as a key aspect for the future.
Additionally, integrating AI systems into existing organisational processes can be difficult, particularly in public sector institutions with legacy IT systems and rigid IT systems structures. Technical support provided to organisations by INSAIT is crucial in this regard, but the risk of receiving excessive number of requests can present challenges.