Skip to main content

Helping data scientists make sense of the DSA Transparency Database

‘Open source the only option’

Published on: 25/02/2025 Last update: 26/03/2025 News
The image - taken from the Fosdem presentation - is a graph of 8 plot-lines with different colors, each line represents the amount of moderation data submitted per day by large online platforms over a six month period.

When the Italian data scientist Mr Enrico Ubaldi joined the European Commission’s Digital Services Act (DSA) Transparency Database team, he quickly realised that the terabytes of anonymised reports on content moderation decisions, reported by online platforms, such as Google, Tiktok, Facebook, and AliExpress, proved difficult to analyse at scale.

Being an open-source software engineer, he put together ‘dsa_tdb’ - a tool that helped him make sense of it all, by selecting, sorting and visualising the data. Other transparency researchers immediately showed interest, and the Commission has now made dsa_tdb available to the public, as an open source tool.

Speaking to code.europa.eu, Mr Ubaldi, joined by his fellow data scientists, Mr Lucas Verney and Ms Sophia Dietrich, cautions that the tools and services that they and others are building for the Transparency Database, are work in progress. “There are ongoing efforts to improve the quality of the reporting by all of the platforms, as well as our tools,” says Ms Dietrich, the team coordinator. “In the meantime, ‘dsa_tdb’ can provide researchers with an interactive tool, as one part of that effort,” she says.

The Digital Services Act or DSA, an EU regulation adopted in 2022, provides a set of rules on illegal content, advertisement, user moderation and content recommendation of online platforms and search engines. In September 2023, the Commission unveiled the ‘Transparency database’, that aggregates the reports on moderation submitted by the online platforms.

Scrutinising content moderation

The Commission makes the data publicly available. The goal is to increase transparency, and to make it possible for everyone to scrutinise the content moderation decisions, explains Ms Dietrich. This is where the data science team members’ programming skills come in. (Tangentially related, the software to create this enormous database is also publicly available as open source software.)

Each day, the database collects over 3 Gigabyte of raw data, over 50 million statements related to moderation. This needs a lot of preparation to get it ready for research. In addition, the search functionality that is provided on the website of the Transparency Database can only search the past six months, says Mr Ubaldi “When I joined in October 2023, we really needed an easier way to access the database. That is why I started on dsa_tdb.” Following a first internal version built in early 2024, the dsa_tdb was further improved by Mr Ubaldi and Ms Dietrich, and more recently by Mr Verney, who joined the team in September.

The decision to make this public came after pivotal exchanges with transparency researchers and data scientists, who made it clear that this tool would significantly aid their research. Encouraged, they unveiled dsa_tdb in October 2024 at one of the DSA research workshops, following four months of internal testing.

The dsa_tdb software is being made available as a downloadable Python package, or as a ready-made Docker container. There is also an online version for users to test the tool. To use dsa_tdb, users have three options: they can call it from other programmes; they can use the command line interface; or use the built-in dashboard.

Streamlining research

In early February this year, the data scientists presented their work at FOSDEM, Europe’s largest gathering of developers of free and open source software, taking place annually in Brussels, Belgium. “Our package tries to optimise and streamline analysis of the DSA database content”, said Mr Ubaldi. Tempering expectations, he said: “We cannot do miracles, we are now at more than 26 billion statements in the database, and analysing the data requires a powerful computer.”

For developers of free and open source software, FOSDEM is an important conference, Mr Verney tells code.europa.eu. “Presenting at this massive conference raises awareness, not just about the dsa_tdb tool, but also the database itself and the DSA regulation,” he says.

The data scientists continue to add improvements to dsa_tdb. In August the team switched to SPARK as the tool to build queries to the database, allowing them to add more ways to directly query the database. “We are now working on an update of the structure of the database, planned for this summer,” says Mr Verney. They are also improving plotting capabilities, following suggestions from data scientists and other users, received via email.

The power of open source software

There is a flourishing community of researchers using the Transparency Database, says Mr Ubaldi, citing already six academic papers. The team hopes scientists will use the dsa_tdb tool for their work. “It would be awesome to see it mentioned in their research,” says Mr Verney. They are also keen on reuse of their work by the regulators in the Member States. “The tool can be used to track moderation on many kinds of platforms,” he says.

The data scientists are convinced of the power of open source. This type of software makes it easy to try new solutions, to modify, and to tailor them. “For our daily work, we rely almost entirely on open source,” says Mr Ubaldi. "As do the platforms that we analyse.”

"We’re making dsa_tdb available to help researchers,” concludes Mr. Ubaldi. “In turn, their work will support the Commission in improving openness around content moderation, and enable everyone to scrutinise the platform’s decisions.”

More information:

dsa-tdb on code.europe.eu
Presentation on dsa_tdb at FOSDEM
Slides on the dsa_tdb presentation at FOSDEM
Introduction to the Digital Services Act
About the DSA Transparency Database
Very Large Online Platforms and Very Large Online Search Engines under the DSA
DSA data access for researchers
DSA Transparency Database dashboard
Source code for the Transparency Database
 

Login or create an account to comment.

Shared on

Last update: 26/03/2025

Open Source Observatory (OSOR)

Open Source Software