Master theses

From Dough to 🥐 Data: Kneading data formats into CroissantML for better interoperability (Internship at VLIZ)

Keywords: AI, CroissantML, FAIR

Promotors: Pieter Colpaert, Julián Andrés Rojas Meléndez

Students: max 1

Problem

In the modern scientific landscape, vast amounts of data are collected across disciplines and stored in diverse formats. However, a key challenge remains: ensuring that this data is not only openly available but also structured in a way that makes it universally findable, understandable, and usable. For researchers, data scientists, and AI-driven systems, seamless access to well-structured metadata is essential for accelerating discovery and innovation.

This challenge is particularly relevant in the marine sciences, where large-scale environmental datasets — such as those in netDCF format — are crucial for understanding ocean processes, biodiversity, and climate change. However, the lack of a standardized metadata structure often limits data interoperability across projects and institutions.

To address this challenge, CroissantML has emerged as a promising metadata standard that enables structured, machine-readable descriptions of datasets. By adopting CroissantML, scientific data can be seamlessly integrated, making it more accessible to both humans and AI-driven systems. This aligns with the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles, ensuring that information is not just available but also practical for computational use.

Goal

In this topic, the focus is on transforming scientific metadata — much like how raw dough is carefully kneaded and shaped before baking. The goal is to develop methods to translate metadata from scientific data formats (mainly those in use in the Marine domain), such as netDCF, into Linked Data using the CroissantML metadata standard. Additionally, the goal is to actually test the effectiveness of this technique in allowing machine learning systems to be trained by data described in this way. Key tasks will include:

  • Making a list of relevant data formats in the marine domain, and searching for existing projects with them towards CroissantML. (netCDF and DWC-a are essential starting points)
  • For a selection of the above continue deeper by
  • Analyzing the structure of those data formats to understand their metadata properties.
  • Identifying key attributes that need to be preserved and translated into CroissantML.
  • Designing a systematic approach for converting and augmenting existing metadata into the
  • CroissantML format while maintaining interoperability.
  • Developing scripts or automation tools to facilitate and streamline the metadata transformation process.
  • Ensuring that datasets are machine-readable, interoperable, and optimized for AI-driven applications.
  • Testing and validating the transformation pipeline, ensuring metadata integrity and usability across different research domains.
  • Contributing to a standardized metadata framework, enabling easier sharing, linking, and cross-disciplinary analysis of scientific datasets.

This work will help "roll out" a more structured and interoperable approach to metadata, making scientific data more accessible and useful across domains.