Master theses

SHACL-driven dataset and service discovery for resilient data products

Promotors: Julián Andrés Rojas Meléndez

Main contact: Julián Andrés Rojas Meléndez

Problem

In our developer ecosystem across the companies Inuits, RedPencil and Sirus, a central bottleneck is that developers still discover and integrate datasets and data services largely manually, with endpoints hard-coded into applications. This makes integration expensive for data space service providers and brittle over time: when a source changes or disappears, the consuming workflow often breaks instead of automatically switching to an equivalent dataset or service. The proposal argues that extending catalog metadata (e.g., DCAT(-AP)-style portals) with machine-readable shape descriptions of dataset contents and service interfaces could enable automated source selection and even improve API ecosystem maintainability by supporting replacement discovery at runtime.

Goal

Design and implement a discovery mechanism that takes a machine-described information need (expressed as SHACL shapes) and automatically selects compatible datasets and/or services from a catalog enriched with those shapes. The thesis should deliver a concrete metadata extension proposal, a matching/scoring algorithm with a reference implementation, and an evaluation that measures correctness (does it find the right sources) and performance (how quickly), plus a small “resilience” demonstrator showing how a pipeline can recover by discovering an alternative source when the original becomes unavailable.