Master theses

A generic toolchain specification for interoperable semantic data pipelines across vendor frameworks

Promotors: Julián Andrés Rojas Meléndez

Main contact: Julián Andrés Rojas Meléndez

Problem

In our developer ecosystem across the companies Inuits, RedPencil and Sirus, we found that even when pipeline components are open-source, “plugging” them into another vendor’s framework is still a manual engineering effort: protocols, configuration, validation, and provenance are stitched ad hoc, and the semantics of what a processor consumes/produces is not described in a uniform, machine-checkable way. This creates vendor lock-in and slows adoption of new components. The proposal positions a SHACL-driven, shape-based pipeline description as a route toward interoperability: define processors and pipelines in a shared vocabulary, validate compatibility before execution, and enable composition even when processors are implemented in different programming languages and run over different transport channels.

Goal

Produce an implementer-ready specification and reference implementation for describing processors and pipelines in a framework-agnostic way, focusing on (1) declaring input/output data shapes, (2) validating pipeline correctness pre-run, and (3) executing pipelines that combine processors written in different languages via interoperable “channels” (e.g., HTTP POST, WebSocket). The thesis should also include a conformance test suite and at least one demonstrator pipeline that composes processors from multiple ecosystems, plus a comparative evaluation versus a state-of-the-art setup in terms of developer effort, runtime overhead, and failure modes.