Parallelizing Multi-View Knowledge Graph Generation over Heterogeneous Data Sources
Promotors: Ben De Meester
Main contact: Ben De Meester
Problem
One of the core promises of Semantic Web technologies is seamless data integration across systems and organisations. After 25 years of steady adoption — evidenced by its use at Google, Microsoft, Amazon, Meta, and IBM, as well as in the 200+ data standards managed by the Flemish Government at data.vlaanderen.be — the technology is mature. But a fundamental tension remains: different people and applications model the same real-world concepts differently, and there is no automatic way to reconcile those differences.
Consider a smart home system that turns on lights when it gets dark. One data source might express this as a weather event ("10 minutes before sunset"), another as a raw sensor reading ("outside lux below 500 lux"). Both describe the same physical reality, but in incompatible models. To make these systems interoperate, you need a way to transform the same underlying data into multiple different representations — simultaneously and efficiently.
This is the multi-view problem. With RML, the RDF Mapping Language developed at IDLab and now used by compliant engines worldwide, you can write separate mapping documents that each produce a different "view" of the same input data. But naively executing those mappings independently wastes resources: different RML documents over the same source will redundantly re-read, re-parse, and re-process large portions of identical input data. At scale — think streaming sensor data, large government datasets, or live Web APIs — this becomes a serious bottleneck.
Goal
In this thesis, you will design and implement a scalable ETL system that executes multiple RML mappings over shared input sources in an optimized, parallelized manner, starting from strong single-machine baselines (e.g., RMLMapper) before moving to distributed execution. The core insight driving your work is that independent mappings over the same data share structure that can be exploited: common logical source reads, overlapping triple map definitions, and shared transformation steps can be identified, deduplicated, and executed once rather than N times.
You will approach this from multiple complementary angles. First, static analysis of RML mapping documents to extract and merge common sub-mappings, reducing redundant computation before execution begins. Second, runtime pipeline optimization — reordering and fusing ETL steps so that shared operations are executed in a single pass over the data. Third, and most ambitiously, distributed execution strategies that partition the workload across a compute cluster, ensuring each worker is efficiently utilized when single-machine throughput is insufficient.
The result will be a system that can take a set of RML mappings and a shared data source and produce all required Knowledge Graph views faster and with fewer resources than naive parallel execution. You will evaluate your design against real-world RML workloads, with KROWN providing a benchmark baseline. IDLab's decade of hands-on ETL and RML experience is directly available to support you throughout.
Success is defined by measurable optimization gains on a representative subset of workflows, not by delivering a full distributed data platform.