Master theses

Parallelizing Multi-View Knowledge Graph Generation over Heterogeneous Data Sources

Problem

One of the core promises of Semantic Web technologies is seamless data integration across systems and organisations. After 25 years of steady adoption — evidenced by its use at Google, Microsoft, Amazon, Meta, and IBM, as well as in the 200+ data standards managed by the Flemish Government at data.vlaanderen.be — the technology is mature. But a fundamental tension remains: different people and applications model the same real-world concepts differently, and there is no automatic way to reconcile those differences.

Consider a smart home system that turns on lights when it gets dark. One data source might express this as a weather event ("10 minutes before sunset"), another as a raw sensor reading ("outside lux below 500 lux"). Both describe the same physical reality, but in incompatible models. To make these systems interoperate, you need a way to transform the same underlying data into multiple different representations — simultaneously and efficiently.

This is the multi-view problem. With RML, the RDF Mapping Language developed at IDLab and now used by compliant engines worldwide, you can write separate mapping documents that each produce a different "view" of the same input data. But naively executing those mappings independently wastes resources: different RML documents over the same source will redundantly re-read, re-parse, and re-process large portions of identical input data. At scale — think streaming sensor data, large government datasets, or live Web APIs — this becomes a serious bottleneck.

Goal

In this thesis, you will design and implement a scalable ETL system that executes multiple RML mappings over shared input sources in an optimized, parallelized manner, starting from strong single-machine baselines (e.g., RMLMapper) before moving to distributed execution. The core insight driving your work is that independent mappings over the same data share structure that can be exploited: common logical source reads, overlapping triple map definitions, and shared transformation steps can be identified, deduplicated, and executed once rather than N times.

You will approach this from multiple complementary angles. First, static analysis of RML mapping documents to extract and merge common sub-mappings, reducing redundant computation before execution begins. Second, runtime pipeline optimization — reordering and fusing ETL steps so that shared operations are executed in a single pass over the data. Third, and most ambitiously, distributed execution strategies that partition the workload across a compute cluster, ensuring each worker is efficiently utilized when single-machine throughput is insufficient.

The result will be a system that can take a set of RML mappings and a shared data source and produce all required Knowledge Graph views faster and with fewer resources than naive parallel execution. You will evaluate your design against real-world RML workloads, with KROWN providing a benchmark baseline. IDLab's decade of hands-on ETL and RML experience is directly available to support you throughout.

Success is defined by measurable optimization gains on a representative subset of workflows, not by delivering a full distributed data platform.

View all master theses.