Master theses

Parallellize data views over the entire Web

Keywords: RML, RDF, Semantic Web, Ontology, Performance

Promotors: Pieter Colpaert, Ben De Meester

Students: max 1

Problem

Semantic Web technologies promised to more easily integrate data from all over the Web. For over 25 years, slowly but steadily, it’s uptake is increasing, as evidenced by its usage in Google, Microsoft, Amazon, IBM, Meta, but also more locally by the 200+ data standards currently managed by the Flemish Government at https://data.vlaanderen.be/standaarden/.

There’s just one issue. People and applications all over the Web don’t seem to agree that easily.

As an example, think of your smart home, where your lamps are turned on when it’s dark outside: you could trigger this via weather data (“lights on 10 mins before sunset”), or via an outside sensor (“lights on when outside lux is below 500”). There is no way to automatically align these rules.

What we need is a scalable system to transform data onto multiple models.

At KNoWS, we have over 10 years of experience in configuring Extract-Transform-Load (ETL) processes for small, big, local, remote, streaming, static, and streaming data in CSV, JSON, XML, etc. This culminated in the RDF Mapping Language (RML), with multiple compliant engines being built all over the world: https://rml.io/implementation-report/.

RML describes the mapping of heterogeneous data, data of different formats, into RDF formats. This provides a way to have a “view” of the input data sources based on mappings described in the RML document. Thus, you can have multiple RML documents mapping the same input data source and generating totally different “views”.

One potential redundancy in the case of having multiple RML documents is that there can be common overlapping mapping descriptions in these different RML documents making it inefficient to just execute them in parallel for multiple views generation.

Goal

In this thesis, your task is to design a scalable ETL system that allows the transformation of data into not just a single view (single RML mapping), but to multiple parallel views (optimized multiple RML mappings).

You could just set up multiple processes in parallel, but this is a large waste of resources and might end up with the majority of the work being redundant.

As a guideline to achieve this parallel view data integration, you could tackle this task from 2 angles (there can be more):

  • Extract common mappings in the given RML documents so that they only need to be executed once (RML mapping optimization)
  • Reorder the ETL pipeline steps such that similar operations are executed in a more efficient manner (single machine execution optimization)
  • Distribute the workload of executing the RML mappings across the network such that all the workers of the computing cluster gets their work allocated efficiently (distributed computing optimization)