Semantic Web technologies have enabled data integration solutions for over 25 years, and slowly but steadily, it’s uptake is increasing, as evidenced by its usage in Google, Microsoft, Amazon, IBM, Meta, but also more locally by the 200+ data standards currently managed by the Flemish Government at https://data.vlaanderen.be/standaarden/.
As this topic matures, the requirements for data integration engines increase, both in terms of complexity ánd performance.
At KNoWS, we have over 10 years of experience in configuring Extract-Transform-Load processes for small, big, local, remote, streaming, static, and streaming data in CSV, JSON, XML, etc. This culminated in the RDF Mapping Language (RML), with multiple compliant engines being built all over the world: https://rml.io/implementation-report/.
The goal of this thesis is simple, but not easy: make our data integration engine (https://github.com/RMLio/rmlmapper-java) as fast as you possibly can. There are multiple existing solutions that you can try to apply to our own engine, and there are established benchmarks to test improvements, but the world’s your oyster. It. just. needs. to. be. faster.
As a guideline to achieve this lightening fast data integration engine, you could tackle this task from 3 angles (there can be more):
- Optimize the mapping document written in RML first (without the engine running it)
- Optimize the process of executing the mappings described by RML document (the engine itself)
- Come up with an intermediate format that describes the mappings in the input RML document in a more efficient manner (think about LLVM for programming languages)