Master theses

What is happening with my data? Increasing transparency of machine learning pipelines with Semantic Web technologies

Keywords: Semantic Web, Solid, GDPR, Machine Learning, Python, Data Provenance, Data Transparency, Data Visualization, Docker, Javascript, Semantics

Promotors: Ben De Meester, Femke Ongenae

Students: max 1

Problem

Regaining Control Over Your Personal Data: A Semantic Approach

In today's data-driven world, personal data is continuously exchanged, processed, and stored by a variety of systems and organizations. This leads to a growing concern over the transparency of data usage and ownership. Individuals often lack the ability to fully understand where their data is going, how it is being processed, and what happens to it once it leaves their control.

The shift towards machine learning and data pipelines has made it even more complex for users to track the flow of their data. These pipelines frequently span across multiple heterogeneous frameworks and systems, making it difficult to provide a clear, cohesive understanding of data movement and manipulation. Furthermore, these systems are often opaque to non-technical users, making it nearly impossible to enforce privacy rights, such as those outlined in the General Data Protection Regulation (GDPR).

The Power of Semantics

The Semantic Web, though perhaps not widely recognized yet, represents a significant shift in the future of data management. Traditionally, data on the internet has been structured as hypertext documents and hyperlinks. However, the Semantic Web seeks to model data more accurately, reflecting entities such as people, places, ideas, and events, while establishing meaningful connections between these entities.

Enhancing data to encompass not only value but also semantic meaning presents significant advantages. A common example can be observed in the evolution of search engine results. Historically, a search query for "Brussels" would yield a list of web pages containing the term. Now, searches provide structured and relevant information, such as the city's population, travel options and more. This shift has been enabled by the Semantic Web, which employs ontologies—formalized frameworks of rules that define concepts such as cities, residents, and transportation systems in a manner that is comprehensible to computers.

Increasing Transparency with Semantics

A tool was developed that employs a semantics approach for better understanding complex data processes across different programming languages and frameworks. By using existing ontologies, data transformations and processes are described in terms of their purpose rather than technical execution. For example, a function might be named differently in various programming languages, but its core task remains the same, such as summing values or calculating distances between points.

In this framework, each function within a pipeline is identified and described semantically, regardless of the underlying programming language. These descriptions highlight how individual components of a heterogeneous pipeline are interconnected and how data flows through each stage. This semantic representation not only increases clarity but also allows users to visualize the flow of their data and see how each step interacts with their information. Furthermore, by integrating provenance data with these semantic descriptions, it is possible to trace and record the entire history of data as it moves through different processes.

Goal

Together with the student, we will explore how to build upon the existing semantic approach to track the flow of data across various stages of a pipeline. This research will focus on several key areas of development:

Refining Semantic Descriptions: Enhancing the quality of semantic descriptions will ensure that data transformations and their underlying processes are more clearly understood, irrespective of the tools or programming languages used.
Better Provenance Capture: We will aim to improve how data lineage is tracked, providing a more detailed record of where the data came from, how it was transformed, and what happened to it throughout the pipeline. This deeper level of tracking will enable greater trust and accountability.
Expanding Framework Support: We will explore how to integrate additional frameworks and tools into the semantic descriptions of data pipelines, allowing us to cover a broader range of systems and applications commonly used in engineering contexts.
interactive visualizations, we will help users easily understand how their data moves through each stage of a pipeline and what happens to it, making complex processes much clearer.
Improving Accessibility: The project will also focus on making the captured data and visualizations more user-friendly, so individuals can more effectively monitor and control their personal data. This, in turn, will make it easier for them to enforce their privacy rights, such as those outlined in GDPR.

This research will lead to the development of powerful tools that give individuals more control over their data, enabling them to understand, trust, and enforce their privacy rights more effectively. By providing clearer insights into data workflows and enhancing transparency, we will help bridge the gap between technical systems and everyday users

View all master theses.