In the realm of querying linked data sources, the discovery of additional sources within the data can significantly enhance the recall of query results. This process, known as link traversal, involves determining which links to follow to discover new sources, a decision governed by what we term reachability criteria. For example, one of these criteria follows every link that is found in a data source. This can cause the query engine to query the entire web, because each source can link to many other sources. Solid, an initiative aimed at providing individuals with their personal data vaults (PODs), offers a more focused context for refining reachability criteria. A demonstration of this approach by Ruben Taelman et al. can be found here: https://comunica.github.io/Article-EDBT2024-SolidQueryDemo/ along with a video presentation: https://www.youtube.com/watch?v=4WHWgWWZ_aQ.
In our research group we are researching incremental query techniques, in essence this allows for more performant maintenance of the query results. Instead of reevaluating the query when the data changes, an incremental query engine uses internal state to calculate the changes in the query result based on the changes in the data. In essence, incremental query engines make sure that the result of a query are up-to-date with the data. The Incremental query engine we are developing is called incremunica (https://github.com/maartyman/incremunica) and is an extension of the non-incremental query engine comunica (https://github.com/comunica/comunica)(https://github.com/comunica/comunica-feature-link-traversal) used in the video mentioned before.
In this thesis, our aim is to delve into the integration of incremental query engines and link traversal techniques. Enabling the incremental query engine to dynamically incorporate newly discovered sources and remove outdated ones in response to changes in the underlying data. Achieving this necessitates the implementation of proper reachability criteria. Given the vastness of the web and the potential interconnectedness of data sources, it is impractical to attempt to maintain query results spanning the entire web. Therefore, we must define and implement reachability criteria that guide the traversal of links in a manner that is both effective and manageable. Moreover, we must develop new algorithms to efficiently calculate changes in the query result in a link traversal setting, as the deletion of one source can cause a subset of the used sources to become outdated.
Explore the field of incremental link traversal, the student can focus on one or more of the challenges below, according to their interests:
- Investigate naive implementation of incremental link traversal
- Investigate optimizations of incremental link traversal incremental to increase maintenance performance
- Investigate the reachability criteria in an incremental setting
- Investigate a Solid use case with incremental link traversal
- Investigate an interface for partial query results (partial traversal)
- Investigating query restarts to decease maintenance costs
- Etc.