In multiple applications, it is necessary to query data from multiple sources. For example, in bioscience scenarios, it may be necessary to query information about a protein in one database and obtain complementary information from another database. In mapping applications, it might be necessary to query location data in one database and retrieve contextual information about a location from another database. Semantic graph databases, such as RDF graphs stored in SPARQL endpoints, are useful for handling such rich information that can be interlinked with different agents providing additional information about a source. The nodes of such graphs can have deep and complex connections, thus, mechanisms like property paths, which are regex-like expressions to navigate edges of graphs, are valuable. Here is an example of a query using property paths where the query is looking for descendants of Marie Curie:
SELECT ?person ?personLabel ?grandparent ?grandparentLabel
WHERE {
Find people who are descendants of Marie Curie (Q7186)
?person wdt:P22|wdt:P25/wdt:P22|wdt:P25 ?grandparent .
?grandparent (wdt:P40)* wd:Q7186 .
}
LIMIT 10
Querying multiple databases in SPARQL is called federated querying. In this context, queries must be decomposed into smaller subqueries distributed across different databases, a process known as source selection. Currently, little research addresses source selection for queries containing property paths. This gap becomes critical when query execution requires retrieving nodes whose data resides in one database while the edges traversed by property paths span multiple other databases. In such cases, a query optimizer could strategically distribute different segments of the property path across databases based on data location, rather than naively routing the smallest possible query fragment to each endpoint.