Master theses

ARCANE: Optimized information retrieval from media archives through knowledge-driven agentic AI

Promotors: Femke Ongenae

Main contact: Femke Ongenae

Problem

In today’s fast-paced media landscape, editorial teams face an overwhelming challenge: navigating millions of constantly updating media items (articles, social media chatter, etc.) to produce timely, culturally relevant, and fact-based content. 100K–200K+ news articles published daily worldwide, 500 hours of video uploaded to YouTube every minute, next to all the chatter on social media about current events leading editorial teams to search and synthesize across millions of fast-updating items, while also combatting that 44% of people aged 18-24 rely on social media for their news and 15% of under-25s already use AI tools for news. This requires media outlets to raise the bar on speed and context to stay relevant, focusing on correctness and contextualization of reporting as ultimate differentiator.

Agentic interfaces and LLMs can support this task, however, LLMs often produce incorrect output (hallucinations) when needing to handle

(1) up-to-date info (e.g. new media / events),

(2) specialized domain knowledge (i.e. niche domains or interests), or

(3) context-sensitive interpretation, i.e. LLMs often hallucinate when asked to reply on questions that pertain to specific EU culture or identity.

Moreover, there is a need to ensure that the produced content is based on facts, while current LLMs often struggle with providing sources for their answers.

Goal

RAG (retrieval augmented generation) offers a solution. RAG matches the user query on a specified knowledge base (e.g. a media database) and includes the best matches in the prompt to generate the answer to the user. This enables fast incorporation of up-to-date & specialized domain knowledge and increases reliability through proper knowledge base curation. Smart selection of the matches is of utmost importance due to the limited context size of prompts to include the matches. However, as current RAG systems are mainly based on textual matching of the content of the knowledge base on the query, they struggle with performing accurate selection due to (1) limited understanding of the (cultural) context of the source, (2) inability to take long-range connections into account between different sources in the knowledge base where 2 sources on their own might not be a good match, but their combined content is the optimal match to the query. It has been shown that, with insufficient context, AI systems fabricate answers 40–60% of the time instead of admitting uncertainty, and that 55% of the time the information provided to AI is insufficient to answer correctly.

Therefore, in this research, we aim to investigate graph-based RAG (GraphRAG) methods. In this method, the source content is represented as a Knowledge Graph, containing not only the source content itself, but also all its (cultural) context (e.g. origin, author, type, etc.) and links to other content (e.g. related news, people, events, etc.).

This requires research into two topics: (a) How to construct the Knowledge Graph from the media data that is expressive enough to support GraphRAG in an automated manner by, i.e. relying on WikiData.

(b) Which GraphRag method is ideal to perform question / answering over media data in a reliable manner?