Efficient metadata enhancement with AI for better data discoverability

This blog is part of “AI for Data, Data for AI”, a series aiming to unwrap, explain and foster the intersection of artificial intelligence and data. This post is the fourth installment of the series—for further reading, here are the first, second, and third installments.

In the previous post, we explored how artificial intelligence (AI) can enhance data discoverability by introducing improvements in systems such as semantic search, making it easier to find relevant information. However, addressing the challenges of discoverability goes beyond simply upgrading the system—it requires simultaneously undertaking two key activities: optimizing the discovery mechanism (system) and enriching the metadata itself.

Metadata acts as the link between AI-driven systems and the underlying data, and its quality directly impacts the effectiveness of search and retrieval and, more importantly, the understandability of data. This is where AI, specifically large language models (LLMs), comes into play. LLMs\’ generative and (pseudo-)reasoning capabilities can be used to automate and improve metadata augmentation, making data more discoverable, usable, and understandable.

In this post, we will show how AI can streamline the process of generating, refining, and enriching metadata, making data not only more accessible but also more understandable. We will outline a flexible agentic framework that can be adapted to various use cases, all to produce enhanced metadata that drives better searchability, relevance, and understandability of data. This metadata augmentation process improves both the discoverability and usability of data, ensuring that users can easily find and understand the information they need.

Metadata curation is a difficult process

Metadata plays a critical role in making data understandable and discoverable, but ensuring that metadata is consistently rich and high-quality is no easy task. The process of metadata curation is often tedious and requires specialized expertise, typically carried out by data curators. However, manual curation doesn’t scale well, creating a bottleneck in ensuring that all data is accompanied by high-quality metadata. This challenge is particularly pressing as the volume and complexity of data continue to grow, making it increasingly challenging to maintain metadata quality at scale manually.

With the advent of AI, we now have opportunities to streamline the data curation process by automating some of the more tedious tasks, such as content generation and extraction from existing resources. The goal is to leverage data curators\’ expertise for tasks that require human judgment, such as evaluating the quality and relevance of information, rather than having them spend time on the difficult task of generating content from scratch or repetitive tasks like copying content across sources.

Assessing the correctness or quality of information is typically much easier than generating it. Assessment often involves tasks like recognition or selection, which rely on passive cognitive processes. In contrast, generating new content demands active engagement, drawing on deeper mental faculties such as recall and elaboration. This significant difference in cognitive load makes generating information more challenging. AI can alleviate these challenges by allowing curators to focus on higher-level decision-making, ensuring data integrity and utility, while automating both labor-intensive and cognitively demanding tasks, like generating content from scratch.

AI agents: A co-pilot for metadata curation

AI agents have garnered significant attention for their ability to enhance expressivity in generative AI applications across a range of use cases. An AI agent typically leverages an LLM that is guided to behave in a specific way. Since LLMs are trained to follow instructions, each AI agent is programmed with distinct instructions dictating how it should process and act upon the information it receives, and agents work together to achieve a task. This opens new possibilities by allowing for the generation of novel, emergent outputs that may not be achievable when using a single LLM for a specific task.

In the context of development data, AI agents have the potential to revolutionize data curation by automating complex workflows, improving metadata quality, and providing unprecedented support to data curators. For instance, these agents can scale the entire data curation process, reducing the cognitive burden on curators by handling time-consuming tasks such as metadata enrichment, classification, and content extraction. By doing so, AI agents not only enhance the efficiency of data management but also free up human experts to focus on higher-level decision-making, ensuring the quality and relevance of the curated data.

Figure 1. An AI agentic framework for metadata augmentation using a \”proponent agent\” and an \”quality judge agent\”.

Figure 1 illustrates an agentic framework that can be adapted to various metadata augmentation tasks. The framework consists of two primary agents: a proponent agent and a quality judge agent. These agents are highly customizable through specific instruction sets tailored to the task at hand. For instance, in the context of improving an indicator definition, the quality judge agent can be given a rubric or a set of guidelines to “quantify” its judgment, providing structured feedback on the quality of the proposed definition. The proponent agent, on the other hand, can then refine the definition based on the feedback from the judge. The rubric can also be shared with the proponent agent to allow it to optimize its revisions accordingly. An example rubric, as shown in Figure 2, outlines how the quality of an indicator definition can be quantified and assessed.

Figure 2. Example rubric that can be provided to agents to align their objectives. This rubric is designed to quantify the quality of an indicator definition.

Select use cases: Improving the existing definition of indicators

Indicator definitions provide clarity regarding the nature of the indicators. This information helps avoid confusion and ensures the data is not misinterpreted or misused. This information must maintain a high standard of quality.

We show an example of how AI, using the agentic framework we propose, can improve the definition of indicators, making them more comprehensive and understandable. We take the indicator of “Methane emissions (% change from 1990)” from the World Development Indicators (WDI). The indicator has an accompanying definition in its metadata, Figure 3.

Figure 3. Subset of the original metadata for the \”Methane emissions (% change from 1990)\” indicator from the World Development Indicators (WDI).

The objective is to assess if the current definition can further be improved. To do this, we pass the information to the judge agent to check its current quality assessment. The agent scores the current definition to be excellent (9) with this explanation: “The definition is clear, concise, and contextually relevant, providing a specific focus on human activities contributing to methane emissions and referencing a base year for comparison”. However, since the score is not yet the maximum, we ask the proposal agent to take the current definition and the feedback from the judge agent to propose an improved version. The output is shown in Figure 4, which includes the assessment of the judge agent of the proposed definition. This time, the quality judge agent finds the proposed definition better than the original definition, giving it a score of 10.

Figure 4. An augmented metadata for the \”Methane emissions (% change from 1990)\” showing the proposal agent\’s proposed improvement to the definition and the accompanying assessment by the quality judge agent. The new definition is scored higher than the original definition.

With the top-quality score as judged by the quality judge agent has been achieved, this generated definition can be presented to the data curator for final quality assessment. If found factually and technically correct, this can be used to update the definition in the metadata. Additionally, this may already be used as the basis for generating embeddings for semantic search as it adds a semantically rich description of the indicator, elaborating the various human activities that may contribute to methane emissions

Select use cases: Generating the development relevance of indicators

The development relevance of an indicator is metadata that outlines how the indicator can be applied across a wide range of socio-economic topics, offering insights into its potential for addressing key development issues. However, generating this metadata often demands extensive domain knowledge, as an indicator can have numerous applications across various contexts. This requires data curators to be well-versed in the different socio-economic dimensions where the indicator might be relevant. Fortunately, LLMs, trained on vast corpora of data, can capture and distill this knowledge. This capability can be leveraged to meet the demand for generating development relevance metadata more efficiently, reducing the cognitive burden on curators and ensuring broader applicability of indicators.

As can be seen in Figure 3, the development relevance metadata in the methane emissions indicator is not available. We test if the same agentic framework can be leveraged to generate this information.

Figure 5. Development relevance generated by the AI agentic framework for the \”Methane emissions (% change from 1990)\” WDI indicator.

We pass the indicator name and the definition as input for the proponent agent to synthesize development relevance fit for the indicator. The synthesized development relevance metadata is then passed to the quality judge agent for assessment. Figure 5 illustrates the output of the system. A comprehensive development relevance is proposed, with the judge agent assessing it to be excellent with a score of 10. Having this generated, data curators can review and consider this development relevance metadata for inclusion into the actual metadata of the indicator.

Keyword extraction and generation

In addition to the agentic framework, a more straightforward application of LLMs for metadata augmentation is their ability to extract information from text. LLMs can automate tasks such as identifying keywords and key phrases from textual resources, and now including images. Furthermore, their generative capabilities can be leveraged to produce related keywords or suggest alternative terms based on the input text. This combination of extraction and generation can be applied across various downstream tasks, improving both the discoverability and the comprehensibility of data, ultimately enhancing the reusability of data, Figure 6.

Figure 6. LLM-extracted and generated keywords from indicator name, definition, and methodology of the “Access to clean fuels and technologies for cooking, rural (% rural population)” indicator in the World Development Indicators (WDI).

Conclusion

Using AI in development data opens innovative ways to enhance how we manage and curate information. By automating complex tasks, AI allows us to work more efficiently and make data more accessible and meaningful.

In an upcoming post, we will explore how adopting metadata standards can ensure the interoperability and scalability of AI solutions, making it easier to build capacity and integrate systems seamlessly across different platforms.

Source: blogs.worldbank.org

Share it :