Despite its immense potential, text analysis at scale presents significant challenges. This column benchmarks several state-of-the-art large language models against incentivised human coders in performing complex text analysis tasks. The results indicate that large language models consistently outperform outsourced human coders across a broad range of tasks, and thus provide economists with a cost-effective and accessible solution for advanced text analysis.
Historically, economists’ data analysis skills have centered on structured, tabular data. However, the rapid expansion of digitisation has positioned text data as a valuable resource for studying phenomena that traditional quantitative methods often struggle to address (Gentzkow et al. 2019). For instance, text analysis has enabled researchers to explore a wide range of topics, including analysing central bank communications and policy announcements for macroeconomic insights (e.g. Demirel 2012), studying firms’ inflation expectations (e.g. Thwaites 2022), investigating emotional contagion in social media (e.g. Kramer et al. 2014), examining gender stereotypes in movies (e.g. Gálvez et al. 2018), and assessing the impact of media coverage on political outcomes (e.g. Caprini 2023) and stock market behavior (e.g. Dougal et al. 2012).
Despite its immense potential, text analysis at scale presents significant challenges (Barberá et al. 2021). As Ash and Hansen (2023) note, economists have mostly relied on three primary approaches to tackle this: (1) manual coding by outsourced human coders, (2) dictionary-based methods, and (3) supervised machine learning models. Each of these, however, has notable limitations. Outsourced manual coding is costly, time-intensive, and often relies on coders without domain-specific expertise. Dictionary-based methods fail to capture contextual nuances, leading to inaccuracies. Meanwhile, supervised machine learning requires considerable technical skills and large, labeled datasets – resources that are not always readily available (Gilardi et al. 2023, Rathje et al. 2024).
Generative large language models (LLMs) present a promising alternative for large-scale text analysis. Unlike traditional supervised learning methods, current LLMs are considered well-suited for tackling complex text analysis tasks without requiring task-specific training, effectively serving as ‘zero-shot learners’ (Kojima et al. 2022). In a recent paper (Bermejo et al. 2024a), we benchmark several state-of-the-art LLMs against incentivised human coders in performing complex text analysis tasks. The results reveal that modern LLMs provide economists with a cost-effective and accessible solution for advanced text analysis, significantly reducing the need for programming expertise or extensive labeled datasets.
The study examines a corpus of 210 Spanish news articles covering a nationwide fiscal consolidation program that impacted over 3,000 municipalities (see Bermejo et al. 2024b). This corpus is particularly suitable for testing contextual understanding, as the articles present complex political and economic narratives requiring in-depth knowledge of local government structures, political actors, and policy implications. Moreover, the articles frequently include intricate discussions on fiscal policies, political critiques, and institutional relationships, which would be difficult to analyse through simple keyword matching or surface-level reading.
A common set of five tasks of increasing complexity was selected to be evaluated through different coding strategies across all news articles, each task requiring progressively deeper contextual analysis. The tasks are as follows:
These tasks were completed following three distinct coding strategies:
Figure 1 illustrates the performance of outsourced human coders and LLMs across all tasks. The final panel (‘All correct’) shows the proportion of news articles where the different coders successfully completed all five tasks.
Figure 1 Overall performance, across tasks and coding strategies
Visual inspection of Figure 1 reveals that all LLMs outperform outsourced coders across all tasks. While GPT-3.5-turbo (the oldest and least advanced LLM tested) surpasses human coders, it falls behind other LLM models. Among the models compared, Claude 3.5 Sonnet and GPT-4-turbo (the most advanced) achieve the highest overall scores. This result suggests that as LLMs continue to grow more powerful, the performance gap between them and outsourced human coders will likely expand.
The performance advantage of LLMs holds even where task difficulty is considered. Figure 2 shows that state-of-the-art LLMs typically outperform outsourced human coders on more challenging tasks, where a task is deemed difficult if at least two authors initially disagreed on the correct answer during the creation of the gold standard labels.
Figure 2 Performance by article difficulty, across tasks and coding strategies
The cost advantages of LLMs are significant. Running all tasks across the entire corpus cost just $0.20 with GPT-3.5-turbo, $3.46 with GPT-4, $8.53 with Claude 3 Opus, and $2.28 with Claude 3.5 Sonnet. In each case, the complete set of answers was delivered within minutes. In contrast, the outsourced human coding approach required substantial investment: designing the online questionnaire, recruiting and managing 146 participants, and coordinating the entire data collection process, all of which incurred significant time and logistical costs. Collecting data from all participants took about 98 days. Beyond cost and time savings, LLMs also provide operational simplicity through straightforward API calls, removing the need for advanced programming expertise or human-labeled training data.
Our study highlights the growing potential of modern generative LLMs as powerful, cost-effective tools for large-scale text analysis. The results demonstrate that LLMs consistently outperform outsourced human coders across a broad range of tasks. These findings underscore the significant advantages of leveraging LLMs for text analysis, suggesting that current natural language processing technologies have reached a point where researchers and practitioners – regardless of technical expertise – can seamlessly incorporate advanced text analysis methods into their work. Furthermore, as newer generations of LLMs continue to evolve, the performance gap between human coders and these models is likely to widen, making LLMs an increasingly valuable resource for economists.
Source: cepr.org