Full Paper
Assessing the Effectiveness of LLMs (Large Language Models) for Extracting Topics and Themes in Survey Responses
Ying-Hsang Liu ,Xin Yang
,Junzhi Jia
Abstract
Artificial intelligence (AI) has the potential to automate metadata tasks, such as identifying key topics and analyzing recurring themes in text. Topic extraction focuses on recognizing dominant subjects, whereas theme extraction examines patterns of meaning within the text. This study evaluated DeepSeek R1 (8b), DeepSeek R1 (14b), and Gemma3 (12b) on extracting topics and themes from 50 qualitative survey comments. Using standard information retrieval methods and metrics, we found that Gemma3 (12b) consistently outperformed the DeepSeek models. Topic detection was handled with reasonable effectiveness (both DeepSeek R1 (8b) and Gemma3 (12b) global F1 0.31). However, theme detection was significantly more challenging, particularly for DeepSeek models (global F1s 0.02, 0.08), with Gemma3 (12b) achieving F1 0.26. Significant document-level variability was also observed. Standard information retrieval (IR) metrics can be applied to assess AI performance in metadata tasks, but achieving accuracy comparable to human experts in abstract thematic analysis remains a significant challenge. Developing AI systems that can better capture the subtleties of abstract meaning needs human oversight since these capabilities are critical for supporting complex analytical tasks.
Author information
Ying-Hsang Liu
Professorship of Predictive Analytics, Chemnitz University of Technology,DE
Xin Yang
School of Information Resources Management, Renmin University of China,CN
Junzhi Jia
School of Information Resources Management, Renmin University of China,CN
Cite this article
- Published
Issue
- Location:
- University of Barcelona, Barcelona, Spain
- Dates:
- October 22-25, 2025