TY - JOUR
T1 - Frankenstein, thematic analysis and generative artificial intelligence: Quality appraisal methods and considerations for qualitative research
AU - Jowsey, Tanisha
AU - Stapleton, Peta
AU - Campbell, Shawna
AU - Davidson, Alexandra
AU - McGillivray, Cher
AU - Maugeri, Isabella
AU - Lee, Megan
AU - Keogh, Justin
N1 - Publisher Copyright:
© 2025 Jowsey et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2025/9/5
Y1 - 2025/9/5
N2 - Objective:To determine accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis.Introduction:With the increasing use of GenAI in data analysis, testing the reliability and suitability of using GenAI to conduct qualitative data analysis is needed. We propose a method for researchers to assess reliability of GenAI outputs using deidentified qualitative datasets.Methods:We searched three databases (United Kingdom Data Service, Figshare, and Google Scholar) and five journals (PlosOne, Social Science and Medicine, Qualitative Inquiry, Qualitative Research, Sociology Health Review) to identify studies on health-related topics, published prior to whereby: humans undertook thematic analysis and published both their analysis in a peer-reviewed journal and the associated dataset. We prompted a closed system GenAI (Microsoft Copilot) to undertake thematic analysis of these datasets and analysed the GenAI outputs in comparison with human outputs. Measures include time (GenAI only), accuracy, overlap with human analysis, and reliability of selected data and quotes.Results:Five studies were identified that met our inclusion criteria. The themes identified by human researchers and Copilot showed minimal overlap, with human researchers often using discursive thematic analyses (40%) and Copilot focusing on thematic analysis (100%). Copilot’s outputs often included fabricated quotes (58% SD = 45%) and none of the Copilot outputs provided participant spread by theme. Additionally, Copilot’s outputs primarily drew themes and quotes from the first 2-3 pages of textual data, rather than from the entire dataset. Human researchers provided broader representation and accurate quotes (79% quotes were correct, SD = 27%).Conclusions:Based on these results, we cannot recommend the current version of Copilot for undertaking thematic analyses. This study raises concerns about the validity of both human-generated and GenAI-generated qualitative data analysis and reporting.
AB - Objective:To determine accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis.Introduction:With the increasing use of GenAI in data analysis, testing the reliability and suitability of using GenAI to conduct qualitative data analysis is needed. We propose a method for researchers to assess reliability of GenAI outputs using deidentified qualitative datasets.Methods:We searched three databases (United Kingdom Data Service, Figshare, and Google Scholar) and five journals (PlosOne, Social Science and Medicine, Qualitative Inquiry, Qualitative Research, Sociology Health Review) to identify studies on health-related topics, published prior to whereby: humans undertook thematic analysis and published both their analysis in a peer-reviewed journal and the associated dataset. We prompted a closed system GenAI (Microsoft Copilot) to undertake thematic analysis of these datasets and analysed the GenAI outputs in comparison with human outputs. Measures include time (GenAI only), accuracy, overlap with human analysis, and reliability of selected data and quotes.Results:Five studies were identified that met our inclusion criteria. The themes identified by human researchers and Copilot showed minimal overlap, with human researchers often using discursive thematic analyses (40%) and Copilot focusing on thematic analysis (100%). Copilot’s outputs often included fabricated quotes (58% SD = 45%) and none of the Copilot outputs provided participant spread by theme. Additionally, Copilot’s outputs primarily drew themes and quotes from the first 2-3 pages of textual data, rather than from the entire dataset. Human researchers provided broader representation and accurate quotes (79% quotes were correct, SD = 27%).Conclusions:Based on these results, we cannot recommend the current version of Copilot for undertaking thematic analyses. This study raises concerns about the validity of both human-generated and GenAI-generated qualitative data analysis and reporting.
UR - http://www.scopus.com/inward/record.url?scp=105015065754&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0330217
DO - 10.1371/journal.pone.0330217
M3 - Article
C2 - 40911617
SN - 1932-6203
VL - 20
SP - 1
EP - 13
JO - PLoS One
JF - PLoS One
IS - 9
M1 - e0330217
ER -