Collaborating with large language models in literature screening for a systematic review of college students’ GenAI literacy
DOI:
https://doi.org/10.47989/ir31iConf64265Keywords:
Systematic literature review, Screening, LLM, Generative AI, LiteracyAbstract
Introduction. Literature screening is among the most time- and labour-intensive phases of systematic literature reviews (SLRs). Although large language models (LLMs) have been explored for screening, prior work has focused on medical and environmental domains and mainly benchmarked LLMs against human coders, offering limited guidance on collaborative integration in SLRs.
Method. This study evaluated 12 GPT model–prompt configurations in an SLR of 1,616 publications. Two human coders screened a 10% sample (n = 162) to create a gold standard for model comparison. Performance was assessed using balanced accuracy, recall for inclusion, time efficiency, and cost efficiency. Disagreements between humans and models were analysed.
Analysis. Bootstrap tests compared performance across configurations. Open coding identified error types in human–model disagreements. Group discussion resolved discrepancies.
Results. GPT-5 Zeroshot and GPT-4o-mini Fewshot achieved the highest performance (accuracy = 0.990 and 0.946; recall = 1.000 and 0.933). GPT-4o-mini was faster and cheaper but more prone to overly rigid rule application. Error analysis identified 10 mismatches, leading to two corrections of human miscoding.
Conclusion. LLM-assisted screening can reduce workload, improve efficiency, and correct human errors in SLR. Practical guidelines for prompt design and confidence thresholds can position LLMs as collaborative tools in SLRs.
References
Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence, and kappa. Journal of Clinical Epidemiology, 46(5), 423–429. https://doi.org/10.1016/0895-4356(93)90018-V
Chibwe, K., Mantilla-Calderon, D., & Ling, F. (2025). Evaluating GPT models for automated literature screening in wastewater-based epidemiology. ACS Environmental Au, 5(1), 61–68. https://doi.org/10.1021/acsenvironau.4c00042
Ghossein, J., Hryciw, B. N., Ramsay, T., & Kyeremanteng, K. (2025). The AI reviewer: Evaluating AI’s role in citation screening for streamlined systematic reviews. JMIR Formative Research, 9, e58366–e58366. https://doi.org/10.2196/58366
Guimarães, N. S., Ferreira, A. J. F., Ribeiro Silva, R. D. C., De Paula, A. A., Lisboa, C. S., Magno, L., Ichiara, M. Y., & Barreto, M. L. (2022). Deduplicating records in systematic reviews: There are free, accurate automated ways to do so. Journal of Clinical Epidemiology, 152, 110–115. https://doi.org/10.1016/j.jclinepi.2022.10.009
Guo, E., Gupta, M., Deng, J., Park, Y.-J., Paget, M., & Naugler, C. (2024). Automated paper screening for clinical reviews using large language models: Data analysis study. Journal of Medical Internet Research, 26(1), e48996. https://doi.org/10.2196/48996
Issaiy, M., Ghanaati, H., Kolahi, S., Shakiba, M., Jalali, A. H., Zarei, D., Kazemian, S., Avanaki, M. A., & Firouznia, K. (2024). Methodological insights into ChatGPT’s screening performance in systematic reviews. BMC Medical Research Methodology, 24, 78. https://doi.org/10.1186/s12874-024-02203-8
Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159. https://doi.org/10.2307/2529310
Matsui, K., Utsumi, T., Aoki, Y., Maruki, T., Takeshima, M., & Takaesu, Y. (2024). Human-comparable sensitivity of large language models in identifying eligible studies through title and abstract screening: 3-layer strategy using GPT-3.5 and GPT-4 for systematic reviews. Journal of Medical Internet Research, 26, e52758. https://doi.org/10.2196/52758
Nykvist, B., Macura, B., Xylia, M., & Olsson, E. (2025). Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis. Environmental Evidence, 14(1), 7. https://doi.org/10.1186/s13750-025-00360-x
Oami, T., Okada, Y., & Nakada, T. (2024). Performance of a large language model in screening citations. JAMA Network Open, 7(7), e2420496. https://doi.org/10.1001/jamanetworkopen.2024.20496
Syriani, E., David, I., & Kumar, G. (2024). Screening articles for systematic reviews with ChatGPT. Journal of Computer Languages, 80, 101287. https://doi.org/10.1016/j.cola.2024.101287
Van Dijk, S. H. B., Brusse-Keizer, M. G. J., Bucsán, C. C., Van Der Palen, J., Doggen, C. J. M., & Lenferink, A. (2023). Artificial intelligence in systematic reviews: Promising when appropriately used. BMJ Open, 13(7), e072254. https://doi.org/10.1136/bmjopen-2023-072254
Zuo, C., Yang, X., Errickson, J., Li, J., Hong, Y., & Wang, R. (2025). AI-assisted evidence screening method for systematic reviews in environmental research: Integrating ChatGPT with domain knowledge. Environmental Evidence, 14, 5. https://doi.org/10.1186/s13750-025-00358-5
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Wonchan Choi , Joyce Lee , Besiki Stvilia , Yan Zhang , Hyerin Bak

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
