Collaborating with large language models in literature screening for a systematic review of college students’ GenAI literacy

Wonchan  Choi; Joyce Lee; Besiki Stvilia; Yan  Zhang; Hyerin  Bak

doi:10.47989/ir31iConf64265

Authors

Wonchan Choi University of Wisconsin-Milwaukee
Joyce Lee University of Wisconsin-Milwaukee
Besiki Stvilia Florida State University
Yan Zhang University of Texas at Austin
Hyerin Bak University of Wisconsin-Milwaukee

DOI:

https://doi.org/10.47989/ir31iConf64265

Keywords:

Systematic literature review, Screening, LLM, Generative AI, Literacy

Abstract

Introduction. Literature screening is among the most time- and labour-intensive phases of systematic literature reviews (SLRs). Although large language models (LLMs) have been explored for screening, prior work has focused on medical and environmental domains and mainly benchmarked LLMs against human coders, offering limited guidance on collaborative integration in SLRs.

Method. This study evaluated 12 GPT model–prompt configurations in an SLR of 1,616 publications. Two human coders screened a 10% sample (n = 162) to create a gold standard for model comparison. Performance was assessed using balanced accuracy, recall for inclusion, time efficiency, and cost efficiency. Disagreements between humans and models were analysed.

Analysis. Bootstrap tests compared performance across configurations. Open coding identified error types in human–model disagreements. Group discussion resolved discrepancies.

Results. GPT-5 Zeroshot and GPT-4o-mini Fewshot achieved the highest performance (accuracy = 0.990 and 0.946; recall = 1.000 and 0.933). GPT-4o-mini was faster and cheaper but more prone to overly rigid rule application. Error analysis identified 10 mismatches, leading to two corrections of human miscoding.

Conclusion. LLM-assisted screening can reduce workload, improve efficiency, and correct human errors in SLR. Practical guidelines for prompt design and confidence thresholds can position LLMs as collaborative tools in SLRs.

References

Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence, and kappa. Journal of Clinical Epidemiology, 46(5), 423–429. https://doi.org/10.1016/0895-4356(93)90018-V

Chibwe, K., Mantilla-Calderon, D., & Ling, F. (2025). Evaluating GPT models for automated literature screening in wastewater-based epidemiology. ACS Environmental Au, 5(1), 61–68. https://doi.org/10.1021/acsenvironau.4c00042

Ghossein, J., Hryciw, B. N., Ramsay, T., & Kyeremanteng, K. (2025). The AI reviewer: Evaluating AI’s role in citation screening for streamlined systematic reviews. JMIR Formative Research, 9, e58366–e58366. https://doi.org/10.2196/58366

Guimarães, N. S., Ferreira, A. J. F., Ribeiro Silva, R. D. C., De Paula, A. A., Lisboa, C. S., Magno, L., Ichiara, M. Y., & Barreto, M. L. (2022). Deduplicating records in systematic reviews: There are free, accurate automated ways to do so. Journal of Clinical Epidemiology, 152, 110–115. https://doi.org/10.1016/j.jclinepi.2022.10.009

Guo, E., Gupta, M., Deng, J., Park, Y.-J., Paget, M., & Naugler, C. (2024). Automated paper screening for clinical reviews using large language models: Data analysis study. Journal of Medical Internet Research, 26(1), e48996. https://doi.org/10.2196/48996

Issaiy, M., Ghanaati, H., Kolahi, S., Shakiba, M., Jalali, A. H., Zarei, D., Kazemian, S., Avanaki, M. A., & Firouznia, K. (2024). Methodological insights into ChatGPT’s screening performance in systematic reviews. BMC Medical Research Methodology, 24, 78. https://doi.org/10.1186/s12874-024-02203-8

Landis, J. R., & Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1), 159. https://doi.org/10.2307/2529310

Matsui, K., Utsumi, T., Aoki, Y., Maruki, T., Takeshima, M., & Takaesu, Y. (2024). Human-comparable sensitivity of large language models in identifying eligible studies through title and abstract screening: 3-layer strategy using GPT-3.5 and GPT-4 for systematic reviews. Journal of Medical Internet Research, 26, e52758. https://doi.org/10.2196/52758

Nykvist, B., Macura, B., Xylia, M., & Olsson, E. (2025). Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis. Environmental Evidence, 14(1), 7. https://doi.org/10.1186/s13750-025-00360-x

Oami, T., Okada, Y., & Nakada, T. (2024). Performance of a large language model in screening citations. JAMA Network Open, 7(7), e2420496. https://doi.org/10.1001/jamanetworkopen.2024.20496

Syriani, E., David, I., & Kumar, G. (2024). Screening articles for systematic reviews with ChatGPT. Journal of Computer Languages, 80, 101287. https://doi.org/10.1016/j.cola.2024.101287

Van Dijk, S. H. B., Brusse-Keizer, M. G. J., Bucsán, C. C., Van Der Palen, J., Doggen, C. J. M., & Lenferink, A. (2023). Artificial intelligence in systematic reviews: Promising when appropriately used. BMJ Open, 13(7), e072254. https://doi.org/10.1136/bmjopen-2023-072254

Zuo, C., Yang, X., Errickson, J., Li, J., Hong, Y., & Wang, R. (2025). AI-assisted evidence screening method for systematic reviews in environmental research: Integrating ChatGPT with domain knowledge. Environmental Evidence, 14, 5. https://doi.org/10.1186/s13750-025-00358-5

Collaborating with large language models in literature screening for a systematic review of college students’ GenAI literacy

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

About the Journal

Make a Submission

Information