A comparison of human and ChatGPT classification performance on complex social media data

Authors

  • Breanna E. Green Cornell University
  • Ashley L. Shea Cornell University
  • Pengfei Zhao Cornell University
  • Drew B. Margolin Cornell University

DOI:

https://doi.org/10.47989/ir31iConf64141

Keywords:

Generative artificial intelligence, Large language models, Chat-GPT, Prosocial objection tactics

Abstract

Introduction. Generative artificial intelligence tools, like ChatGPT, are an increasingly utilised resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language.

Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We employ a dataset containing human-annotated comments from YouTube and X. We craft four prompt styles as input and evaluate precision, recall, and F1 scores.

Analysis. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language.

Results. Qualitative analysis reveals four specific findings: 1) cultural euphemisms are too nuanced for GPT-4 to understand, 2) interpreting the type of ’internet speak’ found on social media platforms is a challenge, 3) GPT-4 falters in determining who or what is the target of directed attacks (e.g., the content or the user), and 4) the rationale GPT-4 provides is inconsistent in logic.

Conclusion. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence

References

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Alkiviadou, N. (2019). Hate speech on social media networks: Towards a regulatory framework? Information & Communications Technology Law, 28(1), 19–35.

Alyeksyeyeva, I. (2017). Defining snowflake in british post-brexit and us post-election public discourse. Science and Education a New Dimension, 39(143), 7–10.

Borji, A., & Mohammadian, M. (2023). Battle of the wordsmiths: Comparing chatgpt, gpt-4, claude, and bard. SSRN Electronic Journal.

Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.

Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2023). Unleashing the potential of prompt engineering in large language models: A comprehensive review. arXiv preprint arXiv:2310.14735.

Dale, R. (2021). Gpt-3: What’s it good for? Natural Language Engineering, 27(1), 113–

118.

Ding, B., Qin, C., Liu, L., Chia, Y. K., Li, B., Joty, S., & Bing, L. (2023, July). Is GPT3 a good data annotator? In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 11173–11195). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.626

Gagrčin, E., Porten-Cheé, P., Leißner, L., Emmer, M., & Jørring, L. (2022). What makes a good citizen online? the emergence of discursive citizenship norms in social media environments. Social Media+ Society, 8(1), 20563051221084297.

Gilardi, F., Alizadeh, M., & Kubli, M. (2023). Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), e2305016120.

Goyal, N., Kivlichan, I. D., Rosen, R., & Vasserman, L. (2022). Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation. Proc. ACM Hum.-Comput. Interact., 6(CSCW2). https://doi.org/10.1145/3555088

Huang, F., Kwak, H., & An, J. (2023). Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. Companion proceedings of the ACM web conference 2023, 294–297.

Kocoń, J., Cichecki, I., Kaszyca, O., Kochanek, M., Szydło, D., Baran, J., Bielaniewicz, J., Gruza, M., Janz, A., Kanclerz, K., et al. (2023). Chatgpt: Jack of all trades, master of none. Information Fusion, 99, 101861.

Krippendorff, K. (1970). Estimating the reliability, systematic error and random error of interval data. Educational and psychological measurement, 30(1), 61– 70.

Larimore, S., Kennedy, I., Haskett, B., & Arseniev-Koehler, A. (2021, June). Reconsidering annotator disagreement about racist language: Noise or signal? In L.-W. Ku & C.-T. Li (Eds.), Proceedings of the ninth international workshop on natural language processing for social media (pp. 81–90). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.socialnlp-1.7

Long, C. (2021). How ‘let’s go brandon’became code for insulting joe biden. AP News, 30.

Marvin, G., Hellen, N., Jjingo, D., & Nakatumba-Nabende, J. (2023). Prompt engineering in large language models. International Conference on Data Intelligence and Cognitive Informatics, 387–402.

Mathew, B., Saha, P., Tharad, H., Rajgaria, S., Singhania, P., Maity, S. K., Goyal, P., & Mukherjee, A. (2019). Thou shalt not hate: Countering online hate speech. Proceedings of the international AAAI conference on web and social media, 13, 369–380.

Mellon, J., Bailey, J., Scott, R., Breckwoldt, J., Miori, M., & Schmedeman, P. (2024). Do ais know what the most important issue is? using language models to code open-text social survey responses at scale. Research & Politics, 11(1), 20531680241231468.

Mirza, A., Alampara, N., Kunchapu, S., Emoekabu, B., Krishnan, A., Wilhelmi, M., Okereke, M., Eberhardt, J., Elahi, A. M., Greiner, M., et al. (2024). Are large language models superhuman chemists? arXiv preprint arXiv:2404.01475.

Ollion, E., Shen, R., Macanovic, A., & Chatelain, A. (2023). Chatgpt for text annotation? mind the hype.

Plank, B. (2022, December). The ‘problem’ of human label variation: On ground truth in data, modeling and evaluation. In Y. Goldberg, Z. Kozareva, & Y. Zhang (Eds.), Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 10671–10682). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlp-main.731

Polak, M. P., & Morgan, D. (2024). Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nature Communications, 15(1), 1569.

Reiss, M. V. (2023). Testing the reliability of chatgpt for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085.

Rescala, P., Ribeiro, M. H., Hu, T., & West, R. (2024). Can language models recognise convincing arguments? arXiv preprint arXiv:2404.00750.

Rossini, P. (2022). Beyond incivility: Understanding patterns of uncivil and intolerant discourse in online political talk. Communication Research, 49(3), 399– 425.

Sap, M., Swayamdipta, S., Vianna, L., Zhou, X., Choi, Y., & Smith, N. A. (2022, July). Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In M. Carpuat, M.-C. de Marneffe, & I. V. Meza Ruiz (Eds.), Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 5884–5906). Association for Computational Linguistics. https://doi. org/10.18653/v1/2022.naacl-main.431

Scheibenzuber, C., Neagu, L.-M., Ruseti, S., Artmann, B., Bartsch, C., Kubik, M., Dascalu, M., Trausan-Matu, S., & Nistor, N. (2023). Dialog in the echo chamber: Fake news framing predicts emotion, argumentation and dialogic social knowledge building in subsequent online discussions. Computers in Human Behavior, 140, 107587.

Schöpke-Gonzalez, A. M., Atreja, S., Shin, H. N., Ahmed, N., & Hemphill, L. (2022). Why do volunteer content moderators quit? burnout, conflict, and harmful behaviors. New Media & Society, 14614448221138529.

Schulman, J., Zoph, B., Kim, C., Hilton, J., Menick, J., Weng, J., Uribe, J. F. C., Fedus, L., Metz, L., Pokorny, M., et al. (2022). Chatgpt: Optimising language models for dialogue. OpenAI blog, 2, 4.

Shea, A. L., Omapang, A. K., Cho, J. Y., Ginsparg, M. Y., Bazarova, N. N., Hui, W., Kizilcec, R. F., Tong, C., & Margolin, D. B. (2025). Beyond ad hominem attacks: A typology of the discursive tactics used when objecting to news commentary on social media. PLoS One, 20(8), e0328550.

Thapa, S., Naseem, U., & Nasim, M. (2023). From humans to machines: Can chatgptlike llms effectively replace human annotators in nlp tasks. Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media.

Vahed, S., Goanta, C., Ortolani, P., & Sanfey, A. G. (2024). Moral judgment of objectionable online content: Reporting decisions and punishment preferences on social media. Plos one, 19(3), e0300960.

Vitak, J., Chadha, K., Steiner, L., & Ashktorab, Z. (2017). Identifying women’s experiences with and strategies for mitigating negative effects of online harassment. Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, 1231–1245.

Wan, Y., & Thompson, K. M. (2022). Making a cocoon: The social factors of pandemic misinformation evaluation. Proceedings of the Association for Information Science and Technology, 59(1), 824–826.

Wang, S., Liu, Y., Xu, Y., Zhu, C., & Zeng, M. (2021, November). Want to reduce labeling cost? GPT-3 can help. In M.-F. Moens, X. Huang, L. Specia, & S. W.-t. Yih (Eds.), Findings of the association for computational linguistics: Emnlp 2021 (pp. 4195–4205). Association for Computational Linguistics. https : / / doi . org/10.18653/v1/2021.findings-emnlp.354

Waterloo, S. F., Baumgartner, S. E., Peter, J., & Valkenburg, P. M. (2018). Norms of online expressions of emotion: Comparing facebook, twitter, instagram, and whatsapp. New media & society, 20(5), 1813–1831.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824–24837.

West, P., Lu, X., Dziri, N., Brahman, F., Li, L., Hwang, J. D., Jiang, L., Fisher, J., Ravichander, A., Chandu, K., et al. (2023). The generative ai paradox:’what it can create, it may not understand’. The Twelfth International Conference on Learning Representations.

White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., SpencerSmith, J., & Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.

Wu, S., & Resnick, P. (2021). Cross-partisan discussions on youtube: Conservatives talk to liberals but liberals don’t talk to conservatives. Proceedings of the International AAAI Conference on Web and Social Media, 15, 808–819.

Yin, Y., Jia, N., & Wakslak, C. J. (2024). Ai can help people feel heard, but an ai label diminishes this impact. Proceedings of the National Academy of Sciences, 121(14), e2319112121.

Downloads

Published

2026-03-20

How to Cite

Green, B. E., Shea, A. L., Zhao, P., & Margolin, D. B. (2026). A comparison of human and ChatGPT classification performance on complex social media data. Information Research an International Electronic Journal, 31(iConf), 1515–1533. https://doi.org/10.47989/ir31iConf64141

Issue

Section

Conference proceedings

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.