Limitations in cultural context: systematic biases of LLMs in Chinese sentiment analysis
DOI:
https://doi.org/10.47989/ir31iConf64278Keywords:
LLMs, Bias, Sentiment polarity, Sentiment intensityAbstract
Introduction. Large language models (LLMs) are increasingly applied in Chinese sentiment analysis; however, their ability to interpret sentiment in culturally specific contexts remains uncertain. This study systematically evaluates the potential biases of LLM-based Chinese sentiment analysis, providing an empirical basis for model optimisation and responsible application.
Method. A human-AI comparison experiment was conducted at both word and sentence levels. Two hundred Chinese words from four categories and 225 sentences were used as materials. Judgment data on sentiment polarity, intensity, valence, and arousal were collected from native Chinese speakers and representative Chinese and international LLMs. Data analysis was performed using Chi-square tests and Kruskal–Wallis tests.
Results. In sentiment polarity judgment, LLMs are highly consistent with humans and outperform the traditional sentiment lexicon. For continuous dimensions (intensity, valence, arousal), LLMs generally show an exaggeration bias, especially for sensory-perceptual words and ironic sentences. The bias stems from the disembodied cognitive nature of models, exaggerated linguistic patterns in training data, and technical characteristics such as the attention mechanism.
Conclusion. This study reveals a systematic limitation of LLMs in Chinese sentiment analysis, characterised by accurate classification but exaggerated quantitative evaluations. Consequently, in practical applications, outputs related to sentiment intensity should be interpreted with caution.
References
Abramski, K., Citraro, S., Lombardi, L., Rossetti, G., & Stella, M. (2023). Cognitive network science reveals bias in GPT-3, GPT-3.5 Turbo, and GPT-4 mirroring math anxiety in high-school students. Big Data and Cognitive Computing, 7(3), Article 124. https://doi.org/10.3390/bdcc7030124
Bai, X., Chen, G., He, T., Zhou, C., & Guo, C. (2025). A holistic comparative study of large language models as emotional support dialogue systems. Cognitive Computation, 17(2), 71. https://doi.org/10.1007/s12559-025-10429-x
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. https://doi.org/10.48550/arXiv.2204.05862
Barrett, L. F. (2017). How emotions are made: The secret life of the brain. Houghton Mifflin Harcourt.
Barsalou, L. W. (2008). Grounded cognition. Annual Review of Psychology, 59, 617-645. https://doi.org/10.1146/annurev.psych.59.103006.093639
Caspi, A., & Etgar, S. (2023). Exaggeration of emotional responses in online communication. Computers in Human Behavior, 146, 107818. https://doi.org/10.1016/j.chb.2023.107818
Dai, X., Zhou, L., Wang, B., & Li, H. (2025). From word to world: Evaluate and mitigate culture bias in LLMs via word association test. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 24521–24537. https://doi.org/10.18653/v1/2025.emnlp-main.1246
Ding, Y., You, J., Machulla, T. K., Jacobs, J., Sen, P., & Hollerer, T. (2022). Impact of Annotator Demographics on Sentiment Dataset Labeling. Proceedings of the ACM on Human-Computer Interaction , 6(CSCW2), 519. https://doi.org/10.1145/3555632
Fu, Z., Hsu, Y. C., Chan, C. S., Lau, C. M., Liu, J., & Yip, P. S. F. (2024). Efficacy of ChatGPT in Cantonese sentiment analysis: Comparative study. Journal of Medical Internet Research, 26, Article e51069. https://doi.org/10.2196/51069
Gao, C. H., Dang, B. B., Wang, B. J., & Wu, M. S. (2025). The linguistic strength and weakness of artificial intelligence: A comparison between large language model(s) and real students in the Chinese context. Acta Psychologica Sinica, 57(6), 27 https://doi.org/10.3724/sp.J.1041.2025.0947
Kirtac, K., & Germano, G. (2024). Sentiment trading with large language models. Finance Research Letters, 62, Article 105227. https://doi.org/10.1016/j.frl.2024.105227
Krugmann, J. O., & Hartmann, J. (2024). Sentiment analysis in the age of generative AI. Customer Needs and Solutions, 11(1), 3. https://doi.org/10.1007/s40547-024-00143-4
Lee, L. H., Li, J. H., & Yu, L. C. (2022). Chinese EmoBank: Building valence-arousal resources for dimensional sentiment analysis. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(4), Article 65. https://doi.org/10.1145/3489141
Li, J., Wang, J., Hu, J., & Jiang, M. (2024). How well do LLMs identify cultural unity in diversity. arXiv preprint arXiv:2408.05102. https://doi.org/10.48550/arXiv.2408.05102
Lin, Z. C. (2025). Six fallacies in substituting large language models for human participants. Advances in Methods and Practices in Psychological Science, 8(3), Article 25152459251357566. https://doi.org/10.1177/2515245925135756
Liu, C., Arulappan, A., Naha, R., Mahanti, A., Kamruzzaman, J., & Ra, I. H. (2024). Large language models and sentiment analysis in financial markets: A review, datasets, and case study. IEEE Access, 12, 134041–134061. https://doi.org/10.1109/ACCESS.2024.3445413
Liu, H., Dai, Y., Tan, H., Lei, Y., Zhou, Y., & Wu, Z. (2025). Outraged AI: Large language models prioritise emotion over cost in fairness enforcement. arXiv preprint arXiv:2510.17880. https://doi.org/10.48550/arXiv.2510.17880
Mabokela, K. R., Schlippe, T., & Wölfel, M. (2025). Large language models for sentiment analysis to detect social challenges: A use case with South African languages. arXiv preprint arXiv:2511.17301. https://doi.org/10.48550/arXiv.2511.17301
Mindner, L., Schlippe, T., & Schaaff, K. (2023). Classification of human- and AI-generated texts: Investigating features for ChatGPT. Artificial Intelligence in Education Technologies: New Development and Innovative Practices. AIET 2023. Lecture Notes on Data Engineering and Communications Technologies, 190, 152–170. https://doi.org/10.1007/978-981-99-7947-9_12
Mohsin, T. M. (2025). Evaluating large language models (LLMs) in financial NLP: A comparative study on financial report analysis. arXiv preprint arXiv:2507.22936. https://doi.org/10.48550/arXiv.2507.22936
Pontes, D. P. N., & Maricato, J. D. (2025). Using sentiment analysis to differentiate bots and humans in the dissemination of scientific publications on COVID-19 on social media X: A study with ChatGPT 3.5 and Gemini 1.5 Flash. Biblios-Revista de Bibliotecologia y Ciencias de la Informacion, Article e001. https://doi.org/10.5195/biblios.2025.1297
Qi, Z., Perron, B. E., Wang, M., Fang, C., Chen, S. T., & Victor, B. G. (2025). AI and cultural context: An empirical investigation of large language models' performance on Chinese social work professional standards. arXiv preprint arXiv:2412.14971. https://doi.org/10.48550/arXiv.2412.14971
Qiu, H., He, H., Zhang, S., Li, A., & Lan, Z. (2024). SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support. Findings of the Association for Computational Linguistics: EMNLP 2024, 615–636. https://doi.org/10.18653/v1/2024.findings-emnlp.34
Rathje, S., Mirea, D.-M., Sucholutsky, I., Marjieh, R., Robertson, C. E., & Bavel, J. J. V. (2024). GPT is an effective tool for multilingual psychological text analysis. Proceedings of the National Academy of Sciences of the United States of America, 121(34), Article e2308950121. https://doi.org/10.1073/pnas.2308950121
Rimban, E. L. (2023). Challenges and limitations of ChatGPT and other large language models. International Journal of Arts and Humanities, 4(1), 147–152. https://doi.org/10.25082/ijah.2023.01.003
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008. https://doi.org/10.48550/arXiv.1706.03762
Wang, A., Morgenstern, J., & Dickerson, J. P. (2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 3, 7. https://doi.org/10.48550/arXiv.2402.01908
Wang, X., Li, X., Yin, Z., Wu, Y., & Liu, J. (2023). Emotional intelligence of large language models. Journal of Pacific Rim Psychology, 17, Article 18344909231213958. https://doi.org/10.1177/18344909231213958
Wilson, M. (2002). Six views of embodied cognition. Psychonomic Bulletin & Review, 9(4), 625–636. https://doi.org/10.3758/bf03196322
Xie, H., Lin, W., Lin, S., Wang, J., & Yu, L.-C. (2021). A multi-dimensional relation model for dimensional sentiment analysis. Information Sciences, 579, 832–844. https://doi.org/10.1016/j.ins.2021.08.052
Xu, L., Lin, H., Pan, Y., Ren, H., & Chen, J. (2008). Constructing the affective lexicon ontology. Journal of the China Society for Scientific and Technical Information, 27, 180–185.
Zhou, J., Luo, S., & Chen, H. (2024). A Chinese multi-label affective computing dataset based on social media network users. arXiv preprint arXiv:2411.08347. https://doi.org/10.48550/arXiv.2411.08347
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Ruili Geng , Jiwen Zhang , Ruixian Yang , Mingzhe Quan , Xiang Zheng , Yishuai Xu

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
