Limitations in cultural context: systematic biases of LLMs in Chinese sentiment analysis

Ruili Geng; Jiwen Zhang; Ruixian Yang; Mingzhe Quan; Xiang Zheng; Yishuai Xu

doi:10.47989/ir31iConf64278

Authors

Ruili Geng Zhengzhou University
Jiwen Zhang Zhengzhou University
Ruixian Yang Zhengzhou University https://orcid.org/0000-0002-5982-6339
Mingzhe Quan Zhengzhou University
Xiang Zheng Zhengzhou University https://orcid.org/0000-0001-7933-2822
Yishuai Xu Universiti Malaya

DOI:

https://doi.org/10.47989/ir31iConf64278

Keywords:

LLMs, Bias, Sentiment polarity, Sentiment intensity

Abstract

Introduction. Large language models (LLMs) are increasingly applied in Chinese sentiment analysis; however, their ability to interpret sentiment in culturally specific contexts remains uncertain. This study systematically evaluates the potential biases of LLM-based Chinese sentiment analysis, providing an empirical basis for model optimisation and responsible application.

Method. A human-AI comparison experiment was conducted at both word and sentence levels. Two hundred Chinese words from four categories and 225 sentences were used as materials. Judgment data on sentiment polarity, intensity, valence, and arousal were collected from native Chinese speakers and representative Chinese and international LLMs. Data analysis was performed using Chi-square tests and Kruskal–Wallis tests.

Results. In sentiment polarity judgment, LLMs are highly consistent with humans and outperform the traditional sentiment lexicon. For continuous dimensions (intensity, valence, arousal), LLMs generally show an exaggeration bias, especially for sensory-perceptual words and ironic sentences. The bias stems from the disembodied cognitive nature of models, exaggerated linguistic patterns in training data, and technical characteristics such as the attention mechanism.

Conclusion. This study reveals a systematic limitation of LLMs in Chinese sentiment analysis, characterised by accurate classification but exaggerated quantitative evaluations. Consequently, in practical applications, outputs related to sentiment intensity should be interpreted with caution.

References

Abramski, K., Citraro, S., Lombardi, L., Rossetti, G., & Stella, M. (2023). Cognitive network science reveals bias in GPT-3, GPT-3.5 Turbo, and GPT-4 mirroring math anxiety in high-school students. Big Data and Cognitive Computing, 7(3), Article 124. https://doi.org/10.3390/bdcc7030124

Bai, X., Chen, G., He, T., Zhou, C., & Guo, C. (2025). A holistic comparative study of large language models as emotional support dialogue systems. Cognitive Computation, 17(2), 71. https://doi.org/10.1007/s12559-025-10429-x

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., … Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. https://doi.org/10.48550/arXiv.2204.05862

Barrett, L. F. (2017). How emotions are made: The secret life of the brain. Houghton Mifflin Harcourt.

Barsalou, L. W. (2008). Grounded cognition. Annual Review of Psychology, 59, 617-645. https://doi.org/10.1146/annurev.psych.59.103006.093639

Caspi, A., & Etgar, S. (2023). Exaggeration of emotional responses in online communication. Computers in Human Behavior, 146, 107818. https://doi.org/10.1016/j.chb.2023.107818

Dai, X., Zhou, L., Wang, B., & Li, H. (2025). From word to world: Evaluate and mitigate culture bias in LLMs via word association test. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 24521–24537. https://doi.org/10.18653/v1/2025.emnlp-main.1246

Ding, Y., You, J., Machulla, T. K., Jacobs, J., Sen, P., & Hollerer, T. (2022). Impact of Annotator Demographics on Sentiment Dataset Labeling. Proceedings of the ACM on Human-Computer Interaction , 6(CSCW2), 519. https://doi.org/10.1145/3555632

Fu, Z., Hsu, Y. C., Chan, C. S., Lau, C. M., Liu, J., & Yip, P. S. F. (2024). Efficacy of ChatGPT in Cantonese sentiment analysis: Comparative study. Journal of Medical Internet Research, 26, Article e51069. https://doi.org/10.2196/51069

Gao, C. H., Dang, B. B., Wang, B. J., & Wu, M. S. (2025). The linguistic strength and weakness of artificial intelligence: A comparison between large language model(s) and real students in the Chinese context. Acta Psychologica Sinica, 57(6), 27 https://doi.org/10.3724/sp.J.1041.2025.0947

Kirtac, K., & Germano, G. (2024). Sentiment trading with large language models. Finance Research Letters, 62, Article 105227. https://doi.org/10.1016/j.frl.2024.105227

Krugmann, J. O., & Hartmann, J. (2024). Sentiment analysis in the age of generative AI. Customer Needs and Solutions, 11(1), 3. https://doi.org/10.1007/s40547-024-00143-4

Lee, L. H., Li, J. H., & Yu, L. C. (2022). Chinese EmoBank: Building valence-arousal resources for dimensional sentiment analysis. ACM Transactions on Asian and Low-Resource Language Information Processing, 21(4), Article 65. https://doi.org/10.1145/3489141

Li, J., Wang, J., Hu, J., & Jiang, M. (2024). How well do LLMs identify cultural unity in diversity. arXiv preprint arXiv:2408.05102. https://doi.org/10.48550/arXiv.2408.05102

Lin, Z. C. (2025). Six fallacies in substituting large language models for human participants. Advances in Methods and Practices in Psychological Science, 8(3), Article 25152459251357566. https://doi.org/10.1177/2515245925135756

Liu, C., Arulappan, A., Naha, R., Mahanti, A., Kamruzzaman, J., & Ra, I. H. (2024). Large language models and sentiment analysis in financial markets: A review, datasets, and case study. IEEE Access, 12, 134041–134061. https://doi.org/10.1109/ACCESS.2024.3445413

Liu, H., Dai, Y., Tan, H., Lei, Y., Zhou, Y., & Wu, Z. (2025). Outraged AI: Large language models prioritise emotion over cost in fairness enforcement. arXiv preprint arXiv:2510.17880. https://doi.org/10.48550/arXiv.2510.17880

Mabokela, K. R., Schlippe, T., & Wölfel, M. (2025). Large language models for sentiment analysis to detect social challenges: A use case with South African languages. arXiv preprint arXiv:2511.17301. https://doi.org/10.48550/arXiv.2511.17301

Mindner, L., Schlippe, T., & Schaaff, K. (2023). Classification of human- and AI-generated texts: Investigating features for ChatGPT. Artificial Intelligence in Education Technologies: New Development and Innovative Practices. AIET 2023. Lecture Notes on Data Engineering and Communications Technologies, 190, 152–170. https://doi.org/10.1007/978-981-99-7947-9_12

Mohsin, T. M. (2025). Evaluating large language models (LLMs) in financial NLP: A comparative study on financial report analysis. arXiv preprint arXiv:2507.22936. https://doi.org/10.48550/arXiv.2507.22936

Pontes, D. P. N., & Maricato, J. D. (2025). Using sentiment analysis to differentiate bots and humans in the dissemination of scientific publications on COVID-19 on social media X: A study with ChatGPT 3.5 and Gemini 1.5 Flash. Biblios-Revista de Bibliotecologia y Ciencias de la Informacion, Article e001. https://doi.org/10.5195/biblios.2025.1297

Qi, Z., Perron, B. E., Wang, M., Fang, C., Chen, S. T., & Victor, B. G. (2025). AI and cultural context: An empirical investigation of large language models' performance on Chinese social work professional standards. arXiv preprint arXiv:2412.14971. https://doi.org/10.48550/arXiv.2412.14971

Qiu, H., He, H., Zhang, S., Li, A., & Lan, Z. (2024). SMILE: Single-turn to multi-turn inclusive language expansion via ChatGPT for mental health support. Findings of the Association for Computational Linguistics: EMNLP 2024, 615–636. https://doi.org/10.18653/v1/2024.findings-emnlp.34

Rathje, S., Mirea, D.-M., Sucholutsky, I., Marjieh, R., Robertson, C. E., & Bavel, J. J. V. (2024). GPT is an effective tool for multilingual psychological text analysis. Proceedings of the National Academy of Sciences of the United States of America, 121(34), Article e2308950121. https://doi.org/10.1073/pnas.2308950121

Rimban, E. L. (2023). Challenges and limitations of ChatGPT and other large language models. International Journal of Arts and Humanities, 4(1), 147–152. https://doi.org/10.25082/ijah.2023.01.003

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008. https://doi.org/10.48550/arXiv.1706.03762

Wang, A., Morgenstern, J., & Dickerson, J. P. (2025). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 3, 7. https://doi.org/10.48550/arXiv.2402.01908

Wang, X., Li, X., Yin, Z., Wu, Y., & Liu, J. (2023). Emotional intelligence of large language models. Journal of Pacific Rim Psychology, 17, Article 18344909231213958. https://doi.org/10.1177/18344909231213958

Wilson, M. (2002). Six views of embodied cognition. Psychonomic Bulletin & Review, 9(4), 625–636. https://doi.org/10.3758/bf03196322

Xie, H., Lin, W., Lin, S., Wang, J., & Yu, L.-C. (2021). A multi-dimensional relation model for dimensional sentiment analysis. Information Sciences, 579, 832–844. https://doi.org/10.1016/j.ins.2021.08.052

Xu, L., Lin, H., Pan, Y., Ren, H., & Chen, J. (2008). Constructing the affective lexicon ontology. Journal of the China Society for Scientific and Technical Information, 27, 180–185.

Zhou, J., Luo, S., & Chen, H. (2024). A Chinese multi-label affective computing dataset based on social media network users. arXiv preprint arXiv:2411.08347. https://doi.org/10.48550/arXiv.2411.08347

Limitations in cultural context: systematic biases of LLMs in Chinese sentiment analysis

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

About the Journal

Make a Submission

Information