Unveiling moral development in generative AI chatbots
DOI:
https://doi.org/10.47989/ir31iConf64283Keywords:
Moral development, Generative AI chatbots, AI ethicsAbstract
Introduction. Guided by Kohlberg’s theory, this paper aims to investigate the moral development levels of GAI chatbots.
Method. The Defining Issues Test Version 2 (DIT-2) was applied to assess the reasoning stages of four GAI chatbots, namely, Claude 4, Claude 4.1, ChatGPT 4o, and ChatGPT 5.
Analysis. A total of 240 data points (6 indices × 10 runs × 4 chatbots) were analysed using the Coefficient of Variation (CV), Welch’s ANOVA, and Games-Howell test.
Results. The results showed that Claude 4 was the most consistent in responding to moral dilemmas, whereas ChatGPT 4o was the least. Compared with Claude 4.1 and ChatGPT 4o, Claude 4 and ChatGPT 5 exhibited similarly higher levels of postconventional reasoning and moral differentiation.
Conclusion(s). This paper advances the literature on AI ethics by shifting the focus from outcome-oriented evaluations to developmental levels of reasoning. Additionally, it extends Kohlberg’s theory of moral development into the domain of GAI. Practically, this study helps users understand the moral reasoning of the latest chatbots for more informed use. It also guides developers in improving models toward greater transparency and ethical alignment.
References
Auger, G. A., & Gee, C. (2016). Developing Moral Maturity: An Evaluation of the Media Ethics Course Using the DIT-2. Journalism & Mass Communication Educator, 71(2), 146–162. https://doi.org/10.1177/1077695815584460
Bajpai, S., Sameer, A., & Fatima, R. (2024). Insights into Moral Reasoning Capabilities of AI: A Comparative Study between Humans and Large Language Models. In Review. https://doi.org/10.21203/rs.3.rs-5336157/v1
Behar-Horenstein, L. S., & Tolentino, L. A. (2019). Exploring Dental Student Performance in Moral Reasoning Using the Defining Issues Test 2. Journal of Dental Education, 83(1), 72–78. https://doi.org/10.21815/JDE.019.009
Carpendale, J. I. M. (2000). Kohlberg and Piaget on Stages and Moral Reasoning. Developmental Review, 20(2), 181–205. https://doi.org/10.1006/drev.1999.0500
ChatGPT — Release Notes. (2025, September 15). OpenAI Help Center. https://help.openai.com/en/articles/6825453-chatgpt-release-notes
Chen, C., Gong, X., Liu, Z., Jiang, W., Goh, S. Q., & Lam, K.-Y. (2025). Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations (No. arXiv:2408.12935). arXiv. https://doi.org/10.48550/arXiv.2408.12935
Chua, A. Y. K., Chen, M., Kan, M., & Seoh, W. (2025). Digital prejudices: An analysis of gender, racial and religious biases in generative AI chatbots. Internet Research, 1–27. https://doi.org/10.1108/INTR-10-2024-1536
Claude Opus 4.1. (2025, August 6). https://www.anthropic.com/news/claude-opus-4-1
Feng, S. (2025). Group interaction patterns in generative AI-supported collaborative problem solving: Network analysis of the interactions among students and a GAI chatbot. British Journal of Educational Technology, 56(5), 2125–2145. https://doi.org/10.1111/bjet.13611
Garcia, B., Qian, C., & Palminteri, S. (2024). The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making (No. arXiv:2410.07304). arXiv. https://doi.org/10.48550/arXiv.2410.07304
Goel, A., Schwartz, D., & Qi, Y. (2025). Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency (No. arXiv:2508.14314). arXiv. https://doi.org/10.48550/arXiv.2508.14314
Gungordu, N., Nabizadehchianeh, G., O’Connor, E., Ma, W., & Walker, D. I. (2024). Moral reasoning development: Norms for Defining Issue Test-2 (DIT2). Ethics & Behavior, 34(4), 246–263. https://doi.org/10.1080/10508422.2023.2206573
Huang, Y., & Huang, H. (2025). Exploring the Effect of Attachment on Technology Addiction to Generative AI Chatbots: A Structural Equation Modeling Analysis. International Journal of Human–Computer Interaction, 41(15), 9440–9449. https://doi.org/10.1080/10447318.2024.2426029
Kim, D.-K., & Ming, H. (2025). Assessing output reliability and similarity of large language models in software development: A comparative case study approach. Information and Software Technology, 185, 107787. https://doi.org/10.1016/j.infsof.2025.107787
Kohlberg, L., & Hersh, R. H. (1977). Moral development: A review of the theory. Theory Into Practice, 16(2), 53–59. https://doi.org/10.1080/00405847709542675
Kruegel, S., Ostermaier, A., & Uhl, M. (2025). ChatGPT’s advice drives moral judgments with or without justification (No. arXiv:2501.01897). arXiv. https://doi.org/10.48550/arXiv.2501.01897
Krügel, S., Ostermaier, A., & Uhl, M. (2023). The moral authority of ChatGPT (No. arXiv:2301.07098). arXiv. https://doi.org/10.48550/arXiv.2301.07098
Lim, B., Seth, I., Maxwell, M., Cuomo, R., Ross, R. J., & Rozen, W. M. (2025). Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude. Aesthetic Plastic Surgery. https://doi.org/10.1007/s00266-025-04842-8
McGrath, C., Farazouli, A., & Cerratto-Pargman, T. (2025). Generative AI chatbots in higher education: A review of an emerging research area. Higher Education, 89(6), 1533–1549. https://doi.org/10.1007/s10734-024-01288-w
Novis-Deutsch, N., Elyoseph, T., & Elyoseph, Z. (2025). How much of a pluralist is ChatGPT? A comparative study of value pluralism in generative AI chatbots. AI & SOCIETY. https://doi.org/10.1007/s00146-025-02450-3
Nunner‐Winkler, G. (2007). Development of moral motivation from childhood to early adulthood1. Journal of Moral Education, 36(4), 399–414. https://doi.org/10.1080/03057240701687970
Reed, G. F., Lynn, F., & Meade, B. D. (2002). Use of Coefficient of Variation in Assessing Variability of Quantitative Assays. Clinical and Vaccine Immunology, 9(6), 1235–1239. https://doi.org/10.1128/CDLI.9.6.1235-1239.2002
Rest, J. R., Narvaez, D., Thoma, S. J., & Bebeau, M. J. (1999). DIT2: Devising and testing a revised instrument of moral judgment. Journal of Educational Psychology, 91(4), 644–659. https://doi.org/10.1037/0022-0663.91.4.644
Sachdeva, S., Singh, P., & Medin, D. (2011). Culture and the quest for universal principles in moral reasoning. International Journal of Psychology, 46(3), 161–176. https://doi.org/10.1080/00207594.2011.568486
Shapiro, D., Li, W., Delaflor, M., & Toxtli, C. (2023). Conceptual Framework for Autonomous Cognitive Entities. https://doi.org/10.13140/RG.2.2.14161.30569
Shingala, M. C., & Rajyaguru, D. A. (2015). Comparison of Post Hoc Tests for Unequal Variance. 2(5).
Takemoto, K. (2024). The moral machine experiment on large language models. Royal Society Open Science, 11(2), 231393. https://doi.org/10.1098/rsos.231393
Thoma, S. J. (2006). Research on the Defining Issues Test. In Handbook of Moral Development. Psychology Press.
Wang, Y., Liang, L., Li, R., Wang, Y., & Hao, C. (2024). Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control. Journal of Multidisciplinary Healthcare, Volume 17, 3917–3929. https://doi.org/10.2147/JMDH.S473680
Zhang, Z., Chen, Z., & Xu, L. (2022). Artificial intelligence and moral dilemmas: Perception of ethical decision-making in AI. Journal of Experimental Social Psychology, 101, 104327. https://doi.org/10.1016/j.jesp.2022.104327
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Jiayu Han , Alton Y.K. Chua

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
