Unveiling moral development in generative AI chatbots

Jiayu Han; Alton Y.K.  Chua

doi:10.47989/ir31iConf64283

Authors

Jiayu Han Nanyang Technological University https://orcid.org/0000-0002-7951-7461
Alton Y.K. Chua Nanyang Technological University https://orcid.org/0000-0002-5603-2453

DOI:

https://doi.org/10.47989/ir31iConf64283

Keywords:

Moral development, Generative AI chatbots, AI ethics

Abstract

Introduction. Guided by Kohlberg’s theory, this paper aims to investigate the moral development levels of GAI chatbots.

Method. The Defining Issues Test Version 2 (DIT-2) was applied to assess the reasoning stages of four GAI chatbots, namely, Claude 4, Claude 4.1, ChatGPT 4o, and ChatGPT 5.

Analysis. A total of 240 data points (6 indices × 10 runs × 4 chatbots) were analysed using the Coefficient of Variation (CV), Welch’s ANOVA, and Games-Howell test.

Results. The results showed that Claude 4 was the most consistent in responding to moral dilemmas, whereas ChatGPT 4o was the least. Compared with Claude 4.1 and ChatGPT 4o, Claude 4 and ChatGPT 5 exhibited similarly higher levels of postconventional reasoning and moral differentiation.

Conclusion(s). This paper advances the literature on AI ethics by shifting the focus from outcome-oriented evaluations to developmental levels of reasoning. Additionally, it extends Kohlberg’s theory of moral development into the domain of GAI. Practically, this study helps users understand the moral reasoning of the latest chatbots for more informed use. It also guides developers in improving models toward greater transparency and ethical alignment.

References

Auger, G. A., & Gee, C. (2016). Developing Moral Maturity: An Evaluation of the Media Ethics Course Using the DIT-2. Journalism & Mass Communication Educator, 71(2), 146–162. https://doi.org/10.1177/1077695815584460

Bajpai, S., Sameer, A., & Fatima, R. (2024). Insights into Moral Reasoning Capabilities of AI: A Comparative Study between Humans and Large Language Models. In Review. https://doi.org/10.21203/rs.3.rs-5336157/v1

Behar-Horenstein, L. S., & Tolentino, L. A. (2019). Exploring Dental Student Performance in Moral Reasoning Using the Defining Issues Test 2. Journal of Dental Education, 83(1), 72–78. https://doi.org/10.21815/JDE.019.009

Carpendale, J. I. M. (2000). Kohlberg and Piaget on Stages and Moral Reasoning. Developmental Review, 20(2), 181–205. https://doi.org/10.1006/drev.1999.0500

ChatGPT — Release Notes. (2025, September 15). OpenAI Help Center. https://help.openai.com/en/articles/6825453-chatgpt-release-notes

Chen, C., Gong, X., Liu, Z., Jiang, W., Goh, S. Q., & Lam, K.-Y. (2025). Trustworthy, Responsible, and Safe AI: A Comprehensive Architectural Framework for AI Safety with Challenges and Mitigations (No. arXiv:2408.12935). arXiv. https://doi.org/10.48550/arXiv.2408.12935

Chua, A. Y. K., Chen, M., Kan, M., & Seoh, W. (2025). Digital prejudices: An analysis of gender, racial and religious biases in generative AI chatbots. Internet Research, 1–27. https://doi.org/10.1108/INTR-10-2024-1536

Claude Opus 4.1. (2025, August 6). https://www.anthropic.com/news/claude-opus-4-1

Feng, S. (2025). Group interaction patterns in generative AI-supported collaborative problem solving: Network analysis of the interactions among students and a GAI chatbot. British Journal of Educational Technology, 56(5), 2125–2145. https://doi.org/10.1111/bjet.13611

Garcia, B., Qian, C., & Palminteri, S. (2024). The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making (No. arXiv:2410.07304). arXiv. https://doi.org/10.48550/arXiv.2410.07304

Goel, A., Schwartz, D., & Qi, Y. (2025). Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency (No. arXiv:2508.14314). arXiv. https://doi.org/10.48550/arXiv.2508.14314

Gungordu, N., Nabizadehchianeh, G., O’Connor, E., Ma, W., & Walker, D. I. (2024). Moral reasoning development: Norms for Defining Issue Test-2 (DIT2). Ethics & Behavior, 34(4), 246–263. https://doi.org/10.1080/10508422.2023.2206573

Huang, Y., & Huang, H. (2025). Exploring the Effect of Attachment on Technology Addiction to Generative AI Chatbots: A Structural Equation Modeling Analysis. International Journal of Human–Computer Interaction, 41(15), 9440–9449. https://doi.org/10.1080/10447318.2024.2426029

Kim, D.-K., & Ming, H. (2025). Assessing output reliability and similarity of large language models in software development: A comparative case study approach. Information and Software Technology, 185, 107787. https://doi.org/10.1016/j.infsof.2025.107787

Kohlberg, L., & Hersh, R. H. (1977). Moral development: A review of the theory. Theory Into Practice, 16(2), 53–59. https://doi.org/10.1080/00405847709542675

Kruegel, S., Ostermaier, A., & Uhl, M. (2025). ChatGPT’s advice drives moral judgments with or without justification (No. arXiv:2501.01897). arXiv. https://doi.org/10.48550/arXiv.2501.01897

Krügel, S., Ostermaier, A., & Uhl, M. (2023). The moral authority of ChatGPT (No. arXiv:2301.07098). arXiv. https://doi.org/10.48550/arXiv.2301.07098

Lim, B., Seth, I., Maxwell, M., Cuomo, R., Ross, R. J., & Rozen, W. M. (2025). Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude. Aesthetic Plastic Surgery. https://doi.org/10.1007/s00266-025-04842-8

McGrath, C., Farazouli, A., & Cerratto-Pargman, T. (2025). Generative AI chatbots in higher education: A review of an emerging research area. Higher Education, 89(6), 1533–1549. https://doi.org/10.1007/s10734-024-01288-w

Novis-Deutsch, N., Elyoseph, T., & Elyoseph, Z. (2025). How much of a pluralist is ChatGPT? A comparative study of value pluralism in generative AI chatbots. AI & SOCIETY. https://doi.org/10.1007/s00146-025-02450-3

Nunner‐Winkler, G. (2007). Development of moral motivation from childhood to early adulthood1. Journal of Moral Education, 36(4), 399–414. https://doi.org/10.1080/03057240701687970

Reed, G. F., Lynn, F., & Meade, B. D. (2002). Use of Coefficient of Variation in Assessing Variability of Quantitative Assays. Clinical and Vaccine Immunology, 9(6), 1235–1239. https://doi.org/10.1128/CDLI.9.6.1235-1239.2002

Rest, J. R., Narvaez, D., Thoma, S. J., & Bebeau, M. J. (1999). DIT2: Devising and testing a revised instrument of moral judgment. Journal of Educational Psychology, 91(4), 644–659. https://doi.org/10.1037/0022-0663.91.4.644

Sachdeva, S., Singh, P., & Medin, D. (2011). Culture and the quest for universal principles in moral reasoning. International Journal of Psychology, 46(3), 161–176. https://doi.org/10.1080/00207594.2011.568486

Shapiro, D., Li, W., Delaflor, M., & Toxtli, C. (2023). Conceptual Framework for Autonomous Cognitive Entities. https://doi.org/10.13140/RG.2.2.14161.30569

Shingala, M. C., & Rajyaguru, D. A. (2015). Comparison of Post Hoc Tests for Unequal Variance. 2(5).

Takemoto, K. (2024). The moral machine experiment on large language models. Royal Society Open Science, 11(2), 231393. https://doi.org/10.1098/rsos.231393

Thoma, S. J. (2006). Research on the Defining Issues Test. In Handbook of Moral Development. Psychology Press.

Wang, Y., Liang, L., Li, R., Wang, Y., & Hao, C. (2024). Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control. Journal of Multidisciplinary Healthcare, Volume 17, 3917–3929. https://doi.org/10.2147/JMDH.S473680

Zhang, Z., Chen, Z., & Xu, L. (2022). Artificial intelligence and moral dilemmas: Perception of ethical decision-making in AI. Journal of Experimental Social Psychology, 101, 104327. https://doi.org/10.1016/j.jesp.2022.104327

Unveiling moral development in generative AI chatbots

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

About the Journal

Make a Submission

Information