<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf47146</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf47146</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Collaborative human-AI risk annotation: co-annotating online incivility with CHAIRA</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Park</surname><given-names>Jinkyung Katie</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>Ellezhuthil</surname><given-names>Rahul Dev</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<contrib contrib-type="author"><name><surname>Wisniewski</surname><given-names>Pamela</given-names></name>
<xref ref-type="aff" rid="aff0003"/></contrib>
<contrib contrib-type="author"><name><surname>Singh</surname><given-names>Vivek</given-names></name>
<xref ref-type="aff" rid="aff0004"/></contrib>
<aff id="aff0001"><bold>Jinkyung Katie Park</bold> is an Assistant Professor in the School of Computing at Clemson University, Clemson, USA. She received her Ph.D. from Rutgers University, and her research focuses on Human-Computer Interaction to promote the online safety of vulnerable populations. She can be contacted at <email xlink:href="jinkyup@clemson.edu">jinkyup@clemson.edu</email></aff>
<aff id="aff0002"><bold>Rahul Dev Ellezhuthil</bold> is a Data Scientist who received a master&#x2019;s degree in Computer Science from Rutgers University. He can be contacted at <email xlink:href="rahul.e.dev@gmail.com">rahul.e.dev@gmail.com</email></aff>
<aff id="aff0003"><bold>Pamela Wisniewski</bold> is an Associate professor in the Department of Computer Science at Vanderbilt University, Nashville, USA. Her work lies at the intersection of Human-Computer Interaction, Social Computing, and Privacy. She can be contacted at <email xlink:href="pamela.wisniewski@vanderbilt.edu">pamela.wisniewski@vanderbilt.edu</email></aff>
<aff id="aff0004"><bold>Vivek Singh</bold> is an Associate professor in the School of Communication and Information at Rutgers University, New Brunswick, USA. He designs AI systems that are responsive to human values and needs. He can be contacted at <email xlink:href="vivek.k.singh@rutgers.edu">vivek.k.singh@rutgers.edu</email></aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>992</fpage>
<lpage>1008</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> Collaborative human-AI annotation is a promising approach for various tasks with large-scale and complex data. Tools and methods to support effective human-AI collaboration for data annotation are an important direction for research. In this paper, we present <italic>CHAIRA</italic>: a <italic>C</italic>ollaborative <italic>H</italic>uman-<italic>AI R</italic>isk <italic>A</italic>nnotation tool that enables human and AI agents to collaboratively annotate online incivility.</p>
<p><bold>Method.</bold> We leveraged Large Language Models (LLMs) to facilitate the interaction between human and AI annotators and examine four different prompting strategies. The developed CHAIRA system combines multiple prompting approaches with human-AI collaboration for online incivility data annotation.</p>
<p><bold>Analysis.</bold> We evaluated CHAIRA on 457 user comments with ground truth labels based on the inter-rater agreement between human and AI coders.</p>
<p><bold>Results.</bold> We found that the most collaborative prompt supported a high level of agreement between a human agent and AI, comparable to that of two human coders. While the AI missed some implicit incivility that human coders easily identified, it also spotted politically nuanced incivility that human coders overlooked.</p>
<p><bold>Conclusions.</bold> Our study reveals the benefits and challenges of using AI agents for incivility annotation and provides design implications and best practices for human- AI collaboration in subjective data annotation. </p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Online incivility refers to features of discussion that convey an unnecessarily disrespectful tone toward the discussion forum, its participants, or the topic (<xref rid="R4" ref-type="bibr">Coe et al., 2014</xref>). Considering adverse effects (<xref rid="R6" ref-type="bibr">Gervais, 2015</xref>; <xref rid="R8" ref-type="bibr">Han et al., 2018</xref>), it is important to identify and moderate incivil comments on social media platforms as well as to understand the nature and characteristics of incivility (<xref rid="R4" ref-type="bibr">Coe et al., 2014</xref>; <xref rid="R11" ref-type="bibr">Jhaver et al., 2018</xref>; <xref rid="R20" ref-type="bibr">Matias, 2019</xref>; <xref rid="R26" ref-type="bibr">Oz et al., 2018</xref>), both of which require the annotation of digital trace data. This annotation process often includes crowdsourced workers (<xref rid="R9" ref-type="bibr">Hosseinmardi et al., 2015</xref>), a team of researchers and research assistants (<xref rid="R28" ref-type="bibr">Park et al., 2023</xref>b; <xref rid="R33" ref-type="bibr">Singh et al., 2017</xref>), and/or domain experts (<xref rid="R27" ref-type="bibr">Park et al., 2023a</xref>). It involves an intensive and collaborative process of training, consensus-building, and quality control among multiple coders; therefore, it can be costly and time-consuming, while still yielding uneven levels of inter-coder agreement (<xref rid="R4" ref-type="bibr">Coe et al., 2014</xref>; <xref rid="R30" ref-type="bibr">Rains et al., 2017</xref>). There is a growing need for innovative and efficient methods to support human coders in annotating large corpora of online data, which can have a significant methodological impact on information science research.</p>
<p>In this study, we explore the use of Large Language Model (LLM)-based Conversational Agents (CAs) as AI-based co-coders for annotating online incivility data. We focus on LLM-based CAs because they have shown promising results in text annotation tasks due to their accuracy and adaptability (<xref rid="R1" ref-type="bibr">Amin et al., 2023</xref>; <xref rid="R8" ref-type="bibr">Huang et al., 2023</xref>; <xref rid="R15" ref-type="bibr">Kuzman &#x0026; Ljube&#x0161;i&#x0107;, 2023</xref>; <xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>; Zhang, et al., 2022a; <xref rid="R42" ref-type="bibr">Zhang et al., 2024</xref>). Moreover, LLMs can be adapted through finetuning or prompting for specific domains (<xref rid="R34" ref-type="bibr">Song et al., 2024</xref>), setting a new standard for what is achievable in natural language tasks. However, there are challenges in the use of LLM-based CAs in annotating textual data for more contextualized constructs (<xref rid="R1" ref-type="bibr">Amin et al., 2023</xref>; <xref rid="R8" ref-type="bibr">Huang et al., 2023</xref>), indicating the need for human-AI collaboration on subjective and contextualized annotation tasks.</p>
<p>In this paper, we present &#x201C;CHAIRA: a Collaborative Human-AI Risk Annotation tool&#x201D; that enables human and AI agents to co-annotate online incivility. We share early results from the design and implementation of a CHAIRA that we developed to interact with human coders and provide suggestions and explanations for annotating online incivility. Using 457 user comments with ground truth labels (e.g., civil vs. uncivil), we experiment with four types of prompting methods with different levels of information exchange between the human coder and CHAIRA. Using 10% of the data (<italic>n</italic> = 50), we established inter-rater agreement between the human coder and CHAIRA to observe how different types of promoting methods impact data annotation results. We analysed the conversation log between human coders and CHAIRA to have qualitative insights into how the quality of annotations changes with different prompting approaches. As such, we use a mixed methods approach to address the research questions:</p>
<p><bold>RQ1:</bold> how do different types of prompting methods influence the inter-rater reliability of human-AI collaborative data annotation results?</p>
<p><bold>RQ2:</bold> how do different types of prompting methods influence the quality and rationale for human-AI collaborative data annotation results?</p>
<p>By answering the above research questions, we address the overarching question: &#x2018;<italic>What is the optimal prompting approach and best practices to make the performance of human-AI collaboration similar to that of human-human collaboration?&#x2019;</italic> We found that CHAIRA&#x2019;s performance in terms of inter-coder agreement with human coders improved with more detailed prompts. The most advanced model, the Two-stage Few-shot Chain of Thought, nearly matched the agreement levels seen between two human coders reported in previous studies. While the AI agent missed some implicit incivility that human coders easily identified, it also spotted politically nuanced incivility that human coders overlooked. Our work provides design insights and best practices for human- AI collaboration in subjective data annotation tasks. It introduces a novel system for human-AI collaboration and applies different prompt engineering approaches to optimize incivility annotation. These findings are applicable beyond online incivility scenarios, offering a path for scalable annotation in subjective or low-resource settings. As such, our work contributes to the iConference community by empirically demonstrating the potential of human-AI collaboration in the context of subjective digital trace data annotation. Particularly, we contribute to the iConference community&#x2019;s focus on addressing multifaceted dimensions of AI to foster a deeper understanding of their benefits, challenges, and broader implications.</p>
</sec>
<sec id="sec2">
<title>Related work</title>
<sec id="sec2_1">
<title>Conversational agents as annotators</title>
<p>Conversational Agents (CAs) are systems enabled with the ability to interact with users using natural human dialogue (<xref rid="R31" ref-type="bibr">Rheu et al., 2021</xref>). After the recent release of various Large Language Model (LLM)-based Conversational Agents (CAs) (e.g., ChatGPT (<xref rid="R3" ref-type="bibr">OpenAI, 2022</xref>)), research communities are increasingly experimenting with data annotation tasks such as annotating political stance and sentiment of textual data (<xref rid="R1" ref-type="bibr">Amin et al., 2023</xref>; <xref rid="R15" ref-type="bibr">Kuzman &#x0026; Ljube&#x0161;i&#x0107;, 2023</xref>; <xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>; <xref rid="R18" ref-type="bibr">Zhang et al., 2022</xref>a). Emerging literature suggests that LLM-based CAs can be useful for text classification tasks (e.g., <xref rid="R1" ref-type="bibr">Amin et al., 2023</xref>; <xref rid="R8" ref-type="bibr">Huang et al., 2023</xref>; <xref rid="R15" ref-type="bibr">Kuzman &#x0026; Ljube&#x0161;i&#x0107;, 2023</xref>; <xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>; <xref rid="R18" ref-type="bibr">Zhang et al., 2022</xref>a; <xref rid="R42" ref-type="bibr">Zhang et al., 2024</xref>). For instance, Zhang et al. (2022) show that ChatGPT was able to annotate the political stance of the tweets with an average accuracy above 70. Moreover, LLMs can be adapted through finetuning or prompting for specific domains (<xref rid="R34" ref-type="bibr">Song et al., 2024</xref>), setting a new standard and expectations for what is achievable in natural language tasks. With proper fine-tuning, LLMs are known to even outperform crowdsourced annotators (<xref rid="R7" ref-type="bibr">Gilardi et al., 2023</xref>). As such, advances in LLMs such as GPT-4 showed a promising opportunity for data annotation at scale due to their ability to automate annotation tasks (<xref rid="R18" ref-type="bibr">Zhang et al., 2022</xref>b). However, there are challenges in the use of LLM-based CAs in annotating textual data for more contextualized constructs. For instance, Amin et al. (<xref rid="R1" ref-type="bibr">2023</xref>) showed that ChatGPT&#x2019;s accuracy for subjective tasks such as the five personality and suicide ideation classifications was lower than the baseline machine learning methods. As such, early empirical research demonstrated the limitations of subjective and contextualized annotation tasks entirely, indicating the need for human-AI collaboration on such tasks.</p>
</sec>
<sec id="sec2_2">
<title>Human-AI collaboration on annotation</title>
<p>As LLMs-based conversational agents have shown the ability to interact with humans and work with examples in various domains (<xref rid="R13" ref-type="bibr">Kim et al., 2022</xref>; <xref rid="R16" ref-type="bibr">Lai et al., 2022</xref>; <xref rid="R19" ref-type="bibr">Mackeprang et al., 2019</xref>; <xref rid="R36" ref-type="bibr">Tang et al., 2024</xref>), researchers are exploring the potential of human-AI collaboration on various tasks such as online content moderation (<xref rid="R16" ref-type="bibr">Lai et al., 2022</xref>), thematic analysis of qualitative data (<xref rid="R12" ref-type="bibr">Jiang et al., 2021</xref>; <xref rid="R42" ref-type="bibr">Zhang et al., 2024</xref>), disease prevention (<xref rid="R18" ref-type="bibr">Lu &#x0026; Peng, 2024</xref>), and crowdsourcing (<xref rid="R35" ref-type="bibr">Tamura et al., 2024</xref>). For instance, Zhang et al. (<xref rid="R39" ref-type="bibr">2024</xref>) explored the potential of LLM-based CAs as collaborative tools for qualitative data analysis and highlighted the efficiency of reducing time and labour for such analysis. Yet, their performance in collaborative co-annotation exercises for online risk where different facets of co-annotation are important is understudied. This gap is pertinent because co-annotation tasks need to support an interactive discussion to help generate a rationale for the various decisions, particularly in the context of highly contextualized online risk behaviour (Clay- Warnder, 2003) which can entail disagreement even among human coders. Yet, the disagreement must not lead to capitulation, instead, inspire better methods of automated analysis. Therefore, the combination of manual and automated content analysis is suggested as the gold standard for identifying subjective concepts such as online risk (<xref rid="R5" ref-type="bibr">Esau, 2022</xref>). In recent work, Wang et al. (<xref rid="R39" ref-type="bibr">2024</xref>) designed a multi-step human-LLM collaborative framework for data annotation tasks (e.g., natural language interface, stance detection, and hate speech detection) and found that when LLMs are incorrect in complex or domain-specific tasks, human annotations without any LLM assistance were the most accurate. Therefore, they suggested an iterative process of human-AI collaboration in data annotation, with feedback from the human annotators used to improve the quality of LLMs annotation.</p>
<p>In this work, we explore the potential of using LLM-based CAs to assist human coders in annotating subjective, nuanced online conversations. We expect that LLM-based CAs can support high- quality annotations with explanations when provided with proper instructions and examples. This approach could enhance scalability and help capture nuances that human coders might miss due to cognitive limits. We experimented with four prompting methods to assess their impact on annotation results and how co-annotation can improve subjective data annotation. By examining agreements and disagreements between human coders and LLM-based CAs, we explored how their strengths and weaknesses can complement each other.</p>
</sec>
</sec>
<sec id="sec3">
<title>Methods</title>
<sec id="sec3_1">
<title>Design and system implementation of CHAIRA</title>
<p>CHAIRA is an online annotation tool that integrates an LLM-based conversational agent to support human-AI collaboration on online risk data annotation. Below, we describe how we designed and implemented CHAIRA in detail.</p>
<sec id="sec3_1_1">
<title>System implementation and dataset</title>
<p>We leveraged the GPT 3.5 Turbo (the underlying model for OpenAI&#x2019;s ChatGPT) as the LLM of choice due to its popularity and ease of use. We developed a custom annotation interface on top of OpenAI&#x2019;s API to support human-AI co-annotation. The interface was developed in React and deployed on AWS. We used S3 buckets to store the dataset and AWS Lambda to evaluate the dataset. We used the dataset collected in prior work in which researchers explored the effectiveness of embedding positive background images on online discussion forums in reducing online incivility (<xref rid="R29" ref-type="bibr">Park &#x0026; Singh, 2022</xref>). The data comprised 457 comments collected from 105 users who participated in an online experiment and were annotated for online incivility by the two human coders. In the prior work (<xref rid="R29" ref-type="bibr">Park &#x0026; Singh, 2022</xref>), researchers designed a codebook to annotate 457 use comments into civil vs. incivil and worked with a human coder (i.e., research assistant) to establish the reliability of the incivility coding scheme. After the researcher and the coder had multiple training sessions, 10% of the comments (<italic>n</italic> = 45) were coded to establish interrater reliability. The reported interrater reliability scores in prior work were 0.88 (percent agreement) and 0.76 (Cohen&#x2019;s Kappa). Once the interrater reliability was established, the researcher coded the rest of the data. In the final dataset, 55% were reported to be civil cases (<italic>n</italic> = 250) while 45% were incivil cases (<italic>n</italic> = 207) (<xref rid="R29" ref-type="bibr">Park &#x0026; Singh, 2022</xref>).</p>
<p>In this work, we split the labelled dataset (457 use comments) into training (5%), validation (10%), and test (85%) datasets using a stratified random sampling approach. Consistent with common practices in human-human collaborative coding, less than 5% of data (20 user comments) were allocated to the training dataset to be used as examples and initial instructions (similar to training sessions for human-human coding). Then each prompt was passed to 50 samples (approx. 10% of data) allocated to the validation dataset to evaluate the inter-rater reliability between the human coder and CHAIRA. The rest of the comments (387 comments) were allocated to the test dataset to evaluate human-AI agreement on final online risk data annotation results.</p>
<p>Following the common human-human coding practices, we split our training, validation, and test datasets to be independent of each other. For instance, samples from either the validation or test split were not mixed with the training split. In addition, only comments from the training set can be added as examples in the prompts to fine-tune CHAIRA. Similarly, samples from the validation set can be used to interact with CHAIRA but cannot be added as examples in a prompt. Samples from the test split cannot be loaded into the prompt.</p>
</sec>
<sec id="sec3_1_2">
<title>Design of web interface</title>
<p>The web interface of CHAIRA (see <xref ref-type="fig" rid="F1">Figure 1</xref>) provides an overview of the layout for a human coder to interact with CHAIRA and design prompts. The left side of the interface shows a list of different prompts created for online risk annotation tasks in black labels. Once the human coder clicks a specific prompt, the label of the prompt turns blue to indicate that it is an active prompt. The right side of the interface shows the chosen prompt, comment data, conversation log between the human coder and AI agent, and inter-rater agreement. Below, we zoom into the major components of the CHAIRA interface to describe how each component was designed to facilitate human-AI collaboration on online risk data annotation tasks.</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>Overview of CHAIRA web interface. A list of designed prompts is shown on the left side, while the prompt, conversation log, and evaluation metrics/results are shown on the right side</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p><bold>Creating prompts</bold>: On the left side of the interface, the human coders can create new prompts to interact with an AI agent by clicking &#x2018;Add Prompt&#x2019; (<xref ref-type="fig" rid="F2">Figure 2</xref>). Once clicking the add prompt, a new text box appears where human coders can add the name of a certain prompt in the &#x2018;Prompt label&#x2019; box and add the content of the prompt in the &#x2018;Prompt text&#x2019; box to create a new prompt. Once new prompts are created, the human coders can add sample comments to test with the new prompts. A double arrow icon (on the left side of the red box in <xref ref-type="fig" rid="F2">Figure 2</xref> helps human coders to randomly sample comments from the training data. The user comments are added within the threads under each prompt with bullet points. Beyond the labelled dataset (457 user comments), the human coders can manually add new comments to label incivility using the same prompt by clicking a plus icon in the middle of the red box in <xref ref-type="fig" rid="F2">Figure 2</xref>. Yet, these comments are not included when evaluating the prompt for inter-rater agreement with a human coder. Human coders can create copies of existing prompts by clicking the double square icon on the left in the red box in <xref ref-type="fig" rid="F2">Figure 2</xref>. We chose icons for the above three features as there is limited space allocated in the prompt label (<xref ref-type="fig" rid="F1">Figure 1</xref>). The &#x2018;Export&#x2019; feature helps human coders download the prompt and conversation log data in a JSON file format, while the &#x2018;Import&#x2019; feature helps the opposite, uploading JSON files to the interface.</p>
<fig id="F2">
<label>Figure 2.</label>
<caption><p>Features to create and manage prompts to interact with CHAIRA</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig2.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p><bold>Evaluating inter-rater agreement</bold>: Inter-rater agreement between the human coder and the AI agent can be assessed using the &#x2018;Evaluate&#x2019; buttons (<xref ref-type="fig" rid="F3">Figure 3</xref>). To report inter-rater agreement between the human coder and CHAIRA, we used percent agreement and Cohen&#x2019;s Kappa, following the practices in the literature on qualitative content analysis (<xref rid="R21" ref-type="bibr">McDonald et al., 2019</xref>; <xref rid="R37" ref-type="bibr">Tinsley &#x0026; Weiss, 2000</xref>). &#x2018;Add Training Data&#x2019; loads all 20 user comment data from the training dataset to be used as examples and initial instructions. Once the human coders click the button, the interface creates a thread under the prompt to display all 20 user comment data (see the left side of <xref ref-type="fig" rid="F1">Figure 1</xref>). &#x2018;Add Validation Data&#x2019; loads all 50 samples allocated in the validation dataset to establish the inter-rater agreement between the human coders and CHAIRA. After looking at the evaluation results, the human coder can edit the prompt by clicking the &#x2018;Edit Prompt&#x2019; button. The interface can evaluate multiple prompts at the same time, which supports its scalability.</p>
<fig id="F3">
<label>Figure 3.</label>
<caption><p>Features to assess inter-rater reliability between the human coders and CHAIRA</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig3.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p><bold>Human-AI interaction</bold>: Once the interface loads user comment data from the training set, the human coders can have interactive conversations with the AI agent on the right side of the interface. Once the human coder clicks a certain user comment under the prompt, the interface shows the prompt, comment data, and response from CHAIRA (<xref ref-type="fig" rid="F4">Figure 4</xref>).</p>
<fig id="F4">
<label>Figure 4.</label>
<caption><p>Features to support interactive communication between the human coders and CHAIRA</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig4.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>The &#x2018;Type&#x2019; icon on the top left indicates an incivility label annotated by the human coder in prior work. The &#x2018;Split&#x2019; icon indicates to which dataset the user comment data belongs. As seen from <xref ref-type="fig" rid="F4">Figure 4</xref>, &#x2018;incivil&#x2019; and &#x2018;train&#x2019; means that the user comment data came from the training dataset and was annotated as incivil by the human coder. Right below these icons, the text in the prompt is shown as brown text with the label &#x2018;Prompt&#x2019; on a yellow background. User comment data to annotate for incivility comes next as blue text with the label &#x2018;Data&#x2019; in a light-blue background. Then the incivility labels and the rationales for the decision generated by CHAIRA follow as green text with the label &#x2018;AI Labeller&#x2019; in a light-green background. The three components above are automatically generated when evaluating the inter-rater reliability for each prompt.</p>
<p>After reviewing the initial response generated by CHAIRA, the human coders can start a conversation by adding queries in the textbox and clicking the &#x2018;Add&#x2019; button next to the textbox. Then, when the human coders click the &#x2018;Generate&#x2019; button, CHAIRA generates responses to the given queries. The queries asked by the human coders appear with the label &#x2018;Human Labeller&#x2019; in a light-blue background, while the answers generated by CHAIRA appear with the label &#x2018;AI Labeller&#x2019; in a light-green background (<xref ref-type="fig" rid="F4">Figure 4</xref>). We designed text generated by CHAIRA to appear as green text in light-green backgrounds while text submitted by the human coders to appear as blue text in light-blue backgrounds to help human coders distinguish text generated by the two parties.</p>
<p>Through these interactive conversations, additional instructions and examples are exchanged between the two. Once the human coder decides that a reasonable agreement was achieved, a conversation log between the human coder and CHAIRA can be added as a prompt by clicking &#x2018;Add To Prompt&#x2019; (Two-stage prompt in the next section). Following the common practices in human-human collaborative coding in which approximately 5-10% of the data is used for training and consensus building, we designed only conversation around 20 user comments in the training dataset to be added as prompts using the &#x2018;Add To Prompt&#x2019; feature. Finally, the &#x2018;Edit Data&#x2019; feature helps the human coders edit the user comment data.</p>
</sec>
</sec>
<sec id="sec3_2">
<title>Prompt engineering approaches</title>
<p>We conducted experiments with four different prompting engineering approaches: zero-shot, definition, few-shot, and two-stage few-shot Chain-of-Thought (CoT). We designed four different prompting approaches with varying levels of interaction between the human coder and the LLM- based agent. We used the same coding scheme as applied in the previous literature (<xref rid="R29" ref-type="bibr">Park &#x0026; Singh, 2022</xref>) to design the prompts. In our prompts, incivility was defined as &#x2018;the feature of discussion that conveys an unnecessarily disrespectful tone toward the discussion forum, its participants, or its topic&#x2019; (<xref rid="R4" ref-type="bibr">Coe et al., 2014</xref>), with six different categories: name-calling, aspersion, lying, vulgarity, pejorative for speech, and others (<xref ref-type="table" rid="T1">Table 1</xref>). The reported inter-rater agreement between the two human coders was 0.88 (percent agreement) and 0.76 (Cohen&#x2019;s Kappa) (<xref rid="R29" ref-type="bibr">Park &#x0026; Singh, 2022</xref>).</p>
<table-wrap id="T1">
<label>Table 1.</label>
<caption><p>Definition and examples of types of incivility (<xref rid="R29" ref-type="bibr">Park &#x0026; Singh, 2022</xref>)</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Category</bold></th>
<th align="center" valign="top"><bold>Description</bold></th>
<th align="center" valign="top"><bold>Example</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top">Name-calling</td>
<td align="center" valign="top">Mean-spirited or disparaging words directed at a person or group of people</td>
<td align="center" valign="top"><italic>&#x2018;At least the morons in the state capital no longer have control of this process!&#x2019;</italic></td>
</tr>
<tr>
<td align="center" valign="top">Aspersion</td>
<td align="center" valign="top">Mean-spirited or disparaging words directed at an idea, plan, policy, or behaviour. An aspersion may be both explicit and implicit</td>
<td align="center" valign="top"><italic>&#x2018;It beckons the memories of Trump&#x2019;s silly border wall, and the incredible waste of resources that was&#x2019;</italic></td>
</tr>
<tr>
<td align="center" valign="top">Lying</td>
<td align="center" valign="top">Stating or implying that an idea, plan, policy, or public figure was disingenuous</td>
<td align="center" valign="top"><italic>&#x2018;Government is wrong, is corrupt, is lying, is deceiving the people, and is violating our constitution&#x2019;</italic></td>
</tr>
<tr>
<td align="center" valign="top">Vulgarity</td>
<td align="center" valign="top">Using profanity of language that would not be considered proper in professional discourse</td>
<td align="center" valign="top"><italic>&#x2018;Am I possibly the only person here who thinks this shit is funny as hell?&#x2019;</italic></td>
</tr>
<tr>
<td align="center" valign="top">Pejorative for speech</td>
<td align="center" valign="top">Disparaging remark about the way in which a person communicates</td>
<td align="center" valign="top"><italic>&#x2018;Quit crying over the spilled milk of&#x2019;</italic></td>
</tr>
<tr>
<td align="center" valign="top">Others</td>
<td align="center" valign="top">All comments that may be deemed incivil, but do not fall into any of the previous categories of incivility</td>
<td align="center" valign="top"><italic>&#x2018;Hahahahahahahahahahaha,, really crack me open this one&#x2019;</italic></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Zero-shot prompting</bold>: In zero-shot prompting, the model is only given a simple instruction describing the task. This method is considered convenient and has the potential for robustness (<xref rid="R2" ref-type="bibr">Brown et al., 2020</xref>). The instruction used in the zero-shot prompt is as follows: &#x2018;Classify the text into &#x2018;civil&#x2019; or &#x2018;incivil&#x2019; and explain why&#x2019;.</p>
<p><bold>Definition prompting</bold>: In definition prompting, along with the instruction of classifying a comment as &#x2018;civil&#x2019; or &#x2018;incivil,&#x2019; we provided the definition of incivility and brief descriptions of six categories of incivility (see <xref ref-type="table" rid="T1">Table 1</xref>).</p>
<p><bold>Few-shot prompting</bold>: With few-shot prompting, models are given a few demonstrations of the task (<xref rid="R2" ref-type="bibr">Brown et al., 2020</xref>), in our case, examples of incivility. With this approach, we provided the model with the definition of incivility, descriptions of the six categories of incivility, examples of the six categories of incivility (<xref ref-type="table" rid="T1">Table 1</xref>), and the instructions for the task.</p>
<p><bold>Two-stage few-shot chain-of-thought</bold>: Finally, we used a two-stage few-shot chain-of-thought (CoT), a few-shot-based prompting for the chain of thought reasoning. Chain-of-thought (CoT) prompting (<xref rid="R40" ref-type="bibr">Wei et al., 2022</xref>) modifies the answers in few-shot examples to step-by-step answers by adding an instruction such as &#x2018;Let&#x2019;s think step by step&#x2019; to the original prompt to elicit reasoning in LLMs (<xref rid="R14" ref-type="bibr">Kojima et al., 2022</xref>; <xref rid="R40" ref-type="bibr">Wei et al., 2022</xref>). The first prompt is the reasoning extraction where we used CoT to elicit reasoning from LLM. The second prompt consists of the first prompt and the answers generated from the first prompt.</p>
<p><xref ref-type="fig" rid="F5">Figure 5</xref> shows a visualization of the two-stage few-shot CoT approach applied. In the first prompt, we provided the same instructions as in few-shot prompting and added the final line that says, &#x2018;Let&#x2019;s work this out in a step-by-step way to be sure we have the right answer&#x2019; suggested by Zhou et al. (2022). Next, we reviewed the responses generated by CHAIRA from the first prompt and did the error analysis. We chose one example of false positive cases (i.e., CHAIRA output = civil, human ground truth = incivil) and pointed the model to recognize what it is missing in its answers (i.e., implicit aspersion). Once CHAIRA generated the responses that matched with human responses, we added the conversation log (<xref ref-type="fig" rid="F6">Figure 6</xref> in Appendix A) to the prompt. A summary of the four prompting approaches we used in this case study is presented in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<fig id="F5">
<label>Figure 5.</label>
<caption><p>Pipeline of two-stage chain of thought prompting. Human feedback on phase 1 and CA responses are prepended to the input for phase 2 sent to CA</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig5.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
</sec>
<sec id="sec4">
<title>Results</title>
<p>Our methodology, system implementation, and prompting strategies showed that practical systems for human-AI collaboration in online risk annotation are feasible. To investigate RQ1, we measured the inter-coder agreement between the human coder and the AI agent. The agreement increased with the amount of detail in the prompts. The inter-rater agreement increased with additional information given in the prompts. Two-stage CoT yielded the highest performance (Cohen&#x2019;s Kappa = 0.71), yet, lower than the levels observed for the baseline, two human coders (Cohen&#x2019;s Kappa = 0.76) (<xref rid="R29" ref-type="bibr">Park &#x0026; Singh, 2022</xref>) (see <xref ref-type="table" rid="T2">Table 2</xref>). </p>
<table-wrap id="T2">
<label>Table 2.</label>
<caption><p>Summary and performance of the four prompting approaches</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"></th>
<th align="center" valign="top" colspan="4"><bold>Prompt</bold></th>
<th align="center" valign="top" colspan="2"><bold>Performance</bold></th>
</tr>
<tr>
<th align="center" valign="top"></th>
<th align="center" valign="top"><bold>Instruction</bold></th>
<th align="center" valign="top"><bold>Definition</bold></th>
<th align="center" valign="top"><bold>Example</bold></th>
<th align="center" valign="top"><bold>Conversation log</bold></th>
<th align="center" valign="top"><bold>Percent agreement</bold></th>
<th align="center" valign="top"><bold>Cohen&#x2019;s Kappa</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top">Zero-shot</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top"></td>
<td align="center" valign="top"></td>
<td align="center" valign="top"></td>
<td align="center" valign="top">0.66</td>
<td align="center" valign="top">0.26</td>
</tr>
<tr>
<td align="center" valign="top">Definition</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top"></td>
<td align="center" valign="top"></td>
<td align="center" valign="top">0.72</td>
<td align="center" valign="top">0.48</td>
</tr>
<tr>
<td align="center" valign="top">Few-shot</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top"></td>
<td align="center" valign="top">0.78</td>
<td align="center" valign="top">0.54</td>
</tr>
<tr>
<td align="center" valign="top">Two-stage CoT</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">0.86</td>
<td align="center" valign="top">0.71</td>
</tr>
<tr>
<td align="center" valign="top">Baseline</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">X</td>
<td align="center" valign="top">0.88</td>
<td align="center" valign="top">0.76</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To further understand the impact of different prompting methods on the annotation quality and rationale (RQ2), we discuss some of the common themes observed in the logs of interaction between the human coder and CHAIRA.</p>
<sec id="sec4_1">
<title>CHAIRA performed better with more human-AI interaction in the prompts, particularly for annotating implicit incivility</title>
<p>Overall, CHAIRA did a better job at annotating explicit incivility and explaining the reasoning behind them with more information provided in the prompts. For instance, even with the zero-shot prompting approach, CHAIRA pointed to the exact incivil expressions and recognized that such expressions reflect a negative attitude towards immigrants. In addition, without any information about incivility, CHAIRA was able to note the use of more nuanced incivility, such as sarcasm, when nuanced expressions are combined with explicit personal attacks and insults (examples in <xref ref-type="fig" rid="F7">Figure 7</xref> in Appendix).</p>
<p>At the same time, we observed some common issues throughout all the prompting approaches: CHAIRA did not recognize implicit and nuanced incivil expressions in the texts, even with the information &#x2018;an aspersion may be both explicit and implicit&#x2019; given in the prompt. Therefore, when designing the two-stage few-shot CoT prompt, we reminded CHAIRA that aspersion can be implicit and nuanced through interactive conversation. Only after the interactive conversation between the human coder and CHAIRA about implicit aspersion was added to the prompt (i.e., the two-stage few-shot prompting approach), CHAIRA started to distinguish implicit and nuanced yet incivil expressions (examples in <xref ref-type="fig" rid="F8">Figure 8</xref> in Appendix), which were frequent in our dataset.</p>
</sec>
<sec id="sec4_2">
<title>The output label remained the same, but the reasons changed with different prompts</title>
<p>We observed that while the output label remained the same the reasons changed with more information in the prompts. For instance, with the zero-shot prompt, CHAIRA mainly focused on the use of language and tone of the text, whereas with the definition prompt, CHAIRA considered whether the text fell under any of the six categories of incivility (<xref ref-type="fig" rid="F7">Figure 7</xref>). The rationales provided were similar for definition and few-shot prompting approaches. A similar trend was observed for incivil cases, where CHAIRA provided specific reasons by pointing to the specific category (i.e., lying) and the context of why it falls under the category with the definition and few-shot prompting approaches, while CHAIRA provided more generic reasons (i.e., use of personal attacks or offensive language) with the zero-shot approach.</p>
<p>Overall, CHAIRA could provide human coders with helpful context on US politics, immigration policy debates, and incivil expressions that may be overlooked. For example, the text &#x2018;<italic>And around Nancy&#x2019;s wall on Capitol Hill. Make that wall to keep them in</italic>&#x2019; was annotated as civil by a human coder, as it did not contain explicit or implicit incivility. However, with zero-shot and definition prompts, CHAIRA recognized &#x2018;them&#x2019; as immigrants, and the comment about building a wall to &#x2018;keep them in&#x2019; was interpreted as disrespectful. Using a two-stage few-shot CoT approach, CHAIRA&#x2019;s rationale became even more detailed: &#x2018;The comment mocks Nancy Pelosi and implies hostility toward immigrants.&#x2019; The human coder missed that &#x2018;Nancy&#x2019; referred to a political figure and failed to recognize the disparaging tone. Although another coder might have annotated it differently, the above example shows that human coders have limits in awareness and cognitive capacity and AI can complement such limitations. As such, with the highest level of human-AI interaction (two- stage CoT), CHAIRA effectively discerned both the tone and the target of incivility, providing crucial context in annotating political incivility.</p>
</sec>
<sec id="sec4_3">
<title>Yet, sometimes, CHAIRA did not fully understand the information in the prompts and text to annotate</title>
<p>Unlike human conders, we observed some cases where CHAIRA did not accurately catch the textual information given in the prompts. For instance, CHAIRA lacked an understanding of the description and example of the &#x2018;Pejorative for Speech&#x2019; category. We found some of the responses that CHAIRA mistakenly understood the pejorative for speech as the use of a sarcastic tone in the text, as opposed to its actual concept, a disparaging remark about how a person communicates. Similarly, CHAIRA sometimes lacked an understanding of the given text compared to human coders, particularly for short texts. For instance, with the definition prompting approach, CHAIRA struggled to annotate a short text such as &#x2018;What is your solution?&#x2019; hence, annotated it as &#x2018;unclear.&#x2019; With the few-shot prompt, CHAIRA annotated the same text as incivil because the text contains examples of name-calling and aspersions. However, the incivil expressions referred to in this response were from the examples given in the prompt (i.e., instructions), not from the given text to annotate. Overall, sometimes CHAIRA could not understand the information in the prompts or texts to annotate and hence, was unable to annotate the text appropriately.</p>
<p><bold>Implications for building human-AI collaborative annotation systems</bold> The above-mentioned results yielded implications for the future design of human-AI collaborative annotation systems. Below, we discuss the design implications for using AI-based CAs to best support the co-annotation of online risk data.</p>
</sec>
<sec id="sec4_4">
<title>Reasoning and domain knowledge provided by CAs are valuable resources for co-annotation workflows</title>
<p>We observed that CHAIRA was good at providing reasons for its annotation results. Therefore, the benefit of CHAIRA lies in the interactive nature of the annotation process, which provides partial explainability and is one of the important aspects of human-centred AI-based systems (<xref rid="R22" ref-type="bibr">Minh et al., 2022</xref>; <xref rid="R38" ref-type="bibr">Vilone &#x0026; Longo, 2021</xref>). With the most sophisticated prompting approach we had (Few-shot CoT), CHAIRA informed the human coder with a broad knowledge and context of the given text and convinced the human coder to change their mind in multiple instances. Therefore, one of the strengths of LLM-based CAs is the ability to provide relevant information (presumably) trained on the entirety of online data, as opposed to human coding which requires extensive training or domain-specific knowledge. This shows the potential strengths and values in human-AI co-annotation, particularly in risk scenarios that require domain-specific knowledge. In addition, the initial reasoning and domain knowledge provided by CAs can be used to inform human coders further on how to design better CA models to support the co-annotation of online risk.</p>
</sec>
<sec id="sec4_5">
<title>Two-way interaction between human coders and CAs is a key to good co-annotation results</title>
<p>A major benefit of co-coding with CHAIRA was its ability to scale the data annotation with a high degree of inter-rater agreement. To be able to do so, we carefully reviewed the incivility labels where CHAIRA and the human coder disagreed and had two-way conversations with CHAIRA to further understand their reasoning. This two-way interaction in the co-annotation process was useful because, despite access to a large corpus of knowledge, CHAIRA also tended to make some simple mistakes that were quite easy for human coders to spot. For instance, during the interaction with CHAIRA, we realized that CHAIRA could miss the previous conversation and hence, we needed to remind it about our conversation for further annotation tasks, particularly about nuanced incivility (e.g., implicit aspersion). Therefore, we added the instruction &#x2018;<italic>keep implicit incivility in mind</italic>,&#x2019; in our Two-stage CoT prompt. Sometimes, we pointed to the exact incivil expressions in the text that contain implicit aspersion to remind the concept (e.g., Don&#x2019;t you think this expression of &#x2018;he&#x2019;s hoping to stir up the same frenzy and ride that wave?&#x2019; could be implicitly incivil?). Then the CHAIRA re-evaluated the text and corrected their answers. As such, interactive communication between the human coders and the AI agent is one of the key elements in improving the risk annotations generated by the AI agent.</p>
</sec>
<sec id="sec4_6">
<title>Providing clear examples in carefully designed prompts considering how LLMs process human language is important</title>
<p>Providing clear examples of risk cases and descriptions of risk types is crucial when designing human-AI co-annotator models. In our few-shot prompt, we explained that aspersion can be both implicit and explicit, yet CHAIRA failed to recognize implicit aspersion until we guided it through two-way communication. This could be due to confusion caused by slight differences between the risk descriptions and examples provided. For instance, CHAIRA may have interpreted the explicit nature of the &#x2018;silly border wall&#x2019; in the aspersion example and missed the implicit aspect described. Therefore, selecting the right examples and crafting clear descriptions of constructs is critical when working with LLM-based CAs to annotate subjective concepts like online risk. In addition, since CAs generate responses by tokenizing input (OpenAI, 2024b), even minor textual changes such as punctuation can affect performance. This can limit the CA&#x2019;s ability to understand text containing abbreviations or spelling variations, which are common in online risk data (<xref rid="R32" ref-type="bibr">Sadeque et al., 2019</xref>). Therefore, designing prompts with careful consideration of how LLMs process natural language is essential to building effective collaborative systems for annotating contextualized online risk data.</p>
</sec>
<sec id="sec4_7">
<title>Limitations and future directions</title>
<p>In consonance with data ethics, we used de-identified data to develop our annotation tool. We used OpenAI API in the backend as its security policy stipulates that data submitted through the API is not used to train OpenAI models (OpenAI, 2024a). However, future work should also consider building LLM-based CAs with private servers so that the training dataset is not shared via the web. Another limitation of this collaborative annotation is variations in LLM responses. Finally, we acknowledge that inductive approaches (e.g., thematic analysis, grounded theory approach), important approaches to building patterns and themes in qualitative work, were not explored in this study. Future work can explore the potential of LLM-based CAs for more inductive analysis that requires an in-depth understanding of the subtleties and complexities of qualitative data. Moving forward, we aim to experiment with our approaches with diverse types of online risk data at scale to gain deeper insights into collaborative annotation between human and LLM-based conversational agent systems.</p>
</sec>
</sec>
<sec id="sec5">
<title>Conclusion</title>
<p>In this study, we built systems to support human-AI collaborative data annotation tasks and explored the potential benefits and challenges of human-AI collaborative annotation of highly subjective and contextualized online incivility data. The AI missed some implicit risks that human coders easily spotted, conversely, it spotted politically nuanced incivility that human coders overlooked. The design implications and best practices derived from this work can serve as a steppingstone for future research considering similar methods. Our work suggests a path toward combining the relative strengths of humans and AI for scalable data annotation, especially in sensitive or low-resource settings.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Amin</surname><given-names>M. M.</given-names></name><name><surname>Cambria</surname><given-names>E.</given-names></name><name><surname>Schuller</surname><given-names>B. W.</given-names></name></person-group><year>2023</year><article-title>Will affective computing emerge from foundation models and general artificial intelligence? A first evaluation of ChatGPT</article-title><source>IEEE Intelligent Systems</source><volume>38</volume><issue>2</issue><fpage>15</fpage><lpage>23</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/MIS.2023.3254179">https://doi.org/10.1109/MIS.2023.3254179</ext-link></element-citation></ref>
<ref id="R2"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Brown</surname><given-names>T.</given-names></name><name><surname>Mann</surname><given-names>B.</given-names></name><name><surname>Ryder</surname><given-names>N.</given-names></name><name><surname>Subbiah</surname><given-names>M.</given-names></name><name><surname>Kaplan</surname><given-names>J. D.</given-names></name><name><surname>Dhariwal</surname><given-names>P.</given-names></name><name><surname>Amodei</surname><given-names>D.</given-names></name></person-group><year>2020</year><article-title>Language models are few-shot learners</article-title><source>Advances in neural information processing systems</source><volume>33</volume><fpage>1877</fpage><lpage>1901</lpage></element-citation></ref>
<ref id="R3"><element-citation publication-type="journal"><person-group person-group-type="author">Clay-<name><surname>Warner</surname><given-names>J.</given-names></name></person-group><year>2003</year><article-title>The context of sexual violence: Situational predictors of self-protective actions</article-title><source>Violence and victims</source><volume>18</volume><issue>5</issue><fpage>543</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1891/088667003780928099">https://doi.org/10.1891/088667003780928099</ext-link></element-citation></ref>
<ref id="R4"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Coe</surname><given-names>K.</given-names></name><name><surname>Kenski</surname><given-names>K.</given-names></name><name><surname>Rains</surname><given-names>S. A.</given-names></name></person-group><year>2014</year><article-title>Online and uncivil? Patterns and determinants of incivility in newspaper website comments</article-title><source>Journal of communication</source><volume>64</volume><issue>4</issue><fpage>658</fpage><lpage>679</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1111/jcom.12104">https://doi.org/10.1111/jcom.12104</ext-link></element-citation></ref>
<ref id="R5"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Esau</surname><given-names>K.</given-names></name></person-group><year>2022</year><chapter-title>Content analysis in the research field of incivility and hate speech in online communication</chapter-title><source>Standardisierte Inhaltsanalyse in der Kommunikationswissenschaft&#x2013; Standardized Content Analysis in Communication Research: Ein Handbuch-A Handbook</source><fpage>451</fpage><lpage>461</lpage><publisher-loc>Wiesbaden</publisher-loc><publisher-name>Springer Fachmedien Wiesbaden</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-658-36179- 2_38">https://doi.org/10.1007/978-3-658-36179- 2_38</ext-link></element-citation></ref>
<ref id="R6"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gervais</surname><given-names>B. T.</given-names></name></person-group><year>2015</year><article-title>Incivility online: Affective and behavioural reactions to uncivil political posts in a web-based experiment</article-title><source>Journal of Information Technology &#x0026; Politics</source><volume>12</volume><issue>2</issue><fpage>167</fpage><lpage>185</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1080/19331681.2014.997416">https://doi.org/10.1080/19331681.2014.997416</ext-link></element-citation></ref>
<ref id="R7"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gilardi</surname><given-names>F.</given-names></name><name><surname>Alizadeh</surname><given-names>M.</given-names></name><name><surname>Kubli</surname><given-names>M.</given-names></name></person-group><year>2023</year><article-title>ChatGPT outperforms crowd workers for text-annotation tasks</article-title><source>Proceedings of the National Academy of Sciences</source><volume>120</volume><issue>30</issue><fpage>e2305016120</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1073/pnas.2305016120">https://doi.org/10.1073/pnas.2305016120</ext-link></element-citation></ref>
<ref id="R8"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Han</surname><given-names>S. H.</given-names></name><name><surname>Brazeal</surname><given-names>L. M.</given-names></name><name><surname>Pennington</surname><given-names>N.</given-names></name></person-group><year>2018</year><article-title>Is civility contagious? Examining the impact of modelling in online political discussions</article-title><source>Social Media+ Society</source><volume>4</volume><issue>3</issue><fpage>2056305118793404</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1177/2056305118793404">https://doi.org/10.1177/2056305118793404</ext-link></element-citation></ref>
<ref id="R9"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hosseinmardi</surname><given-names>H.</given-names></name><name><surname>Mattson</surname><given-names>S. A.</given-names></name><name><surname>Ibn Rafiq</surname><given-names>R.</given-names></name><name><surname>Han</surname><given-names>R.</given-names></name><name><surname>Lv</surname><given-names>Q.</given-names></name><name><surname>Mishra</surname><given-names>S.</given-names></name></person-group><year>2015</year><article-title>Analysing labelled cyberbullying incidents on the Instagram social network</article-title><source>Social Informatics: 7th International Conference</source><comment>SocInfo 2015, Beijing, China, December 9-12, 2015, Proceedings</comment> <volume>7</volume><fpage>49</fpage><lpage>66</lpage><comment>Springer International Publishing</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-27433-1_4">https://doi.org/10.1007/978-3-319-27433-1_4</ext-link></element-citation></ref>
<ref id="R10"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Huang</surname><given-names>F.</given-names></name><name><surname>Kwak</surname><given-names>H.</given-names></name><name><surname>An</surname><given-names>J.</given-names></name></person-group><year>2023</year><comment>April</comment><article-title>Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech</article-title><source>Companion proceedings of the ACM web conference 2023</source><fpage>294</fpage><lpage>297</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3543873.3587368">https://doi.org/10.1145/3543873.3587368</ext-link></element-citation></ref>
<ref id="R11"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jhaver</surname><given-names>S.</given-names></name><name><surname>Ghoshal</surname><given-names>S.</given-names></name><name><surname>Bruckman</surname><given-names>A.</given-names></name><name><surname>Gilbert</surname><given-names>E.</given-names></name></person-group><year>2018</year><article-title>Online harassment and content moderation: The case of blocklists</article-title><source>ACM Transactions on Computer-Human Interaction (TOCHI)</source><volume>25</volume><issue>2</issue><fpage>1</fpage><lpage>33</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3185593">https://doi.org/10.1145/3185593</ext-link></element-citation></ref>
<ref id="R12"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname><given-names>J. A.</given-names></name><name><surname>Wade</surname><given-names>K.</given-names></name><name><surname>Fiesler</surname><given-names>C.</given-names></name><name><surname>Brubaker</surname><given-names>J. R.</given-names></name></person-group><year>2021</year><article-title>Supporting serendipity: Opportunities and challenges for Human-AI Collaboration in qualitative analysis</article-title><source>Proceedings of the ACM on Human-Computer Interaction</source><volume>5</volume><issue>CSCW1</issue><fpage>1</fpage><lpage>23</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3449168">https://doi.org/10.1145/3449168</ext-link></element-citation></ref>
<ref id="R13"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Kim</surname><given-names>T. S.</given-names></name><name><surname>Choi</surname><given-names>D.</given-names></name><name><surname>Choi</surname><given-names>Y.</given-names></name><name><surname>Kim</surname><given-names>J.</given-names></name></person-group><year>2022</year><comment>April</comment><article-title>Stylette: Styling the web with natural language</article-title><source>Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems</source><fpage>1</fpage><lpage>17</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3491102.3501931">https://doi.org/10.1145/3491102.3501931</ext-link></element-citation></ref>
<ref id="R14"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kojima</surname><given-names>T.</given-names></name><name><surname>Gu</surname><given-names>S. S.</given-names></name><name><surname>Reid</surname><given-names>M.</given-names></name><name><surname>Matsuo</surname><given-names>Y.</given-names></name><name><surname>Iwasawa</surname><given-names>Y.</given-names></name></person-group><year>2022</year><article-title>Large language models are zero&#x2013;shot reasoners</article-title><source>Advances in neural information processing systems</source><volume>35</volume><fpage>22199</fpage><lpage>22213</lpage></element-citation></ref>
<ref id="R15"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Kuzman</surname><given-names>T.</given-names></name><name><surname>Ljube&#x0161;i&#x0107;</surname><given-names>N.</given-names></name></person-group><year>2023</year><article-title>Automatic genre identification: a survey</article-title><source>Lang Resources &#x0026; Evaluation</source><fpage>1</fpage><lpage>34</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s10579-023-09695-8">https://doi.org/10.1007/s10579-023-09695-8</ext-link></element-citation></ref>
<ref id="R16"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Lai</surname><given-names>V.</given-names></name><name><surname>Carton</surname><given-names>S.</given-names></name><name><surname>Bhatnagar</surname><given-names>R.</given-names></name><name><surname>Liao</surname><given-names>Q. V.</given-names></name><name><surname>Zhang</surname><given-names>Y.</given-names></name><name><surname>Tan</surname><given-names>C.</given-names></name></person-group><year>2022</year><comment>April</comment><article-title>Human-ai collaboration via conditional delegation: A case study of content moderation</article-title><source>Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems</source><fpage>1</fpage><lpage>18</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3491102.3501999">https://doi.org/10.1145/3491102.3501999</ext-link></element-citation></ref>
<ref id="R17"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>Y.</given-names></name><name><surname>Han</surname><given-names>T.</given-names></name><name><surname>Ma</surname><given-names>S.</given-names></name><name><surname>Zhang</surname><given-names>J.</given-names></name><name><surname>Yang</surname><given-names>Y.</given-names></name><name><surname>Tian</surname><given-names>J.</given-names></name><name><surname>Ge</surname><given-names>B.</given-names></name></person-group><year>2023</year><article-title>Summary of chatgpt-related research and perspective towards the future of large language models</article-title><source>Meta-Radiology</source><fpage>100017</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.metrad.2023.100017">https://doi.org/10.1016/j.metrad.2023.100017</ext-link></element-citation></ref>
<ref id="R18"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Lu</surname><given-names>Q.</given-names></name><name><surname>Peng</surname><given-names>X.</given-names></name></person-group><year>2024</year><comment>April</comment><chapter-title>Differences in Knowledge Adoption Among Task Types in Human- AI Collaboration Under the Chronic Disease Prevention Scenario</chapter-title><source>International Conference on Information</source><fpage>213</fpage><lpage>231</lpage><publisher-loc>Cham</publisher-loc><publisher-name>Springer Nature Switzerland</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3- 031-57867-0_16">https://doi.org/10.1007/978-3- 031-57867-0_16</ext-link></element-citation></ref>
<ref id="R19"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Mackeprang</surname><given-names>M.</given-names></name><name><surname>M&#x00FC;ller-Birn</surname><given-names>C.</given-names></name><name><surname>Stauss</surname><given-names>M. T.</given-names></name></person-group><year>2019</year><article-title>Discovering the sweet spot of human&#x2013;computer configurations: A case study in information extraction</article-title><source>Proceedings of the ACM on Human-Computer Interaction</source><volume>3</volume><issue>CSCW</issue><fpage>1</fpage><lpage>30</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3359297">https://doi.org/10.1145/3359297</ext-link></element-citation></ref>
<ref id="R20"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Matias</surname><given-names>J. N.</given-names></name></person-group><year>2019</year><article-title>Preventing harassment and increasing group participation through social norms in 2,190 online science discussions</article-title><source>Proceedings of the National Academy of Sciences</source><volume>116</volume><issue>20</issue><fpage>9785</fpage><lpage>9789</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1073/pnas.1813486116">https://doi.org/10.1073/pnas.1813486116</ext-link></element-citation></ref>
<ref id="R21"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>McDonald</surname><given-names>N.</given-names></name><name><surname>Schoenebeck</surname><given-names>S.</given-names></name><name><surname>Forte</surname><given-names>A.</given-names></name></person-group><year>2019</year><article-title>Reliability and inter-rater reliability in qualitative research: Norms and guidelines for CSCW and HCI practice</article-title><source>Proceedings of the ACM on human-computer interaction</source><volume>3</volume><issue>CSCW</issue><fpage>1</fpage><lpage>23</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3359174">https://doi.org/10.1145/3359174</ext-link></element-citation></ref>
<ref id="R22"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Minh</surname><given-names>D.</given-names></name><name><surname>Wang</surname><given-names>H. X.</given-names></name><name><surname>Li</surname><given-names>Y. F.</given-names></name><name><surname>Nguyen</surname><given-names>T. N.</given-names></name></person-group><year>2022</year><article-title>Explainable artificial intelligence: a comprehensive review</article-title><source>Artificial Intelligence Review</source><fpage>1</fpage><lpage>66</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s10462-021- 10088-y">https://doi.org/10.1007/s10462-021- 10088-y</ext-link></element-citation></ref>
<ref id="R23"><element-citation publication-type="other"><person-group person-group-type="author"><collab>Open AI</collab></person-group><year>2022</year><source>Introducing ChatGPT</source><ext-link ext-link-type="uri" xlink:href="https://openai.com/index/chatgpt/">https://openai.com/index/chatgpt/</ext-link></element-citation></ref>
<ref id="R24"><element-citation publication-type="other"><person-group person-group-type="author"><collab>Open AI</collab></person-group><year>2024</year><comment>a</comment><source>Security &#x0026; privacy</source><ext-link ext-link-type="uri" xlink:href="https://openai.com/security-and-privacy/">https://openai.com/security-and-privacy/</ext-link></element-citation></ref>
<ref id="R25"><element-citation publication-type="other"><person-group person-group-type="author"><collab>Open AI</collab></person-group><year>2024</year><comment>b</comment><source>What are tokens and how to count them?</source> <ext-link ext-link-type="uri" xlink:href="https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them">https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them</ext-link></element-citation></ref>
<ref id="R26"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Oz</surname><given-names>M.</given-names></name><name><surname>Zheng</surname><given-names>P.</given-names></name><name><surname>Chen</surname><given-names>G. M.</given-names></name></person-group><year>2018</year><article-title>Twitter versus Facebook: Comparing incivility, impoliteness, and deliberative attributes</article-title><source>New media &#x0026; society</source><volume>20</volume><issue>9</issue><fpage>3400</fpage><lpage>3419</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1177/1461444817749516">https://doi.org/10.1177/1461444817749516</ext-link></element-citation></ref>
<ref id="R27"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Park</surname><given-names>J.</given-names></name><name><surname>Ellezhuthil</surname><given-names>R. D.</given-names></name><name><surname>Isaac</surname><given-names>J.</given-names></name><name><surname>Mergerson</surname><given-names>C.</given-names></name><name><surname>Feldman</surname><given-names>L.</given-names></name><name><surname>Singh</surname><given-names>V.</given-names></name></person-group><year>2023</year><comment>a</comment><article-title>Misinformation detection algorithms and fairness across political ideologies: The impact of article level labelling</article-title><source>Proceedings of the 15th ACM Web Science Conference 2023</source><fpage>107</fpage><lpage>116</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3578503.3583617">https://doi.org/10.1145/3578503.3583617</ext-link></element-citation></ref>
<ref id="R28"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Park</surname><given-names>J.</given-names></name><name><surname>Gracie</surname><given-names>J.</given-names></name><name><surname>Alsoubai</surname><given-names>A.</given-names></name><name><surname>Stringhini</surname><given-names>G.</given-names></name><name><surname>Singh</surname><given-names>V.</given-names></name><name><surname>Wisniewski</surname><given-names>P.</given-names></name></person-group><year>2023</year><comment>April</comment><article-title>Towards automated detection of risky images shared by youth on social media</article-title><source>Companion Proceedings of the ACM Web Conference 2023</source><fpage>1348</fpage><lpage>1357</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3543873.3587607">https://doi.org/10.1145/3543873.3587607</ext-link></element-citation></ref>
<ref id="R29"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Park</surname><given-names>J.</given-names></name><name><surname>Singh</surname><given-names>V. K.</given-names></name></person-group><year>2022</year><article-title>How Background Images Impact Online Incivility</article-title><source>Proceedings of the ACM on Human-Computer Interaction</source><volume>6</volume><issue>CSCW2</issue><fpage>1</fpage><lpage>23</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3555545">https://doi.org/10.1145/3555545</ext-link></element-citation></ref>
<ref id="R30"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rains</surname><given-names>S. A.</given-names></name><name><surname>Kenski</surname><given-names>K.</given-names></name><name><surname>Coe</surname><given-names>K.</given-names></name><name><surname>Harwood</surname><given-names>J.</given-names></name></person-group><year>2017</year><article-title>Incivility and political identity on the Internet: Intergroup factors as predictors of incivility in discussions of news online</article-title><source>Journal of Computer-Mediated Communication</source><volume>22</volume><issue>4</issue><fpage>163</fpage><lpage>178</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1111/jcc4.12191">https://doi.org/10.1111/jcc4.12191</ext-link></element-citation></ref>
<ref id="R31"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rheu</surname><given-names>M.</given-names></name><name><surname>Shin</surname><given-names>J. Y.</given-names></name><name><surname>Peng</surname><given-names>W.</given-names></name><name><surname>Huh-Yoo</surname><given-names>J.</given-names></name></person-group><year>2021</year><article-title>Systematic review: Trust-building factors and implications for conversational agent design</article-title><source>International Journal of Human&#x2013;Computer Interaction</source><volume>37</volume><issue>1</issue><fpage>81</fpage><lpage>96</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1080/10447318.2020.1807710">https://doi.org/10.1080/10447318.2020.1807710</ext-link></element-citation></ref>
<ref id="R32"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Sadeque</surname><given-names>F.</given-names></name><name><surname>Rains</surname><given-names>S.</given-names></name><name><surname>Shmargad</surname><given-names>Y.</given-names></name><name><surname>Kenski</surname><given-names>K.</given-names></name><name><surname>Coe</surname><given-names>K.</given-names></name><name><surname>Bethard</surname><given-names>S.</given-names></name></person-group><year>2019</year><comment>June</comment><article-title>Incivility detection in online comments</article-title><source>Proceedings of the eighth joint conference on lexical and computational semantics (* SEM 2019)</source><fpage>283</fpage><lpage>291</lpage><ext-link ext-link-type="uri" xlink:href="http://doi.org/10.18653/v1/S19-1031">http://doi.org/10.18653/v1/S19-1031</ext-link></element-citation></ref>
<ref id="R33"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Singh</surname><given-names>V. K.</given-names></name><name><surname>Ghosh</surname><given-names>S.</given-names></name><name><surname>Jose</surname><given-names>C.</given-names></name></person-group><year>2017</year><comment>May</comment><article-title>Toward multimodal cyberbullying detection</article-title><source>Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems</source><fpage>2090</fpage><lpage>2099</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3027063.3053169">https://doi.org/10.1145/3027063.3053169</ext-link></element-citation></ref>
<ref id="R34"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Song</surname><given-names>F.</given-names></name><name><surname>Yu</surname><given-names>B.</given-names></name><name><surname>Li</surname><given-names>M.</given-names></name><name><surname>Yu</surname><given-names>H.</given-names></name><name><surname>Huang</surname><given-names>F.</given-names></name><name><surname>Li</surname><given-names>Y.</given-names></name><name><surname>Wang</surname><given-names>H.</given-names></name></person-group><year>2024</year><comment>March</comment><article-title>Preference ranking optimization for human alignment</article-title><source>Proceedings of the AAAI Conference on Artificial Intelligence</source><volume>38</volume><issue>17</issue><fpage>18990</fpage><lpage>18998</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1609/aaai.v38i17.29865">https://doi.org/10.1609/aaai.v38i17.29865</ext-link></element-citation></ref>
<ref id="R35"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Tamura</surname><given-names>T.</given-names></name><name><surname>Ito</surname><given-names>H.</given-names></name><name><surname>Oyama</surname><given-names>S.</given-names></name><name><surname>Morishima</surname><given-names>A.</given-names></name></person-group><year>2024</year><comment>April</comment><chapter-title>Influence of AI&#x2019;s Uncertainty in the Dawid-Skene Aggregation for Human-AI Crowdsourcing</chapter-title><source>International Conference on Information</source><fpage>232</fpage><lpage>247</lpage><publisher-loc>Cham</publisher-loc><publisher-name>Springer Nature Switzerland</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3- 031-57867-0_17">https://doi.org/10.1007/978-3- 031-57867-0_17</ext-link></element-citation></ref>
<ref id="R36"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Tang</surname><given-names>Y.</given-names></name><name><surname>Chang</surname><given-names>C. M.</given-names></name><name><surname>Yang</surname><given-names>X.</given-names></name></person-group><year>2024</year><comment>March</comment><article-title>PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs</article-title><source>Proceedings of the 29th International Conference on Intelligent User Interfaces</source><fpage>419</fpage><lpage>430</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3640543.3645174">https://doi.org/10.1145/3640543.3645174</ext-link></element-citation></ref>
<ref id="R37"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Tinsley</surname><given-names>H. E.</given-names></name><name><surname>Weiss</surname><given-names>D. J.</given-names></name></person-group><year>2000</year><chapter-title>Interrater reliability and agreement</chapter-title><source>Handbook of applied multivariate statistics and mathematical modelling</source><fpage>95</fpage><lpage>124</lpage><publisher-name>Academic Press</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/B978-012691360-6/50005-7">https://doi.org/10.1016/B978-012691360-6/50005-7</ext-link></element-citation></ref>
<ref id="R38"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Vilone</surname><given-names>G.</given-names></name><name><surname>Longo</surname><given-names>L.</given-names></name></person-group><year>2021</year><article-title>Notions of explainability and evaluation approaches for explainable artificial intelligence</article-title><source>Information Fusion</source><volume>76</volume><fpage>89</fpage><lpage>106</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.inffus.2021.05.009">https://doi.org/10.1016/j.inffus.2021.05.009</ext-link></element-citation></ref>
<ref id="R39"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>X.</given-names></name><name><surname>Kim</surname><given-names>H.</given-names></name><name><surname>Rahman</surname><given-names>S.</given-names></name><name><surname>Mitra</surname><given-names>K.</given-names></name><name><surname>Miao</surname><given-names>Z.</given-names></name></person-group><year>2024</year><comment>May</comment><article-title>Human-LLM collaborative annotation through effective verification of LLM labels</article-title><source>Proceedings of the CHI Conference on Human Factors in Computing Systems</source><fpage>1</fpage><lpage>21</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3613904.3641960">https://doi.org/10.1145/3613904.3641960</ext-link></element-citation></ref>
<ref id="R40"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname><given-names>J.</given-names></name><name><surname>Wang</surname><given-names>X.</given-names></name><name><surname>Schuurmans</surname><given-names>D.</given-names></name><name><surname>Bosma</surname><given-names>M.</given-names></name><name><surname>Xia</surname><given-names>F.</given-names></name><name><surname>Chi</surname><given-names>E.</given-names></name><name><surname>Zhou</surname><given-names>D.</given-names></name></person-group><year>2022</year><article-title>Chain-of- thought prompting elicits reasoning in large language models</article-title><source>Advances in neural information processing systems</source><volume>35</volume><fpage>24824</fpage><lpage>24837</lpage></element-citation></ref>
<ref id="R41"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>B.</given-names></name><name><surname>Ding</surname><given-names>D.</given-names></name><name><surname>Jing</surname><given-names>L.</given-names></name><name><surname>Dai</surname><given-names>G.</given-names></name><name><surname>Yin</surname><given-names>N.</given-names></name></person-group><year>2022</year><comment>a</comment><article-title>How would stance detection techniques evolve after the launch of chatgpt?</article-title><source>arXiv preprint arXiv:2212.14548</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2212.14548">https://doi.org/10.48550/arXiv.2212.14548</ext-link></element-citation></ref>
<ref id="R42"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>H.</given-names></name><name><surname>Wu</surname><given-names>C.</given-names></name><name><surname>Xie</surname><given-names>J.</given-names></name><name><surname>Rubino</surname><given-names>F.</given-names></name><name><surname>Graver</surname><given-names>S.</given-names></name><name><surname>Kim</surname><given-names>C.</given-names></name><name><surname>Cai</surname><given-names>J.</given-names></name></person-group><year>2024</year><article-title>When Qualitative Research Meets Large Language Model: Exploring the Potential of QualiGPT as a Tool for Qualitative Coding</article-title><source>arXiv preprint arXiv:2407.14925</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2407.14925">https://doi.org/10.48550/arXiv.2407.14925</ext-link></element-citation></ref>
<ref id="R43"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Zhang</surname><given-names>Z.</given-names></name><name><surname>Zhang</surname><given-names>A.</given-names></name><name><surname>Li</surname><given-names>M.</given-names></name><name><surname>Smola</surname><given-names>A.</given-names></name></person-group><year>2022</year><comment>b</comment><article-title>Automatic chain of thought prompting in large language models</article-title><source>arXiv preprint arXiv:2210.03493</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2210.03493">https://doi.org/10.48550/arXiv.2210.03493</ext-link></element-citation></ref>
<ref id="R44"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Zhou</surname><given-names>Y.</given-names></name><name><surname>Muresanu</surname><given-names>A. I.</given-names></name><name><surname>Han</surname><given-names>Z.</given-names></name><name><surname>Paster</surname><given-names>K.</given-names></name><name><surname>Pitis</surname><given-names>S.</given-names></name><name><surname>Chan</surname><given-names>H.</given-names></name><name><surname>Ba</surname><given-names>J.</given-names></name></person-group><year>2022</year><article-title>Large language models are human-level prompt engineers</article-title><source>arXiv preprint arXiv:2211.01910</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2211.01910">https://doi.org/10.48550/arXiv.2211.01910</ext-link></element-citation></ref>
</ref-list>
<app-group>
<app id="app1">
<title>Appendix</title>
<fig id="F6">
<label>Figure 6.</label>
<caption><p>Conversation log that was added to the two-stage few-shot CoT prompt</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig6.jpg"><alt-text>none</alt-text></graphic>
</fig>
<fig id="F7">
<label>Figure 7.</label>
<caption><p>Comparison between the responses from the incivil case generated with zero-shot vs. definition prompting approaches</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig7.jpg"><alt-text>none</alt-text></graphic>
</fig>
<fig id="F8">
<label>Figure 8.</label>
<caption><p>Comparison among responses for the implicit incivil case with different prompting approaches</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c83-fig8.jpg"><alt-text>none</alt-text></graphic>
</fig>
</app>
</app-group>
</back>
</article>