<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf47581</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf47581</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Taking disagreements into consideration: human annotation variability in privacy policy analysis</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Wang</surname><given-names>Tian</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>Ma</surname><given-names>Yuanye</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<contrib contrib-type="author"><name><surname>Blake</surname><given-names>Catherine</given-names></name>
<xref ref-type="aff" rid="aff0003"/></contrib>
<contrib contrib-type="author"><name><surname>Bashir</surname><given-names>Masooda</given-names></name>
<xref ref-type="aff" rid="aff0004"/></contrib>
<contrib contrib-type="author"><name><surname>Wang</surname><given-names>Ryan</given-names></name>
<xref ref-type="aff" rid="aff0005"/></contrib>
<aff id="aff0001"><bold>Tian Wang</bold> is Postdoctoral Associate at the CyLab Security and Privacy Institute of Carnegie Mellon University. She received their Ph.D. from University of Illinois at Urbana Champaign, and her research interests are in mobile and app security and privacy. She can be contacted at <email xlink:href="tianwan2@andrew.cmu.edu">tianwan2@andrew.cmu.edu</email></aff>
<aff id="aff0002"><bold>Yuanye Ma</bold> is Senior Research Associate at University of Illinois Discovery Partners Institute. She received their Ph.D. from University of North Carolina at Chapel Hill, and her research interests are in user-centered privacy, natural language processing and information ethics. She can be contacted at <email xlink:href="yuanyem@uillinois.edu">yuanyem@uillinois.edu</email></aff>
<aff id="aff0003"><bold>Catherine Blake</bold> is Professor and Associate Dean for Academic Affairs in School of Information Sciences, University of Illinois at Urbana Champaign. She received her PhD from University of California, Irvine. Her research interests are in biomedical informatics, natural language processing, evidence-based discovery, learning health systems, socio-technical systems, data analytics, literature-based discovery. She can be contacted at <email xlink:href="clblake@illinois.edu">clblake@illinois.edu</email></aff>
<aff id="aff0004"><bold>Masooda Bashir</bold> is Associate Professor in School of Information Sciences, University of Illinois at Urbana Champaign. She received their Ph.D. from Purdue University, and her research interests are in the interface of information technology, human psychology, and society; especially how privacy, security, and trust intersect from a psychological point of view with information systems. She can be contacted at <email xlink:href="mnb@illinois.edu">mnb@illinois.edu</email></aff>
<aff id="aff0005"><bold>Ryan Wang</bold> is a PhD student in the School of Information Sciences at the University of Illinois Urbana-Champaign. He is interested in natural language processing, machine learning, and bioinformatics. He can be reached at <email xlink:href="hywang3@illinois.edu">hywang3@illinois.edu</email>.</aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>81</fpage>
<lpage>92</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> Privacy policies inform users about data practices but are often complex and difficult to interpret. Human annotation plays a key role in understanding privacy policies, yet annotation disagreements highlight the complexity of these texts. Traditional machine learning models prioritize consensus, overlooking annotation variability and its impact on accuracy.</p>
<p><bold>Method.</bold> This study examines how annotation disagreements affect machine learning performance using the OPP-115 corpus. It compares majority vote and union methods with alternative strategies to assess their impact on policy classification.</p>
<p><bold>Analysis.</bold> The study evaluates whether increasing annotator consensus improves model effectiveness and if disagreement-aware approaches yield more reliable results.</p>
<p><bold>Results.</bold> Higher agreement levels improve model performance across most categories. Complete agreement yields the best F1-scores, especially for First Party Collection/Use and Third-Party Sharing/Collection. Annotation disagreements significantly impact classification outcomes, underscoring the need for understanding annotation disagreements.</p>
<p><bold>Conclusion.</bold> Ignoring annotation disagreements can misrepresent model accuracy. This study proposes new evaluation strategies that account for annotation variability, offering a more realistic approach to privacy policy analysis. Future work should explore the causes of annotation disagreements to improve machine learning transparency and reliability.</p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Privacy policies are crucial for digital privacy, explaining how businesses handle personal data to help users make informed decisions. However, the FTC and other studies find most are <italic>&#x2018;incomprehensible&#x2019;,</italic> with users rarely reading or understanding them (<xref rid="R8" ref-type="bibr">Braun, 2024</xref>; <xref rid="R13" ref-type="bibr">Hallgren, 2012</xref>; <xref rid="R33" ref-type="bibr">Tang et al., 2021</xref>). Technical language and inconsistent terminology further complicate comprehension (<xref rid="R4" ref-type="bibr">Azhagusundari &#x0026; Thanamani, 2013</xref>), and ambiguities lead to varying interpretations of terms (<xref rid="R25" ref-type="bibr">Pedregosa et al., 2011</xref>). Users&#x2019; understanding also depends on their education and cultural background (<xref rid="R18" ref-type="bibr">Hossin &#x0026; Sulaiman, 2015</xref>; <xref rid="R19" ref-type="bibr">Krippendorff, 2018</xref>).</p>
<p>Researchers have used natural language processing (NLP) to improve the readability of privacy policies (Li, et al;, 2022; <xref rid="R35" ref-type="bibr">Wilson et al., 2016</xref>) Efforts include summarising lengthy policies (<xref rid="R21" ref-type="bibr">LingPipe Alias-i., 2008</xref>), using topic models to extract data practices (<xref rid="R15" ref-type="bibr">Harkous et al., 2018</xref>), classifying content (<xref rid="R12" ref-type="bibr">Grosman et al., 2020</xref>), identifying specific information (<xref rid="R11" ref-type="bibr">Gray, 2011</xref>; <xref rid="R30" ref-type="bibr">Mysore Sathyendra et al., 2017</xref>) and checking compliance automatically (<xref rid="R22" ref-type="bibr">Liu, et al., 2016</xref>). Using machine learning (ML) and natural language processing (NLP) to improve privacy policy readability offers benefits, such as processing large data quickly and outperforming human annotators in speed. However, human annotation, often used to create gold-standard datasets, is time-consuming and relies on inter-rater reliability, leaving disagreements unexplored. ML models, which optimise probability-based classification, prioritise data quantity over annotation variety (<xref rid="R27" ref-type="bibr">Plank, 2022</xref>). This overlooks data inconsistencies and diverse interpretations. A study found that even privacy experts rarely agree on policy interpretations, suggesting automated tools may struggle to accurately interpret policies, just like users (<xref rid="R8" ref-type="bibr">Braun, 2024</xref>).</p>
<p>Human annotation and interpretation of policy and legal texts are often inconsistent and prone to disagreement. For instance, annotators assigned three different labels&#x2014;Other, User Choice/Control, and First Party Collection/Use&#x2014;to the same text, while another sentence received two distinct labels: first and third party, and User Choice/Control. Similarly, texts (4)-(7) were inconsistently labelled as First Party Collection/Use by some annotators but not by others. This variation stems from the deliberate ambiguity and complexity of such texts, reflecting their inherently controversial and evolving nature.</p>
<p><italic><underline>Texts to be annotated/ labelled:</underline></italic></p>
<list list-type="order">
<list-item><p>If you do not wish to share your PIN, you always have the option to not provide the information or use the MediaNews Websites that require it.</p></list-item>
<list-item><p>By use of our websites and games that have dynamic in-game advertising, you signify your assent to SCEA&#x2019;s privacy policy.</p></list-item>
<list-item><p>You may register or enhance your profile by linking your Facebook or Google accounts on NYTimes.com.</p></list-item>
<list-item><p>Sharing Your Information with Other Companies</p></list-item>
<list-item><p>You can delete cookies using your browser settings.</p></list-item>
<list-item><p>What Choices Do I Have?</p></list-item>
<list-item><p>You can visit our Web Sites without sharing personally identifiable information.</p></list-item>
</list>
<p><italic><underline>Annotation scheme/label choices:</underline></italic></p>
<p>OPP-115&#x2019;s annotation scheme consists of ten data practice categories:</p>
<list list-type="order">
<list-item><p><italic>First party collection/use</italic>: how and why a service provider collects user information.</p></list-item>
<list-item><p><italic>Third party sharing/collection</italic>: how user information may be shared with or collected by third parties.</p></list-item>
<list-item><p><italic>User choice/control</italic>: choices and control options available to users.</p></list-item>
<list-item><p><italic>User access, edit, &#x0026; deletion</italic>: if and how users may access, edit, or delete their information.</p></list-item>
<list-item><p><italic>Data retention</italic>: how long user information is stored.</p></list-item>
<list-item><p><italic>Data security</italic>: how user information is protected.</p></list-item>
<list-item><p><italic>Policy change</italic>: if and how users will be informed about changes to the privacy policy.</p></list-item>
<list-item><p><italic>Do not track</italic>: if and how Do Not Track signals for online tracking and advertising are honoured.</p></list-item>
<list-item><p><italic>International and specific audiences</italic>: practices that pertain only to a specific group of users (e.g., children, Europeans, or California residents).</p></list-item>
<list-item><p><italic>Other</italic>: additional sub-labels for introductory or general text, contact information, and practices not covered by the other categories.</p></list-item>
</list>
<p>We argue that revealing variations in understanding privacy policies is an underexplored area of research. This gap partly arises from the bias in using machine learning as the primary method and from the tendency to treat privacy as a uniform concept, overlooking cultural, educational, and gender differences. To address this, we conduct an empirical study using the OPP-115 corpus, which includes annotations from three annotators. We expand the characterization from three (individual, pairwise, and complete agreement) to seven: three individual, three pairwise, and one gold standard with full agreement. Unlike majority voting, pairwise agreement requires two specific annotators to agree. Complete agreement, while yielding fewer instances in the target class, may offer higher-quality annotations. Our study provides tangible measures for understanding how annotation variations impact machine learning performance. Rather than simply optimising metrics, we aim to reflect human disagreements through realistic reporting. We highlight two approaches to constructing gold standards&#x2014;one ignoring disagreement and one considering it&#x2014;without advocating for either, acknowledging that practical constraints often dictate these choices. We emphasise that disagreements in interpreting privacy statements are common, even among experts, and recognizing these differences is essential for user-centred privacy research.</p>
</sec>
<sec id="sec2">
<title>Related work</title>
<p>High-quality labelled data are essential for supervised machine learning. Some argue that multiple annotators reduce human bias in evaluation (<xref rid="R3" ref-type="bibr">Artstein, 2017</xref>), but the number of annotators needed for high-quality annotation is unclear and often limited by budget (<xref rid="R30" ref-type="bibr">Mysore Sathyendra et al., 2017</xref>). Inter-rater reliability (IRR), or annotator agreement, is commonly used to measure annotation quality (<xref rid="R23" ref-type="bibr">Moallem, 2018</xref>). Typically, studies use either text annotated by any rater (union) or by the majority (majority vote), ignoring genuine annotator differences.</p>
<p>The assumption that annotators are interchangeable is also flawed, as differences exist between annotator populations and individuals (<xref rid="R17" ref-type="bibr">Hershcovich et al., 2022</xref>). Agreement studies show that variation can arise from heterogeneous data, complex labels, and annotator differences. Global agreement coefficients may mask these variations, while more detailed studies provide insights (<xref rid="R32" ref-type="bibr">Stevens et al., 2020</xref>).</p>
<p>A review of legal machine learning datasets found that disagreement during annotation is typically removed, with final corpora containing only <italic>&#x2018;gold standard&#x2019;</italic> annotations. Common strategies like majority vote, forced agreement, expert review, or arbitration do not account for disagreements (<xref rid="R28" ref-type="bibr">Prabhakaran et al., 2021</xref>). Issues of interpretation and non-transparency in reporting machine learning results, especially with legal documents, have also been flagged (<xref rid="R27" ref-type="bibr">Plank, 2022</xref>). Accurate predictions can build trust, but reproducibility depends on dataset validity and the discovery process (<xref rid="R16" ref-type="bibr">Herbert et al., 2023</xref>). Thus, documenting data preparation processes and their impact on model performance is crucial for trust and transparency in machine learning results (Bai et al., 2022).</p>
</sec>
<sec id="sec3">
<title>Method</title>
<p>Towards this goal of using machine learning to provide privacy policy statement analysis that documents and considers annotation disagreements, our study provides ways to demonstrate how to directly measure the impact of human (dis)agreement on machine learning model performance, by responding to the following research questions:</p>
<speech><speaker>RQ1:</speaker><p>to what extent does reaching consensus amongst annotators impact the classification performance of traditional machine learning and deep learning models, respectively?</p></speech>
<speech><speaker>RQ2:</speaker><p>how do some alternative strategies used to create gold standards compare with the typical union and majority vote strategy?</p></speech>
<speech><speaker>RQ3:</speaker><p>what metrics can be used to better manage the trade-off between more annotated texts and more annotations for the same text?</p></speech>
<p>To address these research questions, we conducted experiments using the OPP-115 corpora (<xref rid="R2" ref-type="bibr">Anaraky et al, 2019</xref>) with annotated text prepared through majority vote, union methods, and alternative strategies (individual, pairwise, and complete agreement). We tested how these strategies impact model performance using two traditional supervised learning algorithms&#x2014; support vector machines (SVM) and Na&#x00EF;ve Bayes (NB)&#x2014;and two deep learning models, bidirectional long short-term memory (BiLSTM) and bidirectional encoder representations from transformers (BERT).</p>
<sec id="sec3_1">
<title>Dataset</title>
<p>The OPP-115 Corpus comprises 13,209 sentences and the number of sentences in each target class are highly imbalanced (as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>). For example, there are only 6, 39, and 63 sentences with complete agreement in the Do Not Track, Data retention, and Policy Change categories respectively. The OPP-115 corpora were prepared by ten independent annotators who coded privacy policy text segments using several predefined categories. Each privacy segment was annotated by three independent annotators. We arranged the unique annotator IDs in each statement in ascending order and replaced them with annotators A, B, and C respectively. By our proposed alternative strategies, the gold standards used individual annotators (A or B or C), where 2 annotators reached consensus (A and B, or A and C or B and C) and where there was complete agreement (A and B and C agree on the annotations). We also replicated the strategy used in the original work where a text is deemed relevant when at least 2 out of 3 annotators (majority vote) reached agreement, or all the 3 annotators (union).</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>Number of sentences in each gold standard</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c8-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>Annotations created in the original OPP-115 data were segmented text into paragraphs (<xref rid="R2" ref-type="bibr">Anaraky et al, 2019</xref>). In contrast, the unit of analysis in our experiments is a sentence, thus both the original text and the annotations were converted into sentences using version 4.1.2 of LingPipe (<xref rid="R1" ref-type="bibr">Alabduljabbar et al., 2021</xref>) and the index position of each sentence was maintained and subsequently aligned with the index position of the manual annotations. Sentences were pre-processed using the NLTK Python package (<xref rid="R28" ref-type="bibr">Prabhakaran et al., 2021</xref>), including: converting words to lowercases, and removing punctuation and stop words. Terms appearing infrequently (less than 5 sentences) and very frequently (more than 95% of the sentences) were removed because their presence would contribute little to the classification performance (Amos et al., 2021). Annotation categories are not mutually exclusive, a sentence can be annotated as belonging to more than one multiple categories. Lastly, we framed the problem as a binary text classification task for each of the annotation categories.</p>
</sec>
<sec id="sec3_2">
<title>Text classification</title>
<p>The classification experiments used two algorithm families: traditional models (SVM, NB) and deep learning models (BiLSTM, BERT). Ten-fold cross-validation evaluated each model, splitting the dataset into ten equal parts, with nine used for training and one for testing. The data was stratified to maintain a balanced proportion of positive and negative labels in each fold. Model performance was measured using standard metrics: precision, recall, F-1, and accuracy (<xref rid="R14" ref-type="bibr">Hamdani et al., 2021</xref>). For traditional models, feature selection was crucial. We used version 1.0.2 of Scikit-learn (<xref rid="R10" ref-type="bibr">Gordon, et al., 2022</xref>) and entropy-based selection, calculating information gain to choose the top 2000 features (Amos et al., 2021; <xref rid="R25" ref-type="bibr">Pedregosa et al., 2011</xref>). These features were then used to construct the sentence-term matrix for the test set. TF-IDF was considered but not used due to its inability to account for target class distribution (<xref rid="R31" ref-type="bibr">Srinath et al., 2021</xref>).</p>
</sec>
</sec>
<sec id="sec4">
<title>Results</title>
<p>We report our findings regarding how much annotator agreement impacts the performance of automated approaches.</p>
<sec id="sec4_1">
<title>Alternative methods: independent, pairwise, complete agreement</title>
<p>We found that increasing the level of agreement from independent to pairwise to complete improved F-1, accuracy, precision, and recall across nearly all categories and classifiers (<xref ref-type="fig" rid="F2">Figure 2</xref>). Complete agreement yielded the best F1 scores for several categories, including First Party Collection/Use and Third-Party Sharing/Collection. For Third Party Sharing/Collection, precision improved by 5% and recall by 4%. First Party Collection/Use saw a 7% improvement in precision and 8% in recall. The <italic>&#x2018;Other&#x2019;</italic> category showed significant gains, with F1 improving from 0.82 to 0.91.</p>
<p>In some categories, like User Choice/Control and International and Specific Audiences, pairwise agreement performed as well as complete agreement, with recall and accuracy reaching 0.97. Categories with fewer examples, such as Data Retention, had high metrics but raised concerns about generalizability.</p>
<p>Certain classifiers handled data inconsistencies better. For Third Party Sharing/Collection, precision improved from 0.85 (independent) to 0.91 (complete agreement) across all classifiers. The difference in model performance was minor (within 0.02), but the impact of different gold standards was more significant, ranging from 0.03 to 0.06 (<xref ref-type="fig" rid="F2">Figures 2</xref> and <xref ref-type="fig" rid="F3">3</xref>). Third Party Sharing/Collection also showed notable improvements in F-1, precision, recall, and accuracy, with precision and F1 for First Party Collection/Use improving by at least 0.07.</p>
<fig id="F2">
<label>Figure 2.</label>
<caption><p>Average model performance for increasing levels of consensus (independent, pairwise, and complete agreement) and increasing levels of disagreement (majority vote, union) gold standards *(Ac=accuracy, F1, Pr=precision, and Re=recall)</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c8-fig2.jpg"><alt-text>none</alt-text></graphic>
</fig>
<fig id="F3">
<label>Figure 3.</label>
<caption><p>Average model performance for complete and union gold standards</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c8-fig3.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec4_2">
<title>Typical gold standard methods: union and complete</title>
<p>In general, we also found increasing the level of agreement improves model performance: the majority vote results outperform the union across all metrics and all classifiers, in all categories. <xref ref-type="fig" rid="F4">Figure 4</xref> shows the original Fleiss&#x2019; Kappa statistic against the F1 score for complete and union gold standards. The range of Kappa in the OPP-115 collection ranges from moderate to very good, so it is possible that larger variations might show a correlation with F1, but these results suggest that: (a) Kappa is not a good substitute for F1 scores produced using different gold standards; (b) the difference between the standard metrics produced from complete and union experiments might provide a more realistic way to convey the impact of disagreement.</p>
<fig id="F4">
<label>Figure 4.</label>
<caption><p>Fleiss Kappa versus F1-Score for complete and union gold standards</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c8-fig4.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
</sec>
<sec id="sec5">
<title>Discussion</title>
<p>Common methods for reporting machine learning results often focus on the model and its tuning parameters, neglecting variations in human-labelled training data. In privacy documentation, diverse terminology and interpretations make it essential to consider human discrepancies in machine learning analysis. Traditional metrics don&#x2019;t always reflect the Fleiss Kappa statistic or the impact of human consensus. Based on our research, we recommend:</p>
<list list-type="order">
<list-item><p><bold>Use multiple annotators:</bold> for at least a subset of the corpus, multiple annotators should be employed to capture human variability. Reducing the number of categories per annotator could enhance annotation quality and efficiency.</p></list-item>
<list-item><p><bold>Iterative annotation:</bold> apply an iterative approach, directing annotators to categories with less agreement. This can optimise human effort and maintain annotation quality, using complete and union gold standards as metrics.</p></list-item>
<list-item><p><bold>Differentiate gold standards:</bold> use the difference between complete and union gold standards to assess how well machine learning results align with human judgments. This can provide insight into the model&#x2019;s accuracy in mirroring human interpretations.</p></list-item>
<list-item><p><bold>Avoid aggregation:</bold> report results separately for each category rather than aggregating them. Specific categories may have varying expectations, and detailed performance information is crucial for understanding model effectiveness in real-world applications.</p></list-item>
</list>
</sec>
<sec id="sec6">
<title>Conclusion</title>
<p>Privacy policy statements are vital for regulatory compliance and user data decisions, yet many are unreadable and often ignored. Machine learning could help by automating information extraction, but current reporting practices that don&#x2019;t align with human judgement undermine trust. Unlike previous methods using Fleiss&#x2019; Kappa to measure disagreement, we propose an approach that uses independent, pairwise, and complete agreement in gold standards. We acknowledge that our data set is limited as it used only the OPP-115 corpus which may limit the generalizability of this study. Given this limitation, our preliminary results show that higher agreement improves precision, recall, F1, and accuracy, while more disagreement reduces these metrics.</p>
<p>Disagreements in privacy statement interpretation are more complex than fact-based tasks, and inter-rater reliability alone may not suffice to measure model performance. Traditional metrics like Cohen&#x2019;s or Fleiss Kappa are inadequate for skewed data. We suggest using precision, recall, F1, and accuracy to evaluate how different gold standards affect performance, which is crucial given the evolving nature of privacy content.</p>
<p>With new collections of privacy statements surpassing a million entries (<xref rid="R6" ref-type="bibr">Bannihatti Kumar et al., 2020</xref>; <xref rid="R34" ref-type="bibr">Thorleiksd&#x00F3;ttir et al., 2022</xref>), investing in annotation adjudication is urgent. Quality assurance involves decisions such as storing multiple annotators&#x2019; data and measuring their agreement (Mousavi <xref rid="R24" ref-type="bibr">Nejad et al., 2020</xref>). Our study highlights the need for multiple annotators on subsets of texts to assess the impact of human judgement on metrics. This may conflict with current practices aimed at maximising annotated data but is essential for realistic metric representation. An iterative approach can help allocate resources effectively, and text classification results should be reported by category rather than aggregated.</p>
<p>Our study is the first to explicitly raise the question of disagreements of annotating privacy policies documents and highlight the value and significance of studying such disagreements. How exactly disagreement in understanding and/or interpreting privacy policies can be leveraged remains unexplored, and future research needs to understand where and why disagreements occur when it comes to privacy policy interpretation, as there can be potentially disagreements for completely different reasons and hence require different treatment or solution. For example, there may be disagreement that originates from a lack of knowledge, linguistic ambiguity, or underlying differences in preference, each would require completely different solutions.</p>
<p>When marking up raw text, annotators need the flexibility to decide the appropriate text boundaries that capture the target category. Before the initial annotations can be used to construct a classifier, the unit of analysis, such as a paragraph, sentence (used in this analysis) or some other predefined <italic>&#x2018;span&#x2019;</italic> must be established. This choice impacts the predictive performance of any model constructed, more work is needed to establish what span is optimal for a given task and to quantify the impact of this decision. We have introduced a new performance metric &#x2013; the difference between complete and union gold standards - that directly measures the impact of human agreement using the same metrics that are commonly used to measure an automated system. However, situated empirical user studies are needed to establish if this new metric is successful in making machine-learning models more transparent.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>We would like to thank all the reviewers for the feedback on this paper. There is no funding for this research to report.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Alabduljabbar</surname><given-names>A.</given-names></name><name><surname>Abusnaina</surname><given-names>A.</given-names></name><name><surname>Meteriz-Yildiran</surname><given-names>&#x00DC;.</given-names></name><name><surname>Mohaisen</surname><given-names>D.</given-names></name></person-group> <year>(2021)</year> <article-title>TLDR: Deep Learning-Based Automated Privacy Policy Annotation with Key Policy Highlights</article-title><source>Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society</source><fpage>103</fpage><lpage>118</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3463676.3485608">https://doi.org/10.1145/3463676.3485608</ext-link></element-citation></ref>
<ref id="R2"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Anaraky</surname><given-names>R. G.</given-names></name><name><surname>Cherry</surname><given-names>D.</given-names></name><name><surname>Jarrell</surname><given-names>M.</given-names></name><name><surname>Knijnenburg</surname><given-names>B.</given-names></name></person-group> <year>(2019)</year> <article-title>Testing a comic-based privacy policy</article-title><source>The 15th Symp. on Usable Privacy and Security</source></element-citation></ref>
<ref id="R3"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Artstein</surname><given-names>R.</given-names></name></person-group> <year>(2017)</year> <article-title>Inter-annotator agreement</article-title><source>Handbook of linguistic annotation</source><fpage>297</fpage><lpage>313</lpage></element-citation></ref>
<ref id="R4"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Azhagusundari</surname><given-names>B.</given-names></name><name><surname>Thanamani</surname><given-names>A. S.</given-names></name></person-group> <year>(2013)</year> <article-title>Feature selection based on information gain</article-title><source>International Journal of Innovative Technology and Exploring Engineering (IJITEE)</source><volume>2</volume><issue>2</issue><fpage>18</fpage><lpage>21</lpage></element-citation></ref>
<ref id="R5"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Bai</surname><given-names>F.</given-names></name><name><surname>Ritter</surname><given-names>A.</given-names></name><name><surname>Xu</surname><given-names>W.</given-names></name></person-group> <year>(2021)</year> <article-title>Pre-train or annotate? domain adaptation with a constrained budget</article-title><source>arXiv preprint arXiv:2109.04711</source></element-citation></ref>
<ref id="R6"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Bannihatti Kumar</surname><given-names>V.</given-names></name><name><surname>Iyengar</surname><given-names>R.</given-names></name><name><surname>Nisal</surname><given-names>N.</given-names></name><name><surname>Feng</surname><given-names>Y.</given-names></name><name><surname>Habib</surname><given-names>H.</given-names></name><name><surname>Story</surname><given-names>P.</given-names></name><name><surname>Cherivirala</surname><given-names>S.</given-names></name><name><surname>Hagan</surname><given-names>M.</given-names></name><name><surname>Cranor</surname><given-names>L.</given-names></name><name><surname>Wilson</surname><given-names>S.</given-names></name><name><surname>Schaub</surname><given-names>F.</given-names></name><name><surname>Sadeh</surname><given-names>N.</given-names></name></person-group> <year>(2020)</year> <article-title>Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text</article-title><source>Proceedings of The Web Conference 2020</source><fpage>1943</fpage><lpage>1954</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3366423.3380262">https://doi.org/10.1145/3366423.3380262</ext-link></element-citation></ref>
<ref id="R7"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Bird</surname><given-names>S.</given-names></name><name><surname>Klein</surname><given-names>E.</given-names></name><name><surname>Loper</surname><given-names>E.</given-names></name></person-group> <year>(2009)</year> <source>Natural language processing with Python: analysing text with the natural language toolkit</source><comment>" O&#x2019;Reilly Media, Inc."</comment></element-citation></ref>
<ref id="R8"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Braun</surname><given-names>D.</given-names></name></person-group> <year>(2024)</year> <article-title>I beg to differ: How disagreement is handled in the annotation of legal machine learning data sets</article-title><source>Artificial Intelligence and Law</source><volume>32</volume><issue>3</issue><fpage>839</fpage><lpage>862</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s10506-023-09369-4">https://doi.org/10.1007/s10506-023-09369-4</ext-link></element-citation></ref>
<ref id="R9"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>R.</given-names></name><name><surname>Fang</surname><given-names>F.</given-names></name><name><surname>Norton</surname><given-names>T.</given-names></name><name><surname>McDonald</surname><given-names>A. M.</given-names></name><name><surname>Sadeh</surname><given-names>N.</given-names></name></person-group> <year>(2021)</year> <article-title>Fighting the Fog: Evaluating the Clarity of Privacy Disclosures in the Age of CCPA</article-title><source>Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society</source><fpage>73</fpage><lpage>102</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3463676.3485601">https://doi.org/10.1145/3463676.3485601</ext-link></element-citation></ref>
<ref id="R10"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Gordon</surname><given-names>M. L.</given-names></name><name><surname>Lam</surname><given-names>M. S.</given-names></name><name><surname>Park</surname><given-names>J. S.</given-names></name><name><surname>Patel</surname><given-names>K.</given-names></name><name><surname>Hancock</surname><given-names>J.</given-names></name><name><surname>Hashimoto</surname><given-names>T.</given-names></name><name><surname>Bernstein</surname><given-names>M. S.</given-names></name></person-group> <year>(2022)</year> <comment>April</comment><article-title>Jury learning: Integrating dissenting voices into machine learning models</article-title><source>Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems</source><fpage>1</fpage><lpage>19</lpage></element-citation></ref>
<ref id="R11"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Gray</surname><given-names>R. M.</given-names></name></person-group> <year>(2011)</year> <source>Entropy and information theory</source><publisher-name>Springer Science &#x0026; Business Media</publisher-name></element-citation></ref>
<ref id="R12"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Grosman</surname><given-names>J. S.</given-names></name><name><surname>Furtado</surname><given-names>P. H. T.</given-names></name><name><surname>Rodrigues</surname><given-names>A. M. B.</given-names></name><name><surname>Schardong</surname><given-names>G. G.</given-names></name><name><surname>Barbosa</surname><given-names>S. D. J.</given-names></name><name><surname>Lopes</surname><given-names>H. C. V.</given-names></name></person-group> <year>(2020)</year> <article-title>Eras: Improving the quality control in the annotation process for Natural Language Processing tasks</article-title><source>Information Systems</source><volume>93</volume><fpage>101553</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.is.2020.101553">https://doi.org/10.1016/j.is.2020.101553</ext-link></element-citation></ref>
<ref id="R13"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hallgren</surname><given-names>K. A.</given-names></name></person-group> <year>(2012)</year> <article-title>Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial</article-title><source>Tutorials in Quantitative Methods for Psychology</source><volume>8</volume><issue>1</issue><fpage>23</fpage><lpage>34</lpage></element-citation></ref>
<ref id="R14"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Hamdani</surname><given-names>R. E.</given-names></name><name><surname>Mustapha</surname><given-names>M.</given-names></name><name><surname>Amariles</surname><given-names>D. R.</given-names></name><name><surname>Troussel</surname><given-names>A.</given-names></name><name><surname>Mee&#x00F9;s</surname><given-names>S.</given-names></name><name><surname>Krasnashchok</surname><given-names>K.</given-names></name></person-group> <year>(2021)</year> <article-title>A combined rule-based and machine learning approach for automated GDPR compliance checking</article-title><source>Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law</source><fpage>40</fpage><lpage>49</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3462757.3466081">https://doi.org/10.1145/3462757.3466081</ext-link></element-citation></ref>
<ref id="R15"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Harkous</surname><given-names>H.</given-names></name><name><surname>Fawaz</surname><given-names>K.</given-names></name><name><surname>Lebret</surname><given-names>R.</given-names></name><name><surname>Schaub</surname><given-names>F.</given-names></name><name><surname>Shin</surname><given-names>K. G.</given-names></name><name><surname>Aberer</surname><given-names>K.</given-names></name></person-group> <year>(2018)</year> <article-title>Polisis: Automated analysis and presentation of privacy policies using deep learning</article-title><source>Proceedings of the 27th USENIX Conference on Security Symposium</source><fpage>531</fpage><lpage>548</lpage></element-citation></ref>
<ref id="R16"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Herbert</surname><given-names>F.</given-names></name><name><surname>Becker</surname><given-names>S.</given-names></name><name><surname>Schaewitz</surname><given-names>L.</given-names></name><name><surname>Hielscher</surname><given-names>J.</given-names></name><name><surname>Kowalewski</surname><given-names>M.</given-names></name><name><surname>Sasse</surname><given-names>A.</given-names></name><name><surname>Acar</surname><given-names>Y.</given-names></name><name><surname>D&#x00FC;rmuth</surname><given-names>M.</given-names></name></person-group> <year>(2023)</year> <article-title>A World Full of Privacy and Security (Mis)conceptions? Findings of a Representative Survey in 12 Countries</article-title><source>Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems</source><fpage>1</fpage><lpage>23</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3544548.3581410">https://doi.org/10.1145/3544548.3581410</ext-link></element-citation></ref>
<ref id="R17"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Hershcovich</surname><given-names>D.</given-names></name><name><surname>Frank</surname><given-names>S.</given-names></name><name><surname>Lent</surname><given-names>H.</given-names></name><name><surname>de Lhoneux</surname><given-names>M.</given-names></name><name><surname>Abdou</surname><given-names>M.</given-names></name><name><surname>Brandl</surname><given-names>S.</given-names></name><name><surname>Bugliarello</surname><given-names>E.</given-names></name><name><surname>Cabello Piqueras</surname><given-names>L.</given-names></name><name><surname>Chalkidis</surname><given-names>I.</given-names></name><name><surname>Cui</surname><given-names>R.</given-names></name><name><surname>Fierro</surname><given-names>C.</given-names></name><name><surname>Margatina</surname><given-names>K.</given-names></name><name><surname>Rust</surname><given-names>P.</given-names></name><name><surname>S&#x00F8;gaard</surname><given-names>A.</given-names></name></person-group> <year>(2022)</year> <article-title>Challenges and Strategies in Cross-Cultural NLP</article-title><person-group person-group-type="editor"><name><surname>Muresan</surname><given-names>S.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Nakov</surname><given-names>P.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Villavicencio</surname><given-names>A.</given-names></name></person-group><source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source><fpage>6997</fpage><lpage>7013</lpage><comment>Association for Computational Linguistics</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18653/v1/2022.acl-long.482">https://doi.org/10.18653/v1/2022.acl-long.482</ext-link></element-citation></ref>
<ref id="R18"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hossin</surname><given-names>M.</given-names></name><name><surname>Sulaiman</surname><given-names>M. N.</given-names></name></person-group> <year>(2015)</year> <article-title>A review on evaluation metrics for data classification evaluations</article-title><source>International journal of data mining &#x0026; knowledge management process</source><volume>5</volume><issue>2</issue><fpage>1</fpage></element-citation></ref>
<ref id="R19"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Krippendorff</surname><given-names>K.</given-names></name></person-group> <year>(2018)</year> <source>Content analysis: An introduction to its methodology</source><publisher-name>Sage publications</publisher-name></element-citation></ref>
<ref id="R20"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>J.</given-names></name><name><surname>Liu</surname><given-names>L.</given-names></name><name><surname>Le</surname><given-names>T. D.</given-names></name><name><surname>Liu</surname><given-names>J.</given-names></name></person-group> <year>(2020)</year> <article-title>Accurate data-driven prediction does not mean high reproducibility</article-title><source>Nature Machine Intelligence</source><volume>2</volume><issue>1</issue><fpage>13</fpage><lpage>15</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s42256-019-0140-2">https://doi.org/10.1038/s42256-019-0140-2</ext-link></element-citation></ref>
<ref id="R21"><element-citation publication-type="other"><person-group person-group-type="author"><collab>LingPipe Alias-i</collab></person-group> <year>(2008)</year> <comment>4.1. 0. URL</comment><ext-link ext-link-type="uri" xlink:href="http://alias-i.com/lingpipe">http://alias-i.com/lingpipe</ext-link><comment>(2008)</comment></element-citation></ref>
<ref id="R22"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>F.</given-names></name><name><surname>Wilson</surname><given-names>S.</given-names></name><name><surname>Schaub</surname><given-names>F.</given-names></name><name><surname>Sadeh</surname><given-names>N.</given-names></name></person-group> <year>(2016)</year> <article-title>Analyzing vocabulary intersections of expert annotations and topic models for data practices in privacy policies</article-title><source>2016 AAAI Fall Symposium Series</source></element-citation></ref>
<ref id="R23"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Moallem</surname><given-names>A.</given-names></name></person-group> <year>(2018)</year> <chapter-title>Do You Really Trust &#x201C;Privacy Policy&#x201D; or &#x201C;Terms of Use&#x201D; Agreements Without Reading Them?</chapter-title><person-group person-group-type="editor"><name><surname>Nicholson</surname><given-names>D.</given-names></name></person-group><source>Advances in Human Factors in Cybersecurity</source><fpage>290</fpage><lpage>295</lpage><publisher-name>Springer International Publishing</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-60585-2_27">https://doi.org/10.1007/978-3-319-60585-2_27</ext-link></element-citation></ref>
<ref id="R24"><element-citation publication-type="book"><person-group person-group-type="author">Mousavi <name><surname>Nejad</surname><given-names>N.</given-names></name><name><surname>Jabat</surname><given-names>P.</given-names></name><name><surname>Nedelchev</surname><given-names>R.</given-names></name><name><surname>Scerri</surname><given-names>S.</given-names></name><name><surname>Graux</surname><given-names>D.</given-names></name></person-group> <year>(2020)</year> <article-title>Establishing a Strong Baseline for Privacy Policy Classification</article-title><person-group person-group-type="editor"><name><surname>H&#x00F6;lbl</surname><given-names>M.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Rannenberg</surname><given-names>K.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Welzer</surname><given-names>T.</given-names></name></person-group><source>ICT Systems Security and Privacy Protection</source><fpage>370</fpage><lpage>383</lpage><publisher-name>Springer International Publishing</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-030-58201-2_25">https://doi.org/10.1007/978-3-030-58201-2_25</ext-link></element-citation></ref>
<ref id="R25"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pedregosa</surname><given-names>F.</given-names></name><name><surname>Varoquaux</surname><given-names>G.</given-names></name><name><surname>Gramfort</surname><given-names>A.</given-names></name><name><surname>Michel</surname><given-names>V.</given-names></name><name><surname>Thirion</surname><given-names>B.</given-names></name><name><surname>Grisel</surname><given-names>O.</given-names></name><name><surname>Duchesnay</surname><given-names>E.</given-names></name></person-group> <year>(2011)</year> <article-title>Scikit-learn: Machine learning in python journal of machine learning research</article-title><source>Journal of machine learning research</source><volume>12</volume><fpage>2825</fpage><lpage>2830</lpage></element-citation></ref>
<ref id="R26"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pepperberg</surname><given-names>I. M.</given-names></name></person-group> <year>(1988)</year> <article-title>An interactive modeling technique for acquisition of communication skills: Separation of &#x201C;labeling&#x201D; and &#x201C;requesting&#x201D; in a psittacine subject</article-title><source>Applied Psycholinguistics</source><volume>9</volume><issue>1</issue><fpage>59</fpage><lpage>76</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1017/S014271640000045X">https://doi.org/10.1017/S014271640000045X</ext-link></element-citation></ref>
<ref id="R27"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Plank</surname><given-names>B.</given-names></name></person-group> <year>(2022)</year> <source>The &#x201C;Problem&#x201D; of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation</source><comment>(arXiv:2211.02570). arXiv</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2211.02570">https://doi.org/10.48550/arXiv.2211.02570</ext-link></element-citation></ref>
<ref id="R28"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Prabhakaran</surname><given-names>V.</given-names></name><name><surname>Davani</surname><given-names>A. M.</given-names></name><name><surname>D&#x00ED;az</surname><given-names>M.</given-names></name></person-group> <year>(2021)</year> <source>On Releasing Annotator-Level Labels and Information in Datasets</source><comment>(arXiv:2110.05699). arXiv</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2110.05699">https://doi.org/10.48550/arXiv.2110.05699</ext-link></element-citation></ref>
<ref id="R29"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Reidenberg</surname><given-names>J. R.</given-names></name><name><surname>Breaux</surname><given-names>T.</given-names></name><name><surname>Cranor</surname><given-names>L. F.</given-names></name><name><surname>French</surname><given-names>B.</given-names></name><name><surname>Grannis</surname><given-names>A.</given-names></name><name><surname>Graves</surname><given-names>J.</given-names></name><name><surname>Liu</surname><given-names>F.</given-names></name><name><surname>McDonald</surname><given-names>A.</given-names></name><name><surname>Norton</surname><given-names>T.</given-names></name><name><surname>Ramanath</surname><given-names>R.</given-names></name><name><surname>Russell</surname><given-names>N. C.</given-names></name><name><surname>Sadeh</surname><given-names>N.</given-names></name><name><surname>Schaub</surname><given-names>F.</given-names></name></person-group> <year>(2014)</year> <source>Disagreeable Privacy Policies: Mismatches between Meaning and Users&#x2019; Understanding</source><comment>(SSRN Scholarly Paper 2418297)</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.2139/ssrn.2418297">https://doi.org/10.2139/ssrn.2418297</ext-link></element-citation></ref>
<ref id="R30"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Mysore Sathyendra</surname><given-names>K.</given-names></name><name><surname>Wilson</surname><given-names>S.</given-names></name><name><surname>Schaub</surname><given-names>F.</given-names></name><name><surname>Zimmeck</surname><given-names>S.</given-names></name><name><surname>Sadeh</surname><given-names>N.</given-names></name></person-group> <year>(2017)</year> <article-title>Identifying the Provision of Choices in Privacy Policy Text</article-title><person-group person-group-type="editor"><name><surname>Palmer</surname><given-names>M.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Hwa</surname><given-names>R.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Riedel</surname><given-names>S.</given-names></name></person-group><source>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source><fpage>2774</fpage><lpage>2779</lpage><comment>Association for Computational Linguistics</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18653/v1/D17-1294">https://doi.org/10.18653/v1/D17-1294</ext-link></element-citation></ref>
<ref id="R31"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Srinath</surname><given-names>M.</given-names></name><name><surname>Wilson</surname><given-names>S.</given-names></name><name><surname>Giles</surname><given-names>C. L.</given-names></name></person-group> <year>(2021)</year> <article-title>Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies</article-title><source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source><fpage>6829</fpage><lpage>6839</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18653/v1/2021.acl-long.532">https://doi.org/10.18653/v1/2021.acl-long.532</ext-link></element-citation></ref>
<ref id="R32"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Stevens</surname><given-names>L. M.</given-names></name><name><surname>Mortazavi</surname><given-names>B. J.</given-names></name><name><surname>Deo</surname><given-names>R. C.</given-names></name><name><surname>Curtis</surname><given-names>L.</given-names></name><name><surname>Kao</surname><given-names>D. P.</given-names></name></person-group> <year>(2020)</year> <article-title>Recommendations for Reporting Machine Learning Analyses in Clinical Research</article-title><source>Circulation. Cardiovascular Quality and Outcomes</source><volume>13</volume><issue>10</issue><fpage>e006556</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1161/CIRCOUTCOMES.120.006556">https://doi.org/10.1161/CIRCOUTCOMES.120.006556</ext-link></element-citation></ref>
<ref id="R33"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Tang</surname><given-names>J.</given-names></name><name><surname>Shoemaker</surname><given-names>H.</given-names></name><name><surname>Lerner</surname><given-names>A.</given-names></name><name><surname>Birrell</surname><given-names>E.</given-names></name></person-group> <year>(2021)</year> <article-title>Defining privacy: How users interpret technical terms in privacy policies</article-title><source>Proceedings on Privacy Enhancing Technologies</source></element-citation></ref>
<ref id="R34"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Thorleiksd&#x00F3;ttir</surname><given-names>T.</given-names></name><name><surname>Renggli</surname><given-names>C.</given-names></name><name><surname>Hollenstein</surname><given-names>N.</given-names></name><name><surname>Zhang</surname><given-names>C.</given-names></name></person-group> <year>(2022)</year> <article-title>Dynamic Human Evaluation for Relative Model Comparisons</article-title><person-group person-group-type="editor"><name><surname>Calzolari</surname><given-names>N.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>B&#x00E9;chet</surname><given-names>F.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Blache</surname><given-names>P.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Choukri</surname><given-names>K.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Cieri</surname><given-names>C.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Declerck</surname><given-names>T.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Goggi</surname><given-names>S.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Isahara</surname><given-names>H.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Maegaard</surname><given-names>B.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Mariani</surname><given-names>J.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Mazo</surname><given-names>H.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Odijk</surname><given-names>J.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Piperidis</surname><given-names>S.</given-names></name></person-group><source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source><fpage>5946</fpage><lpage>5955</lpage><comment>European Language Resources Association</comment><ext-link ext-link-type="uri" xlink:href="https://aclanthology.org/2022.lrec-1.639">https://aclanthology.org/2022.lrec-1.639</ext-link></element-citation></ref>
<ref id="R35"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Wilson</surname><given-names>S.</given-names></name><name><surname>Schaub</surname><given-names>F.</given-names></name><name><surname>Dara</surname><given-names>A. A.</given-names></name><name><surname>Liu</surname><given-names>F.</given-names></name><name><surname>Cherivirala</surname><given-names>S.</given-names></name><name><surname>Giovanni Leon</surname><given-names>P.</given-names></name><name><surname>Schaarup Andersen</surname><given-names>M.</given-names></name><name><surname>Zimmeck</surname><given-names>S.</given-names></name><name><surname>Sathyendra</surname><given-names>K. M.</given-names></name><name><surname>Russell</surname><given-names>N. C.</given-names></name><name><surname>Norton</surname><given-names>T. B.</given-names></name><name><surname>Hovy</surname><given-names>E.</given-names></name><name><surname>Reidenberg</surname><given-names>J.</given-names></name><name><surname>Sadeh</surname><given-names>N.</given-names></name></person-group> <year>(2016)</year> <article-title>The Creation and Analysis of a Website Privacy Policy Corpus</article-title><person-group person-group-type="editor"><name><surname>Erk</surname><given-names>K.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Smith</surname><given-names>N. A.</given-names></name></person-group><source>Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source><fpage>1330</fpage><lpage>1340</lpage><comment>Association for Computational Linguistics</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.18653/v1/P16-1126">https://doi.org/10.18653/v1/P16-1126</ext-link></element-citation></ref>
<ref id="R36"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname><given-names>Y.</given-names></name></person-group><person-group person-group-type="author"><name><surname>Pedersen</surname><given-names>J. O.</given-names></name></person-group> <year>(1997)</year> <comment>July</comment><article-title>A comparative study on feature selection in text categorization</article-title><source>Icml</source><volume>97</volume><issue>412-420</issue><fpage>35</fpage></element-citation></ref>
<ref id="R37"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zaeem</surname><given-names>R. N.</given-names></name><name><surname>German</surname><given-names>R. L.</given-names></name><name><surname>Barber</surname><given-names>K. S.</given-names></name></person-group> <year>(2018)</year> <article-title>PrivacyCheck: Automatic Summarization of Privacy Policies Using Data Mining</article-title><source>ACM Trans. Internet Technol.</source><volume>18</volume><issue>4</issue><comment>53:1-53:18</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3127519">https://doi.org/10.1145/3127519</ext-link></element-citation></ref>
</ref-list>
</back>
</article>