Abstract

Information Research

1368-1613

University of Borås

ir30iConf47581

10.47989/ir30iConf47581

Research article

Taking disagreements into consideration: human annotation variability in privacy policy analysis

Wang

Tian

Yuanye

Blake

Catherine

Bashir

Masooda

Wang

Ryan

Tian Wang is Postdoctoral Associate at the CyLab Security and Privacy Institute of Carnegie Mellon University. She received their Ph.D. from University of Illinois at Urbana Champaign, and her research interests are in mobile and app security and privacy. She can be contacted at tianwan2@andrew.cmu.edu Yuanye Ma is Senior Research Associate at University of Illinois Discovery Partners Institute. She received their Ph.D. from University of North Carolina at Chapel Hill, and her research interests are in user-centered privacy, natural language processing and information ethics. She can be contacted at yuanyem@uillinois.edu Catherine Blake is Professor and Associate Dean for Academic Affairs in School of Information Sciences, University of Illinois at Urbana Champaign. She received her PhD from University of California, Irvine. Her research interests are in biomedical informatics, natural language processing, evidence-based discovery, learning health systems, socio-technical systems, data analytics, literature-based discovery. She can be contacted at clblake@illinois.edu Masooda Bashir is Associate Professor in School of Information Sciences, University of Illinois at Urbana Champaign. She received their Ph.D. from Purdue University, and her research interests are in the interface of information technology, human psychology, and society; especially how privacy, security, and trust intersect from a psychological point of view with information systems. She can be contacted at mnb@illinois.edu Ryan Wang is a PhD student in the School of Information Sciences at the University of Illinois Urbana-Champaign. He is interested in natural language processing, machine learning, and bioinformatics. He can be reached at hywang3@illinois.edu.

06052025

2025

30 i 81 92

2025

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Introduction. Privacy policies inform users about data practices but are often complex and difficult to interpret. Human annotation plays a key role in understanding privacy policies, yet annotation disagreements highlight the complexity of these texts. Traditional machine learning models prioritize consensus, overlooking annotation variability and its impact on accuracy.

Method. This study examines how annotation disagreements affect machine learning performance using the OPP-115 corpus. It compares majority vote and union methods with alternative strategies to assess their impact on policy classification.

Analysis. The study evaluates whether increasing annotator consensus improves model effectiveness and if disagreement-aware approaches yield more reliable results.

Results. Higher agreement levels improve model performance across most categories. Complete agreement yields the best F1-scores, especially for First Party Collection/Use and Third-Party Sharing/Collection. Annotation disagreements significantly impact classification outcomes, underscoring the need for understanding annotation disagreements.

Conclusion. Ignoring annotation disagreements can misrepresent model accuracy. This study proposes new evaluation strategies that account for annotation variability, offering a more realistic approach to privacy policy analysis. Future work should explore the causes of annotation disagreements to improve machine learning transparency and reliability.

Introduction

Privacy policies are crucial for digital privacy, explaining how businesses handle personal data to help users make informed decisions. However, the FTC and other studies find most are ‘incomprehensible’, with users rarely reading or understanding them (Braun, 2024; Hallgren, 2012; Tang et al., 2021). Technical language and inconsistent terminology further complicate comprehension (Azhagusundari & Thanamani, 2013), and ambiguities lead to varying interpretations of terms (Pedregosa et al., 2011). Users’ understanding also depends on their education and cultural background (Hossin & Sulaiman, 2015; Krippendorff, 2018).

Researchers have used natural language processing (NLP) to improve the readability of privacy policies (Li, et al;, 2022; Wilson et al., 2016) Efforts include summarising lengthy policies (LingPipe Alias-i., 2008), using topic models to extract data practices (Harkous et al., 2018), classifying content (Grosman et al., 2020), identifying specific information (Gray, 2011; Mysore Sathyendra et al., 2017) and checking compliance automatically (Liu, et al., 2016). Using machine learning (ML) and natural language processing (NLP) to improve privacy policy readability offers benefits, such as processing large data quickly and outperforming human annotators in speed. However, human annotation, often used to create gold-standard datasets, is time-consuming and relies on inter-rater reliability, leaving disagreements unexplored. ML models, which optimise probability-based classification, prioritise data quantity over annotation variety (Plank, 2022). This overlooks data inconsistencies and diverse interpretations. A study found that even privacy experts rarely agree on policy interpretations, suggesting automated tools may struggle to accurately interpret policies, just like users (Braun, 2024).

Human annotation and interpretation of policy and legal texts are often inconsistent and prone to disagreement. For instance, annotators assigned three different labels—Other, User Choice/Control, and First Party Collection/Use—to the same text, while another sentence received two distinct labels: first and third party, and User Choice/Control. Similarly, texts (4)-(7) were inconsistently labelled as First Party Collection/Use by some annotators but not by others. This variation stems from the deliberate ambiguity and complexity of such texts, reflecting their inherently controversial and evolving nature.

Texts to be annotated/ labelled:

If you do not wish to share your PIN, you always have the option to not provide the information or use the MediaNews Websites that require it.

By use of our websites and games that have dynamic in-game advertising, you signify your assent to SCEA’s privacy policy.

You may register or enhance your profile by linking your Facebook or Google accounts on NYTimes.com.

Sharing Your Information with Other Companies

You can delete cookies using your browser settings.

What Choices Do I Have?

You can visit our Web Sites without sharing personally identifiable information.

Annotation scheme/label choices:

OPP-115’s annotation scheme consists of ten data practice categories:

First party collection/use: how and why a service provider collects user information.

Third party sharing/collection: how user information may be shared with or collected by third parties.

User choice/control: choices and control options available to users.

User access, edit, & deletion: if and how users may access, edit, or delete their information.

Data retention: how long user information is stored.

Data security: how user information is protected.

Policy change: if and how users will be informed about changes to the privacy policy.

Do not track: if and how Do Not Track signals for online tracking and advertising are honoured.

International and specific audiences: practices that pertain only to a specific group of users (e.g., children, Europeans, or California residents).

Other: additional sub-labels for introductory or general text, contact information, and practices not covered by the other categories.

We argue that revealing variations in understanding privacy policies is an underexplored area of research. This gap partly arises from the bias in using machine learning as the primary method and from the tendency to treat privacy as a uniform concept, overlooking cultural, educational, and gender differences. To address this, we conduct an empirical study using the OPP-115 corpus, which includes annotations from three annotators. We expand the characterization from three (individual, pairwise, and complete agreement) to seven: three individual, three pairwise, and one gold standard with full agreement. Unlike majority voting, pairwise agreement requires two specific annotators to agree. Complete agreement, while yielding fewer instances in the target class, may offer higher-quality annotations. Our study provides tangible measures for understanding how annotation variations impact machine learning performance. Rather than simply optimising metrics, we aim to reflect human disagreements through realistic reporting. We highlight two approaches to constructing gold standards—one ignoring disagreement and one considering it—without advocating for either, acknowledging that practical constraints often dictate these choices. We emphasise that disagreements in interpreting privacy statements are common, even among experts, and recognizing these differences is essential for user-centred privacy research.

Related work

High-quality labelled data are essential for supervised machine learning. Some argue that multiple annotators reduce human bias in evaluation (Artstein, 2017), but the number of annotators needed for high-quality annotation is unclear and often limited by budget (Mysore Sathyendra et al., 2017). Inter-rater reliability (IRR), or annotator agreement, is commonly used to measure annotation quality (Moallem, 2018). Typically, studies use either text annotated by any rater (union) or by the majority (majority vote), ignoring genuine annotator differences.

The assumption that annotators are interchangeable is also flawed, as differences exist between annotator populations and individuals (Hershcovich et al., 2022). Agreement studies show that variation can arise from heterogeneous data, complex labels, and annotator differences. Global agreement coefficients may mask these variations, while more detailed studies provide insights (Stevens et al., 2020).

A review of legal machine learning datasets found that disagreement during annotation is typically removed, with final corpora containing only ‘gold standard’ annotations. Common strategies like majority vote, forced agreement, expert review, or arbitration do not account for disagreements (Prabhakaran et al., 2021). Issues of interpretation and non-transparency in reporting machine learning results, especially with legal documents, have also been flagged (Plank, 2022). Accurate predictions can build trust, but reproducibility depends on dataset validity and the discovery process (Herbert et al., 2023). Thus, documenting data preparation processes and their impact on model performance is crucial for trust and transparency in machine learning results (Bai et al., 2022).

Method

Towards this goal of using machine learning to provide privacy policy statement analysis that documents and considers annotation disagreements, our study provides ways to demonstrate how to directly measure the impact of human (dis)agreement on machine learning model performance, by responding to the following research questions:

RQ1:

to what extent does reaching consensus amongst annotators impact the classification performance of traditional machine learning and deep learning models, respectively?

RQ2:

how do some alternative strategies used to create gold standards compare with the typical union and majority vote strategy?

RQ3:

what metrics can be used to better manage the trade-off between more annotated texts and more annotations for the same text?

To address these research questions, we conducted experiments using the OPP-115 corpora (Anaraky et al, 2019) with annotated text prepared through majority vote, union methods, and alternative strategies (individual, pairwise, and complete agreement). We tested how these strategies impact model performance using two traditional supervised learning algorithms— support vector machines (SVM) and Naïve Bayes (NB)—and two deep learning models, bidirectional long short-term memory (BiLSTM) and bidirectional encoder representations from transformers (BERT).

Dataset

The OPP-115 Corpus comprises 13,209 sentences and the number of sentences in each target class are highly imbalanced (as shown in Figure 1). For example, there are only 6, 39, and 63 sentences with complete agreement in the Do Not Track, Data retention, and Policy Change categories respectively. The OPP-115 corpora were prepared by ten independent annotators who coded privacy policy text segments using several predefined categories. Each privacy segment was annotated by three independent annotators. We arranged the unique annotator IDs in each statement in ascending order and replaced them with annotators A, B, and C respectively. By our proposed alternative strategies, the gold standards used individual annotators (A or B or C), where 2 annotators reached consensus (A and B, or A and C or B and C) and where there was complete agreement (A and B and C agree on the annotations). We also replicated the strategy used in the original work where a text is deemed relevant when at least 2 out of 3 annotators (majority vote) reached agreement, or all the 3 annotators (union).

Figure 1.

Number of sentences in each gold standard

none

Annotations created in the original OPP-115 data were segmented text into paragraphs (Anaraky et al, 2019). In contrast, the unit of analysis in our experiments is a sentence, thus both the original text and the annotations were converted into sentences using version 4.1.2 of LingPipe (Alabduljabbar et al., 2021) and the index position of each sentence was maintained and subsequently aligned with the index position of the manual annotations. Sentences were pre-processed using the NLTK Python package (Prabhakaran et al., 2021), including: converting words to lowercases, and removing punctuation and stop words. Terms appearing infrequently (less than 5 sentences) and very frequently (more than 95% of the sentences) were removed because their presence would contribute little to the classification performance (Amos et al., 2021). Annotation categories are not mutually exclusive, a sentence can be annotated as belonging to more than one multiple categories. Lastly, we framed the problem as a binary text classification task for each of the annotation categories.

Text classification

The classification experiments used two algorithm families: traditional models (SVM, NB) and deep learning models (BiLSTM, BERT). Ten-fold cross-validation evaluated each model, splitting the dataset into ten equal parts, with nine used for training and one for testing. The data was stratified to maintain a balanced proportion of positive and negative labels in each fold. Model performance was measured using standard metrics: precision, recall, F-1, and accuracy (Hamdani et al., 2021). For traditional models, feature selection was crucial. We used version 1.0.2 of Scikit-learn (Gordon, et al., 2022) and entropy-based selection, calculating information gain to choose the top 2000 features (Amos et al., 2021; Pedregosa et al., 2011). These features were then used to construct the sentence-term matrix for the test set. TF-IDF was considered but not used due to its inability to account for target class distribution (Srinath et al., 2021).

Results

We report our findings regarding how much annotator agreement impacts the performance of automated approaches.

Alternative methods: independent, pairwise, complete agreement

We found that increasing the level of agreement from independent to pairwise to complete improved F-1, accuracy, precision, and recall across nearly all categories and classifiers (Figure 2). Complete agreement yielded the best F1 scores for several categories, including First Party Collection/Use and Third-Party Sharing/Collection. For Third Party Sharing/Collection, precision improved by 5% and recall by 4%. First Party Collection/Use saw a 7% improvement in precision and 8% in recall. The ‘Other’ category showed significant gains, with F1 improving from 0.82 to 0.91.

In some categories, like User Choice/Control and International and Specific Audiences, pairwise agreement performed as well as complete agreement, with recall and accuracy reaching 0.97. Categories with fewer examples, such as Data Retention, had high metrics but raised concerns about generalizability.

Certain classifiers handled data inconsistencies better. For Third Party Sharing/Collection, precision improved from 0.85 (independent) to 0.91 (complete agreement) across all classifiers. The difference in model performance was minor (within 0.02), but the impact of different gold standards was more significant, ranging from 0.03 to 0.06 (Figures 2 and 3). Third Party Sharing/Collection also showed notable improvements in F-1, precision, recall, and accuracy, with precision and F1 for First Party Collection/Use improving by at least 0.07.

Figure 2.

Average model performance for increasing levels of consensus (independent, pairwise, and complete agreement) and increasing levels of disagreement (majority vote, union) gold standards *(Ac=accuracy, F1, Pr=precision, and Re=recall)

none

Figure 3.

Average model performance for complete and union gold standards

none

Typical gold standard methods: union and complete

In general, we also found increasing the level of agreement improves model performance: the majority vote results outperform the union across all metrics and all classifiers, in all categories. Figure 4 shows the original Fleiss’ Kappa statistic against the F1 score for complete and union gold standards. The range of Kappa in the OPP-115 collection ranges from moderate to very good, so it is possible that larger variations might show a correlation with F1, but these results suggest that: (a) Kappa is not a good substitute for F1 scores produced using different gold standards; (b) the difference between the standard metrics produced from complete and union experiments might provide a more realistic way to convey the impact of disagreement.

Figure 4.

Fleiss Kappa versus F1-Score for complete and union gold standards

none

Discussion

Common methods for reporting machine learning results often focus on the model and its tuning parameters, neglecting variations in human-labelled training data. In privacy documentation, diverse terminology and interpretations make it essential to consider human discrepancies in machine learning analysis. Traditional metrics don’t always reflect the Fleiss Kappa statistic or the impact of human consensus. Based on our research, we recommend:

Use multiple annotators: for at least a subset of the corpus, multiple annotators should be employed to capture human variability. Reducing the number of categories per annotator could enhance annotation quality and efficiency.

Iterative annotation: apply an iterative approach, directing annotators to categories with less agreement. This can optimise human effort and maintain annotation quality, using complete and union gold standards as metrics.

Differentiate gold standards: use the difference between complete and union gold standards to assess how well machine learning results align with human judgments. This can provide insight into the model’s accuracy in mirroring human interpretations.

Avoid aggregation: report results separately for each category rather than aggregating them. Specific categories may have varying expectations, and detailed performance information is crucial for understanding model effectiveness in real-world applications.

Conclusion

Privacy policy statements are vital for regulatory compliance and user data decisions, yet many are unreadable and often ignored. Machine learning could help by automating information extraction, but current reporting practices that don’t align with human judgement undermine trust. Unlike previous methods using Fleiss’ Kappa to measure disagreement, we propose an approach that uses independent, pairwise, and complete agreement in gold standards. We acknowledge that our data set is limited as it used only the OPP-115 corpus which may limit the generalizability of this study. Given this limitation, our preliminary results show that higher agreement improves precision, recall, F1, and accuracy, while more disagreement reduces these metrics.

Disagreements in privacy statement interpretation are more complex than fact-based tasks, and inter-rater reliability alone may not suffice to measure model performance. Traditional metrics like Cohen’s or Fleiss Kappa are inadequate for skewed data. We suggest using precision, recall, F1, and accuracy to evaluate how different gold standards affect performance, which is crucial given the evolving nature of privacy content.

With new collections of privacy statements surpassing a million entries (Bannihatti Kumar et al., 2020; Thorleiksdóttir et al., 2022), investing in annotation adjudication is urgent. Quality assurance involves decisions such as storing multiple annotators’ data and measuring their agreement (Mousavi Nejad et al., 2020). Our study highlights the need for multiple annotators on subsets of texts to assess the impact of human judgement on metrics. This may conflict with current practices aimed at maximising annotated data but is essential for realistic metric representation. An iterative approach can help allocate resources effectively, and text classification results should be reported by category rather than aggregated.

Our study is the first to explicitly raise the question of disagreements of annotating privacy policies documents and highlight the value and significance of studying such disagreements. How exactly disagreement in understanding and/or interpreting privacy policies can be leveraged remains unexplored, and future research needs to understand where and why disagreements occur when it comes to privacy policy interpretation, as there can be potentially disagreements for completely different reasons and hence require different treatment or solution. For example, there may be disagreement that originates from a lack of knowledge, linguistic ambiguity, or underlying differences in preference, each would require completely different solutions.

When marking up raw text, annotators need the flexibility to decide the appropriate text boundaries that capture the target category. Before the initial annotations can be used to construct a classifier, the unit of analysis, such as a paragraph, sentence (used in this analysis) or some other predefined ‘span’ must be established. This choice impacts the predictive performance of any model constructed, more work is needed to establish what span is optimal for a given task and to quantify the impact of this decision. We have introduced a new performance metric – the difference between complete and union gold standards - that directly measures the impact of human agreement using the same metrics that are commonly used to measure an automated system. However, situated empirical user studies are needed to establish if this new metric is successful in making machine-learning models more transparent.

Acknowledgements

We would like to thank all the reviewers for the feedback on this paper. There is no funding for this research to report.

References

Alabduljabbar

Abusnaina

Meteriz-Yildiran

Ü.

Mohaisen

(2021)

TLDR: Deep Learning-Based Automated Privacy Policy Annotation with Key Policy Highlights

Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society103118

https://doi.org/10.1145/3463676.3485608

Anaraky

R. G.

Cherry

Jarrell

Knijnenburg

(2019)

Testing a comic-based privacy policy

The 15th Symp. on Usable Privacy and Security

Artstein

(2017)

Inter-annotator agreement

Handbook of linguistic annotation297313

Azhagusundari

Thanamani

A. S.

(2013)

Feature selection based on information gain

International Journal of Innovative Technology and Exploring Engineering (IJITEE)221821

Bai

Ritter

(2021)

Pre-train or annotate? domain adaptation with a constrained budget

arXiv preprint arXiv:2109.04711

Bannihatti Kumar

Iyengar

Nisal

Feng

Habib

Story

Cherivirala

Hagan

Cranor

Wilson

Schaub

Sadeh

(2020)

Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text

Proceedings of The Web Conference 202019431954

https://doi.org/10.1145/3366423.3380262

Bird

Klein

Loper

(2009) Natural language processing with Python: analysing text with the natural language toolkit" O’Reilly Media, Inc."

Braun

(2024)

I beg to differ: How disagreement is handled in the annotation of legal machine learning data sets

Artificial Intelligence and Law323839862

https://doi.org/10.1007/s10506-023-09369-4

Chen

Fang

Norton

McDonald

A. M.

Sadeh

(2021)

Fighting the Fog: Evaluating the Clarity of Privacy Disclosures in the Age of CCPA

Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society73102

https://doi.org/10.1145/3463676.3485601

Gordon

M. L.

Lam

M. S.

Park

J. S.

Patel

Hancock

Hashimoto

Bernstein

M. S.

(2022) April

Jury learning: Integrating dissenting voices into machine learning models

Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems119

Gray

R. M.

(2011) Entropy and information theorySpringer Science & Business Media

Grosman

J. S.

Furtado

P. H. T.

Rodrigues

A. M. B.

Schardong

G. G.

Barbosa

S. D. J.

Lopes

H. C. V.

(2020)

Eras: Improving the quality control in the annotation process for Natural Language Processing tasks

Information Systems93101553

https://doi.org/10.1016/j.is.2020.101553

Hallgren

K. A.

(2012)

Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial

Tutorials in Quantitative Methods for Psychology812334

Hamdani

R. E.

Mustapha

Amariles

D. R.

Troussel

Meeùs

Krasnashchok

(2021)

A combined rule-based and machine learning approach for automated GDPR compliance checking

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law4049

https://doi.org/10.1145/3462757.3466081

Harkous

Fawaz

Lebret

Schaub

Shin

K. G.

Aberer

(2018)

Polisis: Automated analysis and presentation of privacy policies using deep learning

Proceedings of the 27th USENIX Conference on Security Symposium531548

Herbert

Becker

Schaewitz

Hielscher

Kowalewski

Sasse

Acar

Dürmuth

(2023)

A World Full of Privacy and Security (Mis)conceptions? Findings of a Representative Survey in 12 Countries

Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems123

https://doi.org/10.1145/3544548.3581410

Hershcovich

Frank

Lent

de Lhoneux

Abdou

Brandl

Bugliarello

Cabello Piqueras

Chalkidis

Cui

Fierro

Margatina

Rust

Søgaard

(2022)

Challenges and Strategies in Cross-Cultural NLP

Muresan

Nakov

Villavicencio

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)69977013Association for Computational Linguistics

https://doi.org/10.18653/v1/2022.acl-long.482

Hossin

Sulaiman

M. N.

(2015)

A review on evaluation metrics for data classification evaluations

International journal of data mining & knowledge management process521

Krippendorff

(2018) Content analysis: An introduction to its methodologySage publications

Liu

T. D.

Liu

(2020)

Accurate data-driven prediction does not mean high reproducibility

Nature Machine Intelligence211315

https://doi.org/10.1038/s42256-019-0140-2

LingPipe Alias-i

(2008) 4.1. 0. URL

http://alias-i.com/lingpipe

(2008)

Liu

Wilson

Schaub

Sadeh

(2016)

Analyzing vocabulary intersections of expert annotations and topic models for data practices in privacy policies

2016 AAAI Fall Symposium Series

Moallem

(2018)

Do You Really Trust “Privacy Policy” or “Terms of Use” Agreements Without Reading Them?

Nicholson

Advances in Human Factors in Cybersecurity290295

Springer International Publishing

https://doi.org/10.1007/978-3-319-60585-2_27

Mousavi Nejad

Jabat

Nedelchev

Scerri

Graux

(2020)

Establishing a Strong Baseline for Privacy Policy Classification

Hölbl

Rannenberg

Welzer

ICT Systems Security and Privacy Protection370383

Springer International Publishing

https://doi.org/10.1007/978-3-030-58201-2_25

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Duchesnay

(2011)

Scikit-learn: Machine learning in python journal of machine learning research

Journal of machine learning research1228252830

Pepperberg

I. M.

(1988)

An interactive modeling technique for acquisition of communication skills: Separation of “labeling” and “requesting” in a psittacine subject

Applied Psycholinguistics915976

https://doi.org/10.1017/S014271640000045X

Plank

(2022) The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation(arXiv:2211.02570). arXiv

https://doi.org/10.48550/arXiv.2211.02570

Prabhakaran

Davani

A. M.

Díaz

(2021) On Releasing Annotator-Level Labels and Information in Datasets(arXiv:2110.05699). arXiv

https://doi.org/10.48550/arXiv.2110.05699

Reidenberg

J. R.

Breaux

Cranor

L. F.

French

Grannis

Graves

Liu

McDonald

Norton

Ramanath

Russell

N. C.

Sadeh

Schaub

(2014) Disagreeable Privacy Policies: Mismatches between Meaning and Users’ Understanding(SSRN Scholarly Paper 2418297)

https://doi.org/10.2139/ssrn.2418297

Mysore Sathyendra

Wilson

Schaub

Zimmeck

Sadeh

(2017)

Identifying the Provision of Choices in Privacy Policy Text

Palmer

Hwa

Riedel

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing27742779Association for Computational Linguistics

https://doi.org/10.18653/v1/D17-1294

Srinath

Wilson

Giles

C. L.

(2021)

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)68296839

https://doi.org/10.18653/v1/2021.acl-long.532

Stevens

L. M.

Mortazavi

B. J.

Deo

R. C.

Curtis

Kao

D. P.

(2020)

Recommendations for Reporting Machine Learning Analyses in Clinical Research

Circulation. Cardiovascular Quality and Outcomes1310e006556

https://doi.org/10.1161/CIRCOUTCOMES.120.006556

Tang

Shoemaker

Lerner

Birrell

(2021)

Defining privacy: How users interpret technical terms in privacy policies

Proceedings on Privacy Enhancing Technologies

Thorleiksdóttir

Renggli

Hollenstein

Zhang

(2022)

Dynamic Human Evaluation for Relative Model Comparisons

Calzolari

Béchet

Blache

Choukri

Cieri

Declerck

Goggi

Isahara

Maegaard

Mariani

Mazo

Odijk

Piperidis

Proceedings of the Thirteenth Language Resources and Evaluation Conference59465955European Language Resources Association

https://aclanthology.org/2022.lrec-1.639

Wilson

Schaub

Dara

A. A.

Liu

Cherivirala

Giovanni Leon

Schaarup Andersen

Zimmeck

Sathyendra

K. M.

Russell

N. C.

Norton

T. B.

Hovy

Reidenberg

Sadeh

(2016)

The Creation and Analysis of a Website Privacy Policy Corpus

Erk

Smith

N. A.

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)13301340Association for Computational Linguistics

https://doi.org/10.18653/v1/P16-1126

Yang

Pedersen

J. O.

(1997) July

A comparative study on feature selection in text categorization

Icml97412-42035

Zaeem

R. N.

German

R. L.

Barber

K. S.

(2018)

PrivacyCheck: Automatic Summarization of Privacy Policies Using Data Mining

ACM Trans. Internet Technol.18453:1-53:18

https://doi.org/10.1145/3127519