<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf46918</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf46918</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Are data papers cited as research data? Preliminary analysis on interdisciplinary data paper citations</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Li</surname><given-names>Kai</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>Huang</surname><given-names>Pao-Pei</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<contrib contrib-type="author"><name><surname>Jeng</surname><given-names>Wei</given-names></name>
<xref ref-type="aff" rid="aff0003"/></contrib>
<aff id="aff0001"><bold>Kai Li</bold> is Assistant Professor in School of Information Sciences, University of Tennessee, Knoxville, USA. He received his Ph.D. from the Department of Information Science at Drexel University, and his research interest are scholarly communication, quantitative science studies. He can be contacted at <email xlink:href="kli16@utk.edu">kli16@utk.edu</email>.</aff>
<aff id="aff0002"><bold>Pao-Pei Huang</bold> is Ph.D. student in School of Information and Library Science, University of North Carolina, Chapel Hill, USA. Her research interest are open science practices, scholarly communication, and social informatics. She can be contacted at <email xlink:href="paopei@unc.edu">paopei@unc.edu</email>.</aff>
<aff id="aff0003"><bold>Wei Jeng</bold> is Associate Professor at Department of Library and Information Science, National Taiwan University, Taiwan and Director of Talent Empowerment Center at National Institute of Cyber Security, Taiwan. She received her Ph.D. from the school of computing and information at University of Pittsburgh, and her research interest are open science practices and research data infrastructure. She can be contacted at <email xlink:href="wjeng@ntu.edu.tw">wjeng@ntu.edu.tw</email>.</aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>1225</fpage>
<lpage>1233</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> Research data sharing and reuse have become increasingly important in modern science, and data papers represent a new academic publication genre aimed at enhancing the visibility, sharing, and reuse of research data. However, whether citations to data papers reflect actual data reuse remains largely unexplored. This paper presents preliminary findings from a project designed to address this gap.</p>
<p><bold>Method.</bold> we conducted a content analysis to manually annotate 437 citation sentences from 309 research articles referencing 50 data papers published in <italic>Data in Brief</italic>, a chief academic journal that only publishes data papers. The data papers were sampled from five knowledge domains based on a paper-level classification system.</p>
<p><bold>Results.</bold> Our results show that most citations to all selected data papers (89%) are unrelated to the research data being described in the paper, instead focusing on the research findings or methodologies. This suggests that data papers are being cited similarly to traditional research articles, despite their unique purpose and content.</p>
<p><bold>Conclusion.</bold> These findings raise questions about the effectiveness of data papers as representations of research data within the scholarly communication system, as well as their utility in quantitative studies on data reuse.</p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Data papers represent a unique academic genre that primarily serves to document and contextualize research datasets within the broader research landscape (<xref rid="R1" ref-type="bibr">Chavan &#x0026; Penev, 2011</xref>). As Chavan and Penev highlighted, data papers embody the core properties of data publication: availability, documentation, citability, and verification (<xref rid="R1" ref-type="bibr">Chavan &#x0026; Penev, 2011</xref>). They provide comprehensive dataset descriptions, link to the data&#x2019;s location, undergo peer review, and crucially, offer a mechanism for citation credit. Unlike traditional research articles that focus on presenting findings and interpretations, data papers shift the spotlight onto the data itself, offering comprehensive metadata and methodological details. This approach aligns with the growing emphasis on data sharing and open science principles in academia (<xref rid="R7" ref-type="bibr">Kratz &#x0026; Strasser, 2014</xref>). Given its strong relevance to the data-driven research paradigm, this new academic genre has gained traction in recent years, with the number of published data papers surpassing ten thousand and continuing to grow. Some universities and researchers now consider data paper submission a standard practice in data publication (<xref rid="R14" ref-type="bibr">Sch&#x00F6;pfel et al., 2020</xref>; Thewall, 2020).</p>
<p>The value of data papers extends beyond promoting transparency and reproducibility; they provide an approach for acknowledging the crucial contribution of data to the research process (<xref rid="R4" ref-type="bibr">Gorgolewski et al., 2013</xref>). In addition, by offering a citable entity for datasets, data papers also create a pathway for crediting data uses, thereby incentivizing data sharing, research reproducibility, and enhancing the perceived value of research data within the scholarly ecosystem. This addresses a significant challenge identified by Tenopir and colleagues (<xref rid="R19" ref-type="bibr">Tenopir et al., 2011</xref>), who noted that the lack of robust reward mechanisms often impedes data sharing activities.</p>
<p>However, despite the increasing prevalence of data journals (such as <italic>Data in Brief</italic> and <italic>Scientific Data</italic>) and data papers, studies indicate that data citation and reuse remain lower than anticipated and that citations to data papers may not equalize data reusing (<xref rid="R6" ref-type="bibr">Jiao &#x0026; Darch, 2020</xref>; <xref rid="R17" ref-type="bibr">Stuart, 2017</xref>; <xref rid="R20" ref-type="bibr">Thelwall, 2020</xref>). They remain significant challenges to the existing infrastructure to support data sharing and reusing. These discrepancies suggest the presence of factors influencing the impact and integration of data papers in the research system, which are not well understood in the literature. Understanding how data papers are used and cited, particularly the factors behind the citation behavior, is crucial for assessing their effectiveness in promoting data sharing and reuse.</p>
<p>Therefore, our study aims to explore the citation practices surrounding data papers, explicitly focusing on their citation contexts: whether a data paper is cited as research data. This investigation is guided by the overall research interest: What is the role and impact of data papers in scholarly communication, especially from the perspective of the intention of their citations? More specifically, we aim to address:</p>
<p>RQ1: How are data papers cited in research articles, and for what purposes?</p>
<p>RQ2: How does the context of data paper citations differ between disciplines and citation year gaps?</p>
<p>Through this preliminary analysis, we seek to contribute to the ongoing discussion about the roles of data papers in scholarly communication and their relationship to research data. The findings of this study will have implications for data citation practices, data paper guidelines, and the broader understanding of how data is used and valued across disciplines. As we navigate the evolving landscape of scholarly communication, such insights are crucial for optimizing the impact of data papers and fostering a more data-centric research ecosystem.</p>
</sec>
<sec id="sec2">
<title>Methods</title>
<p>We took our sample from the journal of Data in Brief, an exclusively data journal founded by Elsevier in 2014. As one of the few journals that only publishes data papers, it has been frequently analysed to understand how research data is published and the relationship between data publication and scientific research (<xref rid="R2" ref-type="bibr">Chen et al., 2022</xref>; <xref rid="R3" ref-type="bibr">Fu et al., 2023</xref>; <xref rid="R9" ref-type="bibr">Li &#x0026; Jiao, 2022</xref>; <xref rid="R20" ref-type="bibr">Thelwall, 2020</xref>).</p>
<p>We retrieved the metadata and citations of all data papers in the journal from the Dimensions database (https://www.dimensions.ai/). Dimensions is a well-established scholarly database that contains more than 146 million publications and have been heavily used in various scientometrics and science of science studies (<xref rid="R5" ref-type="bibr">Herzog et al., 2020</xref>; <xref rid="R11" ref-type="bibr">Mart&#x00ED;n-Mart&#x00ED;n et al., 2021</xref>). We only considered all data papers that have at least 10 citations indexed in the Dimensions database, so that we will have enough citations to analyse for each paper.</p>
<p>We specifically traced the domain information of all data papers from Dimensions that is based on the Australian and New Zealand Standard Research Classification (ANZSRC). It contains 22 broad divisions (FoR2) and 159 detailed groups of these divisions (FoR4). This classification was implemented in Dimensions on the paper-level using a machine learning algorithm (https://plus.dimensions.ai/support/solutions/articles/23000018820-which-research-categories-and-classification-schemes-are-available-in-dimensions), which can more accurately categorize the discipline of data papers published in the same journal than journal-level classification.</p>
<p>We mapped the FoR2 categories into five major knowledge domains for the next step of analysis. These five categories include: biomedical (<italic>Bio</italic>), environmental science (<italic>Env</italic>), physical science (<italic>Phy</italic>), social science and humanities (<italic>Soc</italic>), and Technologies and engineering (<italic>Tech</italic>). The mapping table, as well as the size of publications in each FoR2 category, can be accessed in our complementary materials from Zenodo (<xref rid="R10" ref-type="bibr">Li et al., 2024</xref>). For each of these five categories, we randomly selected 10 data papers into our final sample. And for each data paper, we randomly selected 10 publications that cite the paper to analyse in the next step.</p>
<p>We manually examined and filtered out the following research papers from the last step: (1) any publication that is not a research paper, such as review paper, data paper, and corrections, (2) any publication that is not published in English, and (3) any publication that is not citing the target data paper, despite the citation information supplied by the Dimensions database. Some of these issues are derived from the incomplete or inaccurate metadata information from the Dimensions database, which is a major issue in the research infrastructure to support research data. After this step, 309 research papers remain in our final sample.</p>
<p>For every publication, we manually collected sentences where an original data paper was cited. There are 437 sentences in total, given that a paper maybe cited multiple times in the citing publication. Two coders independently annotated every sentence using the following scheme we developed using another 50 randomly selected sentences taken prior to the final sample. The scheme focuses on distinguishing sentences describing or mentioning the data described in the data paper (i.e., data description, data use, and data background) or those that are not related to the data at all (i.e., research concept or background, research method, and research finding). Beyond the definitions presented in <xref ref-type="table" rid="T1">Table 1</xref>, examples of these categories can be accessed in our complementary materials shared on Zenodo (<xref rid="R10" ref-type="bibr">Li et al., 2024</xref>).</p>
<table-wrap id="T1">
<label>Table 1.</label>
<caption><p>Scheme for manual annotation of all citation sentences</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top">Category</th>
<th align="center" valign="top">Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top"><bold>Data-related</bold></td>
<td align="left" valign="top"></td>
</tr>
<tr>
<td align="left" valign="top">Data Description (DP)</td>
<td align="left" valign="top">The sentence describes the dataset(s) per se, without indicating actual usage or how it was collected.</td>
</tr>
<tr>
<td align="left" valign="top">Data Use (DU)</td>
<td align="left" valign="top">The sentence describes the dataset is used in the research, including being used as a baseline.</td>
</tr>
<tr>
<td align="left" valign="top">Data Background (DB)</td>
<td align="left" valign="top">The sentence provides contextual information about the dataset, without describing its specific contents or usage.</td>
</tr>
<tr>
<td align="left" valign="top"><bold>Not data-related</bold></td>
<td align="left" valign="top"></td>
</tr>
<tr>
<td align="left" valign="top">Research Concept or Background (NDC)</td>
<td align="left" valign="top">The sentence references the research topic or theoretical concepts of the cited data papers.</td>
</tr>
<tr>
<td align="left" valign="top">Research Method (NDM)</td>
<td align="left" valign="top">The sentence references the research methods, including but not limited to process, tools, techniques, used in the cited data paper.</td>
</tr>
<tr>
<td align="left" valign="top">Research Finding (NDF)</td>
<td align="left" valign="top">The sentence references the research findings, including results and their implications, derived from data in the cited data paper.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We measured the inter-coder reliability of the annotation using Cohen&#x2019;s kappa implemented in the <italic>&#x2018;psych&#x2019;</italic> package of R (<xref rid="R13" ref-type="bibr">Revelle, 2024</xref>). The unweighted Cohen&#x2019;s kappa value is 0.73, which suggests substantial agreement (<xref rid="R12" ref-type="bibr">McHugh, 2012</xref>). We revisited sentences that are disagreed by the two coders.</p>
</sec>
<sec id="sec3">
<title>Results</title>
<sec id="sec3_1">
<title>How are data papers cited in research articles, and for what purposes?</title>
<p>Our results show that the majority of data papers are NOT cited as research data in research articles. <xref ref-type="table" rid="T2">Table 2</xref> shows the distribution of all citation sentences across the six categories. Among all 437 citations, only 48 of them (11.0%) belong to the data-related categories. And the largest category is <italic>Research finding (NDF)</italic>, despite the fact that no data paper is supposed to present research design and findings based on its original definition (Chaven &#x0026; Penev, 2011). However, there is also a decent share of citations to data papers that concerns research methods (i.e., <italic>NDM</italic>). Even though data papers are not cited as research data in these cases, this citation context is still relatively close to research data. Overall, we find a diverse spectrum of citation practice around data papers that are strongly deviant from the original purposes of publishing data papers.</p>
<table-wrap id="T2">
<label>Table 2.</label>
<caption><p>Distribution of all citation sentences across the six categories</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top">Category</th>
<th align="center" valign="top"># Sentences</th>
<th align="center" valign="top">Share of Sentences</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top"><bold>Data-related</bold></td>
<td align="center" valign="top"><bold>48</bold></td>
<td align="center" valign="top"><bold>11.0%</bold></td>
</tr>
<tr>
<td align="left" valign="top">Data Use (DU)</td>
<td align="center" valign="top">27</td>
<td align="center" valign="top">6.2%</td>
</tr>
<tr>
<td align="left" valign="top">Data Description (DP)</td>
<td align="center" valign="top">14</td>
<td align="center" valign="top">3.2%</td>
</tr>
<tr>
<td align="left" valign="top">Data Background (DB)</td>
<td align="center" valign="top">7</td>
<td align="center" valign="top">1.6%</td>
</tr>
<tr>
<td align="left" valign="top"><bold>Not data-related</bold></td>
<td align="center" valign="top"><bold>425</bold></td>
<td align="center" valign="top"><bold>89.0%</bold></td>
</tr>
<tr>
<td align="left" valign="top">Research Finding (NDF)</td>
<td align="center" valign="top">207</td>
<td align="center" valign="top">47.4%</td>
</tr>
<tr>
<td align="left" valign="top">Research Method (NDM)</td>
<td align="center" valign="top">130</td>
<td align="center" valign="top">29.7%</td>
</tr>
<tr>
<td align="left" valign="top">Research Concept or Background (NDC)</td>
<td align="center" valign="top">52</td>
<td align="center" valign="top">11.9%</td>
</tr>
<tr>
<td align="left" valign="top"><bold>All categories</bold></td>
<td align="center" valign="top"><bold>473</bold></td>
<td align="center" valign="top"><bold>100%</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The above numbers on the level of sentences are translated into 16 data papers with at least one data-related citation. Among the 16 papers, only four of them are cited primarily data as research data (i.e., more than 50% of all citation sentences a paper received are data-related). Even though the reasons for such differences between data papers are not covered by this preliminary research, this would be an interesting question for the next step of our investigation.</p>
<p>We specifically examined how citation contexts are related to the paper sections in which a data paper is cited, given the strong relationship between these variables in citation analysis (<xref rid="R18" ref-type="bibr">Tahamtan &#x0026; Bornmann, 2019</xref>). We classified all paper sections into the following four categories: Introduction (including literature review), Methods, RDC (i.e., Results, Discussion, and Conclusion), and Others (i.e., appendix and acknowledgement). The key statistics are presented in <xref ref-type="table" rid="T3">Table 3</xref>. For both data and method citations, they are more prevalent in the Methods section. In addition, even though only a few data papers are cited in appendix and acknowledgement, this section is also quite strongly connected to the data-related citation contexts.</p>
<table-wrap id="T3">
<label>Table 3.</label>
<caption><p>Data citation index by paper section (% Data citation is the share of all data-related sentences among all sentences; % method citation is the share of all research meth-od sentences among all non-data sentences.)</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top">Section</th>
<th align="center" valign="top"># Sentences</th>
<th align="center" valign="top">% Data citation</th>
<th align="center" valign="top">% Method citation</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Introduction</td>
<td align="center" valign="top">149</td>
<td align="center" valign="top">3.4%</td>
<td align="center" valign="top">21.5%</td>
</tr>
<tr>
<td align="left" valign="top">Methods</td>
<td align="center" valign="top">106</td>
<td align="center" valign="top">23.6%</td>
<td align="center" valign="top">92.6%</td>
</tr>
<tr>
<td align="left" valign="top">RDC</td>
<td align="center" valign="top">158</td>
<td align="center" valign="top">4.4%</td>
<td align="center" valign="top">15.2%</td>
</tr>
<tr>
<td align="left" valign="top">Others</td>
<td align="center" valign="top">24</td>
<td align="center" valign="top">45.8%</td>
<td align="center" valign="top">7.7%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec3_2">
<title>Key factors behind the citation behavior: discipline and the citing-cited year gap</title>
<p>We examined two key factors (i.e., disciplines, citation year gaps) behind the citation contexts of data papers, which is reported in this section.</p>
<sec id="sec3_2_1">
<title>Discipline of data papers</title>
<p>We analysed the relationship between citation contexts and the discipline of data papers based on the ANZSRC classification system. <xref ref-type="table" rid="T4">Table 4</xref> presents the results, which shows largely similar share of data citation to data papers from all domains. The higher share of data citations in the category of Soc (social sciences and humanities) is largely attributed to the single paper <italic>&#x2018;Residential electric vehicle charging datasets from apartment buildings&#x2019;</italic> (<xref rid="R16" ref-type="bibr">S&#x00F8;rensen et al., 2021</xref>), which has all citations related to data. And we believe this categorization is highly debatable, which is further highlighted in our discussion.</p>
<table-wrap id="T4">
<label>Table 4.</label>
<caption><p>Data citation context by the discipline of data papers</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top">Discipline</th>
<th align="center" valign="top"># Sentences</th>
<th align="center" valign="top">% Data citation</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Soc</td>
<td align="center" valign="top">75</td>
<td align="center" valign="top">14.7%</td>
</tr>
<tr>
<td align="left" valign="top">Phy</td>
<td align="center" valign="top">76</td>
<td align="center" valign="top">11.8%</td>
</tr>
<tr>
<td align="left" valign="top">Bio</td>
<td align="center" valign="top">72</td>
<td align="center" valign="top">11.1%</td>
</tr>
<tr>
<td align="left" valign="top">Tech</td>
<td align="center" valign="top">101</td>
<td align="center" valign="top">9.9%</td>
</tr>
<tr>
<td align="left" valign="top">Env</td>
<td align="center" valign="top">113</td>
<td align="center" valign="top">8.8%</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec3_2_2">
<title>Year gap of citing paper</title>
<p>We further investigated how the year gap between data papers and citing papers could be related to the citation contexts. Results, as shown in <xref ref-type="table" rid="T5">Table 5</xref>, illustrates that when the year gap widens, there is a growing possibility that the data paper is cited as research data. However, even after five years of the data paper, the share of data citation is still only at 20%. Our findings show similar patterns with citation contexts of method papers and method objects (<xref rid="R8" ref-type="bibr">Li, 2021</xref>; <xref rid="R15" ref-type="bibr">Small, 2018</xref>), where older research instruments are more likely to be cited as instruments, possibly because of the demonstrated validity.</p>
<table-wrap id="T5">
<label>Table 5.</label>
<caption><p>Data citation context by citation year gap between citing and cited documents</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top">Section</th>
<th align="center" valign="top"># Sentences</th>
<th align="center" valign="top">% Data citation</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Year 0-1</td>
<td align="center" valign="top">119</td>
<td align="center" valign="top">5.0%</td>
</tr>
<tr>
<td align="left" valign="top">Year 2-3</td>
<td align="center" valign="top">145</td>
<td align="center" valign="top">15.2%</td>
</tr>
<tr>
<td align="left" valign="top">Year 4-5</td>
<td align="center" valign="top">121</td>
<td align="center" valign="top">9.1%</td>
</tr>
<tr>
<td align="left" valign="top">Year > 5</td>
<td align="center" valign="top">45</td>
<td align="center" valign="top">20.0%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To further understand how the two variables are statistically correlated to the outcome, we constructed a logistic regression model to understand the correlation between the above variables as well as the publication year of the data papers as <italic><bold>independent variables</bold></italic> and whether a sentence is data-related or not as the <italic><bold>categorical dependent variable</bold>.</italic> The summary of the regression model is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, with the y-axis showing how much each category contributes to the possibility of a sentence being a data-related citation. It shows that comparing to the domain of biomedical science (which is the baseline not shown in the graph), all other domains are at least equally likely having data-related citations, even though none of the other domains are significantly more likely so. Similarly, comparing to the year of 2016, data papers published in the years of 2017 and 2021 are significantly less likely to be cited as data, which is likely due to the smaller data points in these two years. We also did not find the year gap between citing and cited papers to be a significant factor for data citation.</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>Summary of the statistical model</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c102-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
</sec>
</sec>
<sec id="sec4">
<title>Discussion and conclusion</title>
<p>This short paper presents preliminary findings from our project to understand the citation contexts of data papers in English-language research articles. We examined the general patterns of whether data papers are cited as research data, an important argument to support the usefulness of this new genre, with special focus on some key factors behind the citation behavior, such as the discipline of the data paper and citation time gap. Our results reconfirm existing findings that data papers can be cited for various reasons beyond just those related to research data (<xref rid="R6" ref-type="bibr">Jiao &#x0026; Darch, 2020</xref>; <xref rid="R20" ref-type="bibr">Thelwall, 2020</xref>). However, the fact that the majority of data papers are not cited as research data raises a critical concern about whether data papers can help to give the proper credit (i.e., credit of creating and publishing research data) to the authors.</p>
<p>Additionally, building upon existing efforts, our paper strives to offer a more comprehensive empirical investigation on factors behind the data paper citation behavior. We find that the general pattern stays valid across knowledge domains and does not change significantly as the year gap between the citing and cited documents widens.</p>
<p>Beyond our major findings, our research design and process also reveal some key issues in the infrastructure to support data publication. One critical issue is the identification of the discipline of data papers. While many data papers are published in multidisciplinary data journals, such as <italic>Data in Brief</italic> and <italic>Scientific Data</italic>, the journal-level classification system in many scholarly databases (such as the Web of Science) is inadequate for determining the discipline of such publications. Even by using a novel paper-level classification, we find cases where the classification may not be accurate, which could have strong impact on future quantitative studies on this topic. We argue that establishing a more robust and accurate system to evaluate the discipline of data papers and research data is a critical next step; and this system should consider not just paper metadata, but also the authors&#x2019; affiliations and other attributes of the research datasets to achieve better performance.</p>
<p>In the next step of the project, we will expand the presented investigation by using a larger sample size as well as considering more factors to explain the citation behavior of data papers, such as the discipline of the citing paper and whether the citation is from the data paper authors themselves (i.e., self-citation). We expect that these extra factors will have strong impacts on how a data paper is understood and discussed in citing publications. In addition, we will also use advanced statistical models to understand more factors behind the citation behavior of data papers, which will contribute to the literature of citation context analysis.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chavan</surname><given-names>V.</given-names></name><name><surname>Penev</surname><given-names>L.</given-names></name></person-group><year>2011</year><article-title>The data paper: A mechanism to incentivize data publishing in biodiversity science</article-title><source>BMC bioinformatics</source><volume>12</volume><fpage>1</fpage><lpage>12</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1186/1471-2105-12-S15-S2">https://doi.org/10.1186/1471-2105-12-S15-S2</ext-link></element-citation></ref>
<ref id="R2"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>P.Y.</given-names></name><name><surname>Li</surname><given-names>K.</given-names></name><name><surname>Jiao</surname><given-names>C.</given-names></name></person-group><year>2022</year><article-title>A preliminary analysis of geography of collaboration in data papers by S&#x0026;T capacity index</article-title><source>Proceedings of the Association for Information Science and Technology</source><volume>59</volume><issue>1</issue><fpage>642</fpage><lpage>644</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1002/pra2.676">https://doi.org/10.1002/pra2.676</ext-link></element-citation></ref>
<ref id="R3"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Fu</surname><given-names>J.</given-names></name><name><surname>Tian</surname><given-names>L.</given-names></name><name><surname>Zhang</surname><given-names>C.</given-names></name><name><surname>Li</surname><given-names>J.</given-names></name></person-group><year>2023</year><article-title>Opening research data contributes to the citations of related research articles: Evidence from Data in Brief</article-title><source>Learned Publishing</source><volume>36</volume><issue>3</issue><fpage>426</fpage><lpage>438</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1002/leap.1551">https://doi.org/10.1002/leap.1551</ext-link></element-citation></ref>
<ref id="R4"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gorgolewski</surname><given-names>K.J.</given-names></name><name><surname>Margulies</surname><given-names>D.S.</given-names></name><name><surname>Milham</surname><given-names>M.P.</given-names></name></person-group><year>2013</year><article-title>Making data sharing count: A publication-based solution</article-title><source>Frontiers in neuroscience</source><volume>7</volume><fpage>9</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fnins.2013.00009">https://doi.org/10.3389/fnins.2013.00009</ext-link></element-citation></ref>
<ref id="R5"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Herzog</surname><given-names>C.</given-names></name><name><surname>Hook</surname><given-names>D.</given-names></name><name><surname>Konkiel</surname><given-names>S.</given-names></name></person-group><year>2020</year><article-title>Dimensions: Bringing down barriers between scientometricians and data</article-title><source>Quantitative science studies</source><volume>1</volume><issue>1</issue><fpage>387</fpage><lpage>395</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1162/qss_a_00020">https://doi.org/10.1162/qss_a_00020</ext-link></element-citation></ref>
<ref id="R6"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jiao</surname><given-names>C.</given-names></name><name><surname>Darch</surname><given-names>P.T.</given-names></name></person-group><year>2020</year><article-title>The role of the data paper in scholarly communication</article-title><source>Proceedings of the Association for Information Science and Technology</source><volume>57</volume><issue>1</issue><fpage>e316</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1002/pra2.316">https://doi.org/10.1002/pra2.316</ext-link></element-citation></ref>
<ref id="R7"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kratz</surname><given-names>J.</given-names></name><name><surname>Strasser</surname><given-names>C.</given-names></name></person-group><year>2014</year><article-title>Data publication consensus and controversies</article-title><source>F1000Research</source><volume>3</volume><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.12688/f1000research.3979.3">https://doi.org/10.12688/f1000research.3979.3</ext-link></element-citation></ref>
<ref id="R8"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>K.</given-names></name></person-group><year>2021</year><article-title>The re-instrumentalization of the Diagnostic and Statistical Manual of Mental Disorders (DSM) in psychological publications: A citation context analysis</article-title><source>Quantitative Science Studies</source><volume>2</volume><issue>2</issue><fpage>678</fpage><lpage>697</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1162/qss_a_00124">https://doi.org/10.1162/qss_a_00124</ext-link></element-citation></ref>
<ref id="R9"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>K.</given-names></name><name><surname>Jiao</surname><given-names>C.</given-names></name></person-group><year>2022</year><article-title>The data paper as a sociolinguistic epistemic object: A con-tent analysis on the rhetorical moves used in data paper abstracts</article-title><source>Journal of the As-sociation for Information Science and Technology</source><volume>73</volume><issue>6</issue><fpage>834</fpage><lpage>846</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1002/asi.24585">https://doi.org/10.1002/asi.24585</ext-link></element-citation></ref>
<ref id="R10"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Li</surname><given-names>K.</given-names></name><name><surname>Huang</surname><given-names>P.P.</given-names></name><name><surname>Jeng</surname><given-names>W.</given-names></name></person-group><year>2024</year><chapter-title>Dataset for "Are data papers cited as research data?</chapter-title> <source>Preliminary analysis on interdisciplinary data paper citations" [Data set]</source><publisher-loc>Zenodo</publisher-loc><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.13763303">https://doi.org/10.5281/zenodo.13763303</ext-link></element-citation></ref>
<ref id="R11"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Mart&#x00ED;n-Mart&#x00ED;n</surname><given-names>A.</given-names></name><name><surname>Thelwall</surname><given-names>M.</given-names></name><name><surname>Orduna-Malea</surname><given-names>E.</given-names></name><name><surname>Delgado</surname> <given-names>L&#x00F3;pez-C&#x00F3;</given-names></name> <name><surname>zar</surname><given-names>E.</given-names></name></person-group><year>2021</year><article-title>Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations&#x2019; COCI: a multidisciplinary comparison of coverage via citations</article-title><source>Scientometrics</source><volume>126</volume><issue>1</issue><fpage>871</fpage><lpage>906</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s11192-020-03690-4">https://doi.org/10.1007/s11192-020-03690-4</ext-link></element-citation></ref>
<ref id="R12"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>McHugh</surname><given-names>M.L.</given-names></name></person-group><year>2012</year><article-title>Interrater reliability: the kappa statistic</article-title><source>Biochemia medica</source><volume>22</volume><issue>3</issue><fpage>276</fpage><lpage>282</lpage></element-citation></ref>
<ref id="R13"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Revelle</surname><given-names>W.</given-names></name></person-group><year>2024</year><article-title>Package &#x2018;psych&#x2019; version 2.4.3. The comprehensive R archive net-work</article-title><ext-link ext-link-type="uri" xlink:href="https://CRAN.R-project.org/package=psych">https://CRAN.R-project.org/package=psych</ext-link></element-citation></ref>
<ref id="R14"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sch&#x00F6;pfel</surname><given-names>J.</given-names></name><name><surname>Farace</surname><given-names>D.</given-names></name><name><surname>Prost</surname><given-names>H.</given-names></name><name><surname>Zane</surname><given-names>A.</given-names></name></person-group><year>2020</year><article-title>Data papers as a new form of knowledge organization in the field of research data</article-title><source>Knowledge Organization</source><volume>46</volume><issue>8</issue><fpage>622</fpage><lpage>638</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5771/0943-7444-2019-8-622">https://doi.org/10.5771/0943-7444-2019-8-622</ext-link></element-citation></ref>
<ref id="R15"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Small</surname><given-names>H.</given-names></name></person-group><year>2018</year><article-title>Characterizing highly cited method and non-method papers using citation contexts: The role of uncertainty</article-title><source>Journal of Informetrics</source><volume>12</volume><issue>2</issue><fpage>461</fpage><lpage>480</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.joi.2018.03.007">https://doi.org/10.1016/j.joi.2018.03.007</ext-link></element-citation></ref>
<ref id="R16"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>S&#x00F8;rensen</surname><given-names>&#x00C5;.L.</given-names></name><name><surname>Lindberg</surname><given-names>K.B.</given-names></name><name><surname>Sartori</surname><given-names>I.</given-names></name><name><surname>Andresen</surname><given-names>I.</given-names></name></person-group><year>2021</year><article-title>Residential electric vehicle charging datasets from apartment buildings</article-title><source>Data in Brief</source><volume>36</volume><fpage>105</fpage><lpage>107</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.dib.2021.107105">https://doi.org/10.1016/j.dib.2021.107105</ext-link></element-citation></ref>
<ref id="R17"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Stuart</surname><given-names>D.</given-names></name></person-group><year>2017</year><article-title>Data bibliometrics: Metrics before norms</article-title><source>Online Information Review</source><volume>41</volume><issue>3</issue><fpage>428</fpage><lpage>435</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1108/OIR-01-2017-0008">https://doi.org/10.1108/OIR-01-2017-0008</ext-link></element-citation></ref>
<ref id="R18"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tahamtan</surname><given-names>I.</given-names></name><name><surname>Bornmann</surname><given-names>L.</given-names></name></person-group><year>2019</year><article-title>What do citation counts measure? An updated review of studies on citations in scientific documents published between 2006 and 2018</article-title><source>Scientometrics</source><volume>121</volume><issue>3</issue><fpage>1635</fpage><lpage>1684</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s11192-019-03243-4">https://doi.org/10.1007/s11192-019-03243-4</ext-link></element-citation></ref>
<ref id="R19"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tenopir</surname><given-names>C.</given-names></name><name><surname>Allard</surname><given-names>S.</given-names></name><name><surname>Douglass</surname><given-names>K.</given-names></name><name><surname>Aydinoglu</surname><given-names>A.U.</given-names></name><name><surname>Wu</surname><given-names>L.</given-names></name><name><surname>Read</surname><given-names>E.</given-names></name><name><surname>Manoff</surname><given-names>M.</given-names></name><name><surname>Frame</surname><given-names>M.</given-names></name></person-group><year>2011</year><article-title>Data sharing by scientists: Practices and perceptions</article-title><source>PloS one</source><volume>6</volume><issue>6</issue><fpage>e21101</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1371/journal.pone.0021101">https://doi.org/10.1371/journal.pone.0021101</ext-link></element-citation></ref>
<ref id="R20"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Thelwall</surname><given-names>M.</given-names></name></person-group><year>2020</year><article-title>Data in Brief: Can a mega-journal for data be useful?</article-title><source>Scientometrics</source><volume>124</volume><issue>1</issue><fpage>697</fpage><lpage>709</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s11192-020-03437-1">https://doi.org/10.1007/s11192-020-03437-1</ext-link></element-citation></ref>
</ref-list>
</back>
</article>