<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf47263</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf47263</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>An analysis of poet demographic and thematic diversity in a poetry collection for inclusive AI</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Choi</surname><given-names>Kahyun</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>Kang</surname><given-names>Gyuri</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<aff id="aff0001"><bold>Kahyun Choi</bold> is an Assistant Professor in the School of Information Sciences at the University of Illinois at Urbana-Champaign. She earned her Ph.D. from the School of Information Sciences at the University of Illinois at Urbana-Champaign. Kahyun Choi&#x2019;s research interests involve the application of computational methods and machine learning algorithms to various modalities, including audio and text. She can be contacted at <email xlink:href="kahyun@illinois.edu">kahyun@illinois.edu</email></aff>
<aff id="aff0002"><bold>Gyuri Kang</bold> is a PhD student in Information Science at Indiana University Bloomington. Her research interests include digital environmental humanities, cultural analytics, and NLP. She can be contacted at <email xlink:href="gyukang@iu.edu">gyukang@iu.edu</email></aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>610</fpage>
<lpage>617</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> AI technologies, such as theme classification and named entity recognition, enhance digital library accessibility. However, they may introduce biases if training datasets lack adequate representation. For instance, prior AI models for poetry classification overlooked dataset diversity, raising concerns about representation. To address this issue, this study assesses the dataset representation and examines potential issues in AI model design for poetry collections.</p>
<p><bold>Method.</bold> We annotated and published the race and ethnicity of poets in an American poetry collection curated by <italic>poets.org,</italic> which was recently used to train a poetry theme classification system. We then examined the diversity of the collection using these annotations.</p>
<p><bold>Analysis.</bold> We compared the racial/ethnic composition of the collection to U.S. Census data and conducted group-exclusive top word analysis, popular theme analysis, and entropy-based analysis of theme distribution diversity to evaluate linguistic and thematic diversity.</p>
<p><bold>Results.</bold> Our findings indicate that most underrepresented groups are well- represented in the collection, except for Latino/a/x American poets. Furthermore, we found that poems from underrepresented groups increase the collection&#x2019;s linguistic and thematic diversity.</p>
<p><bold>Conclusions.</bold> To design responsible AI that embraces diversity, it is essential to assess dataset representation and support non-standard English and diverse themes beyond those popular with the general population. </p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Artificial intelligence (AI) has unlocked the potential for enhanced recommendation and search services within online library collections. However, careless use of AI can introduce pitfalls, such as biases embedded in the collections upon which AI models are based, leading to the models&#x2019; biased outcomes (<xref rid="R4" ref-type="bibr">Cordell, 2020</xref>). Indeed, collections often do not represent the general population accurately, causing models to discriminate against marginalized groups (<xref rid="R5" ref-type="bibr">D&#x2019;ignazio &#x0026; Klein, 2023</xref>). For instance, the recent large language model, GPT-3, demonstrated biases against Muslims and other marginalized groups (<xref rid="R2" ref-type="bibr">Bommasani et al., 2021</xref>). Thus, investing extra effort and attention in data collection for AI models, with an emphasis on equity and representation, is crucial (<xref rid="R10" ref-type="bibr">Jo &#x0026; Gebru, 2020</xref>). Moreover, it is critical to assess the equity and diversity of pre-existing collections before use.</p>
<p>Addressing this challenge becomes urgent in the poetry domain as readership rises and the application of AI in poetry analysis expands. National endowment for the arts survey has revealed a 76% increase in US poetry readership from 2012 to 2017 (<xref rid="R9" ref-type="bibr">Iyengar et al., 2018</xref>). A follow-up survey in 2022 shows a similarly high engagement, as 11.5% of US adults engage with poetry through reading or listening (<xref rid="R8" ref-type="bibr">Iyengar, 2023</xref>). Alongside, various AI systems for improving accessibility of poetry collections, such as poetry theme classification have been developed (<xref rid="R16" ref-type="bibr">Rakshit et al., 2015</xref>; <xref rid="R14" ref-type="bibr">Lou et al., 2015</xref>; <xref rid="R12" ref-type="bibr">Kaur &#x0026; Saini, 2017</xref>; <xref rid="R15" ref-type="bibr">Navarro-Colorado, 2018</xref>; <xref rid="R3" ref-type="bibr">Choi, 2023</xref>). Yet, these studies have not examined if the collections were diverse enough to accurately represent the general population. Thus, it remains uncertain whether they adequately account for poems by poets from underrepresented groups.</p>
<p>As part of the <italic>&#x2018;unbiased AI for poetry analysis: toward equitable and diverse digital libraries&#x2019;</italic> project funded by institute of museum and library services (IMLS), our study addresses this oversight by assessing <italic>poets.org&#x2019;s</italic> curated poem collection, with a focus on the race and ethnicity of the poets. We selected this collection because it was used to train a theme classification system in one of the most recent AI systems for poetry (<xref rid="R3" ref-type="bibr">Choi, 2023</xref>). Specifically, we compared the U.S. Census population data on race and ethnicity with those of our poem collection to determine if the collection accurately reflects the general population. Furthermore, we analysed prevalent words and themes in poems written by these groups. We investigated how underrepresented groups&#x2019; work contribute to the word and theme diversity of the poem collection. Also, we identified potential issues with AI systems that are developed mostly based on poems by dominant groups, while works from underrepresented groups are disregarded as outliers. While our primary focus is on this poetry collection, we suggest that our methodology for assessing demographic representativeness of a collection can be applied to other literary genres.</p>
</sec>
<sec id="sec2">
<title>Diversity analysis</title>
<sec id="sec2_1">
<title>Data collection and pre-processing</title>
<p>We utilized the same collection of poems retrieved in October 2022 that Choi (<xref rid="R3" ref-type="bibr">2023</xref>) used to train their theme classification system. The poems are from <italic>poets.org</italic>, which is managed by the Academy of American Poets, the nonprofit charitable organization dedicated to fostering American poets and poetry. This collection features a wide variety of poems, each accompanied by several descriptive tags. We excluded audio-only entries, duplicates, and excessively short, resulting 9,445 works. As our project focuses on modern and contemporary American poetry, we selected 8,912 poems published after 1890, covering the era from Emily Dickson and Walt Whitman to 2022. The list of poets, categorized by race and ethnicity, can be available via this link: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.6084/m9.figshare.25572459.v1">https://doi.org/10.6084/m9.figshare.25572459.v1</ext-link></p>
<p>We have identified the racial and ethnic groups of poets using the <italic>&#x2018;occasions&#x2019;</italic> field in the poets.org collection, especially when the occasion is relevant to specific racial and ethnic groups. To align with US Census racial and ethnic categories, we selected the following occasions: &#x2018;Asian/Pacific American Heritage Month&#x2019; for Asian/Pacific Americans (APA), &#x2018;Black History Month&#x2019; or African Americans (AA), &#x2018;Hispanic Heritage Month&#x2019; for Latino/a/x American (LXA), and &#x2018;Native American Heritage Month&#x2019; for Native Americans (NA). To better follow US Census categories, we distinguished between Asians and Pacific Islanders by manually identifying poems by Pacific Islanders through a review of the descriptions of the poems. Therefore, in this paper, APA-AA represents Asian Americans without Pacific Islanders, while APA-PA denotes Pacific Islanders exclusively. In our study, &#x2018;Others&#x2019; or &#x2018;General&#x2019; poems represent works by poets who are not associated with the specific occasions we use to identify the underrepresented groups. Additionally, we have reassigned 329 poets to specific underrepresented groups after reviewing their biography pages on poets.org; however, these biographies are not detailed enough to distinguish the mixed-race identities recognized by the US Census. Upon determining the racial and ethnic categories of the poets, we organized their poems into the corresponding categories. However, we acknowledge that our grouping strategy may overlook some poems by underrepresented groups, despite its comprehensiveness.</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>Representation and trends of poet groups in the United States: a) cumulative count of poems; b) yearly ratio of poems by underrepresented groups and general poets since 2000; c) composition of poems by group; d) U.S. population composition</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c51-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec2_2">
<title>Representation analysis of poems by groups</title>
<p>We examined the cumulative percentage of poems written by poets from underrepresented groups, by other poets, and both combined over time (see Figure1-a). The year indicates either the date of original publication or the date when <italic>poets.org</italic> made the work available on their website. In the collection, there is an increase until mid 1920s, followed by a plateau until early 2000s, and then a soaring upward trend. This pattern can be understood in the context of copyright laws; poems published before 1929 are copyright-free and can be published without permission, while those released after 1929 can only be published on the web with permission. Since its launch in 1996, poets.org has published approximately 80% of poems that are copyright-protected with permission. The scarcity of poems between the mid-1920s and early 2000s is attributed to the copyright issues or potential overriding of original publication year by the year of publication on poets.org. Over the last two decades, we can observe rapid growth in both general poems and those by underrepresented groups; notably, the latter have exhibited a sharp upward growth trend in the last decade, with recent years showing their proportion to be about half (see <xref ref-type="fig" rid="F1">Figure 1-b</xref>), indicating <italic>poets.org&#x2019;s</italic> commitment to equity and diversity.</p>
<p>Furthermore, we compare the composition of poems by individual groups with U.S. population demographics (<xref rid="R18" ref-type="bibr">U.S. Census Bureau, 2022</xref>) to evaluate how well poems from underrepresented groups are represented in the dataset. Figures 1-c and 1-d show that most underrepresented groups have higher representation, except for LXA: while the LXA population in the U.S. accounts for 19.1%, their representation in the collection is only 4.5%. This gap might be due to the prevalence of Spanish in the U.S.: 13% of Americans speak Spanish at home (<xref rid="R6" ref-type="bibr">Dietrich &#x0026; Hernandez, 2022</xref>), and it remains the most popular second language (<xref rid="R1" ref-type="bibr">American Academy of Arts &#x0026; Sciences, 2016</xref>). Thus, substantial portion of the American population may prefer reading poetry in Spanish. However, since English is the primary language of <italic>poets.org</italic>, most works are translated into English, which may affect the representation of LXA poets in the collection. Nonetheless, increasing the number of LXA poems, whether in the original language or translated, would further enhance the collection&#x2019;s already strong diversity and equity. This enhancement is also essential for reducing potential biases in AI models for poetry caused by data imbalance.</p>
</sec>
<sec id="sec2_3">
<title>Groupwise exclusive top word analysis</title>
<p><xref ref-type="fig" rid="F2">Figure 2</xref> shows the top 10 unique words exclusive to each group to explore the word diversity and cultural depth each contributes. AA poems feature colloquial language reflective of African American Vernacular English (AAVE) (<xref rid="R13" ref-type="bibr">Khera, 2021</xref>), with terms such as &#x2018;lawd,&#x2019; &#x2018;hyeah,&#x2019; &#x2018;souf,&#x2019; &#x2018;lovah,&#x2019; and &#x2018;whah,&#x2019; which mean &#x2018;Lord,&#x2019; &#x2018;hear,&#x2019; &#x2018;south,&#x2019; &#x2018;lover,&#x2019; and &#x2018;where,&#x2019; respectively. APA-AA poems are rich in cultural references and names common in Asian contexts, such as &#x2018;Shiratama&#x2019; (a Japanese dessert), &#x2018;Chang&#x2019; (a family name in Korea or China), &#x2018;lola&#x2019; (grandmother in Tagalog), as well as names such as &#x2018;Acequia.&#x2019; APA-PA poems prominently feature words from indigenous languages, including Hawaiian Pidgin (Roberts, 1995), such as &#x2018;nalani,&#x2019; &#x2018;kai,&#x2019; &#x2018;huki,&#x2019; &#x2018;olelo,&#x2019; meaning &#x2018;the heavens&#x2019;, &#x2018;sea&#x2019;, &#x2018;to pull,&#x2019; &#x2018;language,&#x2019; respectively, along with names of places they live, such as &#x2018;guam&#x2019; and &#x2018;hawai.&#x2019; LXA poems contain many Spanish words, including &#x2018;tata,&#x2019; &#x2018;une,&#x2019; &#x2018;alabanza,&#x2019; and &#x2018;templo,&#x2019; which translate to &#x2018;grandfather,&#x2019; &#x2018;article,&#x2019; &#x2018;praise,&#x2019; and &#x2018;temple.&#x2019; Finally, NA poems incorporate historically significant words that describe their tribes or towns, such as &#x2018;Spavinaw,&#x2019; &#x2018;Shawnee,&#x2019; and &#x2018;Anishinaabeg,&#x2019; as well as terms like &#x2018;chieftain&#x2019; and &#x2018;clans&#x2019; to depict their societal organization. The inclusion of non-standard English and foreign words underscores the diversity that these groups bring to American poetry. However, this diversity also highlights the potential limitations of linguistic analysis tools, including named entity recognition when applied to American poetry, especially if they were not trained from non-standard English varieties such as Hawaiian Pidgin or African American Vernacular English, or on foreign terms and languages.</p>
<fig id="F2">
<label>Figure 2.</label>
<caption><p>Word clusters of individual groups</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c51-fig2.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec2_4">
<title>Groupwise popular theme analysis</title>
<p>To examine the similarities and differences among groups, we first identified each group&#x2019;s top 10 most popular themes. To further assess the thematic diversity that each group contributes to the collection in comparison with general poems, we first subtracted the theme proportions found in the general poems from those in each specific group. We then ranked these differences and selected the top 10 themes that were most distinctively prevalent in each group. <xref ref-type="table" rid="T1">Table 1</xref> presents these themes and their corresponding percentages: the numbers in parentheses in the first five columns represent the proportion of each theme within a group, while the numbers in the last four columns show the difference from the corresponding proportions in general poems.</p>
<p>The dominant themes in general poems such as nature, love, death, body, and existential, illustrate a broad range of human emotions and experiences. Each underrepresented group has its distinct set of most popular themes, which only partially overlap with those of general poems, indicating their unique thematic focuses. Specifically, the NA group has only two overlapping themes, LXA has three, while AA and APA each have six overlapping themes. Furthermore, we examined the relatively popular themes in comparison to the general poems to understand the unique thematic focuses within individual groups. Among all groups or three out of the four, themes such as America, ancestry, body, identity, and family are prevalent. History, immigration, migration, and social justice are also prominent, appearing in two groups. These common themes among them reveal how their experiences as underrepresented groups in America impact their poetry. Unique themes present in only one group include beauty, death, and slavery in AA; fathers, mothers, and politics in APA; memories and violence in LXA; and earth, environment, landscapes, language, and nature in NA. These highlights unique cultural and historical characteristics of each group. Particularly, NA poetry stands out with the most distinctive themes, reflecting their unique history as the original inhabitants prior to European arrival, and their strong connection to nature and environment. The distinct sets of popular themes among underrepresented groups raise concerns regarding the current theme classification systems. As theme sets are typically selected based on average theme popularity, as shown in Choi (<xref rid="R3" ref-type="bibr">2023</xref>), themes that are popular only within underrepresented groups often get overlooked. To create more equitable AI systems for poetry, we suggest developing multiple sets of classifiers for underrepresented groups, not only generalized but also tailored to their unique themes.</p>
<table-wrap id="T1">
<label>Table 1.</label>
<caption><p>Top 10 most popular themes per group and relatively popular themes per group against general</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top" rowspan="2"></th>
<th align="center" valign="top" colspan="5">Top 10 Most Popular Themes</th>
<th align="center" valign="top" colspan="5">Top 10 Most Relatively Popular Themes</th>
</tr>
<tr>
<th align="center" valign="top">General</th>
<th align="center" valign="top">AA</th>
<th align="center" valign="top">APA</th>
<th align="center" valign="top">LXA</th>
<th align="center" valign="top">NA</th>
<th align="center" valign="top">AA</th>
<th align="center" valign="top">APA</th>
<th align="center" valign="top">LXA</th>
<th align="center" valign="top">NA</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">1</td>
<td align="center" valign="top">nature (9.6)</td>
<td align="center" valign="top">identity (15.2)</td>
<td align="center" valign="top">body (11.6)</td>
<td align="center" valign="top">body (14.3)</td>
<td align="center" valign="top">nature (22.1)</td>
<td align="center" valign="top">identity (11.0)</td>
<td align="center" valign="top">identity (6.8)</td>
<td align="center" valign="top">identity (9.4)</td>
<td align="center" valign="top">ancestry (15.9)</td>
</tr>
<tr>
<td align="left" valign="top">2</td>
<td align="center" valign="top">love (8.8)</td>
<td align="center" valign="top">body (12.3)</td>
<td align="center" valign="top">identity (10.9)</td>
<td align="center" valign="top">identity (13.6)</td>
<td align="center" valign="top">ancestry (17.8)</td>
<td align="center" valign="top">america (9.0)</td>
<td align="center" valign="top">family (6.0)</td>
<td align="center" valign="top">immigration (8.0)</td>
<td align="center" valign="top">nature (12.5)</td>
</tr>
<tr>
<td align="left" valign="top">3</td>
<td align="center" valign="top">death (8.4)</td>
<td align="center" valign="top">america (12.1)</td>
<td align="center" valign="top">family (9.3)</td>
<td align="center" valign="top">death (10.8)</td>
<td align="center" valign="top">body (13.4)</td>
<td align="center" valign="top">ancestry (7.6)</td>
<td align="center" valign="top">ancestry (5.9)</td>
<td align="center" valign="top">family (7.5)</td>
<td align="center" valign="top">environment (9.7)</td>
</tr>
<tr>
<td align="left" valign="top">4</td>
<td align="center" valign="top">body (7.9)</td>
<td align="center" valign="top">death (11.9)</td>
<td align="center" valign="top">death (9.0)</td>
<td align="center" valign="top">family (10.8)</td>
<td align="center" valign="top">landscapes (11.9)</td>
<td align="center" valign="top">history (5.6)</td>
<td align="center" valign="top">immigration (5.4)</td>
<td align="center" valign="top">america (7.0)</td>
<td align="center" valign="top">earth (8.6)</td>
</tr>
<tr>
<td align="left" valign="top">5</td>
<td align="center" valign="top">existential (7.6)</td>
<td align="center" valign="top">love (10.8)</td>
<td align="center" valign="top">nature (8.9)</td>
<td align="center" valign="top">america (10.1)</td>
<td align="center" valign="top">environment (11.9)</td>
<td align="center" valign="top">social justice (5.4)</td>
<td align="center" valign="top">america (4.1)</td>
<td align="center" valign="top">ancestry (6.4)</td>
<td align="center" valign="top">america (8.4)</td>
</tr>
<tr>
<td align="left" valign="top">6</td>
<td align="center" valign="top">self (6.2)</td>
<td align="center" valign="top">ancestry (9.4)</td>
<td align="center" valign="top">self (8.6)</td>
<td align="center" valign="top">immigration (8.5)</td>
<td align="center" valign="top">america (11.5)</td>
<td align="center" valign="top">body (4.4)</td>
<td align="center" valign="top">body (3.7)</td>
<td align="center" valign="top">body (6.4)</td>
<td align="center" valign="top">landscapes (7.7)</td>
</tr>
<tr>
<td align="left" valign="top">7</td>
<td align="center" valign="top">beauty (5.6)</td>
<td align="center" valign="top">self (9.1)</td>
<td align="center" valign="top">existential (7.7)</td>
<td align="center" valign="top">ancestry (8.3)</td>
<td align="center" valign="top">history (10.7)</td>
<td align="center" valign="top">death (3.5)</td>
<td align="center" valign="top">mothers (3.7)</td>
<td align="center" valign="top">social justice (4.6)</td>
<td align="center" valign="top">family (7.4)</td>
</tr>
<tr>
<td align="left" valign="top">8</td>
<td align="center" valign="top">animals (5.3)</td>
<td align="center" valign="top">nature (9.2)</td>
<td align="center" valign="top">ancestry (7.7)</td>
<td align="center" valign="top">memories (7.8)</td>
<td align="center" valign="top">family (10.7)</td>
<td align="center" valign="top">beauty (3.3)</td>
<td align="center" valign="top">migration (3.1)</td>
<td align="center" valign="top">violence (4.5)</td>
<td align="center" valign="top">language (7.3)</td>
</tr>
<tr>
<td align="left" valign="top">9</td>
<td align="center" valign="top">loss (5.2)</td>
<td align="center" valign="top">beauty (9.0)</td>
<td align="center" valign="top">america (7.2)</td>
<td align="center" valign="top">violence (7.3)</td>
<td align="center" valign="top">earth (10.7)</td>
<td align="center" valign="top">slavery (3.3)</td>
<td align="center" valign="top">politics (3.1)</td>
<td align="center" valign="top">memories (3.8)</td>
<td align="center" valign="top">history (7.2)</td>
</tr>
<tr>
<td align="left" valign="top">10</td>
<td align="center" valign="top">writing (4.5)</td>
<td align="center" valign="top">history (9.0)</td>
<td align="center" valign="top">animals (6.7)</td>
<td align="center" valign="top">loss (7.0)</td>
<td align="center" valign="top">language (10.3)</td>
<td align="center" valign="top">hope (3.1)</td>
<td align="center" valign="top">fathers (3.1)</td>
<td align="center" valign="top">migration (3.5)</td>
<td align="center" valign="top">body (5.5)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec2_5">
<title>Entropy-based analysis of theme distribution diversity</title>
<p>We explore the theme diversity within each individual group of poets to assess their contribution to the collection&#x2019;s overall diversity. For this analysis, we use entropy, given that this has been widely used to assess the diversity of systems and environments based on the richness and evenness of values (<xref rid="R11" ref-type="bibr">Jost, 2006</xref>). Entropy is defined as <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mrow><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mi>log</mml:mi><mml:mi>p</mml:mi><mml:mfenced><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mstyle></mml:mrow></mml:math></inline-formula> where <italic>H</italic> is the entropy, n is the number of themes, <italic>p(x<sub>i</sub>)</italic> is the proportion of the i-th theme within the poem group, and the summation is across all themes A higher entropy suggests a greater number of associated themes per poem (richness) and/or a more equitable distribution among them (evenness).</p>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> displays the cumulative distribution functions (CDF) of all groups, which represents evenness of theme distribution. The x-axis represents the proportion of themes within each group, while the y-axis represents the cumulative probability. A curve that rises quickly to 1 indicates low evenness, as this suggests that a few themes dominate; the corresponding histogram would show a steep decline. Conversely, a curve that rises slowly and stops at a larger theme proportion indicates high evenness, suggesting a more uniform distribution of themes; the corresponding histogram would decrease more gradually. Besides the CDF, the associated entropy scores and the average number of themes associated with each group, as a measure of richness, are presented in the legend of the figure. Because APA-PA associates with fewer than half of the themes, it cannot be compared with others fairly. So, we merged it back to the APA category instead of distinguishing it from Asian Americans in this graph.</p>
<fig id="F3">
<label>Figure 3.</label>
<caption><p>Cumulative distribution function (CDF) of theme proportions across groups, with corresponding entropy scores and average number of associated themes per group</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c51-fig3.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>The analysis of entropy scores shows a consistent trend: underrepresented groups achieve higher scores than their general counterparts, with scores of 10.06 for General, 11.27 for APA, and 11.64 for LXA. Furthermore, incorporating the underrepresented groups&#x2019; poems into the general collection has increased the overall diversity by 0.47. To assess the contribution of richness to the entropy scores, we examined the average number of associated themes in each group. Compared to the General category&#x2019;s 2.69, poems from underrepresented groups tend to have more themes annotated, ranging between 0.77 and 1.24 more themes, which supports <italic>poets.org&#x2019;s</italic> extra care toward these groups. While there is a correlation between entropy scores and thematic richness, this increase is also attributed to the evenness observed in the CDF in <xref ref-type="fig" rid="F3">Figure 3</xref>. The CDFs of the underrepresented groups rise more gradually and stop at a larger theme proportion compared to the CDF for general poems. Similarly, although subtle, the CDF for all poems also ascends more slowly and stops at a slightly higher theme proportion, indicating an evenness contributed by underrepresented groups. Overall, our findings indicate that poems by underrepresented groups contribute to the collection&#x2019;s higher thematic diversity, benefiting from <italic>poets.org&#x2019;s</italic> more thorough annotation of these poems and their more even distribution of theme proportions.</p>
</sec>
</sec>
<sec id="sec3">
<title>Conclusion</title>
<p>This paper evaluates <italic>poets.org&#x2019;s</italic> poetry collection for racial and ethnic representation, addressing a gap in previous AI and NLP studies on poetry that overlooked the collection&#x2019;s assessment, potentially introducing biases against underrepresented groups. The collection generally reflects the demographics of the US population, with most categories having higher representation, except for LXA poems. The groupwise word and theme analyses show the diversity these groups bring to the collection, stemming from their culture and history. However, AI systems need to accommodate non-standard English and foreign terms to be more inclusive. Moreover, tailored theme sets for each group, rather than relying on those for the average, could ensure more equitable and inclusive AI systems for poetry. This study focuses on racial and ethnic diversity, one of many facets of diversity. In future work, we will further investigate other aspects, such as gender and disability.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgments</title>
<p>This work was supported by RE-252382-OLS-22 from the institute of museum and library.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="book"><person-group person-group-type="author"><collab>American Academy of Arts &#x0026; Sciences</collab></person-group><year>2016</year><source>The state of languages in the U.S.: A statistical portrait</source><publisher-loc>Cambridge, MA</publisher-loc><publisher-name>Commission on Language Learning, American Academy of Arts &#x0026; Sciences</publisher-name></element-citation></ref>
<ref id="R2"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Bommasani</surname><given-names>R.</given-names></name><name><surname>Hudson</surname><given-names>D. A.</given-names></name><name><surname>Adeli</surname><given-names>E.</given-names></name><name><surname>Altman</surname><given-names>R.</given-names></name><name><surname>Arora</surname><given-names>S.</given-names></name><name><surname>von Arx</surname><given-names>S.</given-names></name><name><surname>Liang</surname><given-names>P.</given-names></name></person-group><year>2021</year><article-title>On the opportunities and risks of foundation models</article-title><source>arXiv preprint arXiv:2108.07258</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2108.07258">https://doi.org/10.48550/arXiv.2108.07258</ext-link></element-citation></ref>
<ref id="R3"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Choi</surname><given-names>K.</given-names></name></person-group><year>2023</year><article-title>Computational thematic analysis of poetry via bimodal large language models</article-title><source>Proceedings of the Association for Information Science and Technology</source><volume>60</volume><issue>1</issue><fpage>538</fpage><lpage>542</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1002/pra2.812">https://doi.org/10.1002/pra2.812</ext-link></element-citation></ref>
<ref id="R4"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Cordell</surname><given-names>R.</given-names></name></person-group><year>2020</year><source>Machine learning and libraries: a report on the state of the field</source><publisher-name>Library of Congress</publisher-name></element-citation></ref>
<ref id="R5"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>D&#x2019;ignazio</surname><given-names>C.</given-names></name><name><surname>Klein</surname><given-names>L. F.</given-names></name></person-group><year>2023</year><source>Data feminism</source><publisher-name>MIT press</publisher-name></element-citation></ref>
<ref id="R6"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Dietrich</surname><given-names>S.</given-names></name><name><surname>Hernandez</surname><given-names>E.</given-names></name></person-group><year>2022</year><article-title>Language use in the United States: 2019</article-title><source>American Community Survey Reports</source><comment>Retrieved from</comment><ext-link ext-link-type="uri" xlink:href="https://www.census.gov/content/dam/Census/library/publications/2022/acs/acs-50.pdf">https://www.census.gov/content/dam/Census/library/publications/2022/acs/acs-50.pdf</ext-link></element-citation></ref>
<ref id="R7"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Drager</surname><given-names>K.</given-names></name></person-group><year>2012</year><article-title>Pidgin and Hawai &#x2018;i English: an overview</article-title><source>International Journal of Language, Translation and Intercultural Communication</source><volume>1</volume><fpage>61</fpage><lpage>73</lpage></element-citation></ref>
<ref id="R8"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Iyengar</surname><given-names>S.</given-names></name></person-group><year>2023</year><article-title>New Survey Reports Size of Poetry&#x2019;s Audience</article-title><source>Streaming Included</source><comment>Accessed: April 7</comment><year>2024</year></element-citation></ref>
<ref id="R9"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Iyengar</surname><given-names>S.</given-names></name><name><surname>Nichols</surname><given-names>B.</given-names></name><name><surname>Shaffer</surname><given-names>P.M.</given-names></name><name><surname>Menzer</surname><given-names>M.</given-names></name><name><surname>Grantham</surname><given-names>E.</given-names></name><name><surname>Santoro</surname><given-names>H.</given-names></name><name><surname>Moyseowicz</surname><given-names>A.</given-names></name><name><surname>Hall</surname><given-names>E.</given-names></name></person-group><year>2018</year><article-title>US trends in arts attendance and literary reading: 2002&#x2013;2017</article-title></element-citation></ref>
<ref id="R10"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Jo</surname><given-names>E. S.</given-names></name><name><surname>Gebru</surname><given-names>T.</given-names></name></person-group><year>2020</year><comment>January</comment><article-title>Lessons from archives: Strategies for collecting sociocultural data in machine learning</article-title><source>Proceedings of the 2020 conference on fairness, accountability, and transparency</source><fpage>306</fpage><lpage>316</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3351095.3372829">https://doi.org/10.1145/3351095.3372829</ext-link></element-citation></ref>
<ref id="R11"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jost</surname><given-names>L.</given-names></name></person-group><year>2006</year><article-title>Entropy and diversity</article-title><source>Oikos</source><volume>113</volume><issue>2</issue><fpage>363</fpage><lpage>375</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1111/j.2006.0030- 1299.14714.x">https://doi.org/10.1111/j.2006.0030- 1299.14714.x</ext-link></element-citation></ref>
<ref id="R12"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Kaur</surname><given-names>J.</given-names></name><name><surname>Saini</surname><given-names>J. R.</given-names></name></person-group><year>2017</year><comment>February</comment><article-title>Punjabi poetry classification: the test of 10 machine learning algorithms</article-title><source>Proceedings of the 9th international conference on machine learning and computing</source><fpage>1</fpage><lpage>5</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3055635.3056589">https://doi.org/10.1145/3055635.3056589</ext-link></element-citation></ref>
<ref id="R13"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Khera</surname><given-names>T.</given-names></name></person-group><year>2021</year><article-title>What Makes African American Vernacular English Distinct and Complex</article-title><source>Dictionary. com</source><comment>Dictionary. com, February 21</comment></element-citation></ref>
<ref id="R14"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lou</surname><given-names>A.</given-names></name><name><surname>Inkpen</surname><given-names>D.</given-names></name><name><surname>Tanasescu</surname><given-names>C.</given-names></name></person-group><year>2015</year><article-title>Multilabel subject-based classification of poetry</article-title><source>Nature</source><volume>2218</volume><fpage>30</fpage><lpage>37</lpage></element-citation></ref>
<ref id="R15"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Navarro-Colorado</surname><given-names>B.</given-names></name></person-group><year>2018</year><article-title>On poetic topic modeling: extracting themes and motifs from a corpus of Spanish poetry</article-title><source>Frontiers in Digital Humanities</source><volume>5</volume><fpage>15</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fdigh.2018.00015">https://doi.org/10.3389/fdigh.2018.00015</ext-link></element-citation></ref>
<ref id="R16"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Rakshit</surname><given-names>G.</given-names></name><name><surname>Ghosh</surname><given-names>A.</given-names></name><name><surname>Bhattacharyya</surname><given-names>P.</given-names></name><name><surname>Haffari</surname><given-names>G.</given-names></name></person-group><year>2015</year><comment>December</comment><article-title>Automated analysis of Bangla poetry for classification and poet identification</article-title><source>Proceedings of the 12th international conference on natural language processing</source><fpage>247</fpage><lpage>253</lpage></element-citation></ref>
<ref id="R17"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Stevenson</surname><given-names>D.</given-names></name></person-group><year>2021</year><article-title>Application of Shannon Entropy Metrics to Cultural Diversity and Language Evolution</article-title><source>Academia Letters</source><volume>2</volume></element-citation></ref>
<ref id="R18"><element-citation publication-type="other"><person-group person-group-type="author"><collab>U.S. Census Bureau</collab></person-group><year>2022</year><article-title>Race and Hispanic origin</article-title><comment>Retrieved from</comment><ext-link ext-link-type="uri" xlink:href="https://www.census.gov/quickfacts/fact/table/US/PST045222">https://www.census.gov/quickfacts/fact/table/US/PST045222</ext-link></element-citation></ref>
</ref-list>
</back>
</article>