<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf47575</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf47575</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>The role of ontologies in machine learning: a case study of gene ontology</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Liu</surname><given-names>Qiaoyi</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>Qin</surname><given-names>Jian</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<aff id="aff0001"><bold>Qiaoyi Liu</bold> is a Ph.D. student in Information Science and Technology at Syracuse University. Her research interests are in knowledge organization (KO) and science of science (SoS). Especially, she studies biological knowledge representation and construction of ontologies guided by classification theories and semantic measurements. She is interested in knowledgebases exploited by ML models and trustworthy LLMs to generate knowledge for bioinformatics and computational biology research. She can be contacted at <email xlink:href="qliu11@syr.edu">qliu11@syr.edu</email></aff>
<aff id="aff0002"><bold>Jian Qin</bold> is Professor of the iSchool at Syracuse University. She conducts research in metadata, knowledge modelling and representation, ontologies, research collaboration networks, research impact assessment, and data curation. Jian Qin directs a Metadata Lab, a research group focusing on big metadata analytics and knowledge modelling. Her research has received funding from US NSF, NIH, IMLS, among others. She publishes widely with more than 100 journal and conference papers in the field of information science, scientometrics, knowledge organization, and metadata and been invited to give keynotes, lectures, and presentations at conferences and institutions inside and outside of the U.S. She is the co-author of the book Metadata and co-editor for several special journal issues on knowledge discovery in databases and knowledge representation. She received the 2020 Frederick G. Kilgour Award for Research in Library and Information Technology. Jian Qin holds a Ph.D. from University of Illinois at Urbana-Champaign. She can be contacted at <email xlink:href="jqin@syr.edu">jqin@syr.edu</email></aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>108</fpage>
<lpage>122</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> Ontologies as knowledgebases have been heavily applied in computational biological studies by implementing into ML models for purposes such as disease-gene associations identification.</p>
<p><bold>Method.</bold> We conduct a case study using gene ontology (GO) annotation data and three ML models to replicate the prediction of autism spectrum disorder (ASD)- causing genes.</p>
<p><bold>Analysis.</bold> Data were collected from GO and Simmons Foundation Autism Research Initiative (SFARI). The semantic similarities between GO annotation terms on gene products were calculated.</p>
<p><bold>Results.</bold> The best-performing model can reach an AUC of .85, which means using GO annotation data for ASD disease-gene prediction can receive a significantly accurate result. However, we stress the importance of constructing knowledgebases in adapting to LLMs and the role of LIS professionals in curating community knowledge for interoperability and reuse.</p>
<p><bold>Conclusion.</bold> Biomedical ontologies play a crucial role in the discovery of biomedical knowledge. Knowledge organization and computer science domains require more communication and synchronization in the face of emerging AI and ML technologies.</p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Ontologies as knowledge representation and organization systems have had a prolific increase in the last thirty years. This is particularly the case in the biomedical field. As of this writing, BioPortal, the world&#x2019;s most comprehensive repository of biomedical ontologies, have registered 1,145 ontologies that contain a total of 15.6 million classes, 36,286 properties, and 99.5 million mappings (<xref rid="R20" ref-type="bibr">National Center for Biomedical Ontology, 2024</xref>). The vast amount of structured data resulted from these ontologies have been utilized by researchers to explore disease-gene relations and predict genes that have high risk to specific diseases such as Alzheimer&#x2019;s disease (<xref rid="R2" ref-type="bibr">Asif et al., 2018</xref>) (<xref rid="R34" ref-type="bibr">Yang et al., 2021</xref>) (<xref rid="R7" ref-type="bibr">Binder et al., 2022</xref>). The fast growth of biomedical data with parallel advancement of machine learning (ML) algorithms and computational tools created the necessary condition for scientists to respond to urging problems such as identification of genomics for disease research (<xref rid="R23" ref-type="bibr">Pi&#x00F1;ero et al., 2020</xref>) (<xref rid="R16" ref-type="bibr">Krishnan et al., 2016</xref>), protein structure analysis (<xref rid="R28" ref-type="bibr">Radivojac et al., 2008</xref>), phylogenetic inference (<xref rid="R3" ref-type="bibr">Ata et al., 2021</xref>).</p>
<p>The well-structured and actively curated data in knowledge organization systems provide the quality and structures desired by ML model developers, especially in the shift from model-centric artificial intelligence (AI) to data-centric AI (<xref rid="R21" ref-type="bibr">Ng, 2024</xref>). However, there has been a lack of communication and understanding between different communities of information science, computer science, related scientific disciplines on the role of knowledge organization systems (KOS), the work involved in building them, and how KOS may be transformed into knowledgebases for AI modelling and applications. The work involved in building KOS includes not only defining subject terms and mapping controlled terms to free-text keywords, but more importantly, defining semantic relations between entities/concepts, bridging raw and machine-readable data, and curating and organizing structured and interoperable data through human intervention and/or automatic methods. This type of knowledge work is essential in bringing scattered, natural language information into systematic representations of knowledge, i.e. data with interpretations (<xref rid="R14" ref-type="bibr">Haendel et al., 2018</xref>) for machine to process and even conduct automated reasoning. Despite ontologies and knowledgebases are being constructed across disciplinary fields, the theoretical and methodological aspects of these knowledge organization and representation systems have been largely confined in separate communities, which hinder the communication and sharing of research across communities who study and build KOS.</p>
<p>The purpose of this paper is to elucidate the role of KOS as a trustable data source for AI/ML modelling and applications. We will start with reviewing the current developments of ontologies and knowledgebases, highlighting how knowledgebase structure and design can improve its functions in a ML research workflow. Using a case study of gene ontology (GO), we will illustrate the role that gene ontology played in building ML prediction models. In this case study the GO annotation data and supervised ML models are used to calculate the functional similarities of autism syndrome disease (ASD) genes. The detail of methods and results of the case study is presented in Section 3, followed by Sections 4 and 5, where we speculate on the practices and factors contributing to the quality of GO annotation data. Future KOS research direction is discussed on processing multiple data formats, e.g., images and data sources, disease ontologies, and clinical database, to accommodate AI/ML models.</p>
<sec id="sec1_1">
<title>Evolving knowledge organization systems</title>
<p>There are four main types of KOS. Based on the level of sophistication, the simplest type is term lists, such as glossaries and dictionaries. The second type is metadata-like models, including gazetteers, directories, and authority files. Classification and categorization go to the next level of sophistication because of the embedded relations between concepts or classes in classification/categorization schemes, taxonomies, and subject headings. The relationship models are the most sophisticated among the four KOS types. Ontologies and semantic networks are the two members of the relationship model, which possess all traits of a KOS that one can hope for: explicitly representing concepts/entities with unambiguous terms while embedding relations between concepts/entities (<xref rid="R37" ref-type="bibr">Zeng, 2008</xref>). In the sense of representing a conceptual system via a logical theory, an ontology consists of an annotated and indexed set of formal propositions or assertions about things, a collection of assertions that are called a theory in logic (<xref rid="R13" ref-type="bibr">Guarino &#x0026; Giaretta, 1995</xref>). This special property of ontologies is as closely as it can be in fitting the knowledge representation ideology in AI, which emphasizes sufficiently precise notation, adequacy and expressiveness of representation schemes (<xref rid="R6" ref-type="bibr">Bench-Capon, 1990</xref>).</p>
<p>Ontologies sometimes are also called knowledgebases because they are essentially a collection of symbolic structures representing the world based on our cognition (<xref rid="R17" ref-type="bibr">Levesque &#x0026; Lakemeyer, 2022</xref>). Such representation is semantically rich with not only unambiguous vocabularies for entities and individuals in the entities, but also explicit relations between the entities that go beyond hierarchical and associative relations that are the only available relation types in many traditional KOS. Not all ontologies, however, can be called knowledgebases. For example, Schema.org is an ontology developed in collaboration among Google, Microsoft, Yahoo, and Yandex to be used for representing web content as structured data (Google et al., 2024). The purpose of Schema.org is to provide a standard representation scheme for classes of entities and properties these entities possess. Since it is just a representation scheme, it does not contain data (i.e., individual members of classes). Therefore, it is not a knowledgebase and cannot be used for modelling or reasoning purposes.</p>
<p>While many traditional KOS are not designed for problem-solving nor as a data source for ML modelling purposes, many ontologies in the biomedical domain have disrupted the tradition and evolved into knowledgebases. For instance, gene ontology (GO) (<xref rid="R1" ref-type="bibr">Ashburner et al., 2000</xref>) and disease ontology (DO) (<xref rid="R5" ref-type="bibr">Bello et al., 2018</xref>) use axioms to model knowledge in order to define the semantic interpretation of the presented entities, rules, and class constraints and present multiple relations between concepts. Knowledge organization and AI communities, two research fields that were once separated and did not have much communication before, are now drawing closer and converging through advances in semantic web technologies and data science (<xref rid="R25" ref-type="bibr">Qin, 2020</xref>). This trend symbolizes a change to knowledge organization practices. More importantly, it is a signal to knowledge organization moving towards a more interoperable, practical, and heterogeneous identity that exceeds the simple purpose of storing knowledge.</p>
<p>One application of biomedical ontologies is in disease-gene association discovery and identification. Advanced genome sequencing technologies accelerated the process of exploring genomic variations (Pi&#x00F1;<xref rid="R23" ref-type="bibr">ero et al., 2020</xref>) and genetic markers&#x2019; detection (<xref rid="R9" ref-type="bibr">Chang et al., 2024</xref>), generating vast amount of data. This data is preserved and organized into KOS by biocurators who apply knowledge organization theories and practices, which creates semantically rich data sources for applying ML algorithms to analyse larger and complex data sets (<xref rid="R18" ref-type="bibr">Libbrecht &#x0026; Noble, 2015</xref>). Complex diseases with a strong genetic influence often have multiple aetiologies with the involvement of possibly hundreds of different genes. Supervised ML methods can trace hidden relationships among disease-causing genes in existing datasets to discriminate unknown disease genes from non-disease genes (<xref rid="R2" ref-type="bibr">Asif et al., 2018</xref>). This advancement plays a crucial role in disease diagnosis and form the basis for clinical decision-making (<xref rid="R9" ref-type="bibr">Chang et al., 2024</xref>). This approach was once difficult to proceed due to the lack of: (a) structured KOS with semantically rich relationships between properties; and (b) a powerful computational hardware with feasible models to analyse large heterogeneous data sets. Problem (a) was greatly improved through making data FAIR (findable, accessible, interoperable, reusable) (<xref rid="R33" ref-type="bibr">Wilkinson et al., 2016</xref>) and encoding languages (e.g. XML and JSON) for representing machine-readable knowledge and reflecting reality. Problem (b) now is being addressed by numerous supervised and unsupervised ML models that are developed and applied to assist in areas such as genetics and molecular science.</p>
</sec>
</sec>
<sec id="sec2">
<title>Case study: using GO annotation and ML models to identify autism spectrum disorder (ASD)</title>
<sec id="sec2_1">
<title>Case selection</title>
<sec id="sec2_1_1">
<title>Gene ontology (GO)</title>
<p>The fast-developing nucleotide sequencing techniques and gene expression analysis has urged the biological community to establish a knowledge resource for this massive data. Unlike other STEM research fields, biological knowledge can be less explained by mathematical equations but more through natural language (<xref rid="R22" ref-type="bibr">Pesquita et al., 2009</xref>). Gene ontology (GO, http://geneontology.org) constructed by the gene ontology consortium, is crowned as the GOld mine. It provides &#x201C;a comprehensive, structured, computer-accessible representation of gene function for genes from any cellular organism or virus.&#x201D; Until 2023, GO contains 43,303 biological terms as annotations to gene products, linked together by 88,099 relationships (<xref rid="R30" ref-type="bibr">The Gene Consortium, 2023</xref>). The GO knowledgebase represents a standardized controlled vocabulary which defines various components of molecular biology shared amongst life forms (<xref rid="R35" ref-type="bibr">Yousef et al., 2021</xref>). It has become a critical component of life science research, supporting analysis of large-scale genomics data analysis and biological systems (<xref rid="R10" ref-type="bibr">Duck et al., 2016</xref>) and broadly used in research, clinical diagnosis, and industry. Over the years many ML models and equations were developed specifically for processing and using GO data. Considering its significance to biological research and established computational techniques, we select GO annotation data as our primary data source to conduct case study.</p>
</sec>
<sec id="sec2_1_2">
<title>GO construction and GO annotations</title>
<p>Bio-entities described in GO can be considered as knowledge unites, or &#x2018;things&#x2019; such as gene products. The entities in GO are structured as a directed acyclic graph (DAG) in which GO terms/annotations are represented as nodes and relationships between terms are represented as edges that follow certain directions and never form a closed loop (<xref rid="R2" ref-type="bibr">Asif et al., 2018</xref>). Each ontology term (called &#x2018;class&#x2019; in the field of ontology) represents a functional characteristic that can be attributed to a gene product (<xref rid="R30" ref-type="bibr">The Gene Consortium, 2023</xref>). Terms representing the &#x2019;things&#x2019; are organized into three categories: molecular function (MF), biological process (BP), and cellular component (CC). Each term is described by five required elements: a unique ID, term name, aspect (which category it belongs to), definition, and relationships to other terms. GO links the terms by using a set of triple statements, most commonly &#x2018;<italic>is_a</italic>&#x2019; or &#x2018;<italic>part_of</italic>&#x2019;, which stands for class-subclass relationship and part-whole relationship, respectively (see <xref ref-type="table" rid="T1">table 1</xref>). A GO annotation is an association between a specific gene product and a GO term and should be interpreted as a statement that the gene product possesses the functional characteristics represented by the GO term. Each GO annotation covers only one characteristic of the gene product. Therefore, a gene product can have multiple GO annotations. GO annotations are continually added to the knowledgebases from 173,000 scientific papers. All annotations are supported by an evidence code which describes the type of evidence and a reference that lists a persistent identifier for tracing the source. For quality control, they are regularly reviewed, edited, or removed by biocurators or the GO user community (<xref rid="R30" ref-type="bibr">The Gene Consortium, 2023</xref>).</p>
<table-wrap id="T1">
<label>Table 1.</label>
<caption><p>Main term relations used in GO</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top"><bold>Relation</bold></th>
<th align="center" valign="top"><bold>Description</bold></th>
<th align="center" valign="top"><bold>Example</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top"><italic>is_a</italic></td>
<td align="left" valign="top">The basic structure of GO. If we say A <italic>is_a</italic> B, we mean that entity A is a subtype of entity B.</td>
<td align="left" valign="top">Mitotic cell cycle <italic>is_a</italic> cell cycle, or lyase activity <italic>is_a</italic> catalytic activity.</td>
</tr>
<tr>
<td align="left" valign="top"><italic>part_of</italic></td>
<td align="left" valign="top">The <italic>part of</italic> relation is used to represent part-whole relation. A <italic>part of</italic> relation would only be added between A and B if B is <bold>necessarily</bold> <italic>part of</italic> A: wherever B exists, it is as <italic>part of</italic> A, and the presence of the B implies the presence of A.</td>
<td align="left" valign="top">If a gene product X is annotated as located in the inner mitochondrial membrane and the ontology records a <italic>part of</italic> relation between inner mitochondrial membrane and mitochondrion, we can safely conclude that X is located in a mitochondrion.</td>
</tr>
<tr>
<td align="left" valign="top"><italic>has_part</italic></td>
<td align="left" valign="top">The logical complement to the <italic>part of</italic> relation is <italic>has part</italic>, which represents a part-whole relationship from the perspective of the parent. As with <italic>part of</italic>, the GO relation <italic>has part</italic> is only used in cases where A always has B as a part, i.e., where A necessarily <italic>has part</italic> B. If A exists, B will always exist; however, if B exists, we cannot say for certain that A exists. i.e., all A have part B; some B part of A.</td>
<td align="left" valign="top">A receptor tyrosine kinase activity <italic>has part</italic> ATP hydrolysis activity. However, it would not then be correct to group all annotations to kinase activity under ATPase activity.</td>
</tr>
<tr>
<td align="left" valign="top"><italic>regulates</italic></td>
<td align="left" valign="top">A relation that describes case in which one process directly affects the manifestation of another process or quality, i.e., the former <italic>regulates</italic> the latter. The target of the regulation may be another process, for e.g., regulation of a pathway or an enzymatic reaction, or it may be a quality, such as cell size or pH. Analogously to <italic>part of</italic>, this relation is used specifically to mean necessarily <italic>regulates</italic>: if both A and B are present, B always <italic>regulates</italic> A, but A may not always be regulated by B., i.e., all B <italic>regulate</italic> A; some A are <italic>regulated by</italic> B.</td>
<td align="left" valign="top">If gene product X is annotated as involved in a process that <italic>regulates</italic> glycolysis, it would not be correct to conclude that X participates in glycolysis.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec2_1_3">
<title>Identify disease-gene associations in ASD</title>
<p>Disease-gene association prediction has been in the spotlight of bioinformatics research for over a decade. Scholars and doctors are eager to identify gene mutations, which are the primary causes of genetic diseases. Many large-scale genetic studies provided candidate genes that may cause diseases. However, traditional disease-gene association costs heavy human labor and are highly difficult to process due to genetic heterogeneity. The disease-causing genes are identified by statistical geneticists, where linkage analysis and association studies were conducted on candidate genes based on their likelihood of being involved in a specific disease i.e. gene prioritization algorithms (<xref rid="R28" ref-type="bibr">Radivojac et al., 2008</xref>). With more heterogeneous data, it is no longer possible to conduct this type of analysis solely by human labor.</p>
<p>Nowadays, computational approaches can speed up this process and are capable of handling more complex diseases, such as autism spectrum disorder (ASD). ASD is a type of neurodevelopmental syndrome that affects about one in 100 people worldwide (<xref rid="R11" ref-type="bibr">Geschwind &#x0026; State, 2015</xref>). Strong evidence indicates the causes include both genetic and environmental factors (<xref rid="R15" ref-type="bibr">Kim &#x0026; Leventhal, 2015</xref>). Diseases like ASD tend to present a highly heterogeneous genotype, which makes it difficult for biological marker identification. Although ML methods can be used to identify these markers, their performance highly depends on the size and quality of available data (<xref rid="R2" ref-type="bibr">Asif et al., 2018</xref>). Therefore, using GO annotation data can provide consistent and clean data for ML models to learn from, thus improving their general performance. Our purpose is not developing a new model or equation, but to replicate this process and discuss the data quality and ontology construction from the LIS scope outside of these two communities. We hope to highlight the significance of structured knowledgebases for computational biological research and applicability in AI LLMs.</p>
</sec>
</sec>
<sec id="sec2_2">
<title>Functional similarity vs. semantic similarity</title>
<p>One major issue in identifying disease-gene association is how to define an unknown gene can cause similar outcomes as a disease-associated gene already known. Biologists found that functional similar genes tend to contribute to similar phenotypes. For instance, etiologically relevant genes disrupted by genetic variants in ASD patients tend to aggregate in specific biological processes (<xref rid="R31" ref-type="bibr">Voineagu &#x0026; Eapen, 2013</xref>). This means disease-causing genes and disease-candidate genes may belong to the same tree path in GO DAG structure (<xref rid="R2" ref-type="bibr">Asif et al., 2018</xref>). Gene products that share highly overlapping GO terms may have higher functional similarities. If a gene product is on the same tree branch with gene products that are associated with disease-related GO annotation function terms, it indicates this gene may have a higher probability to be a disease-causing gene. Thus, the comparison of gene products&#x2019; function similarities is transformed into the comparison of semantic similarities of GO terms. Adopting ontology annotation data can provide a means to compare entities on aspects that otherwise not be comparable.</p>
<p>Typically there are two ways to compare terms in graph-structured ontologies such as GO: edge-based or node-based (<xref rid="R22" ref-type="bibr">Pesquita et al., 2009</xref>). Edge-based approaches are based on counting the number of edges in the graph path between two terms (<xref rid="R27" ref-type="bibr">Rada et al., 1989</xref>). This can be problematic for biological data because the approach is based on two conditions: (i) nodes and edges in the biological ontology are uniformly distributed, and (ii) edges at the same level in the ontology correspond to the same semantic distance between terms. However, biological knowledge can rarely meet these two conditions where terms at the same tree structure level share the same scale and weight.</p>
<p>Node-based approaches are more commonly accepted in biological domain. There have been several statistical measurements and equations developed. Resnik proposed the Information Content (IC) to quantify the informativeness of a concept c as negative the log likelihood (<xref rid="R29" ref-type="bibr">Resnik, 1999</xref>):</p>
<disp-formula><label>(1)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>log</mml:mi><mml:mi>p</mml:mi><mml:mfenced><mml:mi>c</mml:mi></mml:mfenced></mml:mrow></mml:math></disp-formula>
<p>However, a drawback of Resnik&#x2019;s method is that it ignores the information contained in the structure of the ontology by only concentrating on the information content of a term derived from the corpus statistics. For biological ontologies, the specificity of a GO term is usually determined by its location in the GO graph. A GO term&#x2019;s semantics (biological meanings) are inherited from all its ancestor terms (<xref rid="R32" ref-type="bibr">Wang et al., 2007</xref>). For instance, shown in <xref ref-type="fig" rid="F1">figure 1</xref>, <italic>GO:0002839</italic> positive regulation of immune response to tumour cell is a child term of <italic>GO:0002418</italic> immune response to tumour cell, the latter is a child term of <italic>GO:0002347</italic> response to tumour cell and <italic>GO:0006955</italic> immune response, both of which are the child terms of <italic>GO:0008150</italic> biological processes. Because GO term is the aggregation of all its parent terms, <italic>GO:0002839</italic> positive regulation of immune response to tumour cell should possess the characteristics of both <italic>GO:0002418</italic> response to tumour cell and <italic>GO:0006955</italic> immune response. Not to mention biomedical ontologies usually have various edge length i.e. edges at the same level convey different semantic distances, various depth i.e. terms at the same level have different level of details, and various node density i.e. some areas of the ontology have a greater density of terms than others (<xref rid="R22" ref-type="bibr">Pesquita et al., 2009</xref>).</p>
<p>Wang proposed a metrics that is specifically for encoding biological terms&#x2019; semantics by aggregating the semantic contributions of all its ancestor terms including itself in the GO graph. Wang believes Resnik&#x2019;s IC approach focus is more applicable for knowledge in natural language such as bird and crane, forest and graveyard but is not the ideal option for biological knowledge in ontologies (<xref rid="R32" ref-type="bibr">Wang et al., 2007</xref>). IC neglects the logic that if two GO terms share the same parent are near the root of the ontology i.e., terms that are more general, they should have larger semantic difference than two terms having the same parent and being far away from the root of the ontology because the latter are more specific terms. GO is constructed in such way that if the child GO term describes the gene product, then all its parent terms must also apply to that gene product (<xref rid="R32" ref-type="bibr">Wang et al., 2007</xref>). DO contains logical definitions (axioms) to describe relevant disease drivers, constructed with specific relational ontology (RO) terms, to create a restriction between a DO term and another open biological and biomedical ontology (OBO) Foundry ontology term. Thus, DO has the ability to &#x2018;infer&#x2019; the child terms originated from a parent term based on a known parent-child relationship (<xref rid="R26" ref-type="bibr">Qin &#x0026; Liu, 2024</xref>). Therefore, it is reasonable to aggregate the biological meanings of all its ancestor terms when determining its semantic similarity with another GO term.</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>A graphic view of <italic>GO:0002939</italic> positive regulation of immune response to tumour cell and all its ancestor terms in GO</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c10-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>Next, the issue is transforming semantic similarity of GO terms into the functional similarity of gene products. Since one gene product may have more than one GO annotation, we must consider the contributions from the semantically similar terms that annotate the genes separately (<xref rid="R32" ref-type="bibr">Wang et al., 2007</xref>). Wang first defines the maximum semantic similarity between one GO term and a set of GO = {<italic>go</italic><sub>1</sub>,<italic>go<sub>2</sub></italic>, &#x2026; ,<italic>go<sub>k</sub></italic>}as:</p>
<disp-formula><label>(1)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mfenced><mml:mrow><mml:mi>g</mml:mi><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:mfenced><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mi>max</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:mfenced><mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mrow><mml:mi>G</mml:mi><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mfenced><mml:mrow><mml:mi>g</mml:mi><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>g</mml:mi><mml:msub><mml:mi>o</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow></mml:math></disp-formula>
<p>Then, given two gene products G1 and G2 annotated by sets of GO terms: GO1 = {<italic>go<sub>11</sub></italic>, <italic>go<sub>12</sub></italic>, &#x2026;, <italic>go<sub>1m</sub></italic>} and GO2 = {<italic>go<sub>21</sub></italic>, <italic>go<sub>22</sub></italic>, &#x2026;, <italic>go<sub>2n</sub></italic>} respectively, the functional similarity of G1 and G2 can be defined as:</p>
<disp-formula><label>(2)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mfenced><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfenced><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mfenced><mml:mrow><mml:mi>g</mml:mi><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mn>1</mml:mn><mml:mi>j</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:msub><mml:mi>G</mml:mi><mml:msub><mml:mi>o</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfenced><mml:mo>+</mml:mo><mml:msub><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x2264;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mfenced><mml:mrow><mml:mi>g</mml:mi><mml:msub><mml:mi>o</mml:mi><mml:mrow><mml:mn>2</mml:mn><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>G</mml:mi><mml:msub><mml:mi>o</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>In this study, we applied Wang&#x2019;s approach to calculate the functional similarity of gene products for predicting candidate genes that lead to ASD disease. GO annotation data were collected using the <italic>org.Hs.eg.db</italic> R package (<xref rid="R8" ref-type="bibr">Carlson et al., 2019</xref>). Only terms for biological processes are selected (N = 28,140). Next, we mapped all the GO terms and their associated gene products (N = 157,247). Since our aim is investigating the construction and quality of GO as a knowledge ontology in biomedical research, we only try to replicate part of the workflow by Asif et al., using three of the ML models &#x2013; support-vector machines (SVM), random forest (RF), and gradient boosting (GB). We randomly selected 1,000 gene products from the mapping as our train set for the models. For our test set, we also randomly selected 20 candidate gene products obtained from the Simons foundation autism research initiative (SFARI, https://gene.sfari.org/) gene database (N = 1,176), in which 15 are categorized by SFARI as high confidence disease genes (HD) and 5 are categorized as low confidence (LD) disease genes. This categorization is used to compare with results from the ML classification models. <xref ref-type="table" rid="T2">Table 2</xref> shows the 20 test gene products and their categories. Before we allow the model to conduct semantics similarity calculation, we removed SFARI&#x2019;s categorization for the models to predict and then verify their overall performance by comparing its results with SFARI&#x2019;s classification. The semantic similarity measures between GO terms were implemented using the <italic>GOSemSim</italic> R package (<xref rid="R36" ref-type="bibr">Yu et al., 2010</xref>).</p>
</sec>
</sec>
<sec id="sec3">
<title>Results</title>
<p>As is shown in <xref ref-type="fig" rid="F2">Fig. 2</xref>, we used the train set gene product and their GO annotation data to run the 1,000&#x00D7;1000 similarity matrix. Next, we conducted the 1000&#x00D7;20 similarity matrix between train set and test set (see <xref ref-type="fig" rid="F3">figure 3</xref>). Based on their functional similarity, we identify which of the 20 test set gene products may be a disease-associated gene. Out of the three models we used, SVC-based classifier trained and test on Wang&#x2019;s semantic similarity matrix outperformed the other classifiers, with AUC value equals 0.85 (see <xref ref-type="table" rid="T3">table 3</xref>). The difference between RF and GB AUC values were minor, indicating the independence of the methodology to the semantic measure. A recall of 0.25 is considerably low. However, given the highly imbalanced dataset (most genes in train set are non- ASD-associated with only a very small number of genes are ASD-associated) and small number of train dataset used, we believe this is an expected and sufficient result to prove that GO annotation data could be used in gene-disease association identification ML models. Furthermore, the overall performance of MLs is crucially dependent on the quality of GO annotation data and structural relationship between GO terms. Here we only demonstrated one way to calculate the semantic similarity between GO terms. The results can vary depending on which type of approach one applies.</p>
<table-wrap id="T2">
<label>Table 2.</label>
<caption><p>Selected test set gene products and ASD categories (N = 20)</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Gene product</bold></th>
<th align="center" valign="top"><bold>Candidate gene category by SFARI (ASD causing is 1, non-ASD causing is 0)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top">SNTG2</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">BIRC6</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">CNTN4</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">ADORA2A</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">LZTR1</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">WWOX</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">HYDIN</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">RBFOX1</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">TRPM1</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">FAN1</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">CMPK1</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">STIL</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">TAL1</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">BCL9</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">OR4A47</td>
<td align="center" valign="top">1</td>
</tr>
<tr>
<td align="center" valign="top">CARD16</td>
<td align="center" valign="top">0</td>
</tr>
<tr>
<td align="center" valign="top">CASP1</td>
<td align="center" valign="top">0</td>
</tr>
<tr>
<td align="center" valign="top">PCDH17</td>
<td align="center" valign="top">0</td>
</tr>
<tr>
<td align="center" valign="top">P2RX6</td>
<td align="center" valign="top">0</td>
</tr>
<tr>
<td align="center" valign="top">AHR</td>
<td align="center" valign="top">0</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T3">
<label>Table 3.</label>
<caption><p>The performance of classifiers trained and tested over Wang semantic similarities matric. The Area Under the Curve (AUC) evaluation metric was used to estimate and compare the performance of the classifiers</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Classifiers</bold></th>
<th align="center" valign="top"><bold>AUC</bold></th>
<th align="center" valign="top"><bold>Recall</bold></th>
<th align="center" valign="top"><bold>F1 Score</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top">SVC</td>
<td align="center" valign="top">0.85</td>
<td align="center" valign="top">0.25</td>
<td align="center" valign="top">0.10</td>
</tr>
<tr>
<td align="center" valign="top">RF</td>
<td align="center" valign="top">0.43</td>
<td align="center" valign="top">0.25</td>
<td align="center" valign="top">0.10</td>
</tr>
<tr>
<td align="center" valign="top">GB</td>
<td align="center" valign="top">0.50</td>
<td align="center" valign="top">0.25</td>
<td align="center" valign="top">0.10</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F2">
<label>Figure 2.</label>
<caption><p>Similarity matrix of the first 20 gene products from train set</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c10-fig2.jpg"><alt-text>none</alt-text></graphic>
</fig>
<fig id="F3">
<label>Figure 3.</label>
<caption><p>Similarity matrix of the first 20 gene products from train set and all 20 gene products from test set</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c10-fig3.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>One major issue which reduces its performance is the imbalance data between one class from other classes. For example, in clinical or disease-related cases, there is inevitably less data from treatment groups than from the normal (control) group (<xref rid="R19" ref-type="bibr">Min et al., 2016</xref>). One of the common solutions to this problem is data pre-processing. This is largely conducted by data curators guided by schemas and controlled vocabularies. However, the high proportion of duplicate or near-duplicate samples in biological sequencing data is a serious problem during ML model training which tends to be overlooked (<xref rid="R4" ref-type="bibr">Auslander et al., 2021</xref>). Careful data processing is needed to ensure the independence of data between train and test set. When using GO annotation data to quantify the similarity between biological terms, an important issue is that some annotation data in GO is referred from other sources based on similarity. Using these annotations for semantic similarity calculation is in fact, data circularity (<xref rid="R22" ref-type="bibr">Pesquita et al., 2009</xref>).</p>
</sec>
<sec id="sec4">
<title>Discussion</title>
<p>Conventional LIS approaches to knowledge representation such as hierarchical and faceted classification integrated human knowledge into a systematic arrangement in which concepts and structures tend to be abstract and have an epistemology orientation. Ontologies, on the contrary, disintegrate parts of knowledge into a problem-solving focused structure and are more pragmatic and application oriented. The disintegrative approach places the entity itself, in our case it is the gene products, in a less important position, but rather, focuses on all aspects related to it i.e., GO terms and annotations (<xref rid="R24" ref-type="bibr">Qin, 2002</xref>). As a result, ontologies are more suitable as conceptual frameworks to specific problems. In this case study scenario, GO annotation can be applied to various statistical measurements of semantic similarity, which represents the entity i.e., gene product it describes. The application of GO knowledgebase has transcended beyond its initial purposes which is representing and organizing a phenomenon of a knowledge domain.</p>
<p>As Deep Learning and LLM models are more frequently applied to bioinformatics and biomedical research, we must reconsider schema and frameworks for building knowledge organization systems in order for them to apply to more complex computational approach. Already there have been ways to use LLMs in knowledge graph engineering and ontology construction in replacement of human-conducted natural language processing (Kommineni et al., 2024). Likely we will be witnessing an evolutionary change in knowledge organization and representation under the fast-developing AI era. While computer scientists focus on the pragmatic side of KO and KR by exploiting LLMs and prompt engineering to improve the accuracy, scalability, and depth of knowledge captured, the values in theoretical advancements should not be overlooked. The lack of communication between AI and KOS communities may hinder the applicability of ontologies as knowledge resources. With the domain shifting towards generative AI, more work is necessary to refine the &#x2018;core&#x2019; and paradigm of this interdisciplinary field. This may seem trivial to application- driven science, but a <italic>&#x2018;step back&#x2019;</italic> from the actual phenomena is to find a broader characterization that encompasses the instances at hand (Lyytinen et al., 2004). The paradigmatic similarities in KR between KO and AI offer not only theory foundations but also practicalities for KO to contribute its unique value for knowledge representation (<xref rid="R25" ref-type="bibr">Qin, 2020</xref>).</p>
<p>Fundamentally, KOSs must be prepared to handle various types of knowledge generated by AI models. Measurements for semantic similarity should be updated to comply with new relations between entities and relationships as new terminologies, or even new knowledge domains, emerge. Guidelines for misinformation detection are necessary if ontologies are using LLM- generated knowledge. Questions on the trustworthiness of AI can also impact the reliability of ontologies (Kaur et al., 2023), especially when the construction of LLMs and algorithms are mostly hidden in &#x2018;black boxes&#x2019;. In terms of medical and health data, the fairness of AI using knowledgebase data can cause ethical issues. Patient privacy and confidentiality are necessary factors that decide whether we are entrusted to use this data in ML models and not be shared for all other purposes. Applications like disease-gene association identification is closely intertwined with clinical decision-making, which can have a direct impact on medical practice and communication between physicians and patients (L&#x00F6;tsch et al., 2022). A new ethical framework may be in order to balance the need of society and future patients with legitimate expectations of privacy (<xref rid="R14" ref-type="bibr">Haendel et al., 2018</xref>), especially with the involvement of AI models.</p>
</sec>
<sec id="sec5">
<title>Conclusion</title>
<p>This paper focuses on using GO as case for training ML models to serve a type of computational biological research &#x2013; disease-gene association identification. We calculated the functional similarity of gene products represented by the GO annotation semantic similarity, and trained three supervised ML models &#x2013; SVM, RF, and GB. From experimenting on this workflow, we evaluate the role of ontologies as knowledgebases for large data biomedical research. Applying theories in knowledge organization and representation, we argue that knowledge organization and computer science domains require more communication and synchronization in the face of emerging AI and LLM technologies in order to accommodate to AI-generated knowledge and policies. We conclude that ontologies have played a crucial role in the discovery of biomedical knowledge and clinical decision-making by providing meaningful, structured, and reliable data.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>We thank the comments provided by the anonymous reviewers for this paper.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ashburner</surname><given-names>M.</given-names></name><name><surname>Ball</surname><given-names>C. A.</given-names></name><name><surname>Blake</surname><given-names>J. A.</given-names></name><name><surname>Botstein</surname><given-names>D.</given-names></name><name><surname>Butler</surname><given-names>H.</given-names></name><name><surname>Cherry</surname><given-names>J. M.</given-names></name><name><surname>Davis</surname><given-names>A. P.</given-names></name><name><surname>Dolinski</surname><given-names>K.</given-names></name><name><surname>Dwight</surname><given-names>S. S.</given-names></name><name><surname>Eppig</surname><given-names>J. T.</given-names></name><name><surname>Harris</surname><given-names>M. A.</given-names></name><name><surname>Hill</surname><given-names>D. P.</given-names></name><name><surname>Issel-Tarver</surname><given-names>L.</given-names></name><name><surname>Kasarskis</surname><given-names>A.</given-names></name><name><surname>Lewis</surname><given-names>S.</given-names></name><name><surname>Matese</surname><given-names>J. C.</given-names></name><name><surname>Richardson</surname><given-names>J. E.</given-names></name><name><surname>Ringwald</surname><given-names>M.</given-names></name><name><surname>Rubin</surname><given-names>G. M.</given-names></name><name><surname>Sherlock</surname><given-names>G.</given-names></name></person-group> <year>(2000)</year> <article-title>Gene Ontology: Tool for the unification of biology</article-title><source>Nature Genetics</source><volume>25</volume><issue>1</issue><fpage>25</fpage><lpage>29</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/75556">https://doi.org/10.1038/75556</ext-link></element-citation></ref>
<ref id="R2"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Asif</surname><given-names>M.</given-names></name><name><surname>Martiniano</surname><given-names>H. F. M. C. M.</given-names></name><name><surname>Vicente</surname><given-names>A. M.</given-names></name><name><surname>Couto</surname><given-names>F. M.</given-names></name></person-group> <year>(2018)</year> <article-title>Identifying disease genes using machine learning and gene functional similarities, assessed through Gene Ontology</article-title><source>PLOS ONE</source><volume>13</volume><issue>12</issue><fpage>e0208626</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1371/journal.pone.0208626">https://doi.org/10.1371/journal.pone.0208626</ext-link></element-citation></ref>
<ref id="R3"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ata</surname><given-names>S. K.</given-names></name><name><surname>Wu</surname><given-names>M.</given-names></name><name><surname>Fang</surname><given-names>Y.</given-names></name><name><surname>Ou-Yang</surname><given-names>L.</given-names></name><name><surname>Kwoh</surname><given-names>C. K.</given-names></name><name><surname>Li</surname><given-names>X.-L.</given-names></name></person-group> <year>(2021)</year> <article-title>Recent advances in network&#x2013;based methods for disease gene prediction</article-title><source>Briefings in Bioinformatics</source><volume>22</volume><issue>4</issue><fpage>bbaa303</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bib/bbaa303">https://doi.org/10.1093/bib/bbaa303</ext-link></element-citation></ref>
<ref id="R4"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Auslander</surname><given-names>N.</given-names></name><name><surname>Gussow</surname><given-names>A. B.</given-names></name><name><surname>Koonin</surname><given-names>E. V.</given-names></name></person-group> <year>(2021)</year> <article-title>Incorporating Machine Learning into Established Bioinformatics Frameworks</article-title><source>International Journal of Molecular Sciences</source><volume>22</volume><issue>6</issue><comment>Article 6</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/ijms22062903">https://doi.org/10.3390/ijms22062903</ext-link></element-citation></ref>
<ref id="R5"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Bello</surname><given-names>S. M.</given-names></name><name><surname>Shimoyama</surname><given-names>M.</given-names></name><name><surname>Mitraka</surname><given-names>E.</given-names></name><name><surname>Laulederkind</surname><given-names>S. J. F.</given-names></name><name><surname>Smith</surname><given-names>C. L.</given-names></name><name><surname>Eppig</surname><given-names>J. T.</given-names></name><name><surname>Schriml</surname><given-names>L. M.</given-names></name></person-group> <year>(2018)</year> <article-title>Disease Ontology: Improving and unifying disease annotations across species</article-title><source>Disease Models &#x0026; Mechanisms, dmm.032839</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1242/dmm.032839">https://doi.org/10.1242/dmm.032839</ext-link></element-citation></ref>
<ref id="R6"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Bench-Capon</surname><given-names>T. J. M.</given-names></name></person-group> <year>(1990)</year> <source>Knowledge representation: An approach to artificial intelligence</source><publisher-name>Academic Press</publisher-name></element-citation></ref>
<ref id="R7"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Binder</surname><given-names>J.</given-names></name><name><surname>Ursu</surname><given-names>O.</given-names></name><name><surname>Bologa</surname><given-names>C.</given-names></name><name><surname>Jiang</surname><given-names>S.</given-names></name><name><surname>Maphis</surname><given-names>N.</given-names></name><name><surname>Dadras</surname><given-names>S.</given-names></name><name><surname>Chisholm</surname><given-names>D.</given-names></name><name><surname>Weick</surname><given-names>J.</given-names></name><name><surname>Myers</surname><given-names>O.</given-names></name><name><surname>Kumar</surname><given-names>P.</given-names></name><name><surname>Yang</surname><given-names>J. J.</given-names></name><name><surname>Bhaskar</surname><given-names>K.</given-names></name><name><surname>Oprea</surname><given-names>T. I.</given-names></name></person-group> <year>(2022)</year> <article-title>Machine learning prediction and tau-based screening identifies potential Alzheimer&#x2019;s disease genes relevant to immunity</article-title><source>Communications Biology</source><volume>5</volume><issue>1</issue><fpage>125</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s42003-022-03068-7">https://doi.org/10.1038/s42003-022-03068-7</ext-link></element-citation></ref>
<ref id="R8"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Carlson</surname><given-names>M.</given-names></name><name><surname>Falcon</surname><given-names>S.</given-names></name><name><surname>Pages</surname><given-names>H.</given-names></name><name><surname>Li</surname><given-names>N.</given-names></name></person-group> <year>(2019)</year> <article-title>Org. Hs. Eg. Db: Genome wide annotation for Human</article-title><source>R Package Version</source><volume>3</volume><issue>2</issue><fpage>3</fpage></element-citation></ref>
<ref id="R9"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Chang</surname><given-names>J.</given-names></name><name><surname>Wang</surname><given-names>S.</given-names></name><name><surname>Ling</surname><given-names>C.</given-names></name><name><surname>Qin</surname><given-names>Z.</given-names></name><name><surname>Zhao</surname><given-names>L.</given-names></name></person-group> <year>(2024)</year> <source>Gene-associated Disease Discovery Powered by Large Language Models (No. arXiv:2401.09490)</source><comment>arXiv</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2401.09490">https://doi.org/10.48550/arXiv.2401.09490</ext-link></element-citation></ref>
<ref id="R10"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Duck</surname><given-names>G.</given-names></name><name><surname>Nenadic</surname><given-names>G.</given-names></name><name><surname>Filannino</surname><given-names>M.</given-names></name><name><surname>Brass</surname><given-names>A.</given-names></name><name><surname>Robertson</surname><given-names>D. L.</given-names></name><name><surname>Stevens</surname><given-names>R.</given-names></name></person-group> <year>(2016)</year> <article-title>A Survey of Bioinformatics Database and Software Usage through Mining the Literature</article-title><source>PLOS ONE</source><volume>11</volume><issue>6</issue><fpage>e0157989</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1371/journal.pone.0157989">https://doi.org/10.1371/journal.pone.0157989</ext-link></element-citation></ref>
<ref id="R11"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Geschwind</surname><given-names>D. H.</given-names></name><name><surname>State</surname><given-names>M. W.</given-names></name></person-group> <year>(2015)</year> <article-title>Gene hunting in autism spectrum disorder: On the path to precision medicine</article-title><source>The Lancet Neurology</source><volume>14</volume><issue>11</issue><fpage>1109</fpage><lpage>1120</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/S1474-4422(15)00044-7">https://doi.org/10.1016/S1474-4422(15)00044-7</ext-link></element-citation></ref>
<ref id="R12"><element-citation publication-type="other"><person-group person-group-type="author"><collab>Google, Microsoft, Yahoo, &#x0026; Yandex</collab></person-group> <year>(2024)</year> <article-title>Schema.org [Organization]</article-title><source>Schema.Org</source><ext-link ext-link-type="uri" xlink:href="https://schema.org/">https://schema.org/</ext-link></element-citation></ref>
<ref id="R13"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Guarino</surname><given-names>N.</given-names></name><name><surname>Giaretta</surname><given-names>P.</given-names></name></person-group> <year>(1995)</year> <article-title>Ontologies and knowledge bases: Towards a terminological clarification</article-title><source>Towards Very Large Knowledge Bases</source><fpage>25</fpage><lpage>32</lpage><publisher-name>IOS Press</publisher-name><ext-link ext-link-type="uri" xlink:href="https://www.loa.istc.cnr.it/old/Papers/KBKS95.pdf">https://www.loa.istc.cnr.it/old/Papers/KBKS95.pdf</ext-link></element-citation></ref>
<ref id="R14"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Haendel</surname><given-names>M. A.</given-names></name><name><surname>Chute</surname><given-names>C. G.</given-names></name><name><surname>Robinson</surname><given-names>P. N.</given-names></name></person-group> <year>(2018)</year> <article-title>Classification, Ontology, and Precision Medicine</article-title><source>The New England Journal of Medicine</source><volume>379</volume><issue>15</issue><fpage>1452</fpage><lpage>1462</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1056/NEJMra1615014">https://doi.org/10.1056/NEJMra1615014</ext-link></element-citation></ref>
<ref id="R15"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname><given-names>Y. S.</given-names></name><name><surname>Leventhal</surname><given-names>B. L.</given-names></name></person-group> <year>(2015)</year> <article-title>Genetic Epidemiology and Insights into Interactive Genetic and Environmental Effects in Autism Spectrum Disorders</article-title><source>Biological Psychiatry</source><volume>77</volume><issue>1</issue><fpage>66</fpage><lpage>74</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1056/NEJMra1615014">https://doi.org/10.1056/NEJMra1615014</ext-link></element-citation></ref>
<ref id="R16"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Krishnan</surname><given-names>A.</given-names></name><name><surname>Zhang</surname><given-names>R.</given-names></name><name><surname>Yao</surname><given-names>V.</given-names></name><name><surname>Theesfeld</surname><given-names>C. L.</given-names></name><name><surname>Wong</surname><given-names>A. K.</given-names></name><name><surname>Tadych</surname><given-names>A.</given-names></name><name><surname>Volfovsky</surname><given-names>N.</given-names></name><name><surname>Packer</surname><given-names>A.</given-names></name><name><surname>Lash</surname><given-names>A.</given-names></name><name><surname>Troyanskaya</surname><given-names>O. G.</given-names></name></person-group> <year>(2016)</year> <article-title>Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder</article-title><source>Nature Neuroscience</source><volume>19</volume><issue>11</issue><fpage>1454</fpage><lpage>1462</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nn.4353">https://doi.org/10.1038/nn.4353</ext-link></element-citation></ref>
<ref id="R17"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Levesque</surname><given-names>H. J.</given-names></name><name><surname>Lakemeyer</surname><given-names>G.</given-names></name></person-group> <year>(2022)</year> <source>The logic of knowledge bases (Second edition)</source><publisher-name>College Publications</publisher-name></element-citation></ref>
<ref id="R18"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Libbrecht</surname><given-names>M. W.</given-names></name><name><surname>Noble</surname><given-names>W. S.</given-names></name></person-group> <year>(2015)</year> <article-title>Machine learning applications in genetics and genomics</article-title><source>Nature Reviews Genetics</source><volume>16</volume><issue>6</issue><fpage>321</fpage><lpage>332</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/nrg3920">https://doi.org/10.1038/nrg3920</ext-link></element-citation></ref>
<ref id="R19"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Min</surname><given-names>S.</given-names></name><name><surname>Lee</surname><given-names>B.</given-names></name><name><surname>Yoon</surname><given-names>S.</given-names></name></person-group> <year>(2016)</year> <article-title>Deep learning in bioinformatics</article-title><source>Briefings in Bioinformatics, bbw068</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bib/bbw068">https://doi.org/10.1093/bib/bbw068</ext-link></element-citation></ref>
<ref id="R20"><element-citation publication-type="other"><person-group person-group-type="author"><collab>National Center for Biomedical Ontology</collab></person-group> <year>(2024)</year> <article-title>Welcome to BioPortal, the world&#x2019;s most comprehensive repository of biomedical ontologies. [Organization]</article-title><source>BioPortal</source><ext-link ext-link-type="uri" xlink:href="https://bioportal.bioontology.org/">https://bioportal.bioontology.org/</ext-link></element-citation></ref>
<ref id="R21"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Ng</surname><given-names>A.</given-names></name></person-group><comment>Director</comment> <year>(2024)</year> <comment>July 4</comment><article-title>A Chat with Andrew on MLOps: From Model-Centric to Data- Centric AI[OL]. [Video recording]</article-title><ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=06-AZXmwHjo">https://www.youtube.com/watch?v=06-AZXmwHjo</ext-link></element-citation></ref>
<ref id="R22"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pesquita</surname><given-names>C.</given-names></name><name><surname>Faria</surname><given-names>D.</given-names></name><name><surname>Falc&#x00E3;o</surname><given-names>A. O.</given-names></name><name><surname>Lord</surname><given-names>P.</given-names></name><name><surname>Couto</surname><given-names>F. M.</given-names></name></person-group> <year>(2009)</year> <article-title>Semantic Similarity in Biomedical Ontologies</article-title><source>PLOS Computational Biology</source><volume>5</volume><issue>7</issue><fpage>e1000443</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1371/journal.pcbi.1000443">https://doi.org/10.1371/journal.pcbi.1000443</ext-link></element-citation></ref>
<ref id="R23"><element-citation publication-type="journal"><person-group person-group-type="author">Pi&#x00F1;<name><surname>ero</surname><given-names>J.</given-names></name><name><surname>Ram&#x00ED;rez-Anguita</surname><given-names>J. M.</given-names></name><name><surname>Sa&#x00FC;ch-Pitarch</surname><given-names>J.</given-names></name><name><surname>Ronzano</surname><given-names>F.</given-names></name><name><surname>Centeno</surname><given-names>E.</given-names></name><name><surname>Sanz</surname><given-names>F.</given-names></name><name><surname>Furlong</surname><given-names>L. I.</given-names></name></person-group> <year>(2020)</year> <article-title>The DisGeNET knowledge platform for disease genomics: 2019 update</article-title><source>Nucleic Acids Research</source><volume>48</volume><issue>D1</issue><fpage>D845</fpage><lpage>D855</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/nar/gkz1021">https://doi.org/10.1093/nar/gkz1021</ext-link></element-citation></ref>
<ref id="R24"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname><given-names>J.</given-names></name></person-group> <year>(2002)</year> <article-title>Evolving Paradigms of Knowledge Representation and Organization: A Comparative Study of Classification, XML/DTD, and Ontology</article-title><source>ADVANCES IN KNOWLEDGE ORGANIZATION</source><volume>8</volume><fpage>465</fpage><lpage>471</lpage></element-citation></ref>
<ref id="R25"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname><given-names>J.</given-names></name></person-group> <year>(2020)</year> <article-title>Knowledge Organization and Representation under the AI Lens</article-title><source>Journal of Data and Information Science</source><volume>5</volume><issue>1</issue><fpage>3</fpage><lpage>17</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.2478/jdis-2020-0002">https://doi.org/10.2478/jdis-2020-0002</ext-link></element-citation></ref>
<ref id="R26"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Qin</surname><given-names>J.</given-names></name><name><surname>Liu</surname><given-names>Q.</given-names></name></person-group> <year>(2024)</year> <article-title>Organizing Knowledge in Knowledgebases: A Case Study</article-title><source>Knowledge Organization for Resilience in Times of Crisis: Challenges and Opportunities</source><fpage>393</fpage><lpage>400</lpage></element-citation></ref>
<ref id="R27"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rada</surname><given-names>R.</given-names></name><name><surname>Mili</surname><given-names>H.</given-names></name><name><surname>Bicknell</surname><given-names>E.</given-names></name><name><surname>Blettner</surname><given-names>M.</given-names></name></person-group> <year>(1989)</year> <article-title>Development and application of a metric on semantic nets</article-title><source>IEEE Transactions on Systems, Man, and Cybernetics</source><volume>19</volume><issue>1</issue><fpage>17</fpage><lpage>30</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/21.24528">https://doi.org/10.1109/21.24528</ext-link></element-citation></ref>
<ref id="R28"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Radivojac</surname><given-names>P.</given-names></name><name><surname>Peng</surname><given-names>K.</given-names></name><name><surname>Clark</surname><given-names>W. T.</given-names></name><name><surname>Peters</surname><given-names>B. J.</given-names></name><name><surname>Mohan</surname><given-names>A.</given-names></name><name><surname>Boyle</surname><given-names>S. M.</given-names></name><name><surname>Mooney</surname><given-names>S. D.</given-names></name></person-group> <year>(2008)</year> <article-title>An integrated approach to inferring gene&#x2013;disease associations in humans</article-title><source>Proteins: Structure, Function, and Bioinformatics</source><volume>72</volume><issue>3</issue><fpage>1030</fpage><lpage>1037</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1002/prot.21989">https://doi.org/10.1002/prot.21989</ext-link></element-citation></ref>
<ref id="R29"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Resnik</surname><given-names>P.</given-names></name></person-group> <year>(1999)</year> <article-title>Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language</article-title><source>Journal of Artificial Intelligence Research</source><volume>11</volume><fpage>95</fpage><lpage>130</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1613/jair.514">https://doi.org/10.1613/jair.514</ext-link></element-citation></ref>
<ref id="R30"><element-citation publication-type="journal"><person-group person-group-type="author"><collab>The Gene Consortium</collab></person-group> <year>(2023)</year> <article-title>The Gene Ontology knowledgebase in 2023</article-title><source>Gene</source><volume>224</volume><issue>1</issue><fpage>1</fpage><lpage>14</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/genetics/iyad031">https://doi.org/10.1093/genetics/iyad031</ext-link></element-citation></ref>
<ref id="R31"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Voineagu</surname><given-names>I.</given-names></name><name><surname>Eapen</surname><given-names>V.</given-names></name></person-group> <year>(2013)</year> <article-title>Converging Pathways in Autism Spectrum Disorders: Interplay between Synaptic Dysfunction and Immune Responses</article-title><source>Frontiers in Human Neuroscience</source><volume>7</volume><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fnhum.2013.00738">https://doi.org/10.3389/fnhum.2013.00738</ext-link></element-citation></ref>
<ref id="R32"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>J. Z.</given-names></name><name><surname>Du</surname><given-names>Z.</given-names></name><name><surname>Payattakool</surname><given-names>R.</given-names></name><name><surname>Yu</surname><given-names>P. S.</given-names></name><name><surname>Chen</surname><given-names>C.-F.</given-names></name></person-group> <year>(2007)</year> <article-title>A new method to measure the semantic similarity of GO terms</article-title><source>Bioinformatics</source><volume>23</volume><issue>10</issue><fpage>1274</fpage><lpage>1281</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/btm087">https://doi.org/10.1093/bioinformatics/btm087</ext-link></element-citation></ref>
<ref id="R33"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wilkinson</surname><given-names>M. D.</given-names></name><name><surname>Dumontier</surname><given-names>M.</given-names></name><name><surname>Aalbersberg</surname><given-names>Ij. J.</given-names></name><name><surname>Appleton</surname><given-names>G.</given-names></name><name><surname>Axton</surname><given-names>M.</given-names></name><name><surname>Baak</surname><given-names>A.</given-names></name><name><surname>Blomberg</surname><given-names>N.</given-names></name><name><surname>Boiten</surname><given-names>J.-W.</given-names></name><name><surname>da Silva Santos</surname><given-names>L. B.</given-names></name><name><surname>Bourne</surname><given-names>P. E.</given-names></name><name><surname>Bouwman</surname><given-names>J.</given-names></name><name><surname>Brookes</surname><given-names>A. J.</given-names></name><name><surname>Clark</surname><given-names>T.</given-names></name><name><surname>Crosas</surname><given-names>M.</given-names></name><name><surname>Dillo</surname><given-names>I.</given-names></name><name><surname>Dumon</surname><given-names>O.</given-names></name><name><surname>Edmunds</surname><given-names>S.</given-names></name><name><surname>Evelo</surname><given-names>C. T.</given-names></name><name><surname>Finkers</surname><given-names>R.</given-names></name><name><surname>Mons</surname><given-names>B.</given-names></name></person-group> <year>(2016)</year> <article-title>The FAIR Guiding Principles for scientific data management and stewardship</article-title><source>Scientific Data</source><volume>3</volume><fpage>160018</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/sdata.2016.18">https://doi.org/10.1038/sdata.2016.18</ext-link></element-citation></ref>
<ref id="R34"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname><given-names>K.</given-names></name><name><surname>Lu</surname><given-names>K.</given-names></name><name><surname>Wu</surname><given-names>Y.</given-names></name><name><surname>Yu</surname><given-names>J.</given-names></name><name><surname>Liu</surname><given-names>B.</given-names></name><name><surname>Zhao</surname><given-names>Y.</given-names></name><name><surname>Chen</surname><given-names>J.</given-names></name><name><surname>Zhou</surname><given-names>X.</given-names></name></person-group> <year>(2021)</year> <article-title>A network-based machine&#x2013;learning framework to identify both functional modules and disease genes</article-title><source>Human Genetics</source><volume>140</volume><issue>6</issue><fpage>897</fpage><lpage>913</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s00439-020-02253-0">https://doi.org/10.1007/s00439-020-02253-0</ext-link></element-citation></ref>
<ref id="R35"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Yousef</surname><given-names>M.</given-names></name><name><surname>Say&#x0131;c&#x0131;</surname><given-names>A.</given-names></name><name><surname>Bakir-Gungor</surname><given-names>B.</given-names></name></person-group> <year>(2021)</year> <article-title>Integrating Gene Ontology Based Grouping and Ranking into the Machine Learning Algorithm for Gene Expression Data Analysis</article-title><person-group person-group-type="editor"><name><surname>Kotsis</surname><given-names>G.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Tjoa</surname><given-names>A. M.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Khalil</surname><given-names>I.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Moser</surname><given-names>B.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Mashkoor</surname><given-names>A.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Sametinger</surname><given-names>J.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Fensel</surname><given-names>A.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Martinez-Gil</surname><given-names>J.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Fischer</surname><given-names>L.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Czech</surname><given-names>G.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Sobieczky</surname><given-names>F.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Khan</surname><given-names>S.</given-names></name></person-group><source>Database and Expert Systems Applications&#x2014;DEXA 2021 Workshops</source><fpage>205</fpage><lpage>214</lpage><publisher-name>Springer International Publishing</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-030- 87101-7_20">https://doi.org/10.1007/978-3-030- 87101-7_20</ext-link></element-citation></ref>
<ref id="R36"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>G.</given-names></name><name><surname>Li</surname><given-names>F.</given-names></name><name><surname>Qin</surname><given-names>Y.</given-names></name><name><surname>Bo</surname><given-names>X.</given-names></name><name><surname>Wu</surname><given-names>Y.</given-names></name><name><surname>Wang</surname><given-names>S.</given-names></name></person-group> <year>(2010)</year> <article-title>GOSemSim: An R package for measuring semantic similarity among GO terms and gene products</article-title><source>Bioinformatics</source><volume>26</volume><issue>7</issue><fpage>976</fpage><lpage>978</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1093/bioinformatics/btq064">https://doi.org/10.1093/bioinformatics/btq064</ext-link></element-citation></ref>
<ref id="R37"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zeng</surname><given-names>M. L.</given-names></name></person-group> <year>(2008)</year> <article-title>Knowledge Organization Systems (KOS)</article-title><source>KNOWLEDGE ORGANIZATION</source><volume>35</volume><issue>2&#x2013;3</issue><fpage>160</fpage><lpage>182</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5771/0943-7444-2008-2-3-160">https://doi.org/10.5771/0943-7444-2008-2-3-160</ext-link></element-citation></ref>
</ref-list>
</back>
</article>