<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf47077</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf47077</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>On the robustness of cover version identification models: a study using cover versions from YouTube</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Hachmeier</surname><given-names>Simon</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>J&#x00E4;schke</surname><given-names>Robert</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<aff id="aff0001"><bold>Simon Hachmeier</bold> is a Ph.D. student at the Berlin School of Library and Information Science at the Humboldt-Universit&#x00E4;t zu Berlin. He received his M.Sc. in Information Systems from the University of Innsbruck. Simon&#x2019;s research interest is cover version identification on the web. He can be con-tacted at <email xlink:href="simon.hachmeier@hu-berlin.de">simon.hachmeier@hu-berlin.de</email></aff>
<aff id="aff0002"><bold>Robert J&#x00E4;schke</bold> is a Professor at the Berlin School of Library and Information Science at the Hum- boldt-Universit&#x00E4;t zu Berlin. He received his Ph.D. in Computer Science from the University of Kas-sel. Robert&#x2019;s research interests are web science and digital humanities. He can be contacted at <email xlink:href="robert.jaeschke@hu-berlin.de">robert.jaeschke@hu-berlin.de</email></aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>1103</fpage>
<lpage>1122</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> Recent advances in cover version identification have shown great success. However, models are usually tested on a fixed set of datasets which are relying on the online cover version database SecondHandSongs. It is unclear how well models perform on cover versions on online video platforms, which might exhibit alterations that are not expected.</p>
<p><bold>Method.</bold> We annotate a subset of versions from YouTube sampled by a multi-modal uncertainty sampling approach and evaluate state-of-the-art cover version identification models.</p>
<p><bold>Results.</bold> We find that existing models achieve significantly lower ranking performance on our dataset compared to a community dataset. We additionally measure the performance of different types of versions (e.g., instrumental versions) and find several types that are particularly hard to rank. Lastly, we provide a taxonomy of alterations in cover versions on the web<bold>.</bold></p>
<p><bold>Conclusions.</bold> We found that research in cover version identification shall be less dependent on SecondHandSongs but rather on more diverse datasets.</p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>In the context of western popular music, a cover version is a derivative of an original performance of a musical work. Artists perform versions to convey their subjective interpretations of musical works, which is a long-standing practice in musical culture. Usually, different versions of the same work share similar changes of individual notes (melody) or groups of notes (harmony) over time (<xref rid="R34" ref-type="bibr">Yesiler et al., 2021</xref>).</p>
<p>The research field of version identification (VI) deals with the automatic detection of cover versions in music collections. Recent approaches in VI aim to encode versions into representations retaining only relevant information in the context of cover versions (<xref rid="R10" ref-type="bibr">Du et al., 2021</xref>, <xref rid="R8" ref-type="bibr">2022</xref>, <xref rid="R9" ref-type="bibr">2023</xref>; <xref rid="R19" ref-type="bibr">Hu et al., 2022</xref>; <xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>; <xref rid="R35" ref-type="bibr">Yesiler et al., 2020a</xref>, 2020b). For instance, Abrassart and Doras (<xref rid="R1" ref-type="bibr">Abrassart &#x0026; Doras, 2022</xref>) report that melody, harmony, and lyrics are generally more relevant than rhythm. However, the actual relevance of each characteristic is non-trivial to predict and might strongly vary for different musical pieces. In contrast, characteristics irrelevant in the VI context are usually well agreed upon, such as the tempo or the key/scale.</p>
<p>Online video platforms feature various application scenarios for VI such as copyright infringement detection and music recommendation. Hence, the robustness of methods against noise and variance on the platform is important. One key peculiarity of VI in online videos is the alignment problem. In VI, this was addressed by summarization of musical content along the time axis including pooling mechanisms (<xref rid="R10" ref-type="bibr">Du et al., 2021</xref>, <xref rid="R8" ref-type="bibr">2022</xref>; <xref rid="R35" ref-type="bibr">Yesiler et al., 2020a</xref>; <xref rid="R38" ref-type="bibr">Yu et al., 2020</xref>) and more recently by the matching of smaller chunks of the pairs (<xref rid="R9" ref-type="bibr">Du et al., 2023</xref>; <xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>). Since YouTube is a collection of videos rather than versions (Except for YouTube&#x2019;s proprietary music streaming ser-vice <italic>YouTube Music</italic>), the relationship between videos and versions is an -to- relationship. This makes the alignment problem in online videos particularly challenging. For instance, a video might contain multiple versions (e.g., concert recordings) or only chunks of a version (e.g., guitar solo covers or tutorials (<xref rid="R18" ref-type="bibr">Hanson, 2018</xref>)). Additionally, videos might include noise such as commentary (e.g. people reacting to music (<xref rid="R27" ref-type="bibr">McDaniel, 2021</xref>)). Beside the alignment problem, other challenges might arise for VI in online videos such as the absence of the main melody (e.g. karaoke or instru-mental versions (<xref rid="R2" ref-type="bibr">Agrawal &#x0026; Sureka, 2013</xref>; <xref rid="R32" ref-type="bibr">Smith et al., 2017</xref>)), low fidelity in amateur recordings and versions occurring only in the background as accompaniment (<xref rid="R25" ref-type="bibr">Martet, 2016</xref>).</p>
<p>VI research has made great progress in recent years, mainly measured in metrics from MIREX (<ext-link ext-link-type="uri" xlink:href="https://www.music-ir.org/mirex/wiki/2021:Audio_Cover_Song_Identification">https://www.music-ir.org/mirex/wiki/2021:Audio_Cover_Song_Identification</ext-link>) and reported on community datasets like SHS100K-Test (<xref rid="R33" ref-type="bibr">Xu et al., 2018</xref>) and Da-Tacos (<xref rid="R37" ref-type="bibr">Yesiler et al., 2019</xref>). How-ever, both of these datasets are based on the platform SecondHandSongs (SHS) (<ext-link ext-link-type="uri" xlink:href="https://secondhandsongs.com/">https://secondhandsongs.com/</ext-link>) curated by a community of volunteers (<ext-link ext-link-type="uri" xlink:href="https://secondhand-songs.com/page/About">https://secondhand-songs.com/page/About</ext-link>) which makes present cover version collections subject to the selection policies of the platform. For example, <italic>web covers</italic> are considered an individual category of versions characterized by being released non-commercially (<ext-link ext-link-type="uri" xlink:href="https://secondhand-songs.com/page/Guidelines/Entities/WebCover">https://secondhand-songs.com/page/Guidelines/Entities/WebCover</ext-link>). At the same time, they appear to be less rele-vant for collaborators, since the amount of web covers is usually much lower than for commercially released covers as can be seen for the example ``Enter Sandman&#x2019;&#x2019; by Metallica (<ext-link ext-link-type="uri" xlink:href="https://secondhandsongs.com/work/6616">https://secondhandsongs.com/work/6616</ext-link>). What is more, due to a technical limitation of the application interface of SHS, the created datasets do not actually contain web covers. This poses the question whether VI models trained and evaluated on data from SHS are considering all rele-vant characteristics of versions and motivates our first research question:
<list list-type="bullet">
<list-item><p><bold>RQ1:</bold> do cover version datasets based on the platform SecondHandSongs represent the distributions of cover versions and their characteristics on YouTube?</p></list-item>
</list></p>
<p>We assume that there exists a subset of versions with specific characteristics on YouTube which are relevant in the context of VI but not found on the platform SHS: out-of-distribution data. Con-sequently, recent VI models are neither trained nor evaluated on data with regard to these char-acteristics. We therefore propose our second research question:
<list list-type="bullet">
<list-item><p><bold>RQ2:</bold> which characteristics of versions drive the uncertainty of existing VI models?</p></list-item>
</list></p>
<p>In this paper, we aim to explore the success and the challenges of VI on out-of-distribution data. Rather than relying on the cover version collection SecondHandSongs, we leverage the richness of creativity of the YouTube community. Applying a multi-modal uncertainty sampling approach, we identify the most uncertain version candidates. Subsequently, we obtain human annotations by workers on the crowdsourcing platform Mechanical Turk (MTurk). Lastly, two music experts cu-rate a subset of the dataset and provide annotations of uncertainties in the problem context, to-gether with a taxonomy of these.</p>
<p>In summary, the main contributions are:
<list list-type="bullet">
<list-item><p>we provide a benchmark dataset SHS-YT (<ext-link ext-link-type="uri" xlink:href="https://github.com/progsi/SHS-YT">https://github.com/progsi/SHS-YT</ext-link>) created with a multi-modal uncertainty sampling approach followed by human annotations. It includes labels on an ordinal scale to reflect the complexity of VI on online video platforms (e.g., videos without musical content and identical audio tracks).</p></list-item>
<list-item><p>two experts curate the provided dataset to gather insights into uncertainties in the VI context of online video platforms. We also provide a taxonomy extending an existing one (<xref rid="R34" ref-type="bibr">Yesiler et al., 2021</xref>).</p></list-item>
<list-item><p>our benchmarks show that even the current state-of-the-art model under-performs on our proposed dataset. Additionally, we identify challenging alterations such as the isolation of single instruments or the vocal track which would be better addressed in the field of query-by-humming. This uncovers potential boundaries of cover version definitions.</p></list-item>
</list></p>
</sec>
<sec id="sec2">
<title>Related work</title>
<sec id="sec2_1">
<title>Version identification datasets</title>
<p>VI datasets are composed of versions which are grouped by musical works. During training, VI models are optimized to encode audio representations of versions of the same work as similar and versions of different works as dissimilar. In the evaluation scenario, each version represents a query at a time and the remaining versions are ranked based on the musical similarity computed by the VI model. The resulting <italic>N</italic> rankings for a dataset with <italic>N</italic> versions serve as the input to com-pute retrieval metrics such as the mean average precision (MAP).</p>
<p>We provide an overview of the most popular datasets in VI which are used for benchmarking com-pared to the datasets used in this paper in <xref ref-type="table" rid="T1">Table 1</xref>. Recent VI approaches (<xref rid="R10" ref-type="bibr">Du et al., 2021</xref>, <xref rid="R8" ref-type="bibr">2022</xref>, <xref rid="R9" ref-type="bibr">2023</xref>; <xref rid="R19" ref-type="bibr">Hu et al., 2022</xref>; <xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>; <xref rid="R35" ref-type="bibr">Yesiler et al., 2020a</xref>, 2020b) achieve MAP scores up to 0.96 on YouTubeCovers (<xref rid="R31" ref-type="bibr">Silva et al., 2015</xref>) and Covers80 (<xref rid="R11" ref-type="bibr">Ellis, 2011</xref>). The results on the larger datasets SHS100K-Test (<xref rid="R33" ref-type="bibr">Xu et al., 2018</xref>) and the Da-Tacos benchmark subset (<xref rid="R37" ref-type="bibr">Yesiler et al., 2019</xref>) are lower; for instance, CoverHunter (<xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>) achieves the highest MAP but still does not surpass 0.90.</p>
<table-wrap id="T1">
<label>Table 1.</label>
<caption><p>Popular VI benchmark datasets and the seed dataset and our annotated datasets in bold text</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Dataset</bold></th>
<th align="center" valign="top"><bold>Works</bold></th>
<th align="center" valign="top"><bold>Versions</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Covers80</td>
<td align="right" valign="top">80</td>
<td align="right" valign="top">160</td>
</tr>
<tr>
<td align="left" valign="top">Da-Tacos</td>
<td align="right" valign="top">1,000</td>
<td align="right" valign="top">15,000</td>
</tr>
<tr>
<td align="left" valign="top">Discogs-VI-YT-Test</td>
<td align="right" valign="top">9,878</td>
<td align="right" valign="top">116,547</td>
</tr>
<tr>
<td align="left" valign="top">SHS100K-Test</td>
<td align="right" valign="top">1,692</td>
<td align="right" valign="top">10,547</td>
</tr>
<tr>
<td align="left" valign="top">YouTubeCovers</td>
<td align="right" valign="top">50</td>
<td align="right" valign="top">350</td>
</tr>
<tr>
<td align="left" valign="top"><bold>SHS-SEED</bold></td>
<td align="right" valign="top">100</td>
<td align="right" valign="top">2,404</td>
</tr>
<tr>
<td align="left" valign="top"><bold>SHS-YT</bold></td>
<td align="right" valign="top">100</td>
<td align="right" valign="top">900</td>
</tr>
<tr>
<td align="left" valign="top"><bold>SHS-YT+2Q</bold></td>
<td align="right" valign="top">100</td>
<td align="right" valign="top">1,092</td>
</tr>
<tr>
<td align="left" valign="top"><bold>SHS-YT+AllQ</bold></td>
<td align="right" valign="top">100</td>
<td align="right" valign="top">3,289</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>A commonality of all of these datasets but Covers80, is their utilization of SHS as a data source. The same accounts for the respective training sets of VI models: SHS100K-Train and the training subset of Da-Tacos (<xref rid="R35" ref-type="bibr">Yesiler et al., 2020a</xref>) which were used to train the recent VI models. Conse-quently, versions in the dataset can be found on YouTube, but are only included if these are man-ually collected by the SHS community. The question remains whether the distribution of variance of versions on YouTube is appropriately represented in existing benchmark datasets. A newer da-taset, namely Discogs-VI-YT, is based on Discogs (<ext-link ext-link-type="uri" xlink:href="https://www.discogs.com/">https://www.discogs.com/</ext-link>) rather than SHS. It is the currently biggest dataset in VI. Since it is rather new, there are currently no benchmarks of the state-of-the-art VI models.</p>
</sec>
<sec id="sec2_2">
<title>Music on YouTube</title>
<p>Various studies address the richness and diversity of versions on YouTube and corresponding clas-ses. In <xref ref-type="table" rid="T2">Table 2</xref> we distinguish between 4 classes of versions and provide some examples found in existing research. (<xref rid="R23" ref-type="bibr">Liikkanen and Salovaara 2015</xref>) state that music is the most popular content type on YouTube. The results were derived from data about YouTube search trends, the most popular videos, and channels. The authors established twelve subclasses of versions segmented into three main classes: official (uploaded by copyright owners), user-appropriated (uploaded by fans) and derivative (e.g., cover versions). While the first two classes are expected to contain highly similar audio, the third class rather relies on music fans and hobby musicians. It includes stronger changes in musical characteristics; for instance, covers on instruments, parodies, or remixes.</p>
<p>The category of user-appropriated versions is also discussed by (<xref rid="R25" ref-type="bibr">Martet 2016</xref>). The author also includes a new perspective on versions, including videos which contain versions rather as an ac-companiment (e.g., for movie trailers).</p>
<table-wrap id="T2">
<label>Table 2.</label>
<caption><p>Classes and examples of versions on YouTube.</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top"><bold>Dataset</bold></th>
<th align="center" valign="top"><bold>Works</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">Official</td>
<td align="center" valign="top">Official Music Video Professional Live Video</td>
</tr>
<tr>
<td align="left" valign="top">User-Appropriated</td>
<td align="center" valign="top">Lyric Video Slideshow</td>
</tr>
<tr>
<td align="left" valign="top">Cover</td>
<td align="center" valign="top">Guitar Cover Paraody Karaoke Version</td>
</tr>
<tr>
<td align="left" valign="top">Other</td>
<td align="center" valign="top">Tutorial Reaction Video</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>From an application-driven perspective, studies have implemented pipelines to cope with copy-right infringement detection (<xref rid="R2" ref-type="bibr">Agrawal &#x0026; Sureka, 2013</xref>) and music retrieval (<xref rid="R21" ref-type="bibr">B. Li &#x0026; Kumar, 2019</xref>; <xref rid="R32" ref-type="bibr">Smith et al., 2017</xref>) on YouTube. (<xref rid="R32" ref-type="bibr">Smith et al. 2017</xref>) propose an approach processing audio, text, and video features to predict a version class. Similar to (<xref rid="R3" ref-type="bibr">Airoldi et al. 2016</xref>) as well as (<xref rid="R23" ref-type="bibr">Liikkanen and Salovaara 2015</xref>), the authors established classes like remixes and tutorials beside official music videos and live performances. Another approach to model classes of versions on YouTube derived clusters of categories of versions by a network analysis (<xref rid="R3" ref-type="bibr">Airoldi et al., 2016</xref>). The results emerged clusters corresponding to musical genres and situational contexts (Eg. covers and tutorials).</p>
<p>While the classes of versions in all of these studies might be relevant for VI research, their consid-eration in the field is rather limited. (<xref rid="R34" ref-type="bibr">Yesiler et al. 2021</xref>) construct a taxonomy where they also mention some classes and the corresponding alterations of musical characteristics. To best of our knowledge, no existing benchmarks of VI models investigated the impact of the mentioned alter-ations on model robustness.</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>Dataset creation</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c92-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
</sec>
<sec id="sec3">
<title>Data creation</title>
<p>We here describe the steps of the creation process of the dataset SHS-YT as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<p>We aim to evaluate the performance of state-of-the-art VI models on out-of-distribution data. We select YouTube as a rich source for a diverse set of versions, since there are no constraints for uploaders as opposed to the policies on SHS.</p>
<p>To cover a representative subset of western popular music, we select the widely used SHS100K- Test as a seed dataset composed of works of western popular music. In particular, we choose the first 100 works from its test subset (<ext-link ext-link-type="uri" xlink:href="https://github.com/NovaFrost/SHS100K/blob/mas-ter/SHS100K-TEST">https://github.com/NovaFrost/SHS100K/blob/mas-ter/SHS100K-TEST</ext-link>). These works are represented by 2,859 versions of which we successfully re-trieved 2,397. We denote this dataset with 2,015 unique performers as <italic>SHS-SEED</italic>.</p>
<sec id="sec3_1">
<title>Candidate retrieval</title>
<p>The goal of the candidate retrieval step is to obtain a set of candidate versions to be included in our dataset. We apply an approach by (<xref rid="R17" ref-type="bibr">Hachmeier et al. 2022</xref>) to formulate multiple text queries per work in SHS-SEED. We utilize the strings for performer and title of the first version of each work to formulate queries and additionally formulate new queries using YouTube search sugges-tions (The list of queries per work can be found in our repository). On average, we formulate 44 text queries per work resulting in 4,365 text queries. We retrieve metadata for the top 50 videos per query (<ext-link ext-link-type="uri" xlink:href="https://pypi.org/project/youtube-search-python">https://pypi.org/project/youtube-search-python</ext-link>) and drop videos with a length of 10 minutes or more. We denote the resulting collection of 94,358 videos as <italic>YT-CRAWL</italic>. We download the audio files for all videos with a sampling rate of 22,050 Hz(<ext-link ext-link-type="uri" xlink:href="https://github.com/yt-dlp">https://github.com/yt-dlp</ext-link>) and extract CREMA features which we use in the next step (<ext-link ext-link-type="uri" xlink:href="https://github.com/bmcfee/crema">https://github.com/bmcfee/crema</ext-link>).</p>
</sec>
<sec id="sec3_2">
<title>Uncertainty sampling</title>
<p>In the uncertainty sampling step, we aim to reduce the number of versions to a smaller subset because of two reasons: Firstly, we are limited in annotation capacity. Secondly, we aim to focus on out-of-distribution data and want to focus on versions with characteristics not common to be found on SHS. We leverage the modalities of audio (CREMA features) and text (YouTube metadata). For both domains, we use models based on deep learning as proxies. In theory, only the audio information is necessary to determine whether two versions are associated with each other. How-ever, we use the text-based proxy to systematically find candidates where the VI proxy over- or underestimates the musical similarity.</p>
<sec id="sec3_2_1">
<title>Modality proxies</title>
<p>We use the pre-trained model Re-MOVE (<xref rid="R36" ref-type="bibr">Yesiler et al., 2020b</xref>) as a proxy in the audio/music do-main which is one of the best approaches for VI at the time of dataset creation. The model pro-cesses CREMA features, which represent harmonic and melodic progressions, and encodes these into 256-dimensional embeddings. The Cosine similarity of a pair of embeddings represents their musical similarity. For the text domain, we use the entity matching model Ditto (<xref rid="R17" ref-type="bibr">Y. Li et al., 2020</xref>). The model is based on BERT (<xref rid="R7" ref-type="bibr">Devlin et al., 2018</xref>) and encodes pairs of textual entities into BERT embeddings and predicts a binary matching confidence. From the SHS100K-Train dataset we cre-ate a train, validation, and test set with a ratio of 3:1:1 as proposed by (<xref rid="R17" ref-type="bibr">Y. Li et al. 2020</xref>) with each containing positive and negative pairs of YouTube videos in a 1:4 ratio. We gather the negative pairs by randomly sampling videos from another randomly selected work. We use all of the proposed data augmentation techniques and the best performing language model (RoBERTa) as described by (<xref rid="R17" ref-type="bibr">Y. Li et al. 2020</xref>). We apply the best model checkpoint evaluated on the test set after 50~epochs for our matching task.</p>
<p>Since the inclusion of YouTube descriptions yielded inferior results (F1 score of 0.27 against 0.95) we solely process YouTube titles and channel names.</p>
</sec>
<sec id="sec3_2_2">
<title>Similarity and matching confidence aggregation</title>
<p>For each candidate in YT-CRAWL we compute a similarity and matching confidence with the proxy models. Since each of the works is represented by multiple versions in SHS-SEED (24 on average) we must aggregate the pairwise similarities and model confidences. For a work <italic>i</italic>, a set of query versions from SHS-SEED <italic>Q<sub>i</sub></italic> and a candidate version <italic>c<sub>ij</sub></italic> from YT-CRAWL, we compute the musical similarity <italic>S<sub>m</sub> (c<sub>ij</sub>)</italic> as the arithmetic mean of the Cosine similarities of the Re-MOVE outputs of all pairs <italic>(c<sub>ij</sub>,q)</italic> for <italic>q</italic> &#x2208; <italic>Q<sub>i</sub></italic>. In a preliminary experiment on the validation dataset of SHS100K we vali-dated the aggregation by the arithmetic mean as opposed to aggregation by maximum. We further compute the textual similarity <italic>S<sub>t</sub>(c<sub>ij</sub>)</italic> for the same pairs as the maximum matching confidence based on Ditto. We motivate this because candidates with non-matching metadata among the que-ries shall not have impact on the matching decision as long as at least one candidate in <italic>Q<sub>i</sub></italic> matches. This is especially of relevance in cases with translated version titles. For instance, the version title &#x2018;Tiempo de Verano&#x2019; (Spanish for &#x2018;Time of the Summer&#x2019; or &#x2018;Summertime&#x2019;) which is potentially a sub-string within a YouTube title might match the version title &#x2018;Summertime&#x2019; with rather low confi-dence. Based on the aggregated values <italic>S<sub>m</sub>(c<sub>ij</sub>)</italic> and <italic>S<sub>t</sub>(c<sub>ij</sub>)</italic> for all candidates in YT-CRAWL, we con-duct uncertainty sampling with two approaches: disagreement sampling and mutual uncertainty.</p>
</sec>
<sec id="sec3_2_3">
<title>Disagreement sampling</title>
<p>We establish two disagreement groups: <italic>DisagrAudio</italic> denotes the candidates where the musical similarity high in contrast to the textual similarity and <italic>DisagrText</italic> represents the contrary case. We measure the disagreement as the absolute difference as shown in <xref ref-type="table" rid="T3">Table 3</xref> and select the three candidates with the highest disagreement for both disagreement groups per work.</p>
<table-wrap id="T3">
<label>Table 3.</label>
<caption><p>Uncertainty groups and their constraints. We sample the top three results returned by each ranking function.</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top"><bold>Group</bold></th>
<th align="center" valign="top"><bold>Ranking Function</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">DisagrAudio</td>
<td align="center" valign="top"><italic>S<sub>m</sub>(c<sub>ij</sub>)-S<sub>t</sub>(c<sub>ij</sub>)ifS<sub>m</sub>(c<sub>ij</sub>)>S<sub>t</sub>(c<sub>ij</sub>)</italic></td>
</tr>
<tr>
<td align="left" valign="top">DisagrText</td>
<td align="center" valign="top"><italic>S<sub>t</sub>(c<sub>ij</sub>)-S<sub>m</sub>(c<sub>ij</sub>)ifS<sub>t</sub>(c<sub>ij</sub>)>S<sub>m</sub>(c<sub>ij</sub>)</italic></td>
</tr>
<tr>
<td align="left" valign="top">DisagrUnc</td>
<td align="center" valign="top">-||<italic>(S(c<sub>ij</sub>),S<sup>*</sup>(C<sub>i</sub>))</italic>||<sub>2</sub></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec3_2_4">
<title>Mutual uncertainty</title>
<p>We denote the mutual uncertainty group by <italic>DisagrUnc</italic> containing the top three candidates with the highest mutual uncertainty. Works with less than three candidates for DisagrAudio are filled with samples from this group as well. As shown in <xref ref-type="table" rid="T3">Table 3</xref>, we compute the mutual uncertainty as the negative Euclidean distance between the two-dimensional vector <italic>S(c<sub>ij</sub>)=[S<sub>m</sub>(c<sub>ij</sub>),S<sub>t</sub>(c<sub>ij</sub>)]<sup>T</sup></italic> and the vector <italic>S<sup>*</sup>(C<sub>i</sub>)</italic>, representing the center of uncertainty based on all candidates for the work <italic>C<sub>i</sub></italic>, defined as follows:</p>
<p>(1) <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mi>S</mml:mi><mml:mo>*</mml:mo><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:msubsup><mml:mi>S</mml:mi><mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>*</mml:mo></mml:msubsup><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:msubsup><mml:mi>S</mml:mi><mml:mi>m</mml:mi><mml:mo>*</mml:mo></mml:msubsup><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:msubsup></mml:math></inline-formula></p>
<p>with</p>
<p>(2) <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msubsup><mml:mi>S</mml:mi><mml:mi>&#x03B8;</mml:mi><mml:mo>*</mml:mo></mml:msubsup><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mfenced><mml:mrow><mml:msubsup><mml:mi>S</mml:mi><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>min</mml:mi></mml:mrow></mml:msubsup><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>+</mml:mo><mml:msubsup><mml:mi>S</mml:mi><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>max</mml:mi></mml:mrow></mml:msubsup><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:math></inline-formula></p>
<p>where &#x03B8; &#x2208; {<italic>m,t</italic>} and <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msubsup><mml:mi>S</mml:mi><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>min</mml:mi></mml:mrow></mml:msubsup><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mtext>&#x2009;</mml:mtext><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mtext>&#x2009;</mml:mtext><mml:msubsup><mml:mi>S</mml:mi><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>max</mml:mi></mml:mrow></mml:msubsup><mml:mfenced><mml:mrow><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:math></inline-formula> return the minimum and maximum of the Cosine simi-larities or matching confidences for all the candidates in <italic>C<sub>i</sub></italic> , respectively. In the following, we de-scribe our annotation process of the resulting nine candidates per work.</p>
</sec>
</sec>
<sec id="sec3_3">
<title>Annotation</title>
<p>We impose an ordinal scale of classes and obtain annotations by workers from Amazon&#x2019;s Mechan-ical Turk (MTurk) and inhouse-experts.</p>
<sec id="sec3_3_1">
<title>Relevance classes</title>
<p>Prior VI datasets solely consider the membership of a version to a work (binary label). Hence, each item in the dataset is expected to contain music. Further, the versions in the dataset of the same work are expected to be different regarding different aspects such as tempo or timbre (<xref rid="R34" ref-type="bibr">Yesiler et al., 2021</xref>). Both is not guaranteed when dealing with our retrieved candidates from YouTube, since videos are not even guaranteed to contain music. We construct four classes on an ordinal scale with respect to the relevance to the query version:
<list list-type="bullet">
<list-item><p><italic>NoMusic</italic>: candidate version does not contain music and is not relevant.</p></list-item>
<list-item><p><italic>NonVersion</italic>: candidate version does not contain music and is not relevant.</p></list-item>
<list-item><p><italic>Version</italic>: candidate version is derived of the same work as the query versions and therefore relevant.</p></list-item>
<list-item><p><italic>Match</italic>: candidate version includes (parts of) the exact same audio as the original they are derived of (<italic>user-appropriated</italic> videos). The version is relevant.</p></list-item>
</list></p>
<p>We represent each work <italic>i</italic> by a query version which is a random version from SHS-SEED. The goal of the annotation step is to gather annotations about the relevance between <italic>i</italic> and the candidate in the set. We denote the resulting set of 900 annotated versions as <italic>SHS-YT</italic>.</p>
</sec>
<sec id="sec3_3_2">
<title>Crowdsourcing</title>
<p>We publish one human intelligence tasks (HITs) on MTurk per work with instructions and examples as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. Each contains the query version, the nine candidates and a quality check candidate with a known answer <italic>Version</italic> or <italic>NonVersion</italic> based on the works and versions in SHS- SEED. To simplify the task, we specifically instruct that excerpts are sufficient (e.g., a medley is a <italic>Version</italic> if it contains an excerpt of the query).</p>
<p>The interface and manual presented to the workers can be found in our published dataset.</p>
<fig id="F2">
<label>Figure 2.</label>
<caption><p>Our instructions and examples to workers as presented on MTurk. Please note that the examples of the right are cropped to fit</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c92-fig2.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>We measured the average time effort per annotation pair at 90~seconds and thus expect 15 minutes per HIT. We pay a reward of 3.2 US dollars per HIT corresponding to our domestic mini-mum wage, compensating our estimated time effort in consideration of the average currency ex-change rate between our currency and the US dollar at annotation time.</p>
<p>We collect assignments by up to five workers per HIT. Following best practices to achieve annota-tion quality (<xref rid="R15" ref-type="bibr">Ghosh et al., 2019</xref>; <xref rid="R26" ref-type="bibr">Matherly, 2018</xref>; <xref rid="R28" ref-type="bibr">Mellis &#x0026; Bickel, 2020</xref>; <xref rid="R30" ref-type="bibr">Peer et al., 2013</xref>) we only permit workers with more than 100 approved HITs and approval rate above or equal to 99%. We reject assignments where workers fail the quality check or complete the assignments in less than ten seconds. In some cases, we accept assignments with failed quality checks due to proper justi-fications by workers. We do not include these assignments in our dataset. The final worker labels are obtained by majority voting: minimum three equal labels determine the decision for the label. Candidates which remain without a final label due to high variance in label responses are curated by the experts in the next step.</p>
</sec>
<sec id="sec3_3_3">
<title>Curation</title>
<p>We employ two music experts for curation of the annotated dataset (The two experts have 15 years of musical experience on harmonic instruments.). The experts&#x2019; task is to check the relevance labels of the workers for correctness, decide for a relevance label in undecided cases and to annotate the most prominent reason which makes a candidate more difficult to annotate (uncertainty class). In cases of uncertainty, the experts discuss decisions. Ultimately, experts and authors agreed upon including boundary cases as versions as well (e.g., remixes).</p>
<p>The first expert curates candidates labeled with <italic>NoMusic</italic> and 167 candidates with failed majority votes due to ties or shortage of worker assignments (because of failed assignment quality checks). Based on the collected reasons for uncertainties by the first expert, we formulate uncertainty clas-ses and distinguish between uncertainties related to the version itself (e.g., <italic>Song: Instrumental</italic>) and uncertainties related to the Version in context of its occurrence in an online video (e.g., <italic>Video: With Non-Music</italic>). Some uncertainty classes just apply to one relationship class, for example, <italic>Song: Same Artist</italic> only applies if the candidate is a <italic>NonVersion</italic>. We provide a full documentation in our pub-lished repository.</p>
<p>The second expert utilizes the uncertainty classes directly and curates all candidates labeled with Version and the 96 most similar candidates labeled with <italic>NonVersion (</italic>Measured in mean Cosine similarity per benchmarking model as explained in the previous section) for error analysis. New uncertainty classes are collected and iteratively formulated, resulting in a total of 19 ambiguity classes. Based on these classes derived of observed examples, we construct a taxonomy of altera-tions.</p>
</sec>
</sec>
<sec id="sec4">
<title>Dataset analysis</title>
<sec id="sec4_1">
<title>Overview</title>
<p>We present the distributions of numerical YouTube attributes of SHS-YT in <xref ref-type="fig" rid="F3">Figure 3</xref>. We observe a strong peak in duration around 3.5 minutes and in uploading dates between 2020 and 2022.</p>
<fig id="F3">
<label>Figure 3.</label>
<caption><p>Gaussian kernel density estimates for duration (left) and uploading date (right) of the videos in the SHS-YT dataset. The bandwidth parameter is estimated by Scott&#x2019;s method</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c92-fig3.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>In <xref ref-type="table" rid="T4">Table 4</xref> we show counts per annotation class and sampling group. The dataset mostly contains versions of other works than their respective query versions but also 197 versions of the same works. <italic>NoMusic</italic> versions mostly occurred in the DisagrText group, which is expected, since a mod-eled musical similarity by Re-MOVE is rather unlikely with the absence of actual music. Similarly, the only 4 <italic>Match</italic> versions only occur in the DisagrAudio sampling group. SHS-YT contains 5 ver-sions which are also contained in Da-Tacos; all are labeled with <italic>NonVersion</italic>. Regarding SHS100K, SHS-YT contains 5 versions from the test subset but from other works then in SHS-SEED, 2 from the validation subset and 13 candidates from the training subset. Hence, all of these candidates but one is annotated as <italic>NonVersio</italic><italic>n</italic>.</p>
<table-wrap id="T4">
<label>Table 4.</label>
<caption><p>Uncertainty groups and their constraints. We sample the top three results returned by each ranking function</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top"></th>
<th align="center" valign="top"><bold>Match</bold></th>
<th align="center" valign="top"><bold>Version</bold></th>
<th align="center" valign="top"><bold>NonVersion</bold></th>
<th align="center" valign="top"><bold>NoMusic</bold></th>
<th align="center" valign="top"><bold>&#x03A3;</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">DisagrAudio</td>
<td align="center" valign="top">4</td>
<td align="center" valign="top">89</td>
<td align="center" valign="top">200</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">293</td>
</tr>
<tr>
<td align="left" valign="top">DisagrText</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">82</td>
<td align="center" valign="top">142</td>
<td align="center" valign="top">76</td>
<td align="center" valign="top">300</td>
</tr>
<tr>
<td align="left" valign="top">DisagrUnc</td>
<td align="center" valign="top">0</td>
<td align="center" valign="top">26</td>
<td align="center" valign="top">280</td>
<td align="center" valign="top">1</td>
<td align="center" valign="top">307</td>
</tr>
<tr>
<td align="left" valign="top">&#x03A3;</td>
<td align="center" valign="top">4</td>
<td align="center" valign="top">197</td>
<td align="center" valign="top">622</td>
<td align="center" valign="top">77</td>
<td align="center" valign="top">900</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="fig" rid="F4">Figure 4</xref> we show the relative amounts of uncertainty classes excluding placeholder for 104 non- ambiguous versions according to the experts.</p>
<p>Non-musical content is the most represented uncertainty for <italic>Versions</italic> with 14% (<italic>n</italic>=77), followed by vocal-only. For <italic>NonVersions</italic>, the most frequent uncertainty is musical similarity between ver-sions (<italic>Song: Similar</italic>) at 12%, followed by <italic>NonVersions</italic> from the same artist as the query version with 11%.</p>
<fig id="F4">
<label>Figure 4.</label>
<caption><p>Relative proportion of uncertainty class annotated.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c92-fig4.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec4_2">
<title>Annotation quality</title>
<p>Comparing the aggregated worker labels with expert labels for our 513 curated versions results in a Kendall&#x2019;s Tau of 0.81, indicating a strong positive association. However, the agreement among workers measured in Krippendorff&#x2019;s Alpha is just 0.43. The moderate level of inter-rater agreement might be partly due to the similarity of the VI task to the audio music similarity task which generally is associated with limited agreement as discussed in previous studies (<xref rid="R6" ref-type="bibr">Daikoku et al., 2020</xref>; <xref rid="R14" ref-type="bibr">Flexer et al., 2021</xref>; <xref rid="R12" ref-type="bibr">Flexer &#x0026; Grill, 2016</xref>; <xref rid="R13" ref-type="bibr">Flexer &#x0026; Lallai, 2019</xref>; <xref rid="R20" ref-type="bibr">Jones et al., 2007</xref>). Looking at the annotated uncertainty classes for candidates that are falsely labeled according to the expert (<italic>n</italic>=84) or which did not achieve a majority vote (<italic>n</italic>=167) uncovers some potential issues of workers. Especially ver-sions which include non-musical and musical content seem to confuse workers (<italic>n</italic>=51). We found examples from &#x2018;The Voice&#x2019; (<ext-link ext-link-type="uri" xlink:href="https://youtu.be/aii62acsp_E">https://youtu.be/aii62acsp_E</ext-link>) and a movie scene from &#x2018;Cocktail&#x2019; (<ext-link ext-link-type="uri" xlink:href="https://youtu.be/EFuBvEt84OI">https://youtu.be/EFuBvEt84OI</ext-link>).</p>
</sec>
</sec>
<sec id="sec5">
<title>Benchmark</title>
<p>In this section, we conduct a benchmark on our proposed dataset with the goal to gather insights about the VI performance on out-of-distribution data. Since VI is a matching problem, we require relevant versions for all works in the dataset which is not the case for SHS-YT. Hence, we include versions from SHS-SEED. We construct two benchmark datasets derived of SHS-YT which we also show in <xref ref-type="table" rid="T1">Table 1</xref>. For both datasets, we exclude the versions which are included in the training and validation datasets of SHS100K:
<list list-type="bullet">
<list-item><p><italic>SHS-YT+2Q:</italic> SHS-YT with the query versions used for human annotation and one additional work from SHS-SEED. We select the version with the lowest version identifier which either is the original version or at least an earlier version derived of the original. This dataset includes minimum two relevant versions per work. In this dataset of 1,092 versions, our annotated versions account for around 82%. The 312 versions labeled as irrelevant (<italic>NonVersion</italic> and <italic>NoMusic</italic>) account for 29% of the dataset.</p></list-item>
<list-item><p><italic>SHS-YT+AllQ:</italic> Our proposed dataset with the all versions from SHS-SEED resulting in 3,289 versions. Here, our annotated versions account for around 27% of all versions. The 312 versions labeled as irrelevant account for 9% of the dataset.</p></list-item>
</list></p>
<p>Beside the modality proxies as described before, we evaluate two other VI models and a fuzzy matching baseline. CQTNet (<xref rid="R38" ref-type="bibr">Yu et al., 2020</xref>) is a VI model consisting of mainly convolutional neural networks. It processes constant-Q transform spectrograms (CQTs). The current state-of-the-art model is CoverHunter (<xref rid="R17" ref-type="bibr">Liu et al., 2023</xref>) which also processes CQTs but includes a conformer back-bone (<xref rid="R16" ref-type="bibr">Gulati et al., 2020</xref>) and an attention mechanism (<xref rid="R29" ref-type="bibr">Okabe et al., 2018</xref>). The model is trained with a coarse-to-fine training scheme to address the alignment problem.</p>
<p>Both models are trained and validated on SHS100K. We use the pre-trained models provided by the authors. In contrast to the models we benchmark, the approach by Abrassart and Doras (<xref rid="R1" ref-type="bibr">Abrassart &#x0026; Doras, 2022</xref>), LyraCNet (<xref rid="R19" ref-type="bibr">Hu et al., 2022</xref>) and the ByteCover models (<xref rid="R10" ref-type="bibr">Du et al., 2021</xref>, <xref rid="R8" ref-type="bibr">2022</xref>, <xref rid="R9" ref-type="bibr">2023</xref>) are not publicly available (<ext-link ext-link-type="uri" xlink:href="https://github.com/Orfium/bytecover">https://github.com/Orfium/bytecover</ext-link>).</p>
<sec id="sec5_1">
<title>Overall performance</title>
<p>First, we evaluate the performance of models like in traditional VI research and consider only the binary label (relevant or not).</p>
<p>We report two evaluation metrics suggested by MIREX: (<ext-link ext-link-type="uri" xlink:href="https://www.music-ir.org/mirex/wiki/2021:Audio_Cover_Song_Identification">https://www.music-ir.org/mirex/wiki/2021:Audio_Cover_Song_Identification</ext-link>) mean average precision (MAP) and mean rank of the first relevant item (MR1). Since precision for the first 10 items is not a fair metric when having works with less than 10 relevant items, we omit this metric in our evaluation.</p>
<table-wrap id="T5">
<label>Table 5.</label>
<caption><p>Benchmark results of VI models, the entity resolution model Ditto and Fuzzy which is the token set ratio from rapidfuzz (<xref rid="R5" ref-type="bibr">Bachmann, 2021</xref>)</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top" rowspan="2"></th>
<th align="left" valign="top"></th>
<th align="center" valign="top" colspan="2"><bold>SHS-YT+2Q</bold></th>
<th align="center" valign="top" colspan="2"><bold>SHS-YT+AllQ</bold></th>
<th align="center" valign="top" colspan="2"><bold>SHS100K-Test</bold></th>
<th align="center" valign="top" colspan="2"><bold>Da-Tacos</bold></th>
</tr>
<tr>
<th align="left" valign="top"><bold>Model</bold></th>
<th align="center" valign="top"><bold>MAP</bold></th>
<th align="center" valign="top"><bold>MR1</bold></th>
<th align="center" valign="top"><bold>MAP</bold></th>
<th align="center" valign="top"><bold>MR1</bold></th>
<th align="center" valign="top"><bold>MAP</bold></th>
<th align="center" valign="top"><bold>MR1</bold></th>
<th align="center" valign="top"><bold>MAP</bold></th>
<th align="center" valign="top"><bold>MR1</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top" rowspan="3">Audio</td>
<td align="center" valign="top">CoverHunter</td>
<td align="center" valign="top">0.52</td>
<td align="center" valign="top">44.5</td>
<td align="center" valign="top">0.83</td>
<td align="center" valign="top">8.1</td>
<td align="center" valign="top">0.86</td>
<td align="center" valign="top">11.9</td>
<td align="center" valign="top">0.85</td>
<td align="center" valign="top">12.2</td>
</tr>
<tr>
<td align="center" valign="top">CQTNet</td>
<td align="center" valign="top">0.50</td>
<td align="center" valign="top">35.8</td>
<td align="center" valign="top">0.72</td>
<td align="center" valign="top">12.4</td>
<td align="center" valign="top">0.66</td>
<td align="center" valign="top">54.9</td>
<td align="center" valign="top">0.74</td>
<td align="center" valign="top">10.7</td>
</tr>
<tr>
<td align="center" valign="top">Re-MOVE</td>
<td align="center" valign="top">0.40</td>
<td align="center" valign="top">86.9</td>
<td align="center" valign="top">0.56</td>
<td align="center" valign="top">18.5</td>
<td align="center" valign="top">0.53</td>
<td align="center" valign="top">38.0</td>
<td align="center" valign="top">0.52</td>
<td align="center" valign="top">38.0</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">Text</td>
<td align="center" valign="top">Ditto</td>
<td align="center" valign="top">0.39</td>
<td align="center" valign="top">73.78</td>
<td align="center" valign="top">0.78</td>
<td align="center" valign="top">18.5</td>
<td align="center" valign="top">-</td>
<td align="center" valign="top">-</td>
<td align="center" valign="top">-</td>
<td align="center" valign="top">-</td>
</tr>
<tr>
<td align="center" valign="top">Fuzzy</td>
<td align="center" valign="top">0.24</td>
<td align="center" valign="top">101.3</td>
<td align="center" valign="top">0.46</td>
<td align="center" valign="top">14.3</td>
<td align="center" valign="top">-</td>
<td align="center" valign="top">-</td>
<td align="center" valign="top">-</td>
<td align="center" valign="top">-</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="table" rid="T5">Table 5</xref> we report the respective results on our benchmark datasets, SHS100K-Test and Da- Tacos. Please note that we exclude Discogs-VI-YT (<xref rid="R4" ref-type="bibr">Araz et al., 2024</xref>). since it was published after our experiments. Furthermore, it has to be noted that both of the evaluation metrics are sensitive to the dataset size which is not negligible (see <xref ref-type="table" rid="T1">Table 1</xref>). However, smaller dataset sizes usually pro-mote a higher MAP and even though SHS-YT+2Q is a smaller dataset than the others, we observe a rather strong performance drop in MAP between -34% (CoverHunter) and -13% (Re-MOVE). The performance drop is less apparent for CoverHunter at SHS-YT+AllQ and the performance even increases compared to SHS100K-Test for the other VI models. While this is likely due to the larger number of versions from SHS-SEED we further look into the pairwise Cosine similarities for dif-ferent pairwise relationships in the following section.</p>
<p>A closing remark on the overall evaluation is the potential influence of sampling bias to the perfor-mance of Re-MOVE and Ditto since these models are used as modality proxies in dataset creation.</p>
</sec>
<sec id="sec5_2">
<title>Distributions of Cosine similarities</title>
<p>To support a more well-grounded verdict about the difference of distributions of versions in SHS- YT to versions on SHS and hence in datasets like SHS100K and Da-Tacos, we investigate the Cosine similarities of pairs of versions. A version from SHS-SEED can be considered a baseline version (<italic>SHS-Version</italic>). Our RQ1 aims to uncover whether existing VI models treat two SHS-Versions of the same work as more similar then an <italic>SHS-Version</italic> compared to a version from SHS-YT of the same work (<italic>YT-Version</italic>). Similarly, the question arises whether <italic>NonVersions</italic> from SHS-YT (YT-NonVer- sion) are more similar than other NonVersions from SHS-SEED (<italic>SHS-NonVersion</italic>): the former are versions in the same YouTube result sets (e.g. of the same artists) and the latter are random other versions.</p>
<p>In <xref ref-type="table" rid="T6">Table 6</xref> we show statistics about the respective Cosine similarity distributions of SHS-Versions compared to other types of versions based on the relevance class. We observe that the similarities among SHS-Versions is significantly lower than their similarity to <italic>YT-Versions</italic>. Also, similarities of SHS-Versions of different works are significantly lower than their similarities to <italic>YT-NonVersions</italic>; but the corresponding effect size is lower. Both of these observations are likely a reason for less consistent rankings based on the tested VI models and hence the lower MAP scores observed in the previous section. Additionally, these insights substantiate an answer to RQ1 that in fact there exist different distributions of versions on SHS and YouTube.</p>
<table-wrap id="T6">
<label>Table 6.</label>
<caption><p>Arithmetic means and standard deviations of Cosine similarities between the <italic>SHS-Versions</italic> and the respective other versions. The prefix YT- indicates that the version is from SHS-YT and SHS- indicates that it is from SHS-SEED. Bold formatting indicates that means are statistically significant different measured with the Two-Sample-t-Test at p &#x003C; 0.01</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top"><bold>Relevance Class</bold></th>
<th align="center" valign="top"><bold>CoverHunter</bold></th>
<th align="center" valign="top"><bold>CQTNet</bold></th>
<th align="center" valign="top"><bold>Re-MOVE</bold></th>
<th align="center" valign="top"><bold>Support</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top"><italic>SHS-Version</italic></td>
<td align="center" valign="top">0.88 &#x00B1; 0.07</td>
<td align="center" valign="top">0.61 &#x00B1; 0.13</td>
<td align="center" valign="top">0.62 &#x00B1; 0.16</td>
<td align="center" valign="top">96,502</td>
</tr>
<tr>
<td align="left" valign="top"><italic>YT-Match</italic></td>
<td align="center" valign="top">0.87 &#x00B1; 0.08</td>
<td align="center" valign="top">0.61 &#x00B1; 0.17</td>
<td align="center" valign="top">0.66 &#x00B1; 0.19</td>
<td align="center" valign="top">44</td>
</tr>
<tr>
<td align="left" valign="top"><italic>YT-Version</italic></td>
<td align="center" valign="top"><bold>0.80 &#x00B1; 0.10</bold></td>
<td align="center" valign="top"><bold>0.48 &#x00B1; 0.19</bold></td>
<td align="center" valign="top"><bold>0.45 &#x00B1; 0.24</bold></td>
<td align="center" valign="top">5,021</td>
</tr>
<tr>
<td align="left" valign="top"><italic>SHS-NonVersion</italic></td>
<td align="center" valign="top">0.68 &#x00B1; 0.04</td>
<td align="center" valign="top">0.33 &#x00B1; 0.08</td>
<td align="center" valign="top">0.36 &#x00B1; 0.09</td>
<td align="center" valign="top">5,637,128</td>
</tr>
<tr>
<td align="left" valign="top"><italic>YT-NonVersion</italic></td>
<td align="center" valign="top"><bold>0.72 &#x00B1; 0.05</bold></td>
<td align="center" valign="top"><bold>0.37 &#x00B1; 0.09</bold></td>
<td align="center" valign="top"><bold>0.41 &#x00B1; 0.14</bold></td>
<td align="center" valign="top">14,305</td>
</tr>
<tr>
<td align="left" valign="top"><italic>YT-NoMusic</italic></td>
<td align="center" valign="top">0.68 &#x00B1; 0.07</td>
<td align="center" valign="top">0.23 &#x00B1; 0.05</td>
<td align="center" valign="top">0.22 &#x00B1; 0.05</td>
<td align="center" valign="top">1,810</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T7">
<label>Table 7.</label>
<caption><p>Arithmetic mean and standard deviation of Cosine similarities between versions in SHS-SEED and a version from SHS-YT grouped by the uncertainty class. Bold formatting indicates that means are statistically significant different measured with the Two-Sample-t-Test at p &#x003C; 0.01</p></caption>
<table>
<thead>
<tr>
<th align="left" valign="top"></th>
<th align="center" valign="top"><bold>Uncertainty Class</bold></th>
<th align="center" valign="top"><bold>CoverHunter</bold></th>
<th align="center" valign="top"><bold>CQTNet</bold></th>
<th align="center" valign="top"><bold>Re-MOVE</bold></th>
<th align="center" valign="top"><bold>Support</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top"></td>
<td align="center" valign="top"><italic>SHS-Version</italic></td>
<td align="center" valign="top">0.88 &#x00B1; 0.07</td>
<td align="center" valign="top">0.61 &#x00B1; 0.13</td>
<td align="center" valign="top">0.62 &#x00B1; 0.16</td>
<td align="center" valign="top">96,502</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="11">YT-Version</td>
<td align="center" valign="top"><italic>Version: Difficult Cover</italic></td>
<td align="center" valign="top"><bold>0.82 &#x00B1; 0.11</bold></td>
<td align="center" valign="top"><bold>0.55 &#x00B1; 0.17</bold></td>
<td align="center" valign="top"><bold>0.55 &#x00B1; 0.20</bold></td>
<td align="center" valign="top">293</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Drum-Only</italic></td>
<td align="center" valign="top"><bold>0.72 &#x00B1; 0.05</bold></td>
<td align="center" valign="top"><bold>0.28 &#x00B1; 0.07</bold></td>
<td align="center" valign="top"><bold>0.23 &#x00B1; 0.06</bold></td>
<td align="center" valign="top">321</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Instrumental</italic></td>
<td align="center" valign="top"><bold>0.68 &#x00B1; 0.12</bold></td>
<td align="center" valign="top"><bold>0.38 &#x00B1; 0.24</bold></td>
<td align="center" valign="top"><bold>0.38 &#x00B1; 0.28</bold></td>
<td align="center" valign="top">364</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Mashup/Remix</italic></td>
<td align="center" valign="top"><bold>0.76 &#x00B1; 0.07</bold></td>
<td align="center" valign="top"><bold>0.44 &#x00B1; 0.12</bold></td>
<td align="center" valign="top"><bold>0.41 &#x00B1; 0.21</bold></td>
<td align="center" valign="top">518</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Medley</italic></td>
<td align="center" valign="top"><bold>0.72 &#x00B1; 0.03</bold></td>
<td align="center" valign="top"><bold>0.32 &#x00B1; 0.09</bold></td>
<td align="center" valign="top"><bold>0.25 &#x00B1; 0.06</bold></td>
<td align="center" valign="top">86</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Single Instrument</italic></td>
<td align="center" valign="top"><bold>0.80 &#x00B1; 0.05</bold></td>
<td align="center" valign="top">0.68 &#x00B1; 0.13</td>
<td align="center" valign="top"><bold>0.46 &#x00B1; 0.10</bold></td>
<td align="center" valign="top">195</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Slowed/Sped-up</italic></td>
<td align="center" valign="top">0.87 &#x00B1; 0.05</td>
<td align="center" valign="top"><bold>0.54 &#x00B1; 0.14</bold></td>
<td align="center" valign="top"><bold>0.43 &#x00B1; 0.24</bold></td>
<td align="center" valign="top">63</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Vocal-Only</italic></td>
<td align="center" valign="top"><bold>0.77 &#x00B1; 0.04</bold></td>
<td align="center" valign="top"><bold>0.38 &#x00B1; 0.09</bold></td>
<td align="center" valign="top"><bold>0.23 &#x00B1; 0.07</bold></td>
<td align="center" valign="top">718</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Video: Low Fidelity</italic></td>
<td align="center" valign="top">0.86 &#x00B1; 0.09</td>
<td align="center" valign="top"><bold>0.57 &#x00B1; 0.16</bold></td>
<td align="center" valign="top"><bold>0.49 &#x00B1; 0.29</bold></td>
<td align="center" valign="top">292</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Video: Multiple Versions</italic></td>
<td align="center" valign="top"><bold>0.79 &#x00B1; 0.09</bold></td>
<td align="center" valign="top"><bold>0.49 &#x00B1; 0.17</bold></td>
<td align="center" valign="top"><bold>0.52 &#x00B1; 0.21</bold></td>
<td align="center" valign="top">343</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Video: With Non- Music</italic></td>
<td align="center" valign="top"><bold>0.81 &#x00B1; 0.10</bold></td>
<td align="center" valign="top"><bold>0.48 &#x00B1; 0.18</bold></td>
<td align="center" valign="top"><bold>0.50 &#x00B1; 0.23</bold></td>
<td align="center" valign="top">1,027</td>
</tr>
<tr>
<td align="left" valign="top"></td>
<td align="center" valign="top"><italic>SHS-NonVersion</italic></td>
<td align="center" valign="top">0.68 &#x00B1; 0.04</td>
<td align="center" valign="top">0.33 &#x00B1; 0.08</td>
<td align="center" valign="top">0.36 &#x00B1; 0.09</td>
<td align="center" valign="top">5,637,128</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="5">YT-NonVersion</td>
<td align="center" valign="top"><italic>Version: Mashup/Remix</italic></td>
<td align="center" valign="top"><bold>0.78 &#x00B1; 0.04</bold></td>
<td align="center" valign="top"><bold>0.42 &#x00B1; 0.07</bold></td>
<td align="center" valign="top">0.33 &#x00B1; 0.14</td>
<td align="center" valign="top">53</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Same Artist</italic></td>
<td align="center" valign="top"><bold>0.76 &#x00B1; 0.04</bold></td>
<td align="center" valign="top"><bold>0.45 &#x00B1; 0.08</bold></td>
<td align="center" valign="top"><bold>0.51 &#x00B1; 0.11</bold></td>
<td align="center" valign="top">862</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Same Genre</italic></td>
<td align="center" valign="top"><bold>0.75 &#x00B1; 0.05</bold></td>
<td align="center" valign="top"><bold>0.40 &#x00B1; 0.09</bold></td>
<td align="center" valign="top"><bold>0.53 &#x00B1; 0.13</bold></td>
<td align="center" valign="top">169</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Version: Similar Version</italic></td>
<td align="center" valign="top"><bold>0.76 &#x00B1; 0.06</bold></td>
<td align="center" valign="top"><bold>0.45 &#x00B1; 0.08</bold></td>
<td align="center" valign="top"><bold>0.51 &#x00B1; 0.11</bold></td>
<td align="center" valign="top">1,069</td>
</tr>
<tr>
<td align="center" valign="top"><italic>Video: Multiple Versions</italic></td>
<td align="center" valign="top"><bold>0.71 &#x00B1; 0.05</bold></td>
<td align="center" valign="top"><bold>0.39 &#x00B1; 0.08</bold></td>
<td align="center" valign="top"><bold>0.40 &#x00B1; 0.14</bold></td>
<td align="center" valign="top">102</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To address RQ2, we investigate the differences of Cosine similarities for subsets of relevance clas-ses grouped by their corresponding uncertainty classes in <xref ref-type="table" rid="T7">Table 7</xref>. Our imposed ordinal relevance classes also allow for analysis of similarities when dealing with highly similar versions (<italic>YT-Match</italic>) and versions without music (<italic>YT-NoMusic</italic>). Interestingly, the similarities are neither significantly higher nor lower than the similarities to other <italic>SHS-Versions</italic>. Regarding <italic>NoMusic</italic> versions, we can also see rather high similarities which indicates a lack of robustness of VI models. Almost all the YT-Versions are significantly less similar compared to SHS-Versions (<italic>p</italic> &#x003c; 0.01). The most challeng-ing classes for all the models appear to be drum-only versions, instrumental versions, and medleys. While the latter is rather attributed to an alignment problem, the other two are most likely affected by the absence of the main melody and partly the harmony. Vocal-only versions which most likely only contain the main melody, appear to be hard for CQTNet and Re-MOVE and less so for Cover-Hunter. Difficulties for VI models for YT-NonVersions appear to arise due to versions being of the same artist, genre or just because they are similar by chance (<italic>Version: Similar Version</italic> and <italic>Version: Mashup/Remix</italic>).</p>
<p>In <xref ref-type="fig" rid="F5">Figure 5</xref>, we further investigate the mean similarities by CoverHunter of different relevant ver-sions. Apparently, the difficulty of drum-only versions is validated. We can also see that versions referring to multiple versions or including non-music noise impact the similarity. In the next sec-tion, we provide some examples for versions on YouTube which appear to be very challenging.</p>
<fig id="F5">
<label>Figure 5.</label>
<caption><p>Mean Cosine similarities of CoverHunter embeddings between YT-Versions per respective uncertainty class</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c92-fig5.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec5_3">
<title>Error analysis</title>
<p>We examine the reasons for uncertainties more profoundly. First, we look at versions labelled <italic>Non-Version</italic> which are more similar than random other versions. We found that songs of the same genres are generally more similar, for instance bossa nova and blues. In theory VI models are not optimized to model genres per se. However, musical characteristics such as chord progressions (e.g., blues scheme) or rhythm (e.g., bossa nova beat) seem to be hard to disentangle from VI rep-resentations. Similarly, musical characteristics appear to correlate for versions of the same artists (e.g., Lady Gaga, Backstreet Boys, AC/DC). However, in some cases versions appear to be similar simply by similar chord progressions (e.g., &#x2018;Ultraviolence&#x2019; (<ext-link ext-link-type="uri" xlink:href="https://youtu.be/ZFWC4SiZBao">https://youtu.be/ZFWC4SiZBao</ext-link>) and &#x2018;Radioactive&#x2019; (<ext-link ext-link-type="uri" xlink:href="https://youtu.be/E5sVhFnrlTw">https://youtu.be/E5sVhFnrlTw</ext-link>)). Interestingly, we also found <italic>NoMusic</italic> versions with high similarity to Versions according to CoverHunter (<ext-link ext-link-type="uri" xlink:href="https://youtu.be/PG6iJmbnOTY">https://youtu.be/PG6iJmbnOTY</ext-link> and <ext-link ext-link-type="uri" xlink:href="https://youtu.be/svQD6mGDPXc">https://youtu.be/svQD6mGDPXc</ext-link>). We assume that this is due to the matching of mute or low energy sections in these versions with mute parts of SHS-Versions after the alignment module.</p>
<p>Investigating some YT-Versions which appear to be difficult to detect, we found that vocal-only can refer either to versions with isolated voice stem by sound source separation (<ext-link ext-link-type="uri" xlink:href="https://youtu.be/cixhJpyTWko">https://youtu.be/cixhJpyTWko</ext-link>) but also self-recorded vocal-only versions (<ext-link ext-link-type="uri" xlink:href="https://youtu.be/24AKYyNusvs">https://youtu.be/24AKYyNusvs</ext-link>).</p>
</sec>
</sec>
</sec>
<sec id="sec6">
<title>Discussion and implications</title>
<p>We summarize the findings gathered by our created dataset SHS-YT and the respective bench-marks. Regarding RQ1 we in fact confirmed a significant difference between some of the versions on YouTube and the ones included in SHS-based datasets. Based on our ordinal relevance labels, we derive that the difficulty especially arises due to relevant versions which are difficult to detect (false negatives) rather than irrelevant versions (false positives). However, some aspects such as similarity of songs within genres, of the same artists or with similar chord progressions seem to impact the overestimated similarity.</p>
<p>Looking at our dataset with annotated uncertainty classes reveals that the drum-only videos are rather challenging as well as instrumental versions. While the former do not include melody and harmony, these cases can be denoted as boundary cases. This raises the question about how a cover version is defined which is a question to be asked in musicology and maybe even of philo-sophical nature. Beside these rather song-specific uncertainty classes, there are also observable difficulties arising due to the alignment problem. While this is a general problem in VI, extreme cases such as medleys, multiple versions in a video and videos with versions and non-musical con-tent still appear to be difficult for existing models.</p>
<p>To improve VI models in the future one solution is to rely on broader datasets in terms of data sources. For instance, by utilizing YouTube metadata to train weakly-supervised models. However, we propose another solution based on our observations. In <xref ref-type="fig" rid="F6">Figure 6</xref>, we propose our taxonomy of cover versions in online videos. In the context of VI, musical characteristics which are discussed by (<xref rid="R34" ref-type="bibr">Yesiler et al. 2021</xref>) (<italic>Song</italic> node) are one key component to model cover version relationships. Researchers are well aware about the importance of alterations in these characteristics and ad-dress them by augmentation techniques such as pitch-variations, tempo-variations. In the context of VI on YouTube (<italic>Video</italic> node), there are additional challenges which arise due to the context of online videos. Our observations provided examples for versions with low-fidelity and versions which occur in the background with foreground noise. We believe that both of these alterations can be well addressed by incorporating audio fingerprinting and noise mixing. We also found that isolated stems (e.g., drum-only, vocal-only versions) are particularly challenging. This is a problem which points to the related music information retrieval task of query-by-humming, where audio representations rely on single stems (usually the singing voice). In VI, an integration of sound source separation in augmentation techniques could further benefit model performance. Alterna-tively, rather than extracting the features in an end-to-end fashion using CQT spectrograms, one could extract features for melody, harmony, and rhythm separately like <xref rid="R1" ref-type="bibr">Abrassart &#x0026; Doras (2022</xref>).</p>
<p>Lastly, the alignment problem which we have mentioned appears to be particularly strong on online video platforms. Not only can a version be represented only by a section (<italic>Chunked</italic>), but also along with multiple other versions or non-music noise.</p>
<p>The application of sliding time windows, possibly with different sizes followed by a maximum ag-gregation can address this problem. However, this might in turn increase the risk for false nega-tives and the computational load. We propose that also the synthesis of data by concatenation of different versions and non-musical noise such as commentary can help to make VI models more robust for these cases.</p>
<fig id="F6">
<label>Figure 6.</label>
<caption><p>Taxonomy of cover versions in online videos</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c92-fig6.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec7">
<title>Conclusion and limitations</title>
<p>In this paper we proposed SHS-YT, a new benchmark dataset for VI. Created with a multi-modal uncertainty sampling approach and annotated by workers and experts including uncertainty clas-ses, this dataset provided novel insights in the robustness of VI models. Lastly, we want to highlight some limitations of our study.</p>
<p>To the best of our knowledge, this is the first study which evaluates VI approaches with regard to different alterations among versions focusing on the most prominent uncertainty. Nevertheless, these classes might be partly subjective and cannot be fully isolated by other effects which might occur for certain pairs of versions. Due to the peculiarity of YouTube of being a dynamic online video platform, we cannot guarantee the presence of our videos on the platform in the future. In our repository, we provide all the URLs investigated. Due to copyright issues, we cannot provide the raw audio but only the extracted CQT and CREMA features. This paper focused on cover ver-sions in the context western popular music. We are well aware that other genres might incorporate other characteristics which make this study less applicable. In future studies, the consideration of other genres with different characteristics could improve to gather an even broader overview of musical reinterpretations.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>We thank the music experts for supporting the annotation processes.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Abrassart</surname><given-names>M.</given-names></name><name><surname>Doras</surname><given-names>G.</given-names></name></person-group><year>2022</year><article-title>And what if two musical versions don&#x2019;t share melody, harmony, rhythm, or lyrics?</article-title> <source>International Society for Music Information Retrieval Conference</source></element-citation></ref>
<ref id="R2"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Agrawal</surname><given-names>S.</given-names></name><name><surname>Sureka</surname><given-names>A.</given-names></name></person-group><year>2013</year><chapter-title>Copyright Infringement Detection of Music Videos on YouTube by Mining Video and Uploader Meta-data</chapter-title><person-group person-group-type="editor"><name><surname>Bhatnagar</surname><given-names>V.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Srinivasa</surname><given-names>S.</given-names></name></person-group><source>Big Data Analytics</source><fpage>48</fpage><lpage>67</lpage><publisher-name>Springer International Publishing</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/978-3-319-03689-2_4">https://doi.org/10.1007/978-3-319-03689-2_4</ext-link></element-citation></ref>
<ref id="R3"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Airoldi</surname><given-names>M.</given-names></name><name><surname>Beraldo</surname><given-names>D.</given-names></name><name><surname>Gandini</surname><given-names>A.</given-names></name></person-group><year>2016</year><article-title>Follow the algorithm: An exploratory investigation of music on YouTube</article-title><source>Poetics</source><volume>57</volume><fpage>1</fpage><lpage>13</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.poetic.2016.05.001">https://doi.org/10.1016/j.poetic.2016.05.001</ext-link></element-citation></ref>
<ref id="R4"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Araz</surname><given-names>R.O.</given-names></name><name><surname>Serra</surname><given-names>X.</given-names></name><name><surname>Bogdanov</surname><given-names>D.</given-names></name></person-group><year>2024</year><article-title>Discogs-VI: A musical version identification dataset based on public editorial metadata</article-title><source>Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR)</source></element-citation></ref>
<ref id="R5"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bachmann</surname><given-names>M.</given-names></name></person-group><year>2021</year><source>Maxbachmann/RapidFuzz: Release 1.8. 0</source><volume>10</volume><comment>[Computer software]</comment></element-citation></ref>
<ref id="R6"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Daikoku</surname><given-names>H.</given-names></name><name><surname>Ding</surname><given-names>S.</given-names></name><name><surname>Sanne</surname><given-names>U.S.</given-names></name><name><surname>Benetos</surname><given-names>E.</given-names></name><name><surname>Wood</surname><given-names>A.L.</given-names></name><name><surname>Fujii</surname><given-names>S.</given-names></name><name><surname>Savage</surname><given-names>P.E.</given-names></name></person-group><year>2020</year><article-title>Human and automated judgements of musical similarity in a global sample</article-title><source>PsyArXiv Preprint</source></element-citation></ref>
<ref id="R7"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Devlin</surname><given-names>J.</given-names></name><name><surname>Chang</surname><given-names>M.-W.</given-names></name><name><surname>Lee</surname><given-names>K.</given-names></name><name><surname>Toutanova</surname><given-names>K.</given-names></name></person-group><year>2018</year><article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title><comment>arXiv Preprint arXiv:1810.04805</comment></element-citation></ref>
<ref id="R8"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Du</surname><given-names>X.</given-names></name><name><surname>Chen</surname><given-names>K.</given-names></name><name><surname>Wang</surname><given-names>Z.</given-names></name><name><surname>Zhu</surname><given-names>B.</given-names></name><name><surname>Ma</surname><given-names>Z.</given-names></name></person-group><year>2022</year><article-title>Bytecover2: Towards Dimensionality Reduction of Latent Embedding for Efficient Cover Song Identification</article-title><source>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source><fpage>616</fpage><lpage>620</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ICASSP43922.2022.9747630">https://doi.org/10.1109/ICASSP43922.2022.9747630</ext-link></element-citation></ref>
<ref id="R9"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Du</surname><given-names>X.</given-names></name><name><surname>Wang</surname><given-names>Z.</given-names></name><name><surname>Liang</surname><given-names>X.</given-names></name><name><surname>Liang</surname><given-names>H.</given-names></name><name><surname>Zhu</surname><given-names>B.</given-names></name><name><surname>Ma</surname><given-names>Z.</given-names></name></person-group><year>2023</year><article-title>Bytecover3: Accurate cover song identification on short queries</article-title><source>ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source><fpage>1</fpage><lpage>5</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ICASSP49357.2023.10095389">https://doi.org/10.1109/ICASSP49357.2023.10095389</ext-link></element-citation></ref>
<ref id="R10"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Du</surname><given-names>X.</given-names></name><name><surname>Yu</surname><given-names>Z.</given-names></name><name><surname>Zhu</surname><given-names>B.</given-names></name><name><surname>Chen</surname><given-names>X.</given-names></name><name><surname>Ma</surname><given-names>Z.</given-names></name></person-group><year>2021</year><article-title>ByteCover: Cover Song Identification via Multi&#x00AC;Loss Training (arXiv:2010.14022)</article-title><source>arXiv</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2010.14022">https://doi.org/10.48550/arXiv.2010.14022</ext-link></element-citation></ref>
<ref id="R11"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Ellis</surname><given-names>D.P. W.</given-names></name></person-group><year>2011</year><article-title>The &#x201C;covers80&#x201D; cover song data set</article-title><ext-link ext-link-type="uri" xlink:href="http://labrosa.ee.columbia.edu/projects/coversongs/covers80">http://labrosa.ee.columbia.edu/projects/coversongs/covers80</ext-link></element-citation></ref>
<ref id="R12"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Flexer</surname><given-names>A.</given-names></name><name><surname>Grill</surname><given-names>T.</given-names></name></person-group><year>2016</year><article-title>The Problem of Limited Inter-rater Agreement in Modelling Music Similarity</article-title><source>Journal of New Music Research</source><volume>45</volume><issue>3</issue><fpage>239</fpage><lpage>251</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1080/09298215.2016.1200631">https://doi.org/10.1080/09298215.2016.1200631</ext-link></element-citation></ref>
<ref id="R13"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Flexer</surname><given-names>A.</given-names></name><name><surname>Lallai</surname><given-names>T.</given-names></name></person-group><year>2019</year><article-title>Can we increase inter-and intra-rater agreement in modeling general music similarity?</article-title> <source>Conference of International Society for Music Information Retrieval (ISMIR)</source><fpage>494</fpage><lpage>500</lpage></element-citation></ref>
<ref id="R14"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Flexer</surname><given-names>A.</given-names></name><name><surname>Lallai</surname><given-names>T.</given-names></name><name><surname>Ra&#x0161;l</surname><given-names>K.</given-names></name></person-group><year>2021</year><article-title>On evaluation of inter- and intra-rater agreement in music recommendation</article-title><source>Transactions of the International Society for Music Information Retrieval</source><volume>4</volume><issue>1</issue><fpage>182</fpage><lpage>194</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5334/tismir.107">https://doi.org/10.5334/tismir.107</ext-link></element-citation></ref>
<ref id="R15"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Ghosh</surname><given-names>S.</given-names></name><name><surname>Sperling</surname><given-names>R.</given-names></name><name><surname>Hooper</surname><given-names>S.</given-names></name></person-group><year>2019</year><source>Using Amazon MTurk for research in academia: A beginner&#x2019;s guide for using Qualtrics, detecting VPN/proxy, limiting countries using geolocation &#x0026; other tips</source><publisher-name>SSRN</publisher-name><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.2139/ssrn.3455722">https://doi.org/10.2139/ssrn.3455722</ext-link></element-citation></ref>
<ref id="R16"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Gulati</surname><given-names>A.</given-names></name><name><surname>Qin</surname><given-names>J.</given-names></name><name><surname>Chiu</surname><given-names>C.-C.</given-names></name><name><surname>Parmar</surname><given-names>N.</given-names></name><name><surname>Zhang</surname><given-names>Y.</given-names></name><name><surname>Yu</surname><given-names>J.</given-names></name><name><surname>Han</surname><given-names>W.</given-names></name><name><surname>Wang</surname><given-names>S.</given-names></name><name><surname>Zhang</surname><given-names>Z.</given-names></name><name><surname>Wu</surname><given-names>Y.</given-names></name></person-group><year>2020</year><article-title>Conformer: Convolution-augmented transformer for speech recognition</article-title><comment>arXiv Preprint arXiv:2005.08100</comment></element-citation></ref>
<ref id="R17"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Hachmeier</surname><given-names>S.</given-names></name><name><surname>J&#x00E4;schke</surname><given-names>R.</given-names></name><name><surname>Saadatdoorabi</surname><given-names>H.</given-names></name></person-group><year>2022</year><article-title>Music Version Retrieval from YouTube: How to Formulate Effective Search Queries?</article-title><source>LWDA</source><fpage>213</fpage><lpage>226</lpage></element-citation></ref>
<ref id="R18"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hanson</surname><given-names>J.</given-names></name></person-group><year>2018</year><article-title>Assessing the educational value of YouTube videos for beginning instrumental music</article-title><source>Contributions to Music Education</source><volume>43</volume><fpage>137</fpage><lpage>158</lpage></element-citation></ref>
<ref id="R19"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Hu</surname><given-names>S.</given-names></name><name><surname>Zhang</surname><given-names>B.</given-names></name><name><surname>Lu</surname><given-names>J.</given-names></name><name><surname>Jiang</surname><given-names>Y.</given-names></name><name><surname>Wang</surname><given-names>W.</given-names></name><name><surname>Kong</surname><given-names>L.</given-names></name><name><surname>Zhao</surname><given-names>W.</given-names></name><name><surname>Jiang</surname><given-names>T.</given-names></name></person-group><year>2022</year><article-title>WideResNet with Joint Representation Learning and Data Augmentation for Cover Song Identification</article-title><source>Interspeech 2022</source><fpage>4187</fpage><lpage>4191</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.21437/Interspeech.2022-10600">https://doi.org/10.21437/Interspeech.2022-10600</ext-link></element-citation></ref>
<ref id="R20"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Jones</surname><given-names>M.C.</given-names></name><name><surname>Downie</surname><given-names>J.S.</given-names></name><name><surname>Ehmann</surname><given-names>A.F.</given-names></name></person-group><year>2007</year><article-title>Human similarity judgments: Implications for the design of formal evaluations</article-title><source>International Society for Music Information Retrieval Conference (ISMIR)</source></element-citation></ref>
<ref id="R21"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Li</surname><given-names>B.</given-names></name><name><surname>Kumar</surname><given-names>A.</given-names></name></person-group><year>2019</year><article-title>Query by video: Cross-modal music retrieval</article-title><source>International Society for Music Information Retrieval Conference (ISMIR)</source></element-citation></ref>
<ref id="R22"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Li</surname><given-names>Y.</given-names></name><name><surname>Li</surname><given-names>J.</given-names></name><name><surname>Suhara</surname><given-names>Y.</given-names></name><name><surname>Doan</surname><given-names>A.</given-names></name><name><surname>Tan</surname><given-names>W.-C.</given-names></name></person-group><year>2020</year><article-title>Deep entity matching with pre-trained language models</article-title><source>arXiv Preprint arXiv:2004.00584</source></element-citation></ref>
<ref id="R23"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Liikkanen</surname><given-names>L.A.</given-names></name><name><surname>Salovaara</surname><given-names>A.</given-names></name></person-group><year>2015</year><article-title>Music on YouTube: User engagement with traditional, user-appropriated, and derivative videos</article-title><source>Computers in Human Behavior</source><volume>50</volume><fpage>108</fpage><lpage>124</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.chb.2015.01.067">https://doi.org/10.1016/j.chb.2015.01.067</ext-link></element-citation></ref>
<ref id="R24"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>F.</given-names></name><name><surname>Tuo</surname><given-names>D.</given-names></name><name><surname>Xu</surname><given-names>Y.</given-names></name><name><surname>Han</surname><given-names>X.</given-names></name></person-group><year>2023</year><article-title>CoverHunter: Cover song identification with refined attention and alignments</article-title><source>2023 IEEE International Conference on Multimedia and Expo (ICME)</source><fpage>1080</fpage><lpage>1085</lpage></element-citation></ref>
<ref id="R25"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Martet</surname><given-names>S.</given-names></name></person-group><year>2016</year><article-title>The circulation of user-appropriated music content on YouTube</article-title><source>YouTube and Music</source><volume>22</volume><issue>4</issue><fpage>169</fpage></element-citation></ref>
<ref id="R26"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Matherly</surname><given-names>T.</given-names></name></person-group><year>2018</year><article-title>A Panel for Lemons? Positivity bias, reputation systems and data quality on MTurk</article-title><source>European Journal of Marketing</source><volume>53</volume><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1108/EJM-07-2017-0491">https://doi.org/10.1108/EJM-07-2017-0491</ext-link></element-citation></ref>
<ref id="R27"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>McDaniel</surname><given-names>B.</given-names></name></person-group><year>2021</year><article-title>Popular music reaction videos: Reactivity, creator labor, and the performance of listening online</article-title><source>New Media &#x0026; Society</source><volume>23</volume><issue>6</issue><fpage>1624</fpage><lpage>1641</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1177/1461444820918549">https://doi.org/10.1177/1461444820918549</ext-link></element-citation></ref>
<ref id="R28"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Mellis</surname><given-names>A.</given-names></name><name><surname>Bickel</surname><given-names>W.</given-names></name></person-group><year>2020</year><article-title>Mechanical turk data collection in addiction research: Utility, concerns, and best practices</article-title><source>Addiction (Abingdon, England)</source><volume>115</volume><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1111/add.15032">https://doi.org/10.1111/add.15032</ext-link></element-citation></ref>
<ref id="R29"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Okabe</surname><given-names>K.</given-names></name><name><surname>Koshinaka</surname><given-names>T.</given-names></name><name><surname>Shinoda</surname><given-names>K.</given-names></name></person-group><year>2018</year><article-title>Attentive statistics pooling for deep speaker embedding</article-title><source>arXiv Preprint arXiv:1803.10963</source></element-citation></ref>
<ref id="R30"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Peer</surname><given-names>E.</given-names></name><name><surname>Vosgerau</surname><given-names>J.</given-names></name><name><surname>Acquisti</surname><given-names>A.</given-names></name></person-group><year>2013</year><article-title>Reputation as a sufficient condition for data quality on Amazon Mechanical Turk</article-title><source>Behavior Research Methods</source><volume>46</volume><fpage>1023</fpage><lpage>1031</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3758/s13428-013-0434-y">https://doi.org/10.3758/s13428-013-0434-y</ext-link></element-citation></ref>
<ref id="R31"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Silva</surname><given-names>D.F.</given-names></name><name><surname>de Souza</surname><given-names>V.M.</given-names></name><name><surname>Batista</surname><given-names>G.E.</given-names></name></person-group><year>2015</year><article-title>Music shapelets for fast cover song recognition</article-title><source>ISMIR</source><fpage>441</fpage><lpage>447</lpage></element-citation></ref>
<ref id="R32"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Smith</surname><given-names>J.B. L.</given-names></name><name><surname>Hamasaki</surname><given-names>M.</given-names></name><name><surname>Goto</surname><given-names>M.</given-names></name></person-group><year>2017</year><article-title>Classifying derivative works with search, text, audio, and video features</article-title><source>International Conference on Multimedia and Expo (ICME)</source><fpage>1422</fpage><lpage>1427</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ICME.2017.8019444">https://doi.org/10.1109/ICME.2017.8019444</ext-link></element-citation></ref>
<ref id="R33"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Xu</surname><given-names>X.</given-names></name><name><surname>Chen</surname><given-names>X.</given-names></name><name><surname>Yang</surname><given-names>D.</given-names></name></person-group><year>2018</year><article-title>Key-invariant convolutional neural network toward efficient cover song identification</article-title><source>2018 IEEE International Conference on Multimedia and Expo (ICME)</source><fpage>1</fpage><lpage>6</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ICME.2018.8486531">https://doi.org/10.1109/ICME.2018.8486531</ext-link></element-citation></ref>
<ref id="R34"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yesiler</surname><given-names>F.</given-names></name><name><surname>Doras</surname><given-names>G.</given-names></name><name><surname>Bittner</surname><given-names>R.M.</given-names></name><name><surname>Tralie</surname><given-names>C.J.</given-names></name><name><surname>Serr&#x00E0;</surname><given-names>J.</given-names></name></person-group><year>2021</year><article-title>Audio-based Musical Version Identification: Elements and Challenges</article-title><source>IEEE Signal Processing Magazine</source><volume>38</volume><issue>6</issue><fpage>115</fpage><lpage>136</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/MSP.2021.3105941">https://doi.org/10.1109/MSP.2021.3105941</ext-link></element-citation></ref>
<ref id="R35"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Yesiler</surname><given-names>F.</given-names></name><name><surname>Serr&#x00E0;</surname><given-names>J.</given-names></name><name><surname>G&#x00F3;mez</surname><given-names>E.</given-names></name></person-group><year>2020</year><comment>a</comment><article-title>Accurate and scalable version identification using musically motivated embeddings</article-title><source>International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</source><fpage>21</fpage><lpage>25</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ICASSP40776.2020.9053793">https://doi.org/10.1109/ICASSP40776.2020.9053793</ext-link></element-citation></ref>
<ref id="R36"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Yesiler</surname><given-names>F.</given-names></name><name><surname>Serr&#x00E0;</surname><given-names>J.</given-names></name><name><surname>G&#x00F3;mez</surname><given-names>E.</given-names></name></person-group><year>2020</year><comment>b</comment><article-title>Less is more: Faster and better music version identification with embedding distillation (arXiv:2010.03284)</article-title><comment>arXiv</comment><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2010.03284">https://doi.org/10.48550/arXiv.2010.03284</ext-link></element-citation></ref>
<ref id="R37"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Yesiler</surname><given-names>F.</given-names></name><name><surname>Tralie</surname><given-names>C.</given-names></name><name><surname>Correya</surname><given-names>A.</given-names></name><name><surname>Silva</surname><given-names>D.F.</given-names></name><name><surname>Tovstogan</surname><given-names>P.</given-names></name><name><surname>G&#x00F3;mez</surname><given-names>E.</given-names></name><name><surname>Serra</surname><given-names>X.</given-names></name></person-group><year>2019</year><article-title>Da- TACOS: a dataset for cover song identification and understanding</article-title><source>Proc. of the 20th Int. Soc. for Music Information Retrieval Conf. (ISMIR)</source><fpage>327</fpage><lpage>334</lpage></element-citation></ref>
<ref id="R38"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>Z.</given-names></name><name><surname>Xu</surname><given-names>X.</given-names></name><name><surname>Chen</surname><given-names>X.</given-names></name><name><surname>Yang</surname><given-names>D.</given-names></name></person-group><year>2020</year><article-title>Learning a representation for cover song identification using convolutional neural network</article-title><source>International Conference on Acoustics, Speech, and Signal Processing (ICASSP)</source><fpage>541</fpage><lpage>545</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/ICASSP40776.2020.9053839">https://doi.org/10.1109/ICASSP40776.2020.9053839</ext-link></element-citation></ref>
</ref-list>
</back>
</article>