<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf47140</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf47140</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Inconsistency-driven approach for human-in-the-loop entity matching</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Ito</surname><given-names>Hiroyoshi</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>Koizumi</surname><given-names>Takahiro</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<contrib contrib-type="author"><name><surname>Yoshimoto</surname><given-names>Ryuji</given-names></name>
<xref ref-type="aff" rid="aff0003"/></contrib>
<contrib contrib-type="author"><name><surname>Fukushima</surname><given-names>Yukihiro</given-names></name>
<xref ref-type="aff" rid="aff0004"/></contrib>
<contrib contrib-type="author"><name><surname>Harada</surname><given-names>Takashi</given-names></name>
<xref ref-type="aff" rid="aff0005"/></contrib>
<contrib contrib-type="author"><name><surname>Morishima</surname><given-names>Atsuyuki</given-names></name>
<xref ref-type="aff" rid="aff0006"/></contrib>
<aff id="aff0001"><bold>Hiroyoshi Ito</bold> is an Assistant Professor at Institute of Library, Information and Media Science, University of Tsukuba. He can be contacted at: <email xlink:href="ito@slis.tsukuba.ac.jp">ito@slis.tsukuba.ac.jp</email></aff>
<aff id="aff0002"><bold>Takahiro Koizumi</bold> is a master&#x2019;s student at Graduate School of Comprehensive Human Sciences, University of Tsukuba. He can be contacted at: <email xlink:href="takahiro.koizumi.2022b@gmail.com">takahiro.koizumi.2022b@gmail.com</email></aff>
<aff id="aff0003"><bold>Ryuji Yoshimoto</bold> is an Engineer at CARLIL Inc. He can be contacted at: <email xlink:href="ryuuji@calil.jp">ryuuji@calil.jp</email></aff>
<aff id="aff0004"><bold>Yukihiro Fukushima</bold> is an Associate Professor at Faculty of Letters, Keio University. He can be contacted at: <email xlink:href="fukusima-y@keio.jp">fukusima-y@keio.jp</email></aff>
<aff id="aff0005"><bold>Takashi Harada</bold> is a Professor at Center for License and Qualification, Doshisha University. He can be contacted at: <email xlink:href="ushi@slis.doshisha.ac.jp">ushi@slis.doshisha.ac.jp</email></aff>
<aff id="aff0006"><bold>Atsuyuki Morishima</bold> is a Professor at Institute of Library, Information and Media Science, University of Tsukuba. He can be contacted at: <email xlink:href="mori@slis.tsukuba.ac.jp">mori@slis.tsukuba.ac.jp</email></aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>824</fpage>
<lpage>835</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> Entity matching is a fundamental operation in a wide range of information management applications and a tremendous number of methods have been proposed to address the problem. Human-in-the-loop entity matching is a human-AI collaborative approach which is effective when the data for entity matching is incomplete or requires domain knowledge. A typical human-in-the- loop approach is to allow a machine-learning-based matcher to ask humans to match entities when it cannot match them with high confidence. However, ML- based matchers cannot avoid the unknown-unknown problem, i.e., they can resolve the entities incorrectly with high confidence.</p>
<p><bold>Method.</bold> This paper addresses an inconsistency-based method to deal with this problem. The method asks humans to resolve the entities when we find inconsistency in the transitivity property behind entity matching. For example, if a matcher returns a positive result only for two combinations among three entities, the result is inconsistent.</p>
<p><bold>Analysis.</bold> This paper shows an implementation of our idea in similarity-based blocking method and Bayesian inference and explains the result of an extensive set of experiments that reveals how and when the method is effective.</p>
<p><bold>Results.</bold> The result showed that the inconsistency-based sampling selects very different entity pairs compared to other sampling strategies and that a simple hybrid strategy performs well in many practical situations.</p>
<p><bold>Conclusion.</bold> The results indicate our approach complements any existing matcher that can cause the unknown-unknown problem in entity matching. </p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>Entity matching is a fundamental operation for objects in information management applications, such as bibliographic records, names (<xref rid="R6" ref-type="bibr">Cohen, et al., 2003</xref>), entities that appear in ontology (Xu, et al., 2008), texts and other data collections (<xref rid="R12" ref-type="bibr">Jaro, 1989</xref>). Therefore, a tremendous number of methods have been proposed to address the problem (<xref rid="R4" ref-type="bibr">Christophides, et al., 2020</xref>), (<xref rid="R18" ref-type="bibr">Mudgal, et al., 2018</xref>), (<xref rid="R12" ref-type="bibr">Jaro, 1989</xref>), and implemented services are often available (<xref rid="R11" ref-type="bibr">Govind, 2018</xref>). Machine-learning-based matchers (Eraheem, et al., 2014), (<xref rid="R17" ref-type="bibr">Li, et al., 2020</xref>), (<xref rid="R33" ref-type="bibr">Yao, et al., 2022</xref>) are widely used for many other problems in digital libraries (<xref rid="R19" ref-type="bibr">Nielsen, 2018</xref>).</p>
<p>Human-in-the-loop entity matching (<xref rid="R22" ref-type="bibr">Osawa, et al., 2021</xref>), (<xref rid="R10" ref-type="bibr">Gokhale, et al., 2014</xref>), (Das, et al., 2017) is a human-AI collaborative approach to the problem and known as being effective when the data set is incomplete, or the matching requires domain knowledge (<xref rid="R28" ref-type="bibr">Trabelsi, et al., 2022</xref>). A typical human- in-the-loop ML matching is to allow an ML matcher to ask humans to match entities when it cannot match them with high confidence.</p>
<p>However, ML-based matchers cannot avoid the <italic>unknown-unknown problem</italic>, i.e., they can match the entities incorrectly with high confidence (Chung, et al., 2019). This is an inherent weakness of the ML-based matchers, because typical human-in-the-loop approaches choose a pair when the matching result is uncertain, or multiple matchers disagree on the result (<xref rid="R26" ref-type="bibr">Settles, 2010</xref>), but such procedures do not guarantee that the matching decisions on the remaining pairs are correct.</p>
<p>This paper addresses an inconsistency-driven method that addresses this problem, which can be used with <italic>any</italic> matchers that output matching probabilities for a pair of entities. The method chooses the entity pairs it asks humans to match, by an inconsistency-based &#x201C;sampling&#x201D;; when it finds inconsistency in the equivalence relation behind entity matching, it picks up the pair of entities that cause the inconsistency and asks humans to fix the matches. For example, if a matcher returns a negative result only for a particular pair (e.g., &#x1D460;1 and &#x1D460;3) among three entities &#x1D460;1, &#x1D460;2 and &#x1D460;3 (Fig 1), the result is inconsistent because it violates the transitive law of the equivalence relationship.</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>Inconsistency-driven human-in-the-loop entity matching: if the ML matcher outputs the results that are inconsistent with each other, humans correct the result, and the feedback is given to the matcher to improve the results</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>We implemented the inconsistency-driven sampling in a simple framework combining Bayesian inference and similarity-based blocking method, to highlight the effects of the sampling. In the framework, we assume that we have a certain amount of training data with known labels in advance for the blocking and Bayesian inference. Then, we conducted an extensive set of experiments with different accuracies of matchers, expecting that the result reveals how and when the method is effective. Therefore, we focus on the quality improvement relative to the performance (in terms of f1 values (We use the f1 value instead of accuracy because of the imbalance of the numbers between matched and unmatched pairs)) of the original matcher.</p>
<p>Our research questions are as follows: (RQ1) What is the characteristic of the inconsistency-driven approach as an entity matching strategy? (RQ2) In what situation the approach is effective?</p>
<p>The contributions of this paper are twofold: First, we show a principled framework for an inconsistency-driven approach for human-in-the-loop entity matching. Our approach complements any existing matcher that can cause the unknown-unknown problem. This paper shows an implementation of our approach in a simple framework to address the effects of the approach.</p>
<p>Second, we show the result of an extensive set of experiments with three real-world datasets with different characteristics. We then conducted a detailed analysis of the results. Consequently, we observed the effects of our approach in different situations, revealing how and when the approach is effective.</p>
<p>Note that our research question is not about the performance of a particular matcher. Our findings are summarized as follows.</p>
<list list-type="order">
<list-item><p>The inconsistency-driven sampling chooses data items that are completely different from those chosen by uncertainty-based and random sampling.</p></list-item>
<list-item><p>The inconsistency-driven sampling has advantages even if human answers are not completely correct as long as the performance of the matcher is relatively high. This is because the inconsistency-based check works well in finding incorrect responses from humans as long as the performance of the matcher is relatively high.</p></list-item>
<list-item><p>A simple hybrid strategy broadens the sweet spot of the inconsistency-driven approach; it lowers the f1 value threshold for which inconsistency-driven sampling is effective.</p></list-item>
</list>
</sec>
<sec id="sec2">
<title>Related works</title>
<p>In this section, we introduce the studies related to this research and describe their methods and the position of this research.</p>
<sec id="sec2_1">
<title>Rule-based or ML-based entity matching</title>
<p>Many solutions have been proposed for the entity matching problem. The classic example is a rule-based approach. The method clusters those that contain full or partial matches of attribute values or identical tokens that have been segmented into words (<xref rid="R12" ref-type="bibr">Jaro, 1989</xref>), (<xref rid="R6" ref-type="bibr">Cohen, et al., 2003</xref>), (<xref rid="R2" ref-type="bibr">Benjelloun, et al., 2009</xref>). Recently, a machine learning approach has been proposed. They use random forests (<xref rid="R10" ref-type="bibr">Gokhale, et al., 2014</xref>), (Das, et al., 2017), and metric learning (<xref rid="R24" ref-type="bibr">Peeters, et al., 2022</xref>), (<xref rid="R22" ref-type="bibr">Osawa, et al., 2021</xref>) to judge whether an entity pair is a match by calculating its similarity. These methods are simple and inexpensive but are vulnerable to orthographical variants of data.</p>
</sec>
<sec id="sec2_2">
<title>Human-in-the-loop entity matching</title>
<p>Several studies have pointed out the limitations of entity matching using only computers without human intervention. (<xref rid="R27" ref-type="bibr">Takashi et al. 2019</xref>) identified low-precision results based on similarity using Okapi BM25 (<xref rid="R25" ref-type="bibr">Robertson, et al., 1995</xref>) and proposed a crowdsourcing-based method that allows for human interaction. (Das, et al., 2017) proposed the framework, Falcon. This method assumes that there is no training data, and a small portion of the target data is labelled by a human to prepare the training data. Other methods using human-in-the-loop have been proposed (<xref rid="R11" ref-type="bibr">Li, 2017</xref>), (<xref rid="R10" ref-type="bibr">Gokhale, et al., 2014</xref>), (<xref rid="R22" ref-type="bibr">Osawa, et al., 2021</xref>), (Eraheem, et al., 2014). Any methods are used to generate or modify data for quality of results and training of the computer, and the other methods are used in much the same way. In this study, we propose a human task from the perspective of obtaining efficient training data. </p>
</sec>
<sec id="sec2_3">
<title>Methods using transitivity law and domain knowledge</title>
<p>(<xref rid="R35" ref-type="bibr">Zhu et al., 2020</xref>) proposed a method using the transitivity of equivalence class. The method tries to find two matched pairs that share the same entity so that it can infer another match between other two entities, because the matching is done by crowdsourcing and requires a huge monetary cost. Our work is different from it in the way to use the transitivity of equivalence because we use it to detect errors in AI decisions assuming AI is the main matcher. Other methods (<xref rid="R17" ref-type="bibr">Li, et al., 2020</xref>), (Trabelsi, et al., 2022) use domain knowledge to improve accuracy in entity matching. Their approach is completely different from ours and complementary to each other.</p>
</sec>
<sec id="sec2_4">
<title>Problem definition</title>
<p>In this section, we define the problem. <xref ref-type="table" rid="T1">Table 1</xref> shows definitions of symbols in this paper.</p>
<p>We define a tuple of entities as &#x1D465;=(&#x1D460;<sub>&#x1D456;</sub>,&#x1D460;<sub>&#x1D457;</sub>), set of all possible tuple of entities as &#x1D4B3;={(&#x1D460;<sub>&#x1D456;</sub>,&#x1D460;<sub>&#x1D457;</sub>)&#x2208;&#x1D4AE;<sup>2</sup>&#x2223;&#x1D456;&#x003C;&#x1D457;}, a set of labels &#x1D4B4;={&#x1D440;&#x1D44E;&#x1D461;&#x1D450;&#x210E;,&#x1D448;&#x1D45B;&#x1D45A;&#x1D44E;&#x1D461;&#x1D450;&#x210E;}, and a set of all true data as <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mrow><mml:msup><mml:mi>U</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mfenced close="}" open="{"><mml:mrow><mml:mfenced><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo>&#x2282;</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>Y</mml:mi></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mrow><mml:mi>n</mml:mi><mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mfenced close="|" open="|"><mml:mi>s</mml:mi></mml:mfenced></mml:mrow></mml:msub><mml:msub><mml:mi>C</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mi>L</mml:mi><mml:mi>m</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mfenced close="}" open="{"><mml:mrow><mml:mfenced><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:msubsup><mml:mo>&#x2286;</mml:mo><mml:msub><mml:mi>U</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is the labelled training set with &#x1D45A; samples(We assume that humans give answers with the platform such as crowdsourcing ) (we give &#x1D45A; and &#x1D45A; &#x226A;&#x1D45B;). Our main goal is to design a query strategy &#x1D4AC;: &#x1D4B0;<sup>&#x1D4C3;</sup>&#x2192;&#x2112;<sup>&#x1D4C2;</sup> to train an entity matching model (matcher) &#x1D453;&#x2208;&#x2131;, &#x1D453;:&#x1D4B3;&#x2192;&#x1D4B4;. The optimization problem can be expressed as follows:</p>
<disp-formula><label>(1)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:munder><mml:mrow><mml:mi>arg</mml:mi><mml:mi>max</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>L</mml:mi><mml:mi>m</mml:mi></mml:msup><mml:mo>&#x2286;</mml:mo><mml:msup><mml:mi>U</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:munder><mml:mfrac><mml:mn>1</mml:mn><mml:mi>n</mml:mi></mml:mfrac><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mfenced><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow></mml:mfenced><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>U</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:munder><mml:mrow><mml:mi>&#x03B4;</mml:mi><mml:mfenced><mml:mrow><mml:mi>f</mml:mi><mml:mfenced><mml:mi>x</mml:mi></mml:mfenced><mml:mo>=</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2228;</mml:mo><mml:mfenced><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow></mml:mfenced><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>L</mml:mi><mml:mi>m</mml:mi></mml:msup></mml:mrow></mml:mfenced></mml:mrow></mml:mstyle></mml:mrow></mml:math></disp-formula>
<p>where &#x03B4; is the indicator function. The intuition of Eq. (1) is that, for all pairs of entities in a dataset, it is beneficial for the matcher to answer a matching for a pair $x$ correctly, and the dataset &#x2112;<sup>&#x1D4C2;</sup> contains the correct label for a pair &#x1D465;. We assume the model returns a probability of &#x1D440;&#x1D44E;&#x1D461;&#x1D450;&#x210E; or &#x1D448;&#x1D45A;&#x1D45A;&#x1D44E;&#x1D461;&#x1D450;&#x210E;, and we note the probability of &#x1D440;&#x1D44E;&#x1D461;&#x1D450;&#x210E; that the matching model gives to a tuple (&#x1D460;<sub>&#x1D456;</sub><sub>,</sub>) as &#x1D443;(&#x1D440;&#x1D44E;&#x1D461;&#x1D450;&#x210E;|&#x1D460;<sub>&#x1D456;</sub>,&#x1D460;<sub>&#x1D457;</sub>).</p>
<table-wrap id="T1">
<label>Table 1.</label>
<caption><p>Symbol definitions</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Symbol</bold></th>
<th align="center" valign="top"><bold>Definition</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mfenced close="}" open="{"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:math></inline-formula></td>
<td align="center" valign="top">All entities in the dataset.</td>
</tr>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula></td>
<td align="center" valign="top">An entity.</td>
</tr>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mi>y</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:mfenced close="}" open="{"><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>U</mml:mi><mml:mi>n</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced></mml:math></inline-formula></td>
<td align="center" valign="top">Labels.</td>
</tr>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mfenced close="}" open="{"><mml:mrow><mml:mfenced><mml:mrow><mml:mi>s</mml:mi><mml:msub><mml:malignmark/><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>&#x2208;</mml:mo><mml:mi>S</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>S</mml:mi><mml:mo>&#x007C;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x003C;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mfenced></mml:math></inline-formula></td>
<td align="center" valign="top">Possible tuples.</td>
</tr>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msup><mml:mi>U</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2282;</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>Y</mml:mi></mml:math></inline-formula></td>
<td align="center" valign="top">A set of all possible pairs and their labels.</td>
</tr>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msup><mml:mi>L</mml:mi><mml:mi>m</mml:mi></mml:msup><mml:mo>&#x2282;</mml:mo><mml:msup><mml:mi>U</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:math></inline-formula></td>
<td align="center" valign="top">Labeled entity tuples for retraining.</td>
</tr>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mi>Q</mml:mi><mml:mo></mml:mo><mml:msup><mml:mi>U</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mi>L</mml:mi><mml:mi>m</mml:mi></mml:msup></mml:math></inline-formula></td>
<td align="center" valign="top">A sampling strategy for labeling.</td>
</tr>
<tr>
<td align="center" valign="top"><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mi>f</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>F</mml:mi></mml:math></inline-formula></td>
<td align="center" valign="top">Model (e.g., Bayesian inference).</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec2_5">
<title>Inconsistency-driven sampling</title>
<sec id="sec2_5_1">
<title>Basic idea</title>
<p>The inconsistency-driven sampling samples the pairs of the entities asked to the human. Intuitively, with the current labelled set &#x2112;<sup>&#x1D4C2;&#x2032;</sup> we compute &#x1D443;(&#x1D440;&#x1D44E;&#x1D461;&#x1D450;&#x210E;&#x2223;&#x2223;&#x1D460;<sub>&#x1D456;</sub>,&#x1D460;<sub>&#x1D457;</sub>) for all pairs in &#x1D4B0;&#x1D4C3;&#x2212;&#x2112;&#x1D4C2;&#x2032;, find pairs that cause inconsistency, choose $m$ pairs, and re-train the parameters of matching model.</p>
<p>To find the inconsistent pairs from the inference result by the current matching model, we consider pairs in a triple of the entities. We call the matching of the three pairs inconsistent if they constitute impossible patterns under the transitivity of equivalence relation (such as "positive," "positive," and "negative," as shown in <xref ref-type="fig" rid="F2">Fig. 2</xref>).</p>
<p>Using Bayesian inference, we can calculate the probability that each pair is a match. Given a triple &#x0394;<sub>&#x1D456;&#x1D457;&#x1D458;</sub>=(&#x1D460;<sub>&#x1D456;</sub>,&#x1D460;<sub>&#x1D457;</sub>,&#x1D460;<sub>&#x1D458;</sub>), the probability that these three pairs cause inconsistency can be calculated as in Eq. (2).</p>
<disp-formula><label>(2)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mi>I</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>y</mml:mi><mml:mfenced><mml:mrow><mml:msub><mml:mi>&#x0394;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfenced><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mfenced><mml:mrow><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:mfenced><mml:mo>&#x2208;</mml:mo><mml:mfenced close="}" open="{"><mml:mrow><mml:mfenced><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:mfenced><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:mfenced><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:mfenced></mml:mrow></mml:munder><mml:mrow><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>a</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>b</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>b</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>U</mml:mi><mml:mi>n</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mrow></mml:mfenced></mml:mrow></mml:mstyle></mml:mrow></mml:math></disp-formula>
<p>In this study, when the is higher, we assume the matcher is getting more confused, and we prioritize the triple to ask humans to correct it.</p>
<fig id="F2">
<label>Figure 2.</label>
<caption><p>Inconsistency in focusing on the three data</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-fig2.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
</sec>
<sec id="sec2_6">
<title>Inconsistency-driven sampling algorithm</title>
<p><bold>Algorithm 1</bold> implements the idea of inconsistency-driven sampling. It takes a set of entity pairs with match probabilities D&#x0302;={((&#x1D460;<sub>&#x1D456;</sub>,&#x1D460;<sub>&#x1D457;</sub>),&#x1D45D;)}&#x2282;&#x1D4B3;&#x00D7;[0,1] and outputs &#x1D45A; sample pairs taken from the given set. Note that the matching probability &#x1D45D;=(&#x1D440;&#x1D44E;&#x1D461;&#x1D450;&#x210E;|&#x1D460;<sub>&#x1D456;</sub>,&#x1D460;<sub>&#x1D457;</sub>). Note that the given set of entity pairs forms a graph, and that the triples that may cause the inconsistency in the graph are triangles. We define a set of triangles as</p>
<disp-formula><label>(3)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mi>&#x03C4;</mml:mi><mml:mo>=</mml:mo><mml:mfenced close="}" open="{"><mml:mrow><mml:msub><mml:mi>&#x0394;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mn>3</mml:mn></mml:msup><mml:mo>&#x007C;</mml:mo><mml:mfenced><mml:mrow><mml:mfenced><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:mfenced><mml:mrow><mml:mfenced><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:mfenced><mml:mrow><mml:mfenced><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mn>3</mml:mn></mml:msub></mml:mrow></mml:mfenced><mml:mo>&#x2208;</mml:mo><mml:mover accent='true'><mml:mi>D</mml:mi><mml:mo>&#x2322;</mml:mo></mml:mover></mml:mrow></mml:mfenced></mml:mrow></mml:math></disp-formula>
<p>The flow of inconsistency-driven sampling is as follows: First, we apply the blocking method for the set of entities, then sort the triangles in descending order of , and lastly, sample the /3 triangles with the highest to sample pairs of entities.</p>
<fig id="A1">
<label>Algorithm 1.</label>
<caption><p>Inconsistency-driven samplin</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-A1.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec2_7">
<title>Experiments</title>
<p>We conducted an extensive set of experiments to address our research questions and evaluated the impact of sampling strategies on the f1 value of matchers on datasets from different domains.</p>
<p>Figure 3 overviews the overall experiment workflow. First, we apply a blocking technique to each of the three datasets for the experiment and generate training and evaluation datasets. Then, we execute human-in-the-loop entity matching iterations with each of the five sampling strategies. We explain the steps one by one.</p>
</sec>
<sec id="sec2_8">
<title>Datasets and human settings</title>
<sec id="sec2_8_1">
<title>Datasets</title>
<p>We used three datasets taken from different domains (<xref ref-type="table" rid="T2">Table 2</xref>): <bold>Persons</bold> (K&#x00F6;pcke, et al., 2010), <bold>Bibliorecords</bold> (a private dataset supplied by public libraries in Japan), and <bold>Music</bold> (K&#x00F6;pcke, et al., 2010). Each dataset has a set of entities that has an attribute that stores cluster labels; if two have the same cluster label, they are matched entities, i.e., they represent the same entity. <bold>Persons</bold> and <bold>Bibliorecords</bold> are relatively clean datasets, while <bold>Music</bold> has many missing values and contains dirty attributes, such as having an album name in the (music) title attribute.</p>
<table-wrap id="T2">
<label>Table 2.</label>
<caption><p>Datasets from three different domains</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Domain (Language)</bold></th>
<th align="center" valign="top"><bold>Attributes</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top"><bold>Persons</bold> (K&#x00F6;pcke, et al., 2010) (English)</td>
<td align="center" valign="top">name, surname, suburb, postcode</td>
</tr>
<tr>
<td align="center" valign="top"><bold>Bibliorecords</bold> (Japanese)</td>
<td align="center" valign="top">title, author, publisher, date</td>
</tr>
<tr>
<td align="center" valign="top"><bold>Music</bold> (K&#x00F6;pcke, et al., 2010) (English)</td>
<td align="center" valign="top">artist, title, album, year, length</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="sec3">
<title>Human settings</title>
<p>It is unavoidable for humans to make mistakes. In order to see the impact of errors on each sampling strategy in a systematic way, we took a simulation-based approach by implementing agents that serve as humans who give labels with a given accuracy; we examined three cases in terms of the accuracy of human labels: 100%, 95%, and 90%. The accuracy is, in other words, the percentage of noise labels in active learning, and we investigated the impact of noise labels on active learning (<xref rid="R31" ref-type="bibr">Wu, et al., 2022</xref>), (<xref rid="R34" ref-type="bibr">Younesian, et al., 2021</xref>).</p>
<sec id="sec3_1">
<title>Blocking and data preparation</title>
<p>For each of the three original datasets, we generated two data sets we use in the blocking phase and the human-in-the-loop entity matching iterations:</p>
<p>&#x30FB; the training dataset for the blocking phase and the Bayesian inference, and</p>
<p>&#x30FB; the evaluation dataset for evaluating sampling strategies to update the Bayesian inference.</p>
<p>The two datasets were constructed as follows and disjointed from each other.</p>
<p><bold>Training dataset</bold> &#x1D437;<sup>&#x1D461;</sup>: For each original dataset, we randomly choose a set of 15,000 positive pairs (the two cluster labels are the same) and 15,000 negative pairs (the labels are different from each other) from all pairs of entities in the original dataset. Then, every entity pair and their label form a triple contained in &#x1D437;<sup>&#x1D461;</sup>.</p>
<p><bold>Evaluation Dataset</bold> &#x1D437;<sup>e</sup>: First, we randomly select clusters from each of the original datasets until each dataset contains about 2000 entities in total. Then, we generate a set of entity pairs with labels from the clusters.</p>
<p>Then, we constructed the dataset by applying a standard blocking technique based on metric learning to all pairs of the selected entities. The blocking technique chose only pairs of entities that were closer than the threshold after metric learning on the distance among entities with the training dataset.</p>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows the statistics of the evaluation datasets constructed this way. Note that the blocking does not work in favour of the inconsistency-driven sampling because the blocking is done based on the metric learning result, and it removed some of the potentially matched pairs that can affect the inconsistency-based sampling (i.e., the recall of the matches is less than 1).</p>
<table-wrap id="T3">
<label>Table 3.</label>
<caption><p>Statistics of the evaluation datasets after blocking</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Domain (Language)</bold></th>
<th align="center" valign="top"><bold>#Entities</bold></th>
<th align="center" valign="top"><bold>#Pairs</bold></th>
<th align="center" valign="top"><bold>#Matches</bold></th>
<th align="center" valign="top"><bold>Recall of matches of the blocking</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top"><bold>Persons</bold></td>
<td align="center" valign="top">2001</td>
<td align="center" valign="top">48,835</td>
<td align="center" valign="top">1379</td>
<td align="center" valign="top">0.848</td>
</tr>
<tr>
<td align="center" valign="top"><bold>Bibliorecords</bold></td>
<td align="center" valign="top">2001</td>
<td align="center" valign="top">66,911</td>
<td align="center" valign="top">1594</td>
<td align="center" valign="top">0.840</td>
</tr>
<tr>
<td align="center" valign="top"><bold>Music</bold></td>
<td align="center" valign="top">2002</td>
<td align="center" valign="top">196,414</td>
<td align="center" valign="top">1613</td>
<td align="center" valign="top">0.739</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec3_2">
<title>Human-in-the-loop entity matching iterations</title>
<p>Algorithm 2 gives the concrete steps in the human-in-the-loop entity matching iterations in the experiment workflow. First, we conduct the Bayesian inference with <sup>{}</sup> for <sup>{}</sup> . Then, in the iteration, it chooses samples for which it obtains human labels and uses the obtained result to update the inference. We chose Bayesian inference because it is a simple matcher that satisfies the requirement for the application of our inconsistency-driven sampling: it outputs a matching probability for a pair of entities. Note that our research question is not the performance of a particular matcher, and we can use this Bayesian inference matcher without loss of generality.</p>
<fig id="A2">
<label>Algorithm 2.</label>
<caption><p>Human-in-the-loop entity matching iterations for the experiment</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-A2.jpg"><alt-text>none</alt-text></graphic>
</fig>
<sec id="sec3_2_1">
<title>Matcher implementation</title>
<p>The workflow implements a Bayesian inference-based matcher. The model infers the matching by considering the similarity measures between the given two entities. Our model requires the four similarities, which are the basic similarity measures for texts shown in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap id="T4">
<label>Table 4.</label>
<caption><p>Statistics of the evaluation datasets after blocking</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"></th>
<th align="center" valign="top"><bold>Indicator</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top">1</td>
<td align="center" valign="top">FastText vector (Bojanowski, et al., 2017)</td>
</tr>
<tr>
<td align="center" valign="top">2</td>
<td align="center" valign="top">Jaro-winkler (Winkler, et al., 1990)</td>
</tr>
<tr>
<td align="center" valign="top">3</td>
<td align="center" valign="top">Levenshtein (Levenshtein, et al., 1966)</td>
</tr>
<tr>
<td align="center" valign="top">4</td>
<td align="center" valign="top">Gestalt Pattern Matching (Virtanen, et al., 2020)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The probabilities obtained from these probability density functions can be integrated as in Eq. (4) to estimate the matching probability of the pair.</p>
<disp-formula><label>(4)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mi>p</mml:mi><mml:mfenced><mml:mrow><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mfenced><mml:mi>l</mml:mi></mml:mfenced></mml:mrow></mml:msubsup><mml:mo>&#x007C;</mml:mo><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:mo>&#x2211;</mml:mo> <mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mfenced close="}" open="{"><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>U</mml:mi><mml:mi>n</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:msub><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mi>p</mml:mi><mml:mfenced><mml:mrow><mml:msubsup><mml:mi>Z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mfenced><mml:mi>l</mml:mi></mml:mfenced></mml:mrow></mml:msubsup><mml:mo>&#x007C;</mml:mo><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:mfenced><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>C</mml:mi><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>Note that <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mrow><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mfenced><mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mfenced><mml:mn>1</mml:mn></mml:mfenced></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mn>...</mml:mn><mml:mo>,</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mfenced><mml:mi>n</mml:mi></mml:mfenced></mml:mrow></mml:msubsup></mml:mrow></mml:mfenced></mml:mrow><mml:mi>T</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is a vector based on the similarity between entities &#x1D460;<sub>&#x1D456;</sub> and &#x1D460;<sub>&#x1D457;</sub>. The parameters for constructing the probability &#x1D45D; are obtained by fitting a probability density function based on the mixed Gaussian distribution (Dempster, et al., 1977), (<xref rid="R23" ref-type="bibr">Pedregosa, et al., 2011</xref>) to the training data (<xref ref-type="fig" rid="F4">Fig. 4</xref>).</p>
<fig id="F4">
<label>Figure 4.</label>
<caption><p>Fitting probability density function</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-fig4.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec3_2_2">
<title>Sampling strategies</title>
<p>We used the following five sampling strategies. Note that uncertainty sampling and query-by- committee sampling are <italic>model-based</italic> strategies while diversity sampling and random sampling are <italic>model-free</italic> strategies.</p>
<list list-type="order">
<list-item><p><bold>Inconsistency-driven sampling</bold>. This is the method we proposed in which pairs whose estimated labels lead to an inconsistency.</p></list-item>
<list-item><p><bold>Uncertainty sampling (model-based)</bold> (<xref rid="R26" ref-type="bibr">Settles, 2010</xref>). In Bayesian inference, we compute the confidence value for each pair and chose those pairs that are not clearly positive or negative. Specifically, we use Eq. (5) for the confidence value and choose the pair with the least confidence value.
<disp-formula><label>(5)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mi>Pr</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mfenced close="|" open="|"><mml:mrow><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mo>&#x007C;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mfenced><mml:mo>&#x2212;</mml:mo><mml:mn>0.5</mml:mn></mml:mrow></mml:mfenced></mml:mrow></mml:math></disp-formula></p></list-item>
<list-item><p><bold>Query-by-committee sampling (model-based)</bold> (<xref rid="R26" ref-type="bibr">Settles, 2010</xref>). Query-by-committee is a method of choosing pairs by aggregating the results of multiple indicators. In this experiment, we used two indicators that computed positive and negative scores for each indicator, which were calculated and sampled from antagonistic pairs. Specifically, sample in order of increasing value of Eq. (8).
<disp-formula><label>(6)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mrow><mml:mi>p</mml:mi><mml:mfenced><mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mfenced><mml:mi>l</mml:mi></mml:mfenced></mml:mrow></mml:msubsup><mml:mo>&#x007C;</mml:mo><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:mstyle></mml:math></disp-formula>
<disp-formula><label>(7)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x220F;</mml:mo><mml:mrow><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mrow><mml:mi>p</mml:mi><mml:mfenced><mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mfenced><mml:mi>l</mml:mi></mml:mfenced></mml:mrow></mml:msubsup><mml:mo>&#x007C;</mml:mo><mml:mi>U</mml:mi><mml:mi>n</mml:mi><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced><mml:mi>P</mml:mi><mml:mfenced><mml:mrow><mml:mi>U</mml:mi><mml:mi>n</mml:mi><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:mstyle></mml:math></disp-formula>
<disp-formula><label>(6)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mi>Pr</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mfenced><mml:mrow><mml:mi>P</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mo>+</mml:mo><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mfenced></mml:math></disp-formula></p></list-item>
<list-item><p><bold>Diversity sampling (model-free)</bold> (O&#x2019;Neill, et al., 2017). This method selects distant pairs from those that have already been labelled by humans. The distance is calculated using the embedded representation of FastText (Bojanowski, et al., 2017). The first iteration is &#x2112;<sup><italic>m</italic></sup> = &#x2205;, so random sampling is executed only in the iteration.</p></list-item>
<list-item><p><bold>Random sampling (model-free)</bold>. This method randomly selects the candidate pairs. We used the random function from Python&#x2019;s random module.</p></list-item>
</list>
</sec>
</sec>
<sec id="sec3_3">
<title>Other settings</title>
<p><bold>Prior distribution for Bayesian inference</bold>. The prior distribution was set to (&#x1D440;&#x1D44E;&#x1D461;&#x1D450;&#x210E;) = 0.1 and (&#x1D448;&#x1D45B;&#x1D45A;&#x1D44E;&#x1D461;&#x1D450;&#x210E;) = 0.9.</p>
<p><bold>Batch sampling</bold>. We adopted a batch sampling scheme to reduce the number of inference updates; in each iteration, we choose $m$ samples (instead of choosing one sample) and obtain their human labels before each inference update. We set = 300. We need to be careful when dealing with inconsistency-driven sampling in the batch because we may obtain more than one human label for the same pair if it appears in different sets of inconsistent triangles. We solved the duplication by majority vote.</p>
<p><bold>Languages and libraries</bold>. These algorithms were implemented by Python3, using the modules Tensorflow (<xref rid="R1" ref-type="bibr">Abadi et al., 2015</xref>) for metric learner construction, Cupy (<xref rid="R20" ref-type="bibr">Okuta et al., 2017</xref>) and Faiss (<xref rid="R11" ref-type="bibr">Johnson et al., 2019</xref>) for blocking and indexing, and Scipy (<xref rid="R29" ref-type="bibr">Virtanen, et al., 2020</xref>) and Scikit-Learn (<xref rid="R23" ref-type="bibr">Pedregosa, et al., 2011</xref>) for Bayesian inference probability density function manipulation.</p>
</sec>
</sec>
<sec id="sec4">
<title>Results</title>
<sec id="sec4_1">
<title>Difference of sampling distributions</title>
<p><xref ref-type="fig" rid="F5">Fig. 5</xref> shows how different the five sampling strategies are from each other in terms of chosen samples. The horizontal axis represents the match probability determined by the inference, where the closer the match probability is to 1, the more likely the pair is to be matched, while the closer it is to 0, the more unlikely. The vertical axis represents the number of chosen pairs in the sampling strategy (in the log scale). The higher orange intensity means that they are chosen in earlier iterations.</p>
<p>The result clearly shows that the strategies are remarkably different in terms of chosen samples. Inconsistency-driven sampling tends to choose pairs that are highly likely or unlikely to be positive, while uncertainty sampling constantly chooses pairs in the middle. Query-by-committee, diversity, and random sampling choose pairs from a wide range. </p>
<fig id="F5">
<label>Figure 5.</label>
<caption><p>Analysis: Distribution of match probability for each sampling strategy</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-fig5.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec4_2">
<title>Effects on matcher performance</title>
<p><xref ref-type="fig" rid="F6">Fig. 6</xref> shows the effects of each sampling strategy on the matcher performance in the iterations. The X-axis is the number of iterations. The Y-axis is the F1 value for the output of Bayesian Inference and human labels after each iteration for the evaluation dataset. Each line represents a sampling strategy.</p>
<p>The aim of seeing the figure is to identify the sweet spots of each sampling strategy. Note that the performance of matchers (to be used with sampling strategies) for <bold>Persons</bold>, <bold>Bibliorecords</bold>, and <bold>Music</bold> are very different (very high, moderate, and very low, respectively). The result suggests the following. First, inconsistency-driven sampling is effective when the f1 value of the matcher is high (i.e., higher than 0.8), while uncertainty sampling performs the best, especially for lower-quality matchers. Second, when the f1 value is high, the performance inconsistency-driven is stable, while other sampling strategies are directly influenced when the accuracy of human labels becomes lower. The results are reasonable for the following reasons: if the f1 value of matchers is not high, it is difficult to identify inconsistency correctly. Third, the inconsistency-driven approach can identify inconsistencies even for newly obtained human labels, while other strategies take human inputs as oracles, even if they are incorrect.</p>
<fig id="F6">
<label>Figure 6.</label>
<caption><p>Experiment: Comparison of f1 value for different sampling strategies</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-fig6.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
<sec id="sec4_3">
<title>Hybrid strategy</title>
<p>The results so far showed that (1) the inconsistency-based sampling works better when the inference f1 value is high while uncertainty sampling works better otherwise and that (2) the chosen pairs of the strategies are completely different to each other. They suggest that it is worth considering a hybrid strategy. Since it is difficult to set the threshold to switch strategies, we considered a simple hybrid strategy that switches uncertainty and inconsistency-driven strategies in each iteration.</p>
<p><xref ref-type="fig" rid="F7">Fig. 7</xref> shows the results of applying the hybrid strategy to the three datasets with 90%-accurate human labels. The result shows that the hybrid strategy performs well when the inference f1 value is moderate or higher, which covers many practical situations. On the other hand, uncertainty sampling still works better with the inference with an extremely low f1 value.</p>
<fig id="F7">
<label>Figure 7.</label>
<caption><p>Experiment: comparison of f1 value for different sampling strategies</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c85-fig7.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
</sec>
<sec id="sec5">
<title>Conclusion</title>
<p>This paper addressed an inconsistency-based sampling strategy to deal with the <italic>unknown-unknown problem</italic> in active learning for entity matching. The method asks humans to resolve the entities when we find inconsistency in the transitivity property. This paper implemented a human- in-the-loop entity matching framework with this sampling strategy with similarity-based blocking method and Bayesian inference. It also explained the result of an extensive set of experiments that reveals how and when the method is effective. The result showed that the inconsistency-based sampling selects very different entity pairs compared to other sampling strategies and that a simple hybrid strategy performs well in many practical situations. Future work includes the interaction of the sampling strategy and matcher implementations. For example, the sampling result may suggest switching to other types of matchers.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>This work was supported in part by JST CREST(JPMJCR22M2), Grants-in-Aid for Scientific Research (22H00508, 21H03552).</p>
</ack>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Abadi</surname><given-names>M.</given-names></name><name><surname>Agarwal</surname><given-names>A.</given-names></name><name><surname>Barham</surname><given-names>P.</given-names></name><name><surname>Brevdo</surname><given-names>E.</given-names></name><name><surname>Chen</surname><given-names>Z.</given-names></name><name><surname>Citro</surname><given-names>C.</given-names></name><name><surname>Corrado</surname><given-names>G.S.</given-names></name><name><surname>Davis</surname><given-names>A.</given-names></name><name><surname>Dean</surname><given-names>J.</given-names></name><name><surname>Devin</surname><given-names>M.</given-names></name><name><surname>Ghemawat</surname><given-names>S.</given-names></name><name><surname>Goodfellow</surname><given-names>I.</given-names></name><name><surname>Harp</surname><given-names>A.</given-names></name><name><surname>Irving</surname><given-names>G.</given-names></name><name><surname>Isard</surname><given-names>M.</given-names></name><name><surname>Jia</surname><given-names>Y.</given-names></name><name><surname>Jozefowicz</surname><given-names>R.</given-names></name><name><surname>Kaiser</surname><given-names>L.</given-names></name><name><surname>Kudlur</surname><given-names>M.</given-names></name><name><surname>Levenberg</surname><given-names>J.</given-names></name><name><surname>Man&#x00E9;</surname><given-names>D.</given-names></name><name><surname>Monga</surname><given-names>R.</given-names></name><name><surname>Moore</surname><given-names>S.</given-names></name><name><surname>Murray</surname><given-names>D.</given-names></name><name><surname>Olah</surname><given-names>C.</given-names></name><name><surname>Schuster</surname><given-names>M.</given-names></name><name><surname>Shlens</surname><given-names>J.</given-names></name><name><surname>Steiner</surname><given-names>B.</given-names></name><name><surname>Sutskever</surname><given-names>I.</given-names></name><name><surname>Talwar</surname><given-names>K.</given-names></name><name><surname>Tucker</surname><given-names>P.</given-names></name><name><surname>Vanhoucke</surname><given-names>V.</given-names></name><name><surname>Vasudevan</surname><given-names>V.</given-names></name><name><surname>Vi&#x00E9;gas</surname><given-names>F.</given-names></name><name><surname>Vinyals</surname><given-names>O.</given-names></name><name><surname>Warden</surname><given-names>P.</given-names></name><name><surname>Wattenberg</surname><given-names>M.</given-names></name><name><surname>Wicke</surname><given-names>M.</given-names></name><name><surname>Yu</surname><given-names>Y.</given-names></name><name><surname>Zheng</surname><given-names>X.</given-names></name></person-group><article-title>TensorFlow: Large-scale machine learning on heterogeneous systems</article-title><year>2015</year><ext-link ext-link-type="uri" xlink:href="http://tensorflow.org/">http://tensorflow.org/</ext-link><comment>software available from tensorflow.org</comment></element-citation></ref>
<ref id="R2"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Benjelloun</surname><given-names>O.</given-names></name><name><surname>Garcia-Molina</surname><given-names>H.</given-names></name><name><surname>Menestrina</surname><given-names>D.</given-names></name><name><surname>Su</surname><given-names>Q.</given-names></name><name><surname>Whang</surname><given-names>S.E.</given-names></name><name><surname>Widom</surname><given-names>J.</given-names></name></person-group><year>2009</year><article-title>Swoosh: a generic approach to entity resolution</article-title><source>The VLDB Journal</source> <volume>18</volume><fpage>255</fpage><lpage>276</lpage></element-citation></ref>
<ref id="R3"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Bojanowski</surname><given-names>P.</given-names></name><name><surname>Grave</surname><given-names>E.</given-names></name><name><surname>Joulin</surname><given-names>A.</given-names></name><name><surname>Mikolov</surname><given-names>T.</given-names></name></person-group><year>2017</year><article-title>Enriching word vectors with subword information</article-title></element-citation></ref>
<ref id="R4"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Christophides</surname><given-names>V.</given-names></name><name><surname>Efthymiou</surname><given-names>V.</given-names></name><name><surname>Palpanas</surname><given-names>T.</given-names></name><name><surname>Papadakis</surname><given-names>G.</given-names></name><name><surname>Stefanidis</surname><given-names>K.</given-names></name></person-group><year>2020</year><article-title>An overview of end-to-end entity resolution for big data</article-title><source>ACM Comput. Surv</source><volume>53</volume><issue>6</issue><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3418896">https://doi.org/10.1145/3418896</ext-link></element-citation></ref>
<ref id="R5"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Chung</surname><given-names>Y.</given-names></name><name><surname>Haas</surname><given-names>P.J.</given-names></name><name><surname>Upfal</surname><given-names>E.</given-names></name><name><surname>Kraska</surname><given-names>T.</given-names></name></person-group><year>2019</year><article-title>Unknown examples &#x0026; machine learning model generalization</article-title></element-citation></ref>
<ref id="R6"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname><given-names>W.W.</given-names></name><name><surname>Ravikumar</surname><given-names>P.</given-names></name><name><surname>Fienberg</surname><given-names>S.E.</given-names></name><etal/></person-group><year>2003</year><article-title>A comparison of string dis- tance metrics for name-matching tasks</article-title><source>IIWeb</source><volume>3</volume><fpage>73</fpage><lpage>78</lpage></element-citation></ref>
<ref id="R7"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Das</surname><given-names>S., G.C., P.S.</given-names></name><name><surname>Doan</surname><given-names>A.</given-names></name><name><surname>Naughton</surname><given-names>J.F.</given-names></name><name><surname>Krishnan</surname><given-names>G.</given-names></name><name><surname>Deep</surname><given-names>R.</given-names></name><name><surname>Arcaute</surname><given-names>E.</given-names></name><name><surname>Raghavendra</surname><given-names>V.</given-names></name><name><surname>Park</surname><given-names>Y.</given-names></name></person-group><year>2017</year><article-title>Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services</article-title><source>Proceedings of the 2017 ACM International Conference on Management of Data</source><fpage>1431</fpage><lpage>1446</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/ 3035918.3035960">https://doi.org/10.1145/ 3035918.3035960</ext-link></element-citation></ref>
<ref id="R8"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dempster</surname><given-names>A.P.</given-names></name><name><surname>Laird</surname><given-names>N.M.</given-names></name><name><surname>Rubin</surname><given-names>D.B.</given-names></name></person-group><year>1977</year><article-title>Maximum likelihood from incomplete data via the em algorithm</article-title><source>Journal of the royal statistical society: series B (methodological)</source> <volume>39</volume><issue>1</issue><fpage>1</fpage><lpage>22</lpage></element-citation></ref>
<ref id="R9"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ebraheem</surname><given-names>M.</given-names></name><name><surname>Thirumuruganathan</surname><given-names>S.</given-names></name><name><surname>Joty</surname><given-names>S.</given-names></name><name><surname>Ouzzani</surname><given-names>M.</given-names></name><name><surname>Tang</surname><given-names>N.</given-names></name></person-group><year>2018</year><article-title>Distributed representations of tuples for entity resolution</article-title><source>Proc. VLDB Endow</source><volume>11</volume><issue>11</issue><fpage>1454</fpage><lpage>1467</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.14778/3236187.3236198">https://doi.org/10.14778/3236187.3236198</ext-link></element-citation></ref>
<ref id="R10"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Gokhale</surname><given-names>C.</given-names></name><name><surname>Das</surname><given-names>S.</given-names></name><name><surname>Doan</surname><given-names>A.</given-names></name><name><surname>Naughton</surname><given-names>J.F.</given-names></name><name><surname>Rampalli</surname><given-names>N.</given-names></name><name><surname>Shavlik</surname><given-names>J.</given-names></name><name><surname>Zhu</surname><given-names>X.</given-names></name></person-group><year>2014</year><article-title>Corleone: Hands-off crowdsourcing for entity matching</article-title><source>Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data</source><fpage>601</fpage><lpage>612</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/2588555.2588576">https://doi.org/10.1145/2588555.2588576</ext-link></element-citation></ref>
<ref id="R11"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Govind</surname><given-names>Y.</given-names></name><name><surname>Paulson</surname><given-names>E.</given-names></name><name><surname>Nagarajan</surname><given-names>P., C., P.S.G.</given-names></name><name><surname>Doan</surname><given-names>A.</given-names></name><name><surname>Park</surname><given-names>Y.</given-names></name><name><surname>Fung</surname><given-names>G.M.</given-names></name><name><surname>Conathan</surname><given-names>D.</given-names></name><name><surname>Carter</surname><given-names>M.</given-names></name><name><surname>Sun</surname><given-names>M.</given-names></name></person-group><year>2018</year><article-title>Cloudmatcher: A hands-off cloud/crowd service for entity matching</article-title><source>Proc. VLDB Endow</source><volume>11</volume><issue>12</issue><fpage>2042</fpage><lpage>2045</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.14778/3229863.3236255">https://doi.org/10.14778/3229863.3236255</ext-link></element-citation></ref>
<ref id="R12"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jaro</surname><given-names>M.A.</given-names></name></person-group><year>1989</year><article-title>Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida</article-title><source>Journal of the American Statistical Association</source> <volume>84</volume><issue>406</issue><fpage>414</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.2307/2289924">https://doi.org/10.2307/2289924</ext-link></element-citation></ref>
<ref id="R13"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname><given-names>J.</given-names></name><name><surname>Douze</surname><given-names>M.</given-names></name><name><surname>J&#x00E9;gou</surname><given-names>H.</given-names></name></person-group><year>2019</year><article-title>Billion-scale similarity search with GPUs</article-title><source>IEEE Transactions on Big Data</source> <volume>7</volume><issue>3</issue><fpage>535</fpage><lpage>547</lpage></element-citation></ref>
<ref id="R14"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>K&#x00F6;pcke</surname><given-names>H.</given-names></name><name><surname>Thor</surname><given-names>A.</given-names></name><name><surname>Rahm</surname><given-names>E.</given-names></name></person-group><year>2010</year><article-title>Evaluation of entity resolution approaches on real-world match problems</article-title><source>Proc. VLDB Endow</source><volume>3</volume><issue>1-2</issue><fpage>484</fpage><lpage>493</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.14778/1920841.1920904">https://doi.org/10.14778/1920841.1920904</ext-link></element-citation></ref>
<ref id="R15"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Levenshtein</surname><given-names>V.I.</given-names></name><etal/></person-group><year>1966</year><article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title><source>Soviet physics doklady</source><volume>10</volume><fpage>707</fpage><lpage>710</lpage></element-citation></ref>
<ref id="R16"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>G.</given-names></name></person-group><year>2017</year><article-title>Human-in-the-loop data integration</article-title><source>VLDB Endowment</source> <volume>10</volume><issue>12</issue><fpage>2006</fpage><lpage>2201</lpage></element-citation></ref>
<ref id="R17"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>Y.</given-names></name><name><surname>Li</surname><given-names>J.</given-names></name><name><surname>Suhara</surname><given-names>Y.</given-names></name><name><surname>Doan</surname><given-names>A.</given-names></name><name><surname>Tan</surname><given-names>W.C.</given-names></name></person-group><year>2020</year><article-title>Deep entity matching with pre-trained language models</article-title><source>Proceedings of the VLDB Endowment</source> <volume>14</volume><issue>1</issue><fpage>50</fpage><lpage>60</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.14778/3421424.3421431">https://doi.org/10.14778/3421424.3421431</ext-link></element-citation></ref>
<ref id="R18"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Mudgal</surname><given-names>S.</given-names></name><name><surname>Li</surname><given-names>H.</given-names></name><name><surname>Rekatsinas</surname><given-names>T.</given-names></name><name><surname>Doan</surname><given-names>A.</given-names></name><name><surname>Park</surname><given-names>Y.</given-names></name><name><surname>Krishnan</surname><given-names>G.</given-names></name><name><surname>Deep</surname><given-names>R.</given-names></name><name><surname>Arcaute</surname><given-names>E.</given-names></name><name><surname>Raghavendra</surname><given-names>V.</given-names></name></person-group><year>2018</year><article-title>Deep learning for entity matching: A design space exploration</article-title><source>Proceedings of the 2018 International Conference on Management of Data</source><fpage>19</fpage><lpage>34</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3183713.3196926">https://doi.org/10.1145/3183713.3196926</ext-link></element-citation></ref>
<ref id="R19"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Nielsen</surname><given-names>R.D.</given-names></name></person-group><year>2018</year><article-title>Introduction to machine learning for digital library applications</article-title><source>Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries</source><fpage>421</fpage><lpage>422</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3197026.3201780">https://doi.org/10.1145/3197026.3201780</ext-link></element-citation></ref>
<ref id="R20"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Okuta</surname><given-names>R.</given-names></name><name><surname>Unno</surname><given-names>Y.</given-names></name><name><surname>Nishino</surname><given-names>D.</given-names></name><name><surname>Hido</surname><given-names>S.</given-names></name><name><surname>Loomis</surname><given-names>C.</given-names></name></person-group><year>2017</year><article-title>Cupy:Anumpy-compatible library for nvidia gpu calculations</article-title><source>Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS)</source><ext-link ext-link-type="uri" xlink:href="http://learningsys.org/nips17/assets/papers/paper_16.pdf">http://learningsys.org/nips17/assets/papers/paper_16.pdf</ext-link></element-citation></ref>
<ref id="R21"><element-citation publication-type="journal"><person-group person-group-type="author">O&#x2019;<name><surname>Neill</surname><given-names>J.</given-names></name><name><surname>Delany</surname><given-names>S.</given-names></name><name><surname>MacNamee</surname><given-names>B.</given-names></name></person-group><year>2017</year><article-title>Model-Free and Model-Based Active Learning for Regression</article-title><volume>513</volume><fpage>375</fpage><lpage>386</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/ 978-3-319-46562-3_24">https://doi.org/10.1007/ 978-3-319-46562-3_24</ext-link></element-citation></ref>
<ref id="R22"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Osawa</surname><given-names>N.</given-names></name><name><surname>Ito</surname><given-names>H.</given-names></name><name><surname>Fukushima</surname><given-names>Y.</given-names></name><name><surname>Harada</surname><given-names>T.</given-names></name><name><surname>Morishima</surname><given-names>A.</given-names></name></person-group><year>2021</year><article-title>Bubble: A quality-aware human-in-the-loop entity matching framework</article-title><source>The 5th IEEE Workshop on Human-in-the- loop Methods and Future of Work in Big-Data (IEEE HMData2021)</source><fpage>3557</fpage><lpage>3565</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/ 10.1109/BigData52589.2021.9672002">https://doi.org/ 10.1109/BigData52589.2021.9672002</ext-link></element-citation></ref>
<ref id="R23"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Pedregosa</surname><given-names>F.</given-names></name><name><surname>Varoquaux</surname><given-names>G.</given-names></name><name><surname>Gramfort</surname><given-names>A.</given-names></name><name><surname>Michel</surname><given-names>V.</given-names></name><name><surname>Thirion</surname><given-names>B.</given-names></name><name><surname>Grisel</surname><given-names>O.</given-names></name><name><surname>Blondel</surname><given-names>M.</given-names></name><name><surname>Prettenhofer</surname><given-names>P.</given-names></name><name><surname>Weiss</surname><given-names>R.</given-names></name><name><surname>Dubourg</surname><given-names>V.</given-names></name><name><surname>Vanderplas</surname><given-names>J.</given-names></name><name><surname>Passos</surname><given-names>A.</given-names></name><name><surname>Cournapeau</surname><given-names>D.</given-names></name><name><surname>Brucher</surname><given-names>M.</given-names></name><name><surname>Perrot</surname><given-names>M.</given-names></name><name><surname>Duchesnay</surname><given-names>E.</given-names></name></person-group><year>2011</year><article-title>Scikit-learn: Machine learning in Python</article-title><source>Journal of Machine Learning Research</source> <volume>12</volume><fpage>2825</fpage><lpage>2830</lpage></element-citation></ref>
<ref id="R24"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Peeters</surname><given-names>R.</given-names></name><name><surname>Bizer</surname><given-names>C.</given-names></name></person-group><year>2022</year><article-title>Supervised contrastive learning for product matching</article-title><source>Companion Proceedings of the Web Conference 2022</source><fpage>248</fpage><lpage>251</lpage> <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3487553.3524254">https://doi.org/10.1145/3487553.3524254</ext-link></element-citation></ref>
<ref id="R25"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Robertson</surname><given-names>S.</given-names></name><name><surname>Walker</surname><given-names>S.</given-names></name><name><surname>Jones</surname><given-names>S.</given-names></name><name><surname>Hancock-Beaulieu</surname><given-names>M.M.</given-names></name><name><surname>Gatford</surname><given-names>M.</given-names></name></person-group><year>1995</year><article-title>Okapi at trec-3</article-title><ext-link ext-link-type="uri" xlink:href="https://www.microsoft.com/en-us/research/ publication/okapi-at-trec-3/">https://www.microsoft.com/en-us/research/ publication/okapi-at-trec-3/</ext-link></element-citation></ref>
<ref id="R26"><element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Settles</surname><given-names>B.</given-names></name></person-group><year>2010</year><chapter-title>Active learning literature survey</chapter-title><source>Active Learning Literature Survey</source><publisher-name>University of Wisconsin-Madison</publisher-name><ext-link ext-link-type="uri" xlink:href="https://minds.wisconsin.edu/bitstream/handle/1793/60660/TR1648.pdf">https://minds.wisconsin.edu/bitstream/handle/1793/60660/TR1648.pdf</ext-link></element-citation></ref>
<ref id="R27"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Takashi</surname><given-names>H.</given-names></name><name><surname>Yukihiro</surname><given-names>F.</given-names></name><name><surname>Sho</surname><given-names>S.</given-names></name><name><surname>Misato</surname><given-names>T.</given-names></name><name><surname>Ryuji</surname><given-names>Y.</given-names></name><name><surname>Atsuyuki</surname><given-names>M.</given-names></name></person-group><year>2019</year><article-title>Advancement of bibliographic identification using a crowdsourcing system</article-title><source>Proceedings of the 9th Asia-Pacific Conference on Library &#x0026; Information Education and Practice (A-LIEP 2019)</source><fpage>71</fpage><lpage>82</lpage></element-citation></ref>
<ref id="R28"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Trabelsi</surname><given-names>M.</given-names></name><name><surname>Heflin</surname><given-names>J.</given-names></name><name><surname>Cao</surname><given-names>J.</given-names></name></person-group><year>2022</year><article-title>Dame: Domain adaptation for matching entities</article-title></element-citation></ref>
<ref id="R29"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Virtanen</surname><given-names>P.</given-names></name><name><surname>Gommers</surname><given-names>R.</given-names></name><name><surname>Oliphant</surname><given-names>T.E.</given-names></name><name><surname>Haberland</surname><given-names>M.</given-names></name><name><surname>Reddy</surname><given-names>T.</given-names></name><name><surname>Courna-peau</surname><given-names>D.</given-names></name><name><surname>Burovski</surname><given-names>E.</given-names></name><name><surname>Peterson</surname><given-names>P.</given-names></name><name><surname>Weckesser</surname><given-names>W.</given-names></name><name><surname>Bright</surname><given-names>J.</given-names></name><name><surname>van der Walt</surname><given-names>S.J.</given-names></name><name><surname>Brett</surname><given-names>M.</given-names></name><name><surname>Wilson</surname><given-names>J.</given-names></name><name><surname>Millman</surname><given-names>K.J.</given-names></name><name><surname>Mayorov</surname><given-names>N.</given-names></name><name><surname>Nelson</surname><given-names>A.R.J.</given-names></name><name><surname>Jones</surname><given-names>E.</given-names></name><name><surname>Kern</surname><given-names>R.</given-names></name><name><surname>Larson</surname><given-names>E.</given-names></name><name><surname>Carey</surname><given-names>C.J.</given-names></name><name><surname>Polat</surname><given-names>&#x0130;.</given-names></name><name><surname>Feng</surname><given-names>Y.</given-names></name><name><surname>Moore</surname><given-names>E.W.</given-names></name><name><surname>Vander-Plas</surname><given-names>J.</given-names></name><name><surname>Laxalde</surname><given-names>D.</given-names></name><name><surname>Perktold</surname><given-names>J.</given-names></name><name><surname>Cimrman</surname><given-names>R.</given-names></name><name><surname>Henriksen</surname><given-names>I.</given-names></name><name><surname>Quintero</surname><given-names>E.A.</given-names></name><name><surname>Harris</surname><given-names>C.R.</given-names></name><name><surname>Archibald</surname><given-names>A.M.</given-names></name><name><surname>Ribeiro</surname><given-names>A.H.</given-names></name><name><surname>Pedregosa</surname><given-names>F.</given-names></name><name><surname>van Mulbregt</surname><given-names>P.</given-names></name></person-group><year>2020</year><article-title>SciPy 1.0 Contributors: SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python</article-title><source>Nature Methods</source> <volume>17</volume><fpage>261</fpage><lpage>272</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1038/s41592-019-0686-2">https://doi.org/10.1038/s41592-019-0686-2</ext-link></element-citation></ref>
<ref id="R30"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Winkler</surname><given-names>W.</given-names></name></person-group><year>1990</year><article-title>String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage</article-title><source>Proceedings of the Section on Survey Research Methods</source><fpage>354</fpage><lpage>359</lpage></element-citation></ref>
<ref id="R31"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname><given-names>M.</given-names></name><name><surname>Li</surname><given-names>C.</given-names></name><name><surname>Yao</surname><given-names>Z.</given-names></name></person-group><year>2022</year><article-title>Deep active learning for computer vision tasks: Methodologies, applications, and challenges</article-title><source>Applied Sciences</source> <volume>12</volume><issue>16</issue><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3390/app12168103">https://doi.org/10.3390/app12168103</ext-link></element-citation></ref>
<ref id="R32"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname><given-names>X.</given-names></name><name><surname>Zhang</surname><given-names>F.</given-names></name><name><surname>Niu</surname><given-names>Z.</given-names></name></person-group><year>2008</year><article-title>An ontology-based query system for digital libraries</article-title><source>IEEE Pacific- Asia Workshop on Computational Intelligence and Industrial Application</source><volume>1</volume><fpage>222</fpage><lpage>226</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/PACIIA.2008. 360">https://doi.org/10.1109/PACIIA.2008. 360</ext-link></element-citation></ref>
<ref id="R33"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Yao</surname><given-names>D.</given-names></name><name><surname>Gu</surname><given-names>Y.</given-names></name><name><surname>Cong</surname><given-names>G.</given-names></name><name><surname>Jin</surname><given-names>H.</given-names></name><name><surname>Lv</surname><given-names>X.</given-names></name></person-group><year>2022</year><article-title>Entity resolution with hierarchical graph attention networks</article-title><source>Proceedings of the 2022 International Conference on Management of Data</source><fpage>429</fpage><lpage>442</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1145/3514221.3517872">https://doi.org/10.1145/3514221.3517872</ext-link></element-citation></ref>
<ref id="R34"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Younesian</surname><given-names>T.</given-names></name><name><surname>Zhao</surname><given-names>Z.</given-names></name><name><surname>Ghiassi</surname><given-names>A.</given-names></name><name><surname>Birke</surname><given-names>R.</given-names></name><name><surname>Chen</surname><given-names>L.Y.</given-names></name></person-group><year>2021</year><article-title>Qactor: Active learning on noisy labels</article-title><person-group person-group-type="editor"><name><surname>Balasubramanian</surname><given-names>V.N.</given-names></name></person-group><person-group person-group-type="editor"><name><surname>Tsang</surname><given-names>I.</given-names></name></person-group><source>Proceedings of The 13th Asian Conference on Machine Learning. Proceedings of Machine Learning Research</source><volume>157</volume><fpage>548</fpage><lpage>563</lpage><ext-link ext-link-type="uri" xlink:href="https://proceedings.mlr.press/v157/younesian21a.html">https://proceedings.mlr.press/v157/younesian21a.html</ext-link></element-citation></ref>
<ref id="R35"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Zhu</surname><given-names>Y.</given-names></name><name><surname>Liu</surname><given-names>H.</given-names></name><name><surname>Wu</surname><given-names>Z.</given-names></name><name><surname>Du</surname><given-names>Y.</given-names></name></person-group><year>2020</year><article-title>Relation-aware neighborhood matching model for entity alignment</article-title><ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2012.08128">https://arxiv.org/abs/2012.08128</ext-link></element-citation></ref>
</ref-list>
</back>
</article>