Learning from unknown-unknowns: inconsistency-driven sampling for improving  LLM entity matching

Kota Okayama; Hiroyoshi Ito; Atsuyuki Morishima

doi:10.47989/ir31iConf64127

Authors

Kota Okayama University of Tsukuba https://orcid.org/0009-0001-2660-3251
Hiroyoshi Ito University of Tsukuba https://orcid.org/0000-0002-3265-7029
Atsuyuki Morishima University of Tsukuba https://orcid.org/0000-0003-4606-9065

DOI:

https://doi.org/10.47989/ir31iConf64127

Keywords:

Digital literacy, Academic library, Digital literacy education, Theoretical research

Abstract

Introduction. While large language models (LLMs) demonstrate high performance in entity matching, the ‘unknown-unknown’ problem, where models confidently make incorrect predictions, remains a significant challenge. This research focuses on the manifestation of this problem as logical inconsistencies, such as violations of transitivity (e.g., A=B and B=C, but A≠C) across multiple matching decisions.

Method. ‘Inconsistent triangles,’ in which the transitive law is violated among three entities, were detected, and scored based on their degree of contradiction. Pairs with higher inconsistency scores were prioritised for annotation, and the resulting labeled data was fed back to the model through fine-tuning or few-shot learning.

Analysis. The proposed method was evaluated on multiple datasets, including Japanese and English data. Its performance was compared against existing baseline methods, such as uncertainty sampling and random sampling, using the pairwise F1 score as the primary evaluation metric.

Results. The experiments revealed that the proposed inconsistency-driven sampling strategy outperformed or achieved comparable performance to existing methods across all datasets.

Conclusion. By leveraging inconsistency to actively select training data, our approach achieves learning efficiency, demonstrating improved entity matching performance under the same annotation budget.

References

Barlaug, N., & Gulla, J. A. (2021). Neural networks for entity matching: A survey. ACM Computing Surveys, 15(3), 1–34. https://doi.org/10.1145/3442200

Bayer, M., Lutz, J., & Reuter, C. (2025). ActiveLLM: Large Language Model-based Active Learning for Textual Few-Shot Scenarios. arXiv preprint arXiv:2405.10808.

Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., & Widom, J. (2009). Swoosh: a generic approach to entity resolution. The VLDB Journal, 18(1), 255-276.

Chung, Y., Haas, P.J., Upfal, E., Kraska, T. (2019). Unknown examples & machine learning model generalization.

Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In IIWeb (Vol. 3, pp. 73-78).

Das, S., G.C., P.S., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y. (2017). Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In: Proceedings of the 2017 ACM International Conference on Management of Data. pp. 1431–1446. https://doi.org/10.1145/ 3035918.3035960

Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J., Zhu, X. (2014). Corleone: Hands-off crowdsourcing for entity matching. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. pp. 601–612. https://doi.org/10.1145/2588555.2588576

Hasso Plattner Institute. (2025). Amazon-Walmart Product Matching Dataset. Retrieved from https://hpi.de/naumann/projects/repeatability/datasets/amazon-walmart-dataset.html (Accessed 2025-08-29).

Huang, Z. (2024). Disambiguate Entity Matching using large language models through Relation Discovery. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (pp. 5551–5555). Association for Computing Machinery. https://doi.org/10.1145/3665601.3669844

Ito, H., Koizumi, T., Yoshimoto, R., Fukushima, Y., Harada, T., & Morishima, A. (2025). Inconsistency-driven approach for human-in-the-loop entity matching. Information Research an International Electronic Journal, 30(iConf), 1024–1038.

Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414-420.

Ji, K., Chen, J., Gao, A., Xie, W., Wan, X., & Wang, B. (2025). Unlocking LLMs' Self-Improvement Capacity with Autonomous Learning for Domain Adaptation. In Findings of the Association for Computational Linguistics: ACL 2025 (pp. 21051–21067). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.findings-acl.1084

Johnson, J., Douze, M., Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7(3), 535–547.

Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2), 484-493.

Osawa, N., Ito, H., Fukushima, Y., Harada, T., Morishima, A. (2021). Bubble: A quality-aware human-in-the-loop entity matching framework. In: The 5th IEEE Workshop on Human-in-the-loop Methods and Future of Work in Big-Data (IEEE HMData2021). pp. 3557–3565. https://doi.org/ 10.1109/BigData52589.2021.9672002

Peeters, R., Bizer, C. (2022). Supervised contrastive learning for product matching. In: Companion Proceedings of the Web Conference 2022. pp. 248–251 https://doi.org/10.1145/3487553.3524254

Peeters, R., Der, R. C., & Bizer, C. (2023). WDC Products: A Multi-Dimensional Entity Matching Benchmark. arXiv preprint arXiv:2301.09521.

Peeters, R., Steiner, A., & Bizer, C. (2023). Entity matching using large language models. arXiv preprint arXiv:2310.11244.

Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Huang, Q., Liden, L., Yu, Z., Chen, W., & Gao, J. (2023). Check Your Facts and Try Again: Improving large language models with External Knowledge and Automated Feedback. arXiv preprint arXiv:2302.12813.

Settles, B. (2010). Active learning literature survey. In: Active Learning Literature Survey. University of Wisconsin-Madison, https://minds.wisconsin.edu/bitstream/handle/1793/60660/TR1648.pdf

Steiner, A., Peeters, R., and Bizer, C. 2025. Fine-tuning large language models for Entity Matching. arXiv preprint arXiv:2409.08185.

Takashi, H., Yukihiro, F., Sho, S., Misato, T., Ryuji, Y., Atsuyuki, M. (1993). Advancement of bibliographic identification using a crowdsourcing system. Proceedings of the 9th Asia-Pacific Conference on Library & Information Education and Practice (A-LIEP 2019) pp. 71–82.

Wang, T., Chen, X., Lin, H., Chen, X., Han, X., Sun, L., Wang, H., & Zeng, Z. (2025). Match, Compare, or Select? An Investigation of large language models for Entity Matching. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 96–109). Association for Computational Linguistics.

Zhu, Y., Liu, H., Wu, Z., Du, Y. (2020). Relation-aware neighborhood matching model for entity alignment. https://arxiv.org/abs/2012.08128

Learning from unknown-unknowns: inconsistency-driven sampling for improving LLM entity matching

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Similar Articles

About the Journal

Make a Submission

Information