Hybrid weak supervision model with manual and pseudo labels: a novel approach for scientific text mining
DOI:
https://doi.org/10.47989/ir31iConf64192Keywords:
Hybrid weak supervision, Entity recognition, Large language model, Scientific text miningAbstract
Introduction. Although deep learning has advanced scientific text mining, the high cost of manual annotation remains a significant bottleneck. To address this, we propose a hybrid weak supervision model (HWSM) that balances manual annotation costs, computing resources, and model performance.
Method. HWSM integrates a manually annotated subset with LLM-generated pseudo-labels and trains a deep learning model on this hybrid corpus, leveraging both human expertise and large-scale automated labeling in entity recognition tasks.
Analysis. The model was evaluated against supervised deep learning and few-shot LLM baselines on 5,000 LIS papers, measuring precision, recall, F1, and cost-performance trade-offs.
Results. HWSM achieves a competitive F1-score (within 0.01 of GPT-4.1) while significantly reducing labeling and inference costs. Optimal performance is observed when incorporating 80% pseudo-labeled data.
Conclusion. HWSM provides a robust, cost-effective solution for large-scale scientific entity recognition. It is particularly suited for scenarios with limited labeling budgets, offering a practical alternative to resource-intensive deep learning or expensive LLM-only approaches.
References
Al-Moslmi, T., Gallofre Ocana, M., Opdahl, A. L., & Veres, C. (2020). Named Entity Extraction for Knowledge Graphs: A Literature Overview. IEEE Access, 8, 32862–32881. https://doi.org/10.1109/ACCESS.2020.2973928
Balakrishnan, V., Ahmadi, K., & Ravana, S. D. (2016). Improving retrieval relevance using users’ explicit feedback. Aslib Journal of Information Management, 68(1), 76–98. https://doi.org/10.1108/AJIM-07-2015-0106
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A Pretrained Language Model for Scientific Text. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3615–3620. https://doi.org/10.18653/v1/d19-1371
Chen, R., Qin, C., Jiang, W., & Choi, D. (2024). Is a Large Language Model a Good Annotator for Event Extraction? Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17772–17780. https://doi.org/10.1609/AAAI.V38I16.29730
Hao, J., Chen, Z., Peng, Q., Zhao, L., Zhao, W., Cong, S., Li, J., Li, J., Qian, Q., & Sun, H. (2025). Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study. Journal of Medical Internet Research, 27(1), e67033. https://doi.org/10.2196/67033
Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. https://doi.org/10.1007/S11192-018-2718-6
Hong, Z., Tchoua, R., Chard, K., & Foster, I. (2020). SciNER: Extracting named entities from scientific literature. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12138 LNCS, 308–321. https://doi.org/10.1007/978-3-030-50417-5_23
Li, Q., Wang, P., Liu, C., Li, X., & Hou, J. (2025). Integration patterns in the use of metadata for data sense-making during relevance evaluation: An interpretable deep learning-based prediction. Journal of the Association for Information Science and Technology, 76(3), 621–641. https://doi.org/10.1002/ASI.24961
Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., & Wang, J. (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8), 1381–1388. https://doi.org/10.1093/BIOINFORMATICS/BTX761
Ma, J., & Yuan, H. (2019). Bi-LSTM+CRF-based named entity recognition in scientific papers in the field of ecological restoration technology. Proceedings of the Association for Information Science and Technology, 56(1), 186–195. https://doi.org/10.1002/PRA2.16
Puccetti, G., Giordano, V., Spada, I., Chiarello, F., & Fantoni, G. (2023). Technology identification from patent texts: A novel named entity recognition method. Technological Forecasting and Social Change, 186, 122160. https://doi.org/10.1016/J.TECHFORE.2022.122160
Qiu, Q., Xie, Z., Wu, L., Tao, L., & Li, W. (2019). BiLSTM-CRF for geological named entity recognition from the geoscience literature. Earth Science Informatics, 12(4), 565–579. https://doi.org/10.1007/S12145-019-00390-3/METRICS
Santoso, J., Sutanto, P., Cahyadi, B. K., & Setiawan, E. I. (2024). Pushing the Limits of Low-Resource NER Using LLM Artificial Data Generation. Findings of the Association for Computational Linguistics ACL 2024, 9652–9667. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.575
Shen, S., Liu, J., Lin, L., Huang, Y., Zhang, L., Liu, C., Feng, Y., & Wang, D. (2023). SsciBERT: a pre-trained language model for social science texts. Scientometrics, 128(2), 1241–1263. https://doi.org/10.1007/S11192-022-04602-4
Tarasova, O. A., Rudik, A. V., Biziukova, N. Y., Filimonov, D. A., & Poroikov, V. V. (2022). Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach. Journal of Cheminformatics, 14(1), 1–12. https://doi.org/10.1186/S13321-022-00633-4
Tuomaala, O., Järvelin, K., & Vakkari, P. (2014). Evolution of library and information science, 1965–2005: Content analysis of journal articles. Journal of the Association for Information Science and Technology, 65(7), 1446–1462. https://doi.org/10.1002/asi.23034
Vlachidis, A., & Tudhope, D. (2016). A knowledge-based approach to Information Extraction for semantic interoperability in the archaeology domain. Journal of the Association for Information Science and Technology, 67(5), 1138–1152. https://doi.org/10.1002/ASI.23485
Wang, Y., Zhang, C., & Li, K. (2022). A review on method entities in the academic literature: extraction, evaluation, and application. Scientometrics, 127(5), 2479–2520. https://doi.org/10.1007/S11192-022-04332-7
Zhang, C., Tian, L., & Chu, H. (2023). Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021. Information Processing & Management, 60(6), 103507. https://doi.org/10.1016/J.IPM.2023.103507
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Shengli Deng , Rongrong Xiang , Qiuyu Zhu , Fang Chen

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
