Hybrid weak supervision model with manual and pseudo labels: a novel approach for scientific text mining

Authors

DOI:

https://doi.org/10.47989/ir31iConf64192

Keywords:

Hybrid weak supervision, Entity recognition, Large language model, Scientific text mining

Abstract

Introduction. Although deep learning has advanced scientific text mining, the high cost of manual annotation remains a significant bottleneck. To address this, we propose a hybrid weak supervision model (HWSM) that balances manual annotation costs, computing resources, and model performance.

Method. HWSM integrates a manually annotated subset with LLM-generated pseudo-labels and trains a deep learning model on this hybrid corpus, leveraging both human expertise and large-scale automated labeling in entity recognition tasks.

Analysis. The model was evaluated against supervised deep learning and few-shot LLM baselines on 5,000 LIS papers, measuring precision, recall, F1, and cost-performance trade-offs.

Results. HWSM achieves a competitive F1-score (within 0.01 of GPT-4.1) while significantly reducing labeling and inference costs. Optimal performance is observed when incorporating 80% pseudo-labeled data.

Conclusion. HWSM provides a robust, cost-effective solution for large-scale scientific entity recognition. It is particularly suited for scenarios with limited labeling budgets, offering a practical alternative to resource-intensive deep learning or expensive LLM-only approaches.

References

Al-Moslmi, T., Gallofre Ocana, M., Opdahl, A. L., & Veres, C. (2020). Named Entity Extraction for Knowledge Graphs: A Literature Overview. IEEE Access, 8, 32862–32881. https://doi.org/10.1109/ACCESS.2020.2973928

Balakrishnan, V., Ahmadi, K., & Ravana, S. D. (2016). Improving retrieval relevance using users’ explicit feedback. Aslib Journal of Information Management, 68(1), 76–98. https://doi.org/10.1108/AJIM-07-2015-0106

Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A Pretrained Language Model for Scientific Text. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 3615–3620. https://doi.org/10.18653/v1/d19-1371

Chen, R., Qin, C., Jiang, W., & Choi, D. (2024). Is a Large Language Model a Good Annotator for Event Extraction? Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17772–17780. https://doi.org/10.1609/AAAI.V38I16.29730

Hao, J., Chen, Z., Peng, Q., Zhao, L., Zhao, W., Cong, S., Li, J., Li, J., Qian, Q., & Sun, H. (2025). Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study. Journal of Medical Internet Research, 27(1), e67033. https://doi.org/10.2196/67033

Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. https://doi.org/10.1007/S11192-018-2718-6

Hong, Z., Tchoua, R., Chard, K., & Foster, I. (2020). SciNER: Extracting named entities from scientific literature. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12138 LNCS, 308–321. https://doi.org/10.1007/978-3-030-50417-5_23

Li, Q., Wang, P., Liu, C., Li, X., & Hou, J. (2025). Integration patterns in the use of metadata for data sense-making during relevance evaluation: An interpretable deep learning-based prediction. Journal of the Association for Information Science and Technology, 76(3), 621–641. https://doi.org/10.1002/ASI.24961

Luo, L., Yang, Z., Yang, P., Zhang, Y., Wang, L., Lin, H., & Wang, J. (2018). An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 34(8), 1381–1388. https://doi.org/10.1093/BIOINFORMATICS/BTX761

Ma, J., & Yuan, H. (2019). Bi-LSTM+CRF-based named entity recognition in scientific papers in the field of ecological restoration technology. Proceedings of the Association for Information Science and Technology, 56(1), 186–195. https://doi.org/10.1002/PRA2.16

Puccetti, G., Giordano, V., Spada, I., Chiarello, F., & Fantoni, G. (2023). Technology identification from patent texts: A novel named entity recognition method. Technological Forecasting and Social Change, 186, 122160. https://doi.org/10.1016/J.TECHFORE.2022.122160

Qiu, Q., Xie, Z., Wu, L., Tao, L., & Li, W. (2019). BiLSTM-CRF for geological named entity recognition from the geoscience literature. Earth Science Informatics, 12(4), 565–579. https://doi.org/10.1007/S12145-019-00390-3/METRICS

Santoso, J., Sutanto, P., Cahyadi, B. K., & Setiawan, E. I. (2024). Pushing the Limits of Low-Resource NER Using LLM Artificial Data Generation. Findings of the Association for Computational Linguistics ACL 2024, 9652–9667. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.575

Shen, S., Liu, J., Lin, L., Huang, Y., Zhang, L., Liu, C., Feng, Y., & Wang, D. (2023). SsciBERT: a pre-trained language model for social science texts. Scientometrics, 128(2), 1241–1263. https://doi.org/10.1007/S11192-022-04602-4

Tarasova, O. A., Rudik, A. V., Biziukova, N. Y., Filimonov, D. A., & Poroikov, V. V. (2022). Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach. Journal of Cheminformatics, 14(1), 1–12. https://doi.org/10.1186/S13321-022-00633-4

Tuomaala, O., Järvelin, K., & Vakkari, P. (2014). Evolution of library and information science, 1965–2005: Content analysis of journal articles. Journal of the Association for Information Science and Technology, 65(7), 1446–1462. https://doi.org/10.1002/asi.23034

Vlachidis, A., & Tudhope, D. (2016). A knowledge-based approach to Information Extraction for semantic interoperability in the archaeology domain. Journal of the Association for Information Science and Technology, 67(5), 1138–1152. https://doi.org/10.1002/ASI.23485

Wang, Y., Zhang, C., & Li, K. (2022). A review on method entities in the academic literature: extraction, evaluation, and application. Scientometrics, 127(5), 2479–2520. https://doi.org/10.1007/S11192-022-04332-7

Zhang, C., Tian, L., & Chu, H. (2023). Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021. Information Processing & Management, 60(6), 103507. https://doi.org/10.1016/J.IPM.2023.103507

Downloads

Published

2026-03-20

How to Cite

Deng, S., Xiang, R., Zhu, Q., & Chen, F. (2026). Hybrid weak supervision model with manual and pseudo labels: a novel approach for scientific text mining. Information Research an International Electronic Journal, 31(iConf), 992–1001. https://doi.org/10.47989/ir31iConf64192

Issue

Section

Conference proceedings

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.