Citation count prediction based on Google Scholar profiles and Clarivate’s journal citation reports

Authors

DOI:

https://doi.org/10.47989/ir31141005

Keywords:

Artificial intelligence, regression, scientometrics, citation count, Google scholar, citation

Abstract

Introduction. Citation count prediction (CCP) models are vital for assessing research impact, yet existing approaches suffer from critical limitations. Prior studies often rely on restricted datasets (e.g., journal metrics alone) or fail to account for the multidimensional factors influencing citations, leading to suboptimal accuracy.

Method. We propose an accurate CCP regression model for Computer Science and Electrical Engineering disciplines found on twenty three novel features extracted from public data in Google Scholar profiles and the Journal Citation Reports (JCR) annual report by splitting features into four datasets: Author information database (AI DB), journal information database (JI DB), paper information database (PI DB), and finally author & paper & journal information database (APJ DB).

Analysis. Our evaluation employed Mean Absolute Error (MAE), Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), and the Coefficient of Determination (R²) to assess model performance. Dimensionality reduction techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) were also applied, and their effect on CCP was assessed.

Results. We identified that paper-level features (PI DB) were significantly more predictive than author or journal attributes, resolving a key debate in CCP research.

Conclusions. This study enhances CCP research by introducing scalable, publicly available features, demonstrating the superiority of paper-level attributes through empirical evidence, and identifying Nu-SVR as the most effective algorithm for accurate and interpretable citation prediction, supporting researchers, institutions, and policymakers in assessing research impact.

References

Abramo, G., D’angelo, C. A., & Di Costa, F. (2023). Correlating article citedness and journal impact: An empirical investigation by field on a large-scale dataset. Scientometrics, 128(3), 1877–1894. https://doi.org/10.1007/s11192-022-04622-0

Abrishami, A., & Aliakbary, S. (2019). Predicting citation counts based on deep neural network learning techniques. Journal of Informetrics, 13(2), 485–499. https://doi.org/10.1016/j.joi.2019.02.011

Aksnes, D. W., Langfeldt, L., & Wouters, P. (2019). Citations, citation indicators, and research quality: An overview of basic concepts and theories. Sage Open, 9(1). https://doi.org/10.1177/215824401982957

Almas, K., Ur Rehman, S., Al-Harbi, F., Qadir Khan, S., Ahmed Farooqi, F., Smith, S., & Ahmad, S. (2021). Significance of variable contributing factors on impact factor of Clarivate analytics dental journals. Serials Review, 47(3–4), 201–214. https://doi.org/10.1080/00987913.2021.2018225

Amin, M., & Mabe, M. A. (2003). Impact factors: Use and abuse. Medicina (Buenos Aires), 63(4), 347–354. https://medicinabuenosaires.com/revistas/vol63-03/4/Impact%20factors-use%20and%20abuse.pdf (Archived at https://web.archive.org/web/20250325222750/http://www.medicinabuenosaires.com/revistas/vol63-03/4/Impact%20factors-use%20and%20abuse.pdf)

Baas, J., Schotten, M., Plume, A., Côté, G., & Karimi, R. (2020). Scopus as a curated, high-quality bibliometric data source for academic research in quantitative science studies. Quantitative Science Studies, 1(1), 377–386. https://doi.org/10.1162/qss_a_00019

Bahaghighat, Mahdi; Jahani rad, P. (2024). AoI2WoS: Mapping area of interest in Google Scholar profile to Web Of Science (WoS) scientific fields categories. Mendeley Data. http://doi.org/10.17632/nr7zfdjm7f.1

Rad, P. J., & Bahaghighat, M. (2024). Hierarchical text classification for web of science scientific fields. Facta Universitatis, Series: Electronics and Energetics, 37(4), 703-732.

https://doi.org/10.2298/FUEE2404703J

Bai, X., Zhang, F., Liu, J., Wang, X., & Xia, F. (2025). Revolutionizing scholarly impact: Advanced evaluations, predictive models, and future directions. SpringerX Bai, F Zhang, J Liu, X Wang, F XiaArtificial Intelligence Review, 2025•Springer, 58(10). https://doi.org/10.1007/S10462-025-11315-6

Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. Neural Information Processing-Letters and Reviews, 11(10), 203–224. https://static.aminer.org/pdf/PDF/000/337/560/uncertainty_support_vector_method_for_ordinal_regression.pdf (Archived at https://web.archive.org/web/20200709085151/https://static.aminer.org/pdf/PDF/000/337/560/uncertainty_support_vector_method_for_ordinal_regression.pdf)

Belikov, A. V, & Belikov, V. V. (2015). A citation-based, author-and age-normalized, logarithmic index for evaluation of individual researchers independently of publication counts. F1000Research, 4. https://doi.org/10.12688/f1000research.7070.2

Bhatt, D., Aggarwal, P., Bhattacharya, P., & Devabhaktuni, V. (2012). An enhanced mems error modeling approach based on nu-support vector regression. Sensors, 12(7), 9448–9466. https://doi.org/10.3390/s120709448

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer Google Scholar, 2, 1122–1128.

Blümel, C., & Schniedermann, A. (2020). Studying review articles in scientometrics and beyond: A research agenda. Scientometrics, 124(1), 711–728. https://doi.org/10.1007/s11192-020-03431-7

Bornmann, L., & Daniel, H. (2008). What do citation counts measure? A review of studies on citing behavior. Journal of Documentation, 64(1), 45–80. https://doi.org/10.1108/00220410810844150

Bornmann, L., & Daniel, H. (2009). The state of h index research: Is the h index the ideal way to measure research performance? EMBO Reports, 10(1), 2–6. https://doi.org/10.1038/embor.2008.233

Braun, T., Glänzel, W., & Schubert, A. (2006). A Hirsch-type index for journals. Scientometrics, 69, 169–173. https://doi.org/10.1007/s11192-006-0147-4

Broadus, R. N. (1987). Toward a definition of “bibliometrics.” Scientometrics, 12(5–6), 373–379. https://doi.org/10.1007/BF02016680

Butler, L., & Visser, M. S. (2006). Extending citation analysis to non-source items. Scientometrics, 66(2), 327–343.

Cameron, A. C., & Windmeijer, F. A. G. (1997). An R-squared measure of goodness of fit for some common nonlinear regression models. Journal of Econometrics, 77(2), 329–342. https://doi.org/10.1016/S0304-4076(96)01818-0

Cao, X., Chen, Y., & Liu, K. J. R. (2016). A data analytic approach to quantifying scientific impact. Journal of Informetrics, 10(2), 471–484. https://doi.org/10.1016/j.joi.2016.02.006

Chen, J. T., Lee, C., & Chen, L. Y. (2024). Statistical prediction and machine learning. CRC Press.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. https://doi.org/10.1109/TIT.1967.1053964

Cunningham, P., & Delany, S. J. (2021). K-nearest neighbour classifiers-a tutorial. ACM Computing Surveys (CSUR), 54(6), 1–25. https://doi.org/10.1145/3459665

Durieux, V., & Gevenois, P. A. (2010). Bibliometric indicators: Quality measurements of scientific publication. Radiology, 255(2), 342–351. https://doi.org/10.1148/radiol.09090626

Enduri, M. K., Sankar, V. U., & Hajarathaiah, K. (2022). Empirical study on citation count prediction of research articles. Journal of Scientometric Research, 11(2), 155–163. https://doi.org/10.5530/jscires.11.2.17

Fassin, Y. (2020). The HF-rating as a universal complement to the h-index. Scientometrics, 125(2), 965–990. https://doi.org/10.1007/s11192-020-03611-5

Fox, J. (2015). Applied regression analysis and generalized linear models. SAGE Publications, Inc.

Furman, J. L., & Teodoridis, F. (2020). Automation, research technology, and researchers’ trajectories: Evidence from computer science and electrical engineering. Organization Science, 31(2), 330–354. https://doi.org/10.1287/orsc.2019.1308

Gao, T., Liu, J., Pan, R., & Wang, H. (2024). Citation counts prediction of statistical publications based on multi-layer academic networks via neural network model. Expert Systems with Applications, 238, 121634. https://doi.org/10.1016/j.eswa.2023.121634

Garfield, E. (2006). The history and meaning of the journal impact factor. Jama, 295(1), 90–93. https://doi.org/10.1001/jama.295.1.90

González-Betancor, S. M., & Dorta-González, P. (2017). An indicator of the impact of journals based on the percentage of their highly cited publications. Online Information Review, 41(3), 398–411. https://doi.org/10.1108/OIR-01-2016-0008

Groos, O. V, & Pritchard, A. (1969). Documentation notes. Journal of Documentation, 25(4), 344–349. https://doi.org/10.1108/eb026482

Halder, R. K., Uddin, M. N., Uddin, M. A., Aryal, S., & Khraisat, A. (2024). Enhancing K-nearest neighbor algorithm: A comprehensive review and performance analysis of modifications. Journal of Big Data, 11(1), 113. https://doi.org/10.1186/s40537-024-00973-y

He, G., Gu, S., Xue, Z., Duan, Y., & Zhu, X. (2025). Sequential citation counts prediction enhanced by dynamic contents. Journal of Informetrics, 19(2), 101645. https://doi.org/10.1016/j.joi.2025.101645

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and Their Applications, 13(4), 18–28. https://doi.org/10.1109/5254.708428

Hirsch, J. E. (2010). An index to quantify an individual’s scientific research output that takes into account the effect of multiple coauthorship. Scientometrics, 85(3), 741–754. https://doi.org/10.1007/s11192-010-0193-9

Hutchins, B. I., Yuan, X., Anderson, J. M., & Santangelo, G. M. (2016). Relative citation ratio (RCR): A new metric that uses citation rates to measure influence at the article level. PLoS Biology, 14(9). https://doi.org/10.1371/journal.pbio.1002541

Jung, S., Dagobert, T., Morel, J.-M., & Facciolo, G. (2024). A review of t-SNE. Image Processing On Line, 14, 250–270. https://doi.org/10.5201/ipol.2024.528

Khokhlov, A. N. (2020). How scientometrics became the most important science for researchers of all specialties. Moscow University Biological Sciences Bulletin, 75, 159–163. https://doi.org/10.3103/s0096392520040057

Khurana, P., & Sharma, K. (2022). Impact of h-index on author’s rankings: An improvement to the h-index for lower-ranked authors. Scientometrics, 127(8), 4483–4498. https://doi.org/10.1007/s11192-022-04464-w

Kosyakov, D., & Pislyakov, V. (2024). “I’d like to publish in Q1, but there’s no Q1 to be found”: Study of journal quartile distributions across subject categories and topics. Journal of Informetrics, 18(1), 101494. https://doi.org/10.1016/j.joi.2024.101494

Li, C.-T., Lin, Y.-J., Yan, R., & Yeh, M.-Y. (2015). Trend-based citation count prediction for research articles. Advances in Knowledge Discovery and Data Mining: 19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part I 19, 659–671. http://dx.doi.org/10.1007/978-3-319-18038-0_51

Li, S., Zhao, W. X., Yin, E. J., & Wen, J.-R. (2019). A neural citation count prediction model based on peer review text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4914–4924. https://doi.org/10.18653/v1/D19-1497

Lundberg, J. (2006). Bibliometrics as a research assessment tool: impact beyond the impact factor. Karolinska Institutet (Sweden).

Moed, H. F. (2006). Citation analysis in research evaluation (Vol. 9). Springer Science & Business Media.

Moussa, S. (2023). A bibliometric investigation of the journals that were repeatedly suppressed from Clarivate’s journal citation reports. Accountability in Research, 30(8), 592–612. https://doi.org/10.1080/08989621.2022.2071154

Murphy, A. H. (1988). Skill scores based on the mean square error and their relationships to the correlation coefficient. Monthly Weather Review, 116(12), 2417–2424. https://doi.org/10.1175/1520-0493(1988)116%3C2417:SSBOTM%3E2.0.CO;2

Nguyen, B. T., & Nguyen, T. T. (2025). Forecasting scientific impact: A model for predicting citation counts. Statistics, Optimization & Information Computing, 13(6), 2601–2615. https://doi.org/10.19139/soic-2310-5070-2524

Okagbue, H. I., Akhmetshin, E. M., & Teixeira da Silva, J. A. (2021). Distinct clusters of CiteScore and percentiles in top 1000 journals in Scopus. COLLNET Journal of Scientometrics and Information Management, 15(1), 133–143. https://doi.org/10.1080/09737766.2021.1934604

Okagbue, H. I., Bishop, S. A., Adamu, P. I., Opanuga, A. A., & Obasi, E. C. M. (2020). Analysis of percentiles of computer science, theory and methods journals: CiteScore versus impact factor. DESIDOC Journal of Library & Information Technology, 40(1), 359–365. http://dx.doi.org/10.14429/djlit.40.1.14866

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., & Dubourg, V. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825–2830. https://doi.org/10.48550/arXiv.1201.0490

Pobiedina, N., & Ichise, R. (2016). Citation count prediction as a link prediction problem. Applied Intelligence, 44, 252–268. https://doi.org/10.1007/s10489-015-0657-y

Rostami, M., Bahaghighat, M., & Zanjireh, M. M. (2021). Bitcoin daily close price prediction using optimized grid search method. Acta Universitatis Sapientiae, Informatica, 13(2), 265–287. https://doi.org/10.2478/ausi-2021-0012

Sabry, F. (2023). K Nearest Neighbor algorithm: Fundamentals and applications (Vol. 28). One Billion Knowledgeable.

Skrodzki, M., van Geffen, H., Chaves-de-Plaza, N. F., Höllt, T., Eisemann, E., & Hildebrandt, K. (2024). Accelerating hyperbolic t-SNE. IEEE Transactions on Visualization and Computer Graphics, 30(7), 4403–4415. https://doi.org/10.1109/TVCG.2024.3364841

Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88

Sohrabi, B., & Iraj, H. (2017). The effect of keyword repetition in abstract and keyword frequency per journal in predicting citation counts. Scientometrics, 110, 243–251. https://doi.org/10.1007/s11192-016-2161-5

Song, Y., Liang, J., Lu, J., & Zhao, X. (2017). An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing, 251, 26–34. https://doi.org/10.1016/j.neucom.2017.04.018

Teixeira da Silva, J. A. (2020). CiteScore: Advances, evolution, applications, and limitations. Publishing Research Quarterly, 36(3), 459–468. https://doi.org/10.1007/s12109-020-09736-y

Torres-Salinas, D., Valderrama-Baca, P., & Arroyo-Machado, W. (2022). Is there a need for a new journal metric? Correlations between JCR Impact Factor metrics and the Journal Citation Indicator—JCI. Journal of Informetrics, 16(3), 101315. https://doi.org/10.1016/j.joi.2022.101315

Wallisch, C., Bach, P., Hafermann, L., Klein, N., Sauerbrei, W., Steyerberg, E. W., Heinze, G., Rauch, G., & Initiative, T. G. 2 of the S. (2022). Review of guidance papers on regression modeling in statistical series of medical journals. PloS One, 17(1). https://doi.org/10.1371/journal.pone.0262918

Wang, B., Wu, F., & Shi, L. (2023). AGSTA-NET: Adaptive graph spatiotemporal attention network for citation count prediction. Scientometrics, 128(1), 511–541. https://doi.org/10.1007/s11192-022-04541-0

Willmott, C. J., & Matsuura, K. (2005). Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 30(1), 79–82. https://doi.org/10.3354/cr030079

Yao, Z., & Ruzzo, W. L. (2006). A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC Bioinformatics, 7, 1–11. https://doi.org/10.1186/1471-2105-7-S1-S11

Yu, T., Yu, G., Li, P.-Y., & Wang, L. (2014). Citation impact prediction for scientific papers using stepwise regression analysis. Scientometrics, 101, 1233–1252. https://doi.org/10.1007/s11192-014-1279-6

Zafar, L., Masood, N., Hadi, F., & Ahmed, S. (2024). Citation count prediction of scholarly articles. Journal of Computing & Biomedical Informatics, 6(2). https://doi.org/10.56979/602/2024

Zhang, Z., Yu, C., Wang, J., & An, L. (2025). A temporal evolution and fine-grained information aggregation model for citation count prediction. Scientometrics, 130(4), 2069–2091. https://doi.org/10.1007/s11192-025-05294-2

Zhu, J., Zhou, J., Pan, J., Gu, F., & Guo, J. (2025). Ranking influential non-content factors on scientific papers’ citation impact: A multidomain comparative analysis. Big Data and Cognitive Computing, 9(2), 30. https://doi.org/10.3390/bdcc9020030

Downloads

Published

2026-01-15

How to Cite

Bahaghighat, M., Akbari, L., Ghasemi, M., & Xin, Q. (2026). Citation count prediction based on Google Scholar profiles and Clarivate’s journal citation reports. Information Research an International Electronic Journal, 31(1), 46–69. https://doi.org/10.47989/ir31141005

Issue

Section

Peer-reviewed papers

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.