A benchmark for evaluating crisis information generation capabilities in LLMs

Authors

  • Ruilian Han Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China
  • Lu An Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China
  • Wei Zhou School of Information Management, Wuhan University, China
  • Gang Li Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China

DOI:

https://doi.org/10.47989/ir30iConf47518

Keywords:

LLMs, Crisis informatics, LLMs evaluation, Information generation, Evaluation benchmark

Abstract

Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information.

Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries.

Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance.

Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises.

Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information.

Downloads

Published

2025-03-11

How to Cite

Han, R., An, L., Zhou, W., & Li, G. (2025). A benchmark for evaluating crisis information generation capabilities in LLMs. Information Research an International Electronic Journal, 30(iConf), 240–248. https://doi.org/10.47989/ir30iConf47518

Issue

Section

Peer-reviewed papers

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.