A benchmark for evaluating crisis information generation capabilities in LLMs
DOI:
https://doi.org/10.47989/ir30iConf47518Keywords:
LLMs, Crisis informatics, LLMs evaluation, Information generation, Evaluation benchmarkAbstract
Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information.
Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries.
Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance.
Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises.
Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Ruilian Han, Lu An, Wei Zhou, Gang Li

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.