A benchmark for evaluating crisis information generation capabilities in LLMs

Ruilian Han; Lu An; Wei Zhou; Gang Li

doi:10.47989/ir30iConf47518

Authors

Ruilian Han Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China
Lu An Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China
Wei Zhou School of Information Management, Wuhan University, China
Gang Li Center for Studies of Information Resources, Wuhan University, China; School of Information Management, Wuhan University, China

DOI:

https://doi.org/10.47989/ir30iConf47518

Keywords:

LLMs, Crisis informatics, LLMs evaluation, Information generation, Evaluation benchmark

Abstract

Introduction. Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information.

Method. CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries.

Analysis. Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model's performance.

Results. The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises.

Conclusion. The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information.

A benchmark for evaluating crisis information generation capabilities in LLMs

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

About the Journal

Make a Submission

Information