<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" article-type="research-article" xml:lang="en">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">IR</journal-id>
<journal-title-group>
<journal-title>Information Research</journal-title>
</journal-title-group>
<issn pub-type="epub">1368-1613</issn>
<publisher>
<publisher-name>University of Bor&#x00E5;s</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">ir30iConf47518</article-id>
<article-id pub-id-type="doi">10.47989/ir30iConf47518</article-id>
<article-categories>
<subj-group xml:lang="en">
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A benchmark for evaluating crisis information generation capabilities in LLMs</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Han</surname><given-names>Ruilian</given-names></name>
<xref ref-type="aff" rid="aff0001"/></contrib>
<contrib contrib-type="author"><name><surname>An</surname><given-names>Lu</given-names></name>
<xref ref-type="aff" rid="aff0002"/></contrib>
<contrib contrib-type="author"><name><surname>Zhou</surname><given-names>Wei</given-names></name>
<xref ref-type="aff" rid="aff0003"/></contrib>
<contrib contrib-type="author"><name><surname>Li</surname><given-names>Gang</given-names></name>
<xref ref-type="aff" rid="aff0004"/></contrib>
<aff id="aff0001"><bold>Ruilian Han</bold> is a PhD student at School of Information Management, Wuhan University, China. Her research focuses on social media data analysis. She can be contacted at rlhan_1127@163.com</aff>
<aff id="aff0002"><bold>Lu An</bold> is a professor at School of Information Management, Wuhan University, China. Her research focuses on crisis informatics. She can be contacted at anlu97@163.com</aff>
<aff id="aff0003"><bold>Wei Zhou</bold> is a PhD student at School of Information Management, Wuhan University, China. Her research focuses on risk identification. She can be contacted at 664880781@qq.com</aff>
<aff id="aff0004"><bold>Li Gang</bold> is a professor at School of Information Management, Wuhan University, China. His research focuses on information resource management. He can be contacted at ligang@whu.edu.cn</aff>
</contrib-group>
<pub-date pub-type="epub"><day>06</day><month>05</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>30</volume>
<issue>i</issue>
<fpage>240</fpage>
<lpage>248</lpage>
<permissions>
<copyright-year>2025</copyright-year>
<copyright-holder>&#x00A9; 2025 The Author(s).</copyright-holder>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by-nc/4.0/">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by-nc/4.0/">http://creativecommons.org/licenses/by-nc/4.0/</ext-link>), permitting all non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<abstract xml:lang="en">
<title>Abstract</title>
<p><bold>Introduction.</bold> Large language models (LLMs) have become increasingly significant in crisis information management due to their advanced natural language processing capabilities. This study aims to develop a comprehensive evaluation benchmark to assess the effectiveness of LLMs in generating crisis information.</p>
<p><bold>Method.</bold> CIEeval, an evaluation dataset, was constructed through steps such as information extraction and prompt generation. CIEeval covers 26 types of crises across sub-domains including water disasters, environmental pollution, and others, comprising a total of 4.8k data entries.</p>
<p><bold>Analysis.</bold> Eight LLMs applicable to the Chinese context were selected for evaluation based on multidimensional criteria. A combination of manual and machine scoring methods was utilized. This approach ensured a comprehensive understanding of each model&#x2019;s performance.</p>
<p><bold>Results.</bold> The manual and machine scores showed significant correlation. Under this scoring method, Claude 3.5 Sonnet performed the best, particularly excelling in complex scenarios like natural and accident disasters. In contrast, while scoring slightly lower overall, Chinese models like ERNIE 4.0 Turbo and iFlytek Spark V4.0, showed strong performance in specific crises.</p>
<p><bold>Conclusion.</bold> The evaluation benchmark validates the best LLM for crisis information generation (Claude 3.5 Sonnet) and provides valuable insights for LLMs to optimize and apply LLM in crisis information.</p>
</abstract>
</article-meta>
</front>
<body>
<sec id="sec1">
<title>Introduction</title>
<p>In the context of rapid technological development, large language models (LLMs) have become a key focus in natural language processing due to their application of deep learning and Transformer architecture. LLMs efficiently encode and decode language, demonstrating significant advancements in machine translation, text generation, and dialogue systems (<xref rid="R5" ref-type="bibr">Guler et al., 2024</xref>).</p>
<p>Crisis information (CI) is increasingly important in crisis response. Traditional manual process is inefficient and hard to manage large-scale and frequent events. LLMs, with their powerful language generation capabilities, can automatically produce high-quality crisis information, improving efficiency and accuracy. Additionally, LLMs handle multilingual and multimodal information, facilitating cross-regional and cross-domain information sharing. However, the information generated by different LLMs varies greatly in quality, and may exhibit over-interpretation of information or thematic bias, both of which may affect the effectiveness of decision-making.</p>
<p>This study aims to establish benchmarks to comprehensively evaluate LLMs&#x2019; performance in CI generation across scenarios like natural disasters, accidents, and public health incidents. By assessing objectivity, completeness, and reasonableness, this research provides insights into LLMs&#x2019; strengths and limitations, offering guidance for model optimization and practical applications in CI.</p>
</sec>
<sec id="sec2">
<title>Literature review</title>
<p>In recent years, several specialized benchmark datasets have been developed to assess the performance of LLMs, such as GLUE (<xref rid="R11" ref-type="bibr">Wang et al., 2019</xref>), SuperGLUE (<xref rid="R8" ref-type="bibr">Sarlin et al., 2020</xref>), and SoEval (<xref rid="R6" ref-type="bibr">Liu et al., 2024</xref>). These benchmarks cover tasks like reading comprehension, sentiment analysis, and structured output, becoming essential tools for evaluating LLMs. More complex benchmarks like BIG-Bench Hard have been introduced recently, facilitating the evaluation of LLMs&#x2019; cross-task and cross-domain generalization capabilities (<xref rid="R9" ref-type="bibr">Suzgun et al., 2022</xref>). Evaluation benchmarks in the Chinese context are also emerging, such as CLUE (<xref rid="R13" ref-type="bibr">Xu et al., 2020</xref>), which has become a widely used evaluation tool in various industries in China.</p>
<p>In the information science field, LLMs are widely applied in automating information analysis (<xref rid="R4" ref-type="bibr">Giannakopoulos et al., 2023</xref>), processing, and public opinion monitoring. As LLMs are increasingly used, researchers have established evaluation frameworks for information compilation and report generation. For example, Thelwall (<xref rid="R10" ref-type="bibr">2024</xref>) assessed LLMs&#x2019; effectiveness in scientific information evaluation, and explored ways to enhance their capabilities through prompt engineering and external tools.</p>
<p>For evaluation methods, while accuracy remains an important measure of LLMs&#x2019; objectivity, automated metrics like BLEU (<xref rid="R2" ref-type="bibr">Evtikhiev et al., 2023</xref>) and Bipol (<xref rid="R1" ref-type="bibr">Alkhaled et al., 2023</xref>) have limitations in addressing the feasibility of generated content for open-ended questions. Therefore, manual evaluation remains essential, especially in text generation and translation tasks (<xref rid="R14" ref-type="bibr">Xu et al., 2023</xref>).</p>
<p>Currently, there is a lack of specialized LLM benchmarks for crisis information generation. Crisis information, as a core element for responding to and solving crises, requires the support of the latest technologies. Therefore, this paper proposes constructing CIEval, an LLM evaluation benchmark for CI generation, combining manual and machine evaluations to identify the best LLMs and promote their application in this field.</p>
</sec>
<sec id="sec3">
<title>Generation of CIEval dataset</title>
<sec id="sec3_1">
<title>Subjects selection</title>
<p>The crisis mechanism aims to efficiently respond to sudden crises and ensure social stability and public safety. According to the definition of <italic>the Emergency Response Law</italic> (General Office of the State Council, 2024), crises include natural disasters, accident disasters, public health events, and social security events. In order to comprehensively and accurately evaluate the ability of LLMs in CI generation, the CIEval dataset we constructed covers 16 segmented subjects under the four major categories of crises mentioned above, ensuring its representativeness and wide applicability. In addition, to further enhance the completeness of the dataset, we also focus on a specific social security event - cyber security, and includes 10 events with high frequency of occurrence in this field as supplementary categories. The specific event classification is shown in Table 1.</p>
<table-wrap id="T1">
<label>Table 1.</label>
<caption><p>Subjects of CIEval</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Categories</bold></th>
<th align="center" valign="top"><bold>Subjects</bold></th>
<th align="center" valign="top"><bold>Categories</bold></th>
<th align="center" valign="top" colspan="2"><bold>Subjects</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top" rowspan="5">Natural disasters</td>
<td align="center" valign="top">Drought and water disasters</td>
<td align="center" valign="top" rowspan="12">Social security events</td>
<td align="center" valign="top" colspan="2">Economic security incidents</td>
</tr>
<tr>
<td align="center" valign="top">Meteorological disasters</td>
<td align="center" valign="top" colspan="2">Foreign emergencies</td>
</tr>
<tr>
<td align="center" valign="top">Earthquake</td>
<td align="center" valign="top" rowspan="10">Cyber security events</td>
<td align="center" valign="top">Botnet</td>
</tr>
<tr>
<td align="center" valign="top">Geologic disasters</td>
<td align="center" valign="top">Data leakage</td>
</tr>
<tr>
<td align="center" valign="top">Forest and grassland fires</td>
<td align="center" valign="top">Phishing emails</td>
</tr>
<tr>
<td align="center" valign="top" rowspan="5">Accident disasters</td>
<td align="center" valign="top">Enterprise safety accidents</td>
<td align="center" valign="top">Vulnerability exploitation</td>
</tr>
<tr>
<td align="center" valign="top">Transportation accidents</td>
<td align="center" valign="top">DDOS</td>
</tr>
<tr>
<td align="center" valign="top">Public facility accidents</td>
<td align="center" valign="top">APT</td>
</tr>
<tr>
<td align="center" valign="top">Environmental pollution</td>
<td align="center" valign="top">Tampering</td>
</tr>
<tr>
<td align="center" valign="top">Ecological damage incident</td>
<td align="center" valign="top">Worms</td>
</tr>
<tr>
<td align="center" valign="top" rowspan="4">Public health events</td>
<td align="center" valign="top">Infectious disease outbreaks</td>
<td align="center" valign="top">Mining</td>
</tr>
<tr>
<td align="center" valign="top">Congregative unknown disease</td>
<td align="center" valign="top">Ransomware</td>
</tr>
<tr>
<td align="center" valign="top">Food safety</td>
<td align="center" valign="top"></td>
<td align="center" valign="top"></td>
<td align="center" valign="top"></td>
</tr>
<tr>
<td align="center" valign="top">Animal outbreaks</td>
<td align="center" valign="top"></td>
<td align="center" valign="top"></td>
<td align="center" valign="top"></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="sec3_2">
<title>Data collection and pre-processing</title>
<p>The primary sources of the raw data are post-disaster recovery plans and accident investigation reports publicly released by emergency management departments. Data related to cyber security events comes from typical case collections issued by cyber security companies like QAX (https://en.qianxin.com/).</p>
<p>For the raw data, a refined content extraction strategy was implemented to remove non-event- related elements such as headers and hyperlinks, retaining and integrating only text and image content crucial for event analysis, which are stored in structured documents. In cases where a document may cover multiple events, detailed manual intervention was employed to carefully split the content. This process ensures that each independent event is accurately mapped to a unique corresponding document, facilitating subsequent dataset construction. Each final document includes an overview, losses, policy measures, and other information of the event.</p>
</sec>
<sec id="sec3_3">
<title>Dataset generation method</title>
<sec id="sec3_3_1">
<title>Dataset generation</title>
<p>Advanced models like GPT-4.0 play a crucial role in dataset construction (<xref rid="R6" ref-type="bibr">Liu et al., 2024</xref>). Based on this, we utilized the generative capabilities of LLMs to build an evaluation dataset, significantly shortening the construction period and reducing manual workload, providing abundant test resources for assessing LLM performance in specific crisis information needs.</p>
<p>Prompt engineering is a core element focused on designing precise prompts to guide LLMs in producing expected outputs. The key to prompt engineering lies in how to accurately construct prompts. Prompts (<italic>P</italic>) typically consist of instructions (<italic>I</italic>) and inputs (<italic>In</italic>); the instructions specify the task goals, setting a framework for model responses, while the inputs provide specific context or examples. <italic>I</italic> and <italic>In</italic> jointly influence the quality of the model&#x2019;s output, represented by Eq. (1).</p>
<disp-formula><label>(1)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="block"><mml:mrow><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mfenced><mml:mrow><mml:mi>I</mml:mi><mml:mo>,</mml:mo><mml:mi>I</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:mfenced></mml:mrow></mml:math></disp-formula>
<p>Here, <italic>I</italic> and <italic>In</italic> are combined through function <italic>f</italic> to form the LLM prompt (<italic>P</italic>). In this paper, <italic>f</italic> is the internal mechanism of LLMs, implemented by GPT-4.0. The method of using LLMs to generate datasets is illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1">
<label>Figure 1.</label>
<caption><p>Overview of CIEval generation methods</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c20-fig1.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>Specifically, the dataset construction process we adopted follows a systematic pipeline approach, detailed as follows:</p>
<list list-type="order">
<list-item><p><italic>Event information extraction</italic>: pre-processed documents are input into GPT-4.0, which, using its powerful analytical capabilities, automatically extracts key event information from the documents. The information includes but not limited to occurrence time, disaster type, and economic losses, etc., i.e., <italic>In</italic>.</p></list-item>
<list-item><p><italic>Targeted question generation</italic>: based on the response measures, policy protection suggestions, and other information in the input document, GPT-4.0 randomly generates a series of crisis- related questions, i.e., <italic>I</italic>.</p></list-item>
<list-item><p><italic>Prompt construction</italic>: in this step, GPT-4.0 is used to combine <italic>In</italic> and <italic>I</italic> to form a complete <italic>P</italic> dataset for assessing the CI generation capability of LLMs.</p></list-item>
<list-item><p><italic>Manual review</italic>: the automatically generated prompts undergo rigorous manual review to ensure clarity, fluency, and high relevance to the crisis information tasks. This process further minimizes the potential influence of any biases introduced by GPT-4.0, thereby enhancing the dataset quality and ensuring the accuracy and validity of the evaluation results.</p></list-item>
</list>
</sec>
<sec id="sec3_4">
<title>Avoiding data contamination</title>
<p>In constructing the dataset, this study focuses on avoiding data contamination to ensure quality. Given that large-scale crises, especially those that generate widespread social response online, often become publicly accessible data resources through subsequent investigation reports, there is a potential risk of these being included in large language model training datasets. Considering that the latest training data for the model to be evaluated is up to April 2024, this paper uses data from April 2024 onwards as the original dataset for CIEval, in order to reduce data contamination.</p>
</sec>
<sec id="sec3_5">
<title>Dataset description</title>
<p>CIEval is a comprehensive dataset designed to thoroughly evaluate the performance of LLMs in CI generation. The dataset includes four major categories of authoritative-defined crisis types and additionally incorporates cyber security events, which have a broad impact in modern society, to fully reflect realistic and diverse CI needs. It covers a wide range of event types, including but not limited to meteorological disasters, earthquakes, transportation accidents, food safety, and data leakage, providing a thorough test of the models&#x2019; performance in handling complex and variable CI tasks. Table 2 shows some of the contents of this dataset. CIEval contains a total of 4,820 evaluation questions, offering a comprehensive benchmark framework to assess the overall capabilities of LLMs in CI generation tasks, ultimately enhancing their practicality and effectiveness in real-world crisis scenarios.</p>
<table-wrap id="T2">
<label>Table 2.</label>
<caption><p>Partial content of CIEval dataset</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Categories</bold></th>
<th align="center" valign="top"><bold>Questions</bold></th>
<th align="center" valign="top"><bold>Examples</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top">Natural disasters</td>
<td align="center" valign="top">1500</td>
<td align="center" valign="top"><bold>Event overview:</bold> &#x2018;Time: June 22 to July 4, 2024; Location: Jiangxi Province; Affected Areas: 65 counties (cities, districts) in 9 prefecture level cities including Nanchang and Jiujiang; Number of Affected Persons: 1.565 million people...&#x2019; <bold>Based on the above description, please answer the following:</bold> How to coordinate multi departmental cooperation to ensure the basic living security of residents in disaster areas?</td>
</tr>
<tr>
<td align="center" valign="top">Accident disasters</td>
<td align="center" valign="top">2000</td>
<td align="center" valign="top"><bold>Event overview</bold>: &#x2018;Time: April 26, 2024; Location: Xia County Expressway in Yuncheng, Shanxi Province; Event Type: Vehicle Fire Caused by Traffic Accident...&#x2019; <bold>Based on the above description, please answer the following:</bold> How to handle the automatic locking system of damaged vehicles during crisis rescue?</td>
</tr>
<tr>
<td align="center" valign="top">Public health events</td>
<td align="center" valign="top">550</td>
<td align="center" valign="top"><bold>Event overview:</bold> &#x2018;5.27 Xinzheng Elementary School Canteen Food Mold Incident&#x2019; <bold>Based on the above description, please answer the following:</bold> How to effectively establish and manage a parental supervision mechanism in the school environment to ensure food safety?</td>
</tr>
<tr>
<td align="center" valign="top">Social security events</td>
<td align="center" valign="top">770</td>
<td align="center" valign="top"><bold>Event overview:</bold> &#x2018;8.19 Philippine Coast Guard Ship Collides with Chinese Ship Incident&#x2019; <bold>Based on the above description, please answer the following:</bold> How to quickly take measures to prevent the situation from escalating when a similar collision event occurs?</td>
</tr>
<tr>
<td align="center" valign="top">Cyber security events</td>
<td align="center" valign="top">310</td>
<td align="center" valign="top"><bold>Event overview:</bold> &#x2018;The terminal computer was infected with ransomware through phishing emails&#x2019; <bold>Based on the above description, please answer the following:</bold> What role does terminal security control software play in preventing ransomware?</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="sec4">
<title>Benchmarking experiment</title>
<sec id="sec4_1">
<title>Models</title>
<p>To comprehensively understand the applicability and CI generation capabilities of LLMs in the CI domain within the Chinese context, this study selected four of the latest models developed by Chinese companies and four internationally renowned models suitable for the Chinese context. These models vary in scale and structure, performing differently across various datasets and tasks. Specific information about the models is shown in Table 3.</p>
<table-wrap id="T3">
<label>Table 3.</label>
<caption><p>Description of models</p></caption>
<table>
<thead>
<tr>
<th align="center" valign="top"><bold>Models</bold></th>
<th align="center" valign="top"><bold>Developer</bold></th>
<th align="center" valign="top"><bold>Size</bold></th>
<th align="center" valign="top"><bold>Access</bold></th>
<th align="center" valign="top"><bold>Source</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td align="center" valign="top">GPT-4o</td>
<td align="center" valign="top">OpenAI</td>
<td align="center" valign="top">Undisclosed</td>
<td align="center" valign="top">API</td>
<td align="center" valign="top">https://chat.openai.com/</td>
</tr>
<tr>
<td align="center" valign="top">Claude 3.5 Sonnet</td>
<td align="center" valign="top">Authropic</td>
<td align="center" valign="top">Undisclosed</td>
<td align="center" valign="top">API</td>
<td align="center" valign="top">https://claude.ai</td>
</tr>
<tr>
<td align="center" valign="top">Llama-3.1-405B</td>
<td align="center" valign="top">Meta</td>
<td align="center" valign="top">405B</td>
<td align="center" valign="top">Weights</td>
<td align="center" valign="top">https://github.com/meta-llama/llama3</td>
</tr>
<tr>
<td align="center" valign="top">Gemini-1.5-Pro</td>
<td align="center" valign="top">Google</td>
<td align="center" valign="top">Undisclosed</td>
<td align="center" valign="top">API</td>
<td align="center" valign="top">https://deepmind.google/technologies/gemini/</td>
</tr>
<tr>
<td align="center" valign="top">ERNIE 4.0 Turbo</td>
<td align="center" valign="top">Baidu</td>
<td align="center" valign="top">Undisclosed</td>
<td align="center" valign="top">Official website</td>
<td align="center" valign="top">https://yiyan.baidu.com</td>
</tr>
<tr>
<td align="center" valign="top">GLM-130B</td>
<td align="center" valign="top">Tsinghua</td>
<td align="center" valign="top">130B</td>
<td align="center" valign="top">Weights</td>
<td align="center" valign="top">https://github.com/THUDM/GLM-130B</td>
</tr>
<tr>
<td align="center" valign="top">iFlytek Spark V4.0</td>
<td align="center" valign="top">iFlytek</td>
<td align="center" valign="top">Undisclosed</td>
<td align="center" valign="top">Official website</td>
<td align="center" valign="top">https://xinghuo.xfyun.cn</td>
</tr>
<tr>
<td align="center" valign="top">Qwen LM-72B</td>
<td align="center" valign="top">Alibaba</td>
<td align="center" valign="top">72B</td>
<td align="center" valign="top">Official website</td>
<td align="center" valign="top">https://tongyi.aliyun.com/qianwen</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>GPT-4o is the latest and fastest flagship model of OpenAI, a specialized version of GPT-4.0 optimized for specific tasks such as real-time inference, and generating audio, images, and text. In addition, Claude 3.5 Sonnet surpassed GPT-4o in multiple areas including graduate level reasoning, undergraduate level knowledge, and coding ability, according to the Authropic&#x2019;s release report.</p>
<p>ERNIE 4.0 Turbo, GLM-130B, IFlytek Spark V4.0 and Qwen-72B are currently outstanding Chinese LLMs. IFlytek Spark V4.0 fully benchmarked GPT-4-Turbo during construction, and according to testing, it has surpassed GPT-4-Turbo in text generation, logical reasoning, and other aspects. Among them, non-open-source models are all the premium version.</p>
</sec>
<sec id="sec4_2">
<title>Evaluation methods</title>
<p>This study plans to use a combined evaluation method of manual scoring and machine scoring to enhance the efficiency of the evaluation process and ensure the comprehensiveness and accuracy of the results.</p>
<sec id="sec4_2_1">
<title>Manual scoring</title>
<p>Based on <xref rid="R12" ref-type="bibr">Wang et al. (2024)</xref>, we established evaluation indicators for CI generation capabilities of LLMs, including content quality, expression quality, feasibility, and effectiveness, starting from the quality of the CI itself. Firstly, in order to obtain professional evaluation opinions, this study invited three PhD students with relevant experience in the field of CI as evaluators. The scoring personnel score the CI generated by each model based on the indicators, and the score level is divided into [extremely high, high, medium, low, extremely low]. Secondly, based on the fuzzy decision trial and evaluation laboratory method, the weights of each indicator were based on the experience of the scoring personnel. Finally, the interactive multi-criteria decision-making method based on triangular intuitive fuzzy numbers was used to obtain the global dominance of the evaluated model, which is the CI generation capability score of LLMs (<xref rid="R7" ref-type="bibr">Qin et al., 2017</xref>).</p>
</sec>
<sec id="sec4_2_2">
<title>Machine scoring</title>
<p>Given the large scale of CIEval, relying solely on manual scoring would require a significant amount of manpower and resources. To this end, this study further explores the possibility of machine scoring. We select GPT-4.0, which performs well in multiple benchmarks (<xref rid="R14" ref-type="bibr">Xu et al., 2023</xref>), and Claude 3.5 Sonnet, which the developer claims to be superior to GPT-4o in all aspects, as <italic>&#x2018;machine scoring experts.&#x2019;</italic></p>
<p>Similar to the final results obtained through manual scoring, the machine scoring range is set to [0,1]. To ensure the rationality of machine scoring, we will conduct validation experiments before formal evaluation, comparing the consistency between GPT-4.0 and Claude-3.5 Sonnet scoring and manual scoring to verify their reliability as machine scoring experts. Choose the optimal model score to replace manual scoring, thereby improving evaluation efficiency and reducing labour costs.</p>
</sec>
</sec>
</sec>
<sec id="sec5">
<title>Results and discussion</title>
<sec id="sec5_1">
<title>Comparison between manual scoring and machine scoring</title>
<p>This study randomly selected 80 pieces of data in the CIEval dataset as samples for manual scoring and machine scoring, and conducted correlation tests between manual scoring and Claude-3.5 Sonnet/GPT-4.0 scoring, respectively. The Mann-Kendall&#x2019;s test results showed that the correlation coefficient between GPT-4.0 score and the manual score was 0.397, showing extremely high significance (p&#x003C;0.01). Claude-3.5 Sonnet was significantly correlated with manual scoring at the p value lower than 0.05. To further validate the rationality of utilizing GPT-4.0 for scoring, a Spearman test was conducted. The results indicated a correlation coefficient of 0.514 between GPT-4.0 scores and manual scores, further supporting the feasibility and reliability of replacing manual scoring with GPT-4.0. Therefore, this study selects GPT-4.0 as the main tool for evaluating CI generation capabilities of LLMs.</p>
<sec id="sec5_1_1">
<title>The crisis information generation capabilities of LLMs</title>
<p>This section provides a performance evaluation of several LLMs based on the CIEval dataset. Overall, Chinese LLMs slightly underperform compared to leading international models such as the GPT and Claude series (<xref ref-type="fig" rid="F2">Figure 2(a)</xref>). This performance gap can be attributed to the global advantages in data resources and technological development enjoyed by international models.</p>
<fig id="F2">
<label>Figure 2.</label>
<caption><p>Score of LLMs&#x2019; CI generation capability</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c20-fig2.jpg"><alt-text>none</alt-text></graphic>
</fig>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2(b)</xref>, iFlytek Spark V4.0 excels in generating relevant information for natural disasters, achieving high scores due to its deep understanding of local meteorological data. In the context of accident disasters, Claude 3.5 Sonnet and GPT-4o outperform others with scores of 0.96 and 0.63, respectively. While models like ERNIE 4.0 Turbo perform well in specific scenarios such as transportation accidents (<xref ref-type="fig" rid="F3">Figure 3</xref>), others, like Llama-3.1-405B, fall short in complex enterprise safety incidents, producing less actionable information. For public health events, Claude 3.5 Sonnet leads in scenarios like infectious disease outbreaks. In the social security category, Claude 3.5 Sonnet excels, achieving a score of 0.98 due to its strong situational analysis. Other models like Qwen LM-72B and ERNIE 4.0 Turbo achieve high scores in specific incidents like mining and ransomware.</p>
<fig id="F3">
<label>Figure 3.</label>
<caption><p>Score of LLMs&#x2019; CI generation capability in segmented scenarios</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="images\c20-fig3.jpg"><alt-text>none</alt-text></graphic>
</fig>
</sec>
</sec>
</sec>
<sec id="sec6">
<title>Conclusion</title>
<p>This study focuses on the potential of LLMs in the field of CI generation and proposes a scientifically reasonable evaluation benchmark, CIEval. CIEval is constructed through a series of processes including key event information extraction, question generation, prompt construction, and manual review. This dataset contains 26 types of crises related to natural disasters, accident disasters, public health events, and social security events, with a total of 4.8k data. In the experimental phase, we evaluated the CI generation capability of LLMs such as Claude 3.5 Sonnet using the CIEval dataset. At the same time, we validated the feasibility of GPT-4.0 as a machine scoring expert. This benchmark aims to provide reference for the optimization and application of LLMs in CI generation in the future, which will help promote the application and development of LLMs in practical emergency management.</p>
</sec>
</body>
<back>
<ack>
<title>Acknowledgements</title>
<p>This study is supported by the National Social Science Foundation of China (Grant No. 23&#x0026;ZD230).</p>
</ack>
<ref-list>
<title>References</title>
<ref id="R1"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Alkhaled</surname><given-names>L.</given-names></name><name><surname>Adewumi</surname><given-names>T.</given-names></name><name><surname>Sabry</surname><given-names>S. S.</given-names></name></person-group> <year>(2023)</year> <article-title>Bipol: A novel multi-axes bias evaluation metric with explainability for NLP</article-title><source>Natural Language Processing Journal</source><volume>4</volume><fpage>100030</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.nlp.2023.100030">https://doi.org/10.1016/j.nlp.2023.100030</ext-link></element-citation></ref>
<ref id="R2"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Evtikhiev</surname><given-names>M.</given-names></name><name><surname>Bogomolov</surname><given-names>E.</given-names></name><name><surname>Sokolov</surname><given-names>Y.</given-names></name><name><surname>Bryksin</surname><given-names>T.</given-names></name></person-group> <year>(2023)</year> <article-title>Out of the bleu: How should we assess quality of the code generation models?</article-title><source>Journal of Systems and Software</source><volume>203</volume><fpage>111741</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.jss.2023.111741">https://doi.org/10.1016/j.jss.2023.111741</ext-link></element-citation></ref>
<ref id="R3"><element-citation publication-type="other"><person-group person-group-type="author"><collab>General Office of the State Council, P.</collab></person-group> <year>(2024)</year> <article-title>Emergency Response Law</article-title><comment>People&#x2019;s Publishing House</comment></element-citation></ref>
<ref id="R4"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Giannakopoulos</surname><given-names>K.</given-names></name><name><surname>Kavadella</surname><given-names>A.</given-names></name><name><surname>Salim</surname><given-names>A. A.</given-names></name><name><surname>Stamatopoulos</surname><given-names>V.</given-names></name><name><surname>Kaklamanos</surname><given-names>E. G.</given-names></name></person-group> <year>(2023)</year> <article-title>Evaluation of the performance of generative AI large language models Chatgpt, Google bard, and microsoft bing chat in supporting evidence-based dentistry: comparative mixed methods study</article-title><source>Journal of Medical Internet Research</source><volume>25</volume><fpage>e51580</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.2196/51580">https://doi.org/10.2196/51580</ext-link></element-citation></ref>
<ref id="R5"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Guler</surname><given-names>N.</given-names></name><name><surname>Kirshner</surname><given-names>S.N.</given-names></name></person-group> <year>(2024)</year> <article-title>A literature review of artificial intelligence research in business and management using machine learning and ChatGPT</article-title><source>Data and Information Management</source><volume>8</volume><issue>3</issue><fpage>1</fpage><lpage>25</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.dim.2024.100076">https://doi.org/10.1016/j.dim.2024.100076</ext-link></element-citation></ref>
<ref id="R6"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>Y.</given-names></name><name><surname>Li</surname><given-names>D.</given-names></name><name><surname>Wang</surname><given-names>K.</given-names></name><name><surname>Xiong</surname><given-names>Z.</given-names></name><name><surname>Shi</surname><given-names>F.</given-names></name><name><surname>Wang</surname><given-names>J.</given-names></name><name><surname>Li</surname><given-names>B.</given-names></name><name><surname>Hang</surname><given-names>B.</given-names></name></person-group> <year>(2024)</year> <article-title>Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs</article-title><source>Information Processing &#x0026; Management</source><volume>61</volume><issue>5</issue><fpage>103809</fpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.ipm.2024.103809">https://doi.org/10.1016/j.ipm.2024.103809</ext-link></element-citation></ref>
<ref id="R7"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname><given-names>Q.</given-names></name><name><surname>Liang</surname><given-names>F.</given-names></name><name><surname>Li</surname><given-names>L.</given-names></name><name><surname>Chen</surname><given-names>Y.-W.</given-names></name><name><surname>Yu</surname><given-names>G.-F.</given-names></name></person-group> <year>(2017)</year> <article-title>A TODIM-based multi-criteria group decision making with triangular intuitionistic fuzzy numbers</article-title><source>Applied Soft Computing</source><volume>55</volume><fpage>93</fpage><lpage>107</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.asoc.2017.01.041">https://doi.org/10.1016/j.asoc.2017.01.041</ext-link></element-citation></ref>
<ref id="R8"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Sarlin</surname><given-names>P.-E.</given-names></name><name><surname>DeTone</surname><given-names>D.</given-names></name><name><surname>Malisiewicz</surname><given-names>T.</given-names></name><name><surname>Rabinovich</surname><given-names>A.</given-names></name></person-group> <year>(2020)</year> <article-title>Superglue: Learning feature matching with graph neural networks</article-title><source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source><fpage>4938</fpage><lpage>4947</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1109/CVPR42600.2020.00499">https://doi.org/10.1109/CVPR42600.2020.00499</ext-link></element-citation></ref>
<ref id="R9"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Suzgun</surname><given-names>M.</given-names></name><name><surname>Scales</surname><given-names>N.</given-names></name><name><surname>Sch&#x00E4;rli</surname><given-names>N.</given-names></name><name><surname>Gehrmann</surname><given-names>S.</given-names></name><name><surname>Tay</surname><given-names>Y.</given-names></name><name><surname>Chung</surname><given-names>H. W.</given-names></name><name><surname>Chowdhery</surname><given-names>A.</given-names></name><name><surname>Le</surname><given-names>Q. V.</given-names></name><name><surname>Chi</surname><given-names>E. H.</given-names></name><name><surname>Zhou</surname><given-names>D.</given-names></name><name><surname>Wei</surname><given-names>J.</given-names></name></person-group> <year>(2022)</year> <article-title>Challenging Big-Bench tasks and whether chain-of-thought can solve them (arXiv:2210.09261)</article-title><source>arXiv</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2210.09261">https://doi.org/10.48550/arXiv.2210.09261</ext-link></element-citation></ref>
<ref id="R10"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Thelwall</surname><given-names>M.</given-names></name></person-group> <year>(2024)</year> <article-title>Can ChatGPT evaluate research quality?</article-title><source>Journal of Data and Information Science</source><volume>9</volume><issue>2</issue><fpage>1</fpage><lpage>21</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.2478/jdis-2024-0013">https://doi.org/10.2478/jdis-2024-0013</ext-link></element-citation></ref>
<ref id="R11"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>A.</given-names></name><name><surname>Singh</surname><given-names>A.</given-names></name><name><surname>Michael</surname><given-names>J.</given-names></name><name><surname>Hill</surname><given-names>F.</given-names></name><name><surname>Levy</surname><given-names>O.</given-names></name><name><surname>Bowman</surname><given-names>S. R.</given-names></name></person-group> <year>(2019)</year> <article-title>GLUE: A multi-task benchmark and analysis platform for natural language understanding (arXiv:1804.07461)</article-title><source>arXiv</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.1804.07461">https://doi.org/10.48550/arXiv.1804.07461</ext-link></element-citation></ref>
<ref id="R12"><element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>J.</given-names></name><name><surname>Liu</surname><given-names>Y.</given-names></name><name><surname>Li</surname><given-names>P.</given-names></name><name><surname>Lin</surname><given-names>Z.</given-names></name><name><surname>Sindakis</surname><given-names>S.</given-names></name><name><surname>Aggarwal</surname><given-names>S.</given-names></name></person-group> <year>(2024)</year> <article-title>Overview of data quality: Examining the dimensions, antecedents, and impacts of data quality</article-title><source>Journal of the Knowledge Economy</source><volume>15</volume><issue>1</issue><fpage>1159</fpage><lpage>1178</lpage><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1007/s13132-022-01096-6">https://doi.org/10.1007/s13132-022-01096-6</ext-link></element-citation></ref>
<ref id="R13"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Xu</surname><given-names>L.</given-names></name><name><surname>Hu</surname><given-names>H.</given-names></name><name><surname>Zhang</surname><given-names>X.</given-names></name><name><surname>Li</surname><given-names>L.</given-names></name><name><surname>Cao</surname><given-names>C.</given-names></name><name><surname>Li</surname><given-names>Y.</given-names></name><name><surname>Xu</surname><given-names>Y.</given-names></name><name><surname>Sun</surname><given-names>K.</given-names></name><name><surname>Yu</surname><given-names>D.</given-names></name><name><surname>Yu</surname><given-names>C.</given-names></name><name><surname>Tian</surname><given-names>Y.</given-names></name><name><surname>Dong</surname><given-names>Q.</given-names></name><name><surname>Liu</surname><given-names>W.</given-names></name><name><surname>Shi</surname><given-names>B.</given-names></name><name><surname>Cui</surname><given-names>Y.</given-names></name><name><surname>Li</surname><given-names>J.</given-names></name><name><surname>Zeng</surname><given-names>J.</given-names></name><name><surname>Wang</surname><given-names>R.</given-names></name><name><surname>Xie</surname><given-names>W.</given-names></name><name><surname>Lan</surname><given-names>Z.</given-names></name></person-group> <year>(2020)</year> <article-title>CLUE: A chinese language understanding evaluation benchmark (arXiv:2004.05986)</article-title><source>arXiv</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2004.05986">https://doi.org/10.48550/arXiv.2004.05986</ext-link></element-citation></ref>
<ref id="R14"><element-citation publication-type="other"><person-group person-group-type="author"><name><surname>Xu</surname><given-names>L.</given-names></name><name><surname>Li</surname><given-names>A.</given-names></name><name><surname>Zhu</surname><given-names>L.</given-names></name><name><surname>Xue</surname><given-names>H.</given-names></name><name><surname>Zhu</surname><given-names>C.</given-names></name><name><surname>Zhao</surname><given-names>K.</given-names></name><name><surname>He</surname><given-names>H.</given-names></name><name><surname>Zhang</surname><given-names>X.</given-names></name><name><surname>Kang</surname><given-names>Q.</given-names></name><name><surname>Lan</surname><given-names>Z.</given-names></name></person-group> <year>(2023)</year> <article-title>SuperCLUE: A comprehensive chinese large language model benchmark (arXiv:2307.15020)</article-title><source>arXiv</source><ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2307.15020">https://doi.org/10.48550/arXiv.2307.15020</ext-link></element-citation></ref>
</ref-list>
</back>
</article>