Abstract
Reconstructing high-resolution climate data from historical documents is hindered by subjectivity and a lack of standardization. This study develops and validates a novel framework to overcome these challenges. In this paper, a historical weather classification lexicon is constructed by optimizing natural language processing (NLP) techniques. Leveraging semantic clustering and dynamic expansion, this lexicon effectively captures the linguistic diversity associated with weather events across different regions and intensity levels. Building on this lexicon, we propose a multi-dimensional index system to quantify historical weather grades. This system includes indicators such as weather intensity, agricultural impact, economic impact, social impact, and population casualties. For each indicator, scientific and objective weights are assigned using the entropy method combined with expert judgment. To validate the effectiveness of our approach, we extracted low-temperature weather records from historical documents of Guangdong and Hebei provinces in China. The results show that the overall trend of low-temperature weather in these two provinces is consistent with existing research on climate change during the Qing Dynasty. Moreover, the provincial trend maps reveal not only synchronous change patterns but also significant regional differences. A Random Forest model was employed to validate our index, achieving a classification accuracy of 94.0%, with Area Under the Curve(AUC) scores exceeding 0.98 for low-grade events. This data-driven methodology offers a replicable and scalable tool for converting qualitative historical narratives into high-resolution quantitative climate data, thereby enhancing our understanding of past climate variability and its societal impacts.