Abstract
Symbolic regression (SR) has regained research prominence as deep learning advancements accelerate the search for analytical models from observational data. However, the vast search space often hinders existing algorithms to yield complex analytical expressions. We present SR-LLM, an SR framework integrating retrieval-augmented generation mechanisms based on large language models (LLM) to achieve incremental learning. Specifically, our framework is capable of leveraging accumulated prior knowledge and past exploration results from external knowledge bases to retrieve the most relevant information for current regression tasks. It first composes prior information into small symbolic groups with the assistance of the LLMs and then utilizes deep reinforcement learning to combine these groups to formulate complex yet explainable analytic expressions that are more easily understood by humans. The capability for efficient knowledge utilization enables our framework to integrate all previous human experiences and exploration results, effectively learning by standing on the shoulders of giants. To validate the effectiveness of our proposed method, we not only test the framework on popular symbolic regression benchmarks but also extend its application to a domain where the explicit optimal model remains controversial: how to analytically describe human car-following behavior based on observed vehicle trajectories? Experiments confirm that our method outperforms on standard benchmarks, successfully rediscovers famous traditional car-following models and discovers new models from empirical trajectory data, achieving both fitting effectiveness and interpretability.