Safe reinforcement learning under temporal logic with reward design and quantum action selection

基于时间逻辑、奖励设计和量子动作选择的安全强化学习

阅读:1

Abstract

This paper proposes an advanced Reinforcement Learning (RL) method, incorporating reward-shaping, safety value functions, and a quantum action selection algorithm. The method is model-free and can synthesize a finite policy that maximizes the probability of satisfying a complex task. Although RL is a promising approach, it suffers from unsafe traps and sparse rewards and becomes impractical when applied to real-world problems. To improve safety during training, we introduce a concept of safety values, which results in a model-based adaptive scenario due to online updates of transition probabilities. On the other hand, a high-level complex task is usually formulated via formal languages, including Linear Temporal Logic (LTL). Another novelty of this work is using an Embedded Limit-Deterministic Generalized Büchi Automaton (E-LDGBA) to represent an LTL formula. The obtained deterministic policy can generalize the tasks over infinite and finite horizons. We design an automaton-based reward, and the theoretical analysis shows that an agent can accomplish task specifications with the maximum probability by following the optimal policy. Furthermore, a reward shaping process is developed to avoid sparse rewards and enforce the RL convergence while keeping the optimal policies invariant. In addition, inspired by quantum computing, we propose a quantum action selection algorithm to replace the existing [Formula: see text]-greedy algorithm for the balance of exploration and exploitation during learning. Simulations demonstrate how the proposed framework can achieve good performance by dramatically reducing the times to visit unsafe states while converging optimal policies.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。