CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020
Autonomous Systems Laboratory
2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network
Autonomous Systems Laboratory
3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
• DQN is capable of human level performance on many Atari games
• Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
• High dimensional observation: deep neural network can extract feature from high dimensional input
• Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
𝑠𝑡
Update
Copy
𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
store
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝑟𝑡
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡)
𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡
′
, 𝑎 𝑡; 𝜃′)
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ]
Q learning
𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′
𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
DQN
Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state
𝑎 𝑡: action
𝑟𝑡: reward
𝑄(𝑠𝑡, 𝑎 𝑡): reward to go
Autonomous Systems Laboratory
4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
• Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces
• DQN cannot be straight forwardly applied to continuous domain
• Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)
2. Update: 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′
𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
6/21
Policy gradient: The goal of Reinforcement learning
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
Agent World
action
𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
model
reward & next state
𝑟𝑡
𝑎 𝑡
𝑠𝑡+1
state&
𝑠𝑡
policy
𝜋(𝑎 𝑡|𝑠𝑡)
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
Markov decision process
𝑠1
𝑎1
𝑠2
𝑎2
𝑠3
𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2)
𝑎3
𝑝(𝑠4|𝑠3, 𝑎3)
𝜏
objective: 𝐽(𝜃)
trajectory distribution
Goal of reinforcement learning
policy(𝜋 𝜃):
stochastic policy with weights 𝜽
Autonomous Systems Laboratory
7/21
Policy gradient: REINFORCE
• REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
probability
0.1
0.1
0.2
0.2
0.4
Autonomous Systems Laboratory
8/21
Policy gradient: REINFORCE
𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡))
𝛻𝐽 𝜃 ≈
1
𝑁
෍
𝑖=1
𝑁
෍
𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍
𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡)
𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
𝑟1
𝑟2
𝑟 𝑁
• REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝜃: weights of actor network
𝛼: learning rate
Autonomous Systems Laboratory
9/21
→ 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Policy gradient: Actor critic (actor critic)
• Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action
initial state
sample data 𝒊 times
update critic & actor
sample data 𝒊 times
update critic & actor
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
𝜙: weights of critic network
1. High gradient variance
2. Slow training policy
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Env.𝑎 𝑡
𝑠𝑡
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃)
actor
critic
update critic
Autonomous Systems Laboratory
10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t)
𝑎 𝑥
𝑎 𝑦
Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑠𝑡
• Need 10 action spaces for 5 discretized 2 dimensional actions
𝑎 𝑥
𝑎 𝑦
Deterministic policy 𝝁 𝜽(𝒔 𝒕)
𝑠𝑡
• Only 2 action spaces are needed
Autonomous Systems Laboratory
11/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t)
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
trajectory distribution
𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡))
𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
objective
𝜏 𝜏
𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )]
loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Autonomous Systems Laboratory
12/21
DDPG: DQN + DPG
Q learning DQN
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
+ continuous action spaces
- no replay buffer: sample correlation
- no target network: unstable
- high variance + lower variance
+ off policy: replay buffer
+ stable update: target network
+ high dimensional observation spaces
- discrete action spaces
- low dimensional observation spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
13/21
DDPG: algorithm(1/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
• policy
• exploration
• Add noise for exploration: white Gaussian noise
• soft target update
• Target network is constrained to change slowly
• Stabilize training process
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔)
𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍
𝜽′
← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏
Autonomous Systems Laboratory
14/21
soft update 𝜃′
DDPG: algorithm(2/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
policy 𝜇 𝜃
target policy 𝜇 𝜃′
critic 𝑄 𝜙
target critic 𝑄 𝜙′
𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁
Env
actorcritic
𝑠𝑡
Replay buffer
store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝜇 𝜃′(𝑠𝑡+1)
update critic
loss: 𝐿(𝜙) soft update 𝜙′
𝜇 𝜃(𝑠𝑡)
𝛻𝐽(𝜃)
select action
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)
Autonomous Systems Laboratory
15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.
Autonomous Systems Laboratory
16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
• DDPG used as local planner for long range navigation
Autonomous Systems Laboratory
17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
Autonomous Systems Laboratory
18/21
Conclusion & Future work
• DQN have problem to adjust continuous action space directly
• DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
• MADDPG for multi agent RL
• Use DDPG for continuous action space decision making problem
• ex) navigation, obstacle avoidance
Autonomous Systems Laboratory
19/21
Appendix: Objective gradient derivation
objective gradient
Autonomous Systems Laboratory
20/21
Appendix: DPG objective
Autonomous Systems Laboratory
21/21
Appendix: DDPG algorithm

ddpg seminar

  • 1.
    CONTACT Autonomous Systems Laboratory MechanicalEngineering 5th Engineering Building Room 810 Web. https://sites.google.com/site/aslunist/ Deep deterministic policy gradient Minjae Jung May. 19, 2020
  • 2.
    Autonomous Systems Laboratory 2/21 DQNto DDPG: DQN overview Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Q learning DQN (2015) 1. replay buffer 2. neural network 3. target network
  • 3.
    Autonomous Systems Laboratory 3/21 DQNto DDPG: DQN algorithm Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. • DQN is capable of human level performance on many Atari games • Off policy training: replay buffer breaks the correlation of samples that are sampled from agent • High dimensional observation: deep neural network can extract feature from high dimensional input • Learning stability: target network make training process stable Environment Q Network Target Q Network DQN Loss Replay buffer 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃) 𝑠𝑡 Update Copy 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃) store (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) 𝑟𝑡 𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡) 𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡 ′ , 𝑎 𝑡; 𝜃′) 𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ] Q learning 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′ 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ] DQN Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃 (𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state 𝑎 𝑡: action 𝑟𝑡: reward 𝑄(𝑠𝑡, 𝑎 𝑡): reward to go
  • 4.
    Autonomous Systems Laboratory 4/21 DQNto DDPG: Limitation of DQN (discrete action spaces) • Discrete action spaces - DQN can only handle discrete and low-dimensional action spaces - If the dimension increases, action spaces(the number of node) increase exponentially - i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces • DQN cannot be straight forwardly applied to continuous domain • Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃 (𝑠𝑡, 𝑎 𝑡) 2. Update: 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′ 𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃 𝑠𝑡, 𝑎 𝑡 ] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 5.
    Autonomous Systems Laboratory 5/21 DDPG:DQN with Policy gradient methods Q learning DQN 1. replay buffer 2. deep neural network 3. target network Policy gradient (REINFORCE) Actor critic DPG DDPG Continuous action spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 6.
    Autonomous Systems Laboratory 6/21 Policygradient: The goal of Reinforcement learning 𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) Agent World action 𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) model reward & next state 𝑟𝑡 𝑎 𝑡 𝑠𝑡+1 state& 𝑠𝑡 policy 𝜋(𝑎 𝑡|𝑠𝑡) 𝜃∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍ 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 Markov decision process 𝑠1 𝑎1 𝑠2 𝑎2 𝑠3 𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2) 𝑎3 𝑝(𝑠4|𝑠3, 𝑎3) 𝜏 objective: 𝐽(𝜃) trajectory distribution Goal of reinforcement learning policy(𝜋 𝜃): stochastic policy with weights 𝜽
  • 7.
    Autonomous Systems Laboratory 7/21 Policygradient: REINFORCE • REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. 𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) probability 0.1 0.1 0.2 0.2 0.4
  • 8.
    Autonomous Systems Laboratory 8/21 Policygradient: REINFORCE 𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍ 𝑡 𝑟 𝑠𝑡, 𝑎 𝑡 𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1 𝑇 𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1 𝑇 𝑟(𝑠𝑡, 𝑎 𝑡)) 𝛻𝐽 𝜃 ≈ 1 𝑁 ෍ 𝑖=1 𝑁 ෍ 𝑡=1 𝑇 𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍ 𝑡=1 𝑇 𝑟(𝑠𝑡, 𝑎 𝑡) 𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃) The number of episodes problem Must experience some episodes to update 1. Slow training process 2. High gradient variance initial state 𝑟1 𝑟2 𝑟 𝑁 • REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. 𝜃: weights of actor network 𝛼: learning rate
  • 9.
    Autonomous Systems Laboratory 9/21 →𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) Policy gradient: Actor critic (actor critic) • Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic • Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action initial state sample data 𝒊 times update critic & actor sample data 𝒊 times update critic & actor 1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times 2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data 3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) 4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃) Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000. 𝜙: weights of critic network 1. High gradient variance 2. Slow training policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) Env.𝑎 𝑡 𝑠𝑡 (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃) actor critic update critic
  • 10.
    Autonomous Systems Laboratory 10/21 Policygradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. • Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t) 𝑎 𝑥 𝑎 𝑦 Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑠𝑡 • Need 10 action spaces for 5 discretized 2 dimensional actions 𝑎 𝑥 𝑎 𝑦 Deterministic policy 𝝁 𝜽(𝒔 𝒕) 𝑠𝑡 • Only 2 action spaces are needed
  • 11.
    Autonomous Systems Laboratory 11/21 Policygradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. • Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t) 1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times 2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples 3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡) 4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃) 𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡) trajectory distribution 𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇 = 𝑝(𝑠1) ෑ 𝑡=1 𝑇 𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡)) 𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) objective 𝜏 𝜏 𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )] loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
  • 12.
    Autonomous Systems Laboratory 12/21 DDPG:DQN + DPG Q learning DQN Policy gradient (REINFORCE) Actor critic DPG DDPG + continuous action spaces - no replay buffer: sample correlation - no target network: unstable - high variance + lower variance + off policy: replay buffer + stable update: target network + high dimensional observation spaces - discrete action spaces - low dimensional observation spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 13.
    Autonomous Systems Laboratory 13/21 DDPG:algorithm(1/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). • policy • exploration • Add noise for exploration: white Gaussian noise • soft target update • Target network is constrained to change slowly • Stabilize training process 𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃 (𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔) 𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍 𝜽′ ← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏
  • 14.
    Autonomous Systems Laboratory 14/21 softupdate 𝜃′ DDPG: algorithm(2/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). policy 𝜇 𝜃 target policy 𝜇 𝜃′ critic 𝑄 𝜙 target critic 𝑄 𝜙′ 𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁 Env actorcritic 𝑠𝑡 Replay buffer store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) 𝜇 𝜃′(𝑠𝑡+1) update critic loss: 𝐿(𝜙) soft update 𝜙′ 𝜇 𝜃(𝑠𝑡) 𝛻𝐽(𝜃) select action 𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)
  • 15.
    Autonomous Systems Laboratory 15/21 DDPGexample: landing on a moving platform Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic Systems 93.1-2 (2019): 351-366.
  • 16.
    Autonomous Systems Laboratory 16/21 DDPGexample: long-range robotic navigation Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018. • DDPG used as local planner for long range navigation
  • 17.
    Autonomous Systems Laboratory 17/21 DDPGexample: multi agent DDPG (MADDPG) Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
  • 18.
    Autonomous Systems Laboratory 18/21 Conclusion& Future work • DQN have problem to adjust continuous action space directly • DDPG is able to consider continuous action spaces via policy gradient method and actor critic architecture • MADDPG for multi agent RL • Use DDPG for continuous action space decision making problem • ex) navigation, obstacle avoidance
  • 19.
    Autonomous Systems Laboratory 19/21 Appendix:Objective gradient derivation objective gradient
  • 20.
  • 21.