ddpg seminar

CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020

2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network

3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
• DQN is capable of human level performance on many Atari games
• Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
• High dimensional observation: deep neural network can extract feature from high dimensional input
• Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
𝑠𝑡
Update
Copy
𝑄(𝑠𝑡, 𝑎 𝑡; 𝜃)
store
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝑟𝑡
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡)
𝑚𝑎𝑥 𝑎 𝑄(𝑠𝑡
′
, 𝑎 𝑡; 𝜃′)
𝑄 𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝑠𝑡, 𝑎 𝑡 ]
Q learning
𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃′
𝑠𝑡+1, 𝑎 𝑡+1 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
DQN
Policy(𝜋): 𝑎 𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)𝑠𝑡: state
𝑎 𝑡: action
𝑟𝑡: reward
𝑄(𝑠𝑡, 𝑎 𝑡): reward to go

4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
• Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. 𝒌 discrete action spaces with 𝒏 dimension -> 𝒌 𝒏action spaces
• DQN cannot be straight forwardly applied to continuous domain
• Why? -> 1. Policy(𝜋): 𝑎 𝑡 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝒂 𝑄 𝜋 𝜃
(𝑠𝑡, 𝑎 𝑡)
2. Update: 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ← 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 + 𝛼[𝑟𝑡+1 + 𝛾𝒎𝒂𝒙 𝒂 𝑸 𝝅 𝜽′
𝒔 𝒕+𝟏, 𝒂 𝒕+𝟏 − 𝑄 𝜋 𝜃
𝑠𝑡, 𝑎 𝑡 ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).

5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces

6/21
Policy gradient: The goal of Reinforcement learning
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
Agent World
action
𝑃(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
model
reward & next state
𝑟𝑡
𝑎 𝑡
𝑠𝑡+1
state&
𝑠𝑡
policy
𝜋(𝑎 𝑡|𝑠𝑡)
𝜃∗
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝜃 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
Markov decision process
𝑠1
𝑎1
𝑠2
𝑎2
𝑠3
𝑝(𝑠2|𝑠1, 𝑎1) 𝑝(𝑠3|𝑠2, 𝑎2)
𝑎3
𝑝(𝑠4|𝑠3, 𝑎3)
𝜏
objective: 𝐽(𝜃)
trajectory distribution
Goal of reinforcement learning
policy(𝜋 𝜃):
stochastic policy with weights 𝜽

7/21
Policy gradient: REINFORCE
• REINFORCE models the policy as a stochastic policy: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝑠𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
probability
0.1
0.1
0.2
0.2
0.4

8/21
Policy gradient: REINFORCE
𝐽 𝜃 = 𝐸𝜏~𝑝 𝜃(𝜏) ෍
𝑡
𝑟 𝑠𝑡, 𝑎 𝑡
𝛻𝐽 𝜃 =𝐸𝜏~𝑝 𝜃(𝜏) ( σ 𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) (σ 𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡))
𝛻𝐽 𝜃 ≈
1
𝑁
෍
𝑖=1
𝑁
෍
𝑡=1
𝑇
𝛻𝜃 𝑙𝑜𝑔 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) ෍
𝑡=1
𝑇
𝑟(𝑠𝑡, 𝑎 𝑡)
𝜃 ← 𝜃 + 𝛼𝛻𝐽(𝜃)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
𝑟1
𝑟2
𝑟 𝑁
• REINFORCE models the policy as a stochastic decision: 𝑎 𝑡 ~ 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
𝜃: weights of actor network
𝛼: learning rate

9/21
→ 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Policy gradient: Actor critic (actor critic)
• Actor(𝜋 𝜃(𝑎 𝑡|𝑠𝑡)): output action distribution by policy network and updates in the direction suggested by critic
• Critic(𝑸 𝝓(𝒔 𝒕, 𝒂 𝒕)): evaluate actor’s action
initial state
sample data 𝒊 times
update critic & actor
sample data 𝒊 times
update critic & actor
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to sampled data
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝑙𝑜𝑔𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
𝜙: weights of critic network
1. High gradient variance
2. Slow training policy
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
Env.𝑎 𝑡
𝑠𝑡
(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)0~𝑖𝛻𝐽(𝜃)
actor
critic
update critic

10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: 𝑎t = 𝜇 𝜃(𝑠t)
𝑎 𝑥
𝑎 𝑦
Stochastic policy 𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝑠𝑡
• Need 10 action spaces for 5 discretized 2 dimensional actions
𝑎 𝑥
𝑎 𝑦
Deterministic policy 𝝁 𝜽(𝒔 𝒕)
𝑠𝑡
• Only 2 action spaces are needed

11/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
• Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: 𝑎t = 𝜇 𝜃(𝑠t)
1. Sample 𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1 from 𝜇 𝜃(𝑠) 𝑖 times
2. Update 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡) to samples
3. 𝛻𝜃 𝐽(𝜃) ≈ σ𝑖 𝛻𝜃 𝜇 𝜃 𝑠𝑡 𝛻 𝜙 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)| 𝑎 𝑡=𝜇 𝜃(𝑠 𝑡)
4. 𝜃 ← 𝜃 + 𝛼𝛻𝜃 𝐽(𝜃)
𝑝 𝜃 𝑠1,𝑎1,⋯,𝑠 𝑇,𝑎 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝜋 𝜃 𝑎 𝑡 𝑠𝑡 𝑝(𝑠𝑡+1|𝑠𝑡, 𝑎 𝑡)
trajectory distribution
𝑝 𝜃 𝑠1,𝑠2,𝑠3⋯,𝑠 𝑇
= 𝑝(𝑠1) ෑ
𝑡=1
𝑇
𝑝(𝑠𝑡+1|𝑠𝑡, 𝜇 𝜃(𝑠𝑡))
𝐽 𝜃 = 𝐸𝑠,𝑎~𝑝 𝜃(𝜏) 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)
objective
𝜏 𝜏
𝐽 𝜃 = 𝐸𝑠~𝑝 𝜃(𝜏)[𝑄(𝑠, 𝜇 𝜃 𝑠 )]
loss: 𝐿 = 𝑟𝑡 + 𝛾𝑄 𝜙 𝑠𝑡+1, 𝜇 𝜃(𝑠𝑡+1) − 𝑄 𝜙(𝑠𝑡, 𝑎 𝑡)

12/21
DDPG: DQN + DPG
Q learning DQN
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
+ continuous action spaces
- no replay buffer: sample correlation
- no target network: unstable
- high variance + lower variance
+ off policy: replay buffer
+ stable update: target network
+ high dimensional observation spaces
- discrete action spaces
- low dimensional observation spaces

13/21
DDPG: algorithm(1/2)
• policy
• exploration
• Add noise for exploration: white Gaussian noise
• soft target update
• Target network is constrained to change slowly
• Stabilize training process
𝑎 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑄 𝜋 𝜃
(𝑠, 𝑎) 𝒂 = 𝝁 𝜽(𝒔)
𝝁′ 𝒔 = 𝝁 𝜽 𝒔 + 𝐍
𝜽′
← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ where 𝝉 ≪ 𝟏

14/21
soft update 𝜃′
DDPG: algorithm(2/2)
policy 𝜇 𝜃
target policy 𝜇 𝜃′
critic 𝑄 𝜙
target critic 𝑄 𝜙′
𝑎 𝑡 = 𝜇 𝜃 𝑠𝑡 + 𝑁
Env
actorcritic
𝑠𝑡
Replay buffer
store 𝑖 data (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)sample 𝑖 batch (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1)
𝜇 𝜃′(𝑠𝑡+1)
update critic
loss: 𝐿(𝜙) soft update 𝜙′
𝜇 𝜃(𝑠𝑡)
𝛻𝐽(𝜃)
select action
𝑠𝑡+1(𝑠𝑡, 𝑎 𝑡, 𝑟𝑡)

15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.

16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
• DDPG used as local planner for long range navigation

17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.

18/21
Conclusion & Future work
• DQN have problem to adjust continuous action space directly
• DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
• MADDPG for multi agent RL
• Use DDPG for continuous action space decision making problem
• ex) navigation, obstacle avoidance

19/21
Appendix: Objective gradient derivation
objective gradient

20/21
Appendix: DPG objective

21/21
Appendix: DDPG algorithm

ddpg seminar

More Related Content

What's hot

Similar to ddpg seminar

Recently uploaded

In this document

ddpg seminar