Reinforcement LearningA Beginner’s TutorialBy: Omar Enayet(Presentation Version)
The Problem
Agent-Environment Interface
Environment Model
Goals & Rewards
Returns
Credit-Assignment Problem
Markov Decision ProcessAn MDP is defined by < S, A, p, r, >S  -  set of states of the environmentA(s)– set of actions possible in state s     - probability of transition from s- expected reward when executing ain s - discount rate for expected rewardAssumption: discrete timet = 0, 1, 2, . . .rrrt +2t +3. . .s. . .t +1ssst+3t+1t+2taaaatt+1t +2t +3
Value Functions
Value Functions
Value Functions
Optimal Value Functions
Exploration-Exploitation Problem
Policies
Elementary Solution Methods
Dynamic Programming
Perfect Model
Bootstrapping
Generalized Policy Iteration
Efficiency of DP
Monte-Carlo Methods
Episodic Return
Advantages over DPNo Model
Simulation OR part of Model
Focus on small subset of states
Less Harmed by violations of Markov PropertyFirst Visit VS Every-Visit
On-Policy VS Off-Policy
Action-value instead of State-value
Temporal-Difference Learning
Advantages of TD Learning
SARSA (On-Policy)
Q-Learning (Off-Policy)
Actor-Critic Methods(On-Policy)
R-Learning (Off-Policy)>>Average Expected reward per time-step
Eligibility Traces

Reinforcement Learning : A Beginners Tutorial

Editor's Notes

  • #5 By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions. Given a state and an action, a model produces a prediction of the resultant next state and next reward. If the model is stochastic, then there are several possible next states and next rewards, each with some probability of occurring. Some models produce a description of all possibilities and their probabilities; these we call distribution models. Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models. For example, consider modeling the sum of a dozen dice. A distribution model would produce all possible sums and their probabilities of occurring, whereas a sample model would produce an individual sum drawn according to this probability distribution.
  • #8 Credit assignment problem: How do you distribute credit for success among the many decisions that may have been involved in producing it?
  • #14 One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions andprogressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate its expected reward. The exploration-exploitation dilemma has been intensively studied by mathematicians for many decades