Reinforcement Learning ⇒ Dynamic Programming ⇒
Markov Decision Process
Subject: Machine Learning
Dr. Varun Kumar
Subject: Machine Learning Dr. Varun Kumar Lecture 9 1 / 16
Outlines
1 Introduction to Reinforcement Learning
2 Application of Reinforcement Learning
3 Approach for Studying Reinforcement Learning
4 Basics of Dynamic Programming
5 Markov Decision Process:
6 References
Subject: Machine Learning Dr. Varun Kumar Lecture 9 2 / 16
Introduction to reinforcement learning:
Key Feature
1 There is no supervisor for performing the learning process.
2 In stead of supervisor, there is a critic that informs the end outcome.
3 If outcome is meaningful then the whole process is rewarded. On the
other side the whole process is penalized.
4 This learning process is based on reward and penalty.
5 Critic convert the primary reinforcement signal into heuristic
reinforcement signal.
6 Primary reinforcement signal → Signal observed from the
environment.
7 Heuristic reinforcement signal → Higher quality signal.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 3 / 16
Difference between critic and supervisor
Let a complex system has been described as follows
Note
⇒ Critic does not provide the step-by-step solution.
⇒ Critic does not provide any method, training data, suitable learning
system or logical operation for doing the necessary correction, if
output does reaches to the expected value.
⇒ It comment only the end output, whereas supervisor helps in many
ways.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 4 / 16
Block diagram of reinforcement learning
Block diagram
Subject: Machine Learning Dr. Varun Kumar Lecture 9 5 / 16
Aim of reinforcement learning
⇒ To minimize the cost-to-go function.
⇒ Cost-to-go function → Expectation of cumulative cost of action
taken over a sequence of steps instead of immediate cost.
⇒ Learning system : It discover several actions and feed them back to
the environment.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 6 / 16
Application of reinforcement learning
Major application area
♦ Game theory.
♦ Simulation based optimization.
♦ Operational research.
♦ Control theory.
♦ Swarm intelligence.
♦ Multi-agents system.
♦ Information theory.
Note :
⇒ Reinforcement learning is also called as approximate dynamic
programming.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 7 / 16
Approach for studying reinforcement learning
Classical approach: Learning takes place through a process of reward
and penalty with the goal of achieving highly skilled behavior.
Modern approach :
⇒ Based on mathematical framework, such as dynamic programming.
⇒ It decides on the course of action by considering possible future stages
without actually experiencing them.
⇒ It emphasis on planning.
⇒ It is a credit assignment problem.
⇒ Credit or blame is part of interacting decision.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 8 / 16
Dynamic programming
Basics
⇒ How can an agent/decision maker/learning system improves its
long term performance in a stochastic environment ?
⇒ Attaining long term improvised performance without disrupting the
short term performance.
Markov decision process (MDP)
Subject: Machine Learning Dr. Varun Kumar Lecture 9 9 / 16
Markov decision process (MDP):
Key features of MDP
♦ Environment is modeled through probabilistic framework. Some
known probability mass function (pmf) may be the basis for
modeling.
♦ It consists of a finite set of discrete states.
♦ Here states does not contain any past statistics.
♦ Through well defined pmf a set of discrete sample data is created.
♦ For each environmental state, there is a finite set of possible action
that may be taken by agent.
♦ Every time agent takes an action, a certain cost is incurred.
♦ States are observed, actions are taken and costs are incurred at
discrete times.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 10 / 16
Continued–
MDP works on the stochastic environment. It is nothing but a
random process.
Decision action is a time dependent random variable.
Mathematical description:
⇒ Si is the ith state at a sample instant n.
⇒ Sj is the next state at a sample instant n + 1
⇒ pij is known as the transition probability ∀ 1 ≤ i ≤ k and 1 ≤ j ≤ k
pij (Ai ) = P(Xn+1 = Sj |Xn = Si , An = Ai )
⇒ Ai ia ith action taken by an agent at a sample instant n.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 11 / 16
Markov chain rule
Markov chain rule
Markov chain rule is based on the partition theorem.
Statement of partition theorem: Let B1, ..., Bm form a partition of Ω,
then for any event A.
P(A) =
N
i=1
P(A ∩ Bi ) =
N
i=1
P(A|Bi )P(Bi )
Subject: Machine Learning Dr. Varun Kumar Lecture 9 12 / 16
Markov property
1 The basic property of a Markov chain is that only the most recent
point in the trajectory affects what happens next.
P(Xn+1|Xn, Xn−1, ....X0) = P(Xn+1|Xn)
2 Transition matrix or stochastic matrix:
P =





p11 p12 .... p1K
p21 p22 .... p2K
...
... ..........
pK1 pK2 ..... pKK





⇒ Sum of row is equal to unity → j pij = 1
⇒ p11 + p12 + ....p1K = 1 or but p11 + p21 + .... + pK1 = 1
Subject: Machine Learning Dr. Varun Kumar Lecture 9 13 / 16
Continued–
3 n-step transition probability:
Statement: Let X0, X1, X2, ... be a Markov chain with state space
S = 1, 2, ..., N. Recall that the elements of the transition matrix P
are defined as:
pij = P(X1 = j|X0 = i) = P(Xn+1 = j|Xn = i) for any n.
⇒ pij is the probability of making a transition from state i to state j in a
single step.
Q What is the probability of making a transition from state i to state j
over two steps? In another sence, what is P(X2 = j|X0 = i)?
Ans pij
2
Subject: Machine Learning Dr. Varun Kumar Lecture 9 14 / 16
Continued–
Subject: Machine Learning Dr. Varun Kumar Lecture 9 15 / 16
References
E. Alpaydin, Introduction to machine learning. MIT press, 2020.
J. Grus, Data science from scratch: first principles with python. O’Reilly Media,
2019.
T. M. Mitchell, The discipline of machine learning. Carnegie Mellon University,
School of Computer Science, Machine Learning , 2006, vol. 9.
S. Haykin, Neural Networks and Learning Machines, 3/E. Pearson Education
India, 2010.
Subject: Machine Learning Dr. Varun Kumar Lecture 9 16 / 16

Lecture 9 Markov decision process

  • 1.
    Reinforcement Learning ⇒Dynamic Programming ⇒ Markov Decision Process Subject: Machine Learning Dr. Varun Kumar Subject: Machine Learning Dr. Varun Kumar Lecture 9 1 / 16
  • 2.
    Outlines 1 Introduction toReinforcement Learning 2 Application of Reinforcement Learning 3 Approach for Studying Reinforcement Learning 4 Basics of Dynamic Programming 5 Markov Decision Process: 6 References Subject: Machine Learning Dr. Varun Kumar Lecture 9 2 / 16
  • 3.
    Introduction to reinforcementlearning: Key Feature 1 There is no supervisor for performing the learning process. 2 In stead of supervisor, there is a critic that informs the end outcome. 3 If outcome is meaningful then the whole process is rewarded. On the other side the whole process is penalized. 4 This learning process is based on reward and penalty. 5 Critic convert the primary reinforcement signal into heuristic reinforcement signal. 6 Primary reinforcement signal → Signal observed from the environment. 7 Heuristic reinforcement signal → Higher quality signal. Subject: Machine Learning Dr. Varun Kumar Lecture 9 3 / 16
  • 4.
    Difference between criticand supervisor Let a complex system has been described as follows Note ⇒ Critic does not provide the step-by-step solution. ⇒ Critic does not provide any method, training data, suitable learning system or logical operation for doing the necessary correction, if output does reaches to the expected value. ⇒ It comment only the end output, whereas supervisor helps in many ways. Subject: Machine Learning Dr. Varun Kumar Lecture 9 4 / 16
  • 5.
    Block diagram ofreinforcement learning Block diagram Subject: Machine Learning Dr. Varun Kumar Lecture 9 5 / 16
  • 6.
    Aim of reinforcementlearning ⇒ To minimize the cost-to-go function. ⇒ Cost-to-go function → Expectation of cumulative cost of action taken over a sequence of steps instead of immediate cost. ⇒ Learning system : It discover several actions and feed them back to the environment. Subject: Machine Learning Dr. Varun Kumar Lecture 9 6 / 16
  • 7.
    Application of reinforcementlearning Major application area ♦ Game theory. ♦ Simulation based optimization. ♦ Operational research. ♦ Control theory. ♦ Swarm intelligence. ♦ Multi-agents system. ♦ Information theory. Note : ⇒ Reinforcement learning is also called as approximate dynamic programming. Subject: Machine Learning Dr. Varun Kumar Lecture 9 7 / 16
  • 8.
    Approach for studyingreinforcement learning Classical approach: Learning takes place through a process of reward and penalty with the goal of achieving highly skilled behavior. Modern approach : ⇒ Based on mathematical framework, such as dynamic programming. ⇒ It decides on the course of action by considering possible future stages without actually experiencing them. ⇒ It emphasis on planning. ⇒ It is a credit assignment problem. ⇒ Credit or blame is part of interacting decision. Subject: Machine Learning Dr. Varun Kumar Lecture 9 8 / 16
  • 9.
    Dynamic programming Basics ⇒ Howcan an agent/decision maker/learning system improves its long term performance in a stochastic environment ? ⇒ Attaining long term improvised performance without disrupting the short term performance. Markov decision process (MDP) Subject: Machine Learning Dr. Varun Kumar Lecture 9 9 / 16
  • 10.
    Markov decision process(MDP): Key features of MDP ♦ Environment is modeled through probabilistic framework. Some known probability mass function (pmf) may be the basis for modeling. ♦ It consists of a finite set of discrete states. ♦ Here states does not contain any past statistics. ♦ Through well defined pmf a set of discrete sample data is created. ♦ For each environmental state, there is a finite set of possible action that may be taken by agent. ♦ Every time agent takes an action, a certain cost is incurred. ♦ States are observed, actions are taken and costs are incurred at discrete times. Subject: Machine Learning Dr. Varun Kumar Lecture 9 10 / 16
  • 11.
    Continued– MDP works onthe stochastic environment. It is nothing but a random process. Decision action is a time dependent random variable. Mathematical description: ⇒ Si is the ith state at a sample instant n. ⇒ Sj is the next state at a sample instant n + 1 ⇒ pij is known as the transition probability ∀ 1 ≤ i ≤ k and 1 ≤ j ≤ k pij (Ai ) = P(Xn+1 = Sj |Xn = Si , An = Ai ) ⇒ Ai ia ith action taken by an agent at a sample instant n. Subject: Machine Learning Dr. Varun Kumar Lecture 9 11 / 16
  • 12.
    Markov chain rule Markovchain rule Markov chain rule is based on the partition theorem. Statement of partition theorem: Let B1, ..., Bm form a partition of Ω, then for any event A. P(A) = N i=1 P(A ∩ Bi ) = N i=1 P(A|Bi )P(Bi ) Subject: Machine Learning Dr. Varun Kumar Lecture 9 12 / 16
  • 13.
    Markov property 1 Thebasic property of a Markov chain is that only the most recent point in the trajectory affects what happens next. P(Xn+1|Xn, Xn−1, ....X0) = P(Xn+1|Xn) 2 Transition matrix or stochastic matrix: P =      p11 p12 .... p1K p21 p22 .... p2K ... ... .......... pK1 pK2 ..... pKK      ⇒ Sum of row is equal to unity → j pij = 1 ⇒ p11 + p12 + ....p1K = 1 or but p11 + p21 + .... + pK1 = 1 Subject: Machine Learning Dr. Varun Kumar Lecture 9 13 / 16
  • 14.
    Continued– 3 n-step transitionprobability: Statement: Let X0, X1, X2, ... be a Markov chain with state space S = 1, 2, ..., N. Recall that the elements of the transition matrix P are defined as: pij = P(X1 = j|X0 = i) = P(Xn+1 = j|Xn = i) for any n. ⇒ pij is the probability of making a transition from state i to state j in a single step. Q What is the probability of making a transition from state i to state j over two steps? In another sence, what is P(X2 = j|X0 = i)? Ans pij 2 Subject: Machine Learning Dr. Varun Kumar Lecture 9 14 / 16
  • 15.
    Continued– Subject: Machine LearningDr. Varun Kumar Lecture 9 15 / 16
  • 16.
    References E. Alpaydin, Introductionto machine learning. MIT press, 2020. J. Grus, Data science from scratch: first principles with python. O’Reilly Media, 2019. T. M. Mitchell, The discipline of machine learning. Carnegie Mellon University, School of Computer Science, Machine Learning , 2006, vol. 9. S. Haykin, Neural Networks and Learning Machines, 3/E. Pearson Education India, 2010. Subject: Machine Learning Dr. Varun Kumar Lecture 9 16 / 16