markovDP: Discrete-Time Markov Decision Processes (MDPs)

Introduction

The package markovDP (Hahsler 2024a) provides the infrastructure to work with discrete-time Markov Decision Processes (MDPs) (Bellman 1957; Howard 1960) in R. The focus is on convenience in formulating MDPs, the support of sparse representations (using sparse matrices, lists and data.frames) and visualization of results. Some key components are implemented in C++ to speed up computation. It also provides to the following popular solving procedures:

The implementations follow the description is (Russell and Norvig 2020) and (Sutton and Barto 2018) closely. The implementations represent the state space explicitly, so only problems with small to medium state spaces can be used. It is intended to work with simplified models, to be useful to teach how the different methods work, and to experiment with new algorithmic ideas.

markovDP provides an alternative implementation to the existing R package MDPToolbox (Chades et al. 2017). The main difference is that markovDP has a stronger focus on visualizing models and resulting policies. It is also designed with compatibility with the package pomdp for Partially Observable Markov Processes (Hahsler 2024b) in mind. We also implement some reinforcement learning algorithms (e.g., Q-learning) to solve MDP models. Here the MDP is defines the simulated environment the agent interacts with. Reinforcement learning algorithm implementations can also be found in the R package ReinforcementLearning (Proellochs and Feuerriegel 2020) which focuses on model-free learning from pre-defined observations or interaction with an environment.

In this document, we will give a very brief introduction to the concept of MDPs, describe the features of the R package markovDP, and illustrate the usage with a toy example.

Markov Decision Processes

A Markov decision process (MDP) (Bellman 1957; Howard 1960) is a discrete-time stochastic control process. In each time step, an agent can perform an action which affect the system (i.e., may cause the system state to change). The agent’s goal is to maximize its expected future rewards that depend on the sequence of system states and the agent’s actions in the future. The goal is to find the optimal policy that guides the agent’s actions.

The MDP framework is general enough to model a variety of real-world sequential decision-making problems. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general.

Definition

A discrete-time MDP can formally be described by

  • 𝒮 = {s1, s2, …, sn} is the set of fully observable states.

  • 𝒜 = {a1, a2, …, am} is the set of actions.

  • 𝒫 a set of conditional transition probabilities. p(s′|s, a) is the probability for the state transition s → s conditioned on the taken action a.

  • r is the reward function defined by a reward function r(s, a, s′) returning the immediate reward received when transitioning from state s to s while using action a. Rt is a random variable for the immediate reward at time t.

  • γ ∈ [0, 1] is the discount factor.

At each time step t, the environment is in some known state s ∈ 𝒮. The agent chooses an action a ∈ 𝒜, which causes the environment to transition to state s′ ∈ 𝒮 with probability p(s′ ∣ s, a) and the agent receives the immediate reward r(s, a, s′) as a realization of the random variable Rt. This process repeats for each time step t. The goal is for the agent to choose actions that maximizes the expected sum of discounted future immediate rewards. We fist consider the infinite time horizon case. The optimal action for an MDP depends on the current state and can be specified as a deterministic policy π* which gives for each state the optimal action π*(s). This leads to the following optimization problem.

$$\pi_* = \max_\pi \mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^t R_t | \pi\right].$$

For a finite time horizon, only the expectation over the sum up to the time horizon T is used and the the optimal policy also depends on the time step t and the start state S0 specified as a distribution over the states.

Notation Used in the Package

The notation used in the package largely follows (Sutton and Barto 2018).

Notation Variables/Functions in Code Description
𝒮 states, S() the set of state
s, s s, state, start.state, end.state a state
𝒜 actions, A() the set of actions
a a, action a action
t t discrete time step
T T, horizon final time step (length) of an episode, horizon
p(s′|s, a) transition_matrix(), p probability of transition from state s to state s by taking action a
r(s, a, s′) reward_matrix(), reward, r expected immediate reward on transition from s to s' under action a
γ discount discount factor
St, At, Rt random variables for state, action and reward at time t
s0, S0 start, start_vector() single start state, start distribution over states
π vector pi, policy a deterministic policy with an action for each state
π(s) prescribed action for state s in a deterministic policy
π(a|s) probability of action a in state s in a stochastic policy
v, vπ(s), v*(s) a state value vector, the value of state s under policy π or under the optimal policy
V vector V the tabular estimate of vπ or v*
qπ(s, a), q*(s, a) value of taking action a in state s under policy π or the optimal policy
Q matrix Q the tabular estimate of qπ(s, a) or q*(s, a)
Bπ bellman_operator() Bellman operator
B* bellman_update() Bellman update
BE vector BE Bellman error vector (state value difference due to a Bellman update)
VE value_error() State value difference between two policies
δt delta temporal difference error
Δ delta max. absolute Bellman error used in value iteration

Package Functionality

library("markovDP")

Solving a MDP problem with the markovDP package consists of two steps:

  1. Define a MDP problem using the function MDP(), and
  2. solve the problem using solve_MDP().

Defining an MDP Problem

The MDP() function has the following arguments, each corresponds to one of the elements of a MDP.

str(args(MDP))
#> function (states, actions, transition_prob, reward, discount = 0.9, horizon = Inf, 
#>     start = "uniform", info = NULL, name = NA)

where

  • states defines the set of states.

  • actions defines the set of actions.

  • transition_prob defines the conditional transition probabilities p(s′ ∣ s, a),

  • reward specifies the reward function with entries for r(s, a, s′),

  • discount is the discount factor in the range [0, 1],

  • horizon is the problem horizon as the number of periods to consider.

  • start defines in what state the problem starts. It is specified as a probability distribution over the states.

While specifying the discount rate and the set of states, and actions in code is straight-forward. Some arguments can be specified in different ways. The initial state start can be specified as

  • A vector of n probabilities that add up to 1, where n is the number of states.

    start <- c(0.5, 0.3, 0.2)
  • The string "uniform" for a uniform distribution over all states.

    start <- "uniform"
  • A vector of integer indices specifying a subset as start states. The initial probability is uniform over these states. For example, start only in state 3 or start in state 1 and 3:

    start <- 3
    start <- c(1, 3)
  • A vector of strings specifying a subset as equally likely start states.

    start <- "state3"
    start <- c("state1", "state3")
  • A vector of strings starting with "-" specifying which states to exclude from the uniform initial probability distribution.

    start <- c("-", "state2")

The transition probabilities (transition_prob) and reward function (reward) can be specified in several ways:

  • As a data.frame created using rbind() and the helper functions T_() and R_(). This is the preferred and most sparse representation for most problems.
  • A named list of matrices representing the transition probabilities or rewards. Each list elements corresponds to an action.
  • A function with the model as the first argument and then the same arguments T_() or R_() that returns the probability or reward.

More details can be found in the manual page for MDP().

Solving a MDP

MDP problems are solved with the function solve_MDP() with the following arguments.

str(args(solve_MDP))
#> function (model, method = "value_iteration", ...)

The model argument is a MDP problem created using the MDP() function. The method argument specifies what algorithm the solver should use. Available methods including "value_iteration", "policy_iteration", "q_learning", "sarsa", "expected_sarsa" and several more.

Toy Example: Steward Russell’s 4x3 Maze Gridworld MDP

We will demonstrate how to use the package with the 4x3 Maze Gridworld described in Chapter 17 of the textbook “Artificial Intelligence: A Modern Approach” (AIMA) (Russell and Norvig 2020). The simple maze has the following layout:

    1234           Transition model:
   ######             .8 (action direction)
  1#   +#              ^
  2# # -#              |
  3#S   #         .1 <-|-> .1
   ######

We represent the maze states as a gridworld matrix with 3 rows and 4 columns. The states are labeled s(row, col) representing the position in the matrix. The # (state s(2,2)) in the middle of the maze is an obstruction and not reachable. Rewards are associated with transitions. The default reward (penalty) is -0.04 for each action taken. The start state marked with S is s(3,1). Transitioning to + (state s(1,4)) gives a reward of +1.0, transitioning to - (state s_(2,4)) has a reward of -1.0. Both these states are absorbing (i.e., terminal) states.

Actions are movements (up, right, down, left). The actions are unreliable with a .8 chance to move in the correct direction and a 0.1 chance to instead to move in a perpendicular direction. This means that the maze has a stochastic transition model.

Specifying the Stochastic Maze

library("markovDP")

After loading the library, we create the states using a gridworld helper function. We first look at the state layout of a 3 × 4 maze.

gw_matrix(gw_init(dim = c(3, 4)))
#>      [,1]     [,2]     [,3]     [,4]    
#> [1,] "s(1,1)" "s(1,2)" "s(1,3)" "s(1,4)"
#> [2,] "s(2,1)" "s(2,2)" "s(2,3)" "s(2,4)"
#> [3,] "s(3,1)" "s(3,2)" "s(3,3)" "s(3,4)"

Then we initialize the grid world with start state, goal state, absorbing states and the wall is a blocked state.

gw <- gw_init(dim = c(3, 4),
        start = "s(3,1)",
        goal = "s(1,4)",
        absorbing_states = c("s(1,4)", "s(2,4)"),
        blocked_states = "s(2,2)",
        state_labels = list(
            "s(3,1)" = "Start",
            "s(2,4)" = "-1",
            "s(1,4)" = "Goal: +1")
)
gw_matrix(gw)
#>      [,1]     [,2]     [,3]     [,4]    
#> [1,] "s(1,1)" "s(1,2)" "s(1,3)" "s(1,4)"
#> [2,] "s(2,1)" NA       "s(2,3)" "s(2,4)"
#> [3,] "s(3,1)" "s(3,2)" "s(3,3)" "s(3,4)"

We see that the blocked state was automatically excluded from the state space.

The initialization function also creates a deterministic transition function for a grid world with deterministic movement in gw$transition_prob, but we need to create our own stochastic transition function. The following function returns a probability vector for a given action in a given start state. The values are probabilities to reach an end state.

P <- function(model, action, start.state) {
  action <- match.arg(action, choices = A(model))
  
  P <- structure(numeric(length(S(model))), names = S(model))
  
  # absorbing states
  if (start.state %in% model$info$absorbing_states) {
    P[start.state] <- 1
    return(P)
  }
  
  if (action %in% c("up", "down")) {
    error_direction <- c("right", "left")
  } else {
    error_direction <- c("up", "down")
  }
  
  rc <- gw_s2rc(start.state)
  delta <- list(
    up = c(-1, 0),
    down = c(+1, 0),
    right = c(0, +1),
    left = c(0, -1)
  )
  
  # there are 3 directions. For blocked directions, stay in place
  # 1) action works .8
  rc_new <- gw_rc2s(rc + delta[[action]])
  if (rc_new %in% S(model))
    P[rc_new] <- .8
  else
    P[start.state] <- .8
  
  # 2) off to the right .1
  rc_new <- gw_rc2s(rc + delta[[error_direction[1]]])
  if (rc_new %in% S(model))
    P[rc_new] <- .1
  else
    P[start.state] <-  P[start.state] + .1
  
  # 3) off to the left .1
  rc_new <- gw_rc2s(rc + delta[[error_direction[2]]])
  if (rc_new %in% S(model))
    P[rc_new] <- .1
  else
    P[start.state] <-  P[start.state] + .1
  
  P
  } 

Next, we specify the reward. The most convenient way is a table. Every time a reward is calculated, the last matching row will be used. We set the default reward for any transition to -0.04 which will be used if no other row matches the start or end state. Then we set the reward for the terminal states is +/-1 plus the cost to transition to the state. Finally, we make sure that staying in the absorbing states does not add to the reward.

R <- rbind(
  R_(                         value = -0.04),
  R_(end.state = "s(2,4)",    value = -1 - 0.04),
  R_(end.state = "s(1,4)",    value = +1 - 0.04),
  R_(start.state = "s(2,4)",  value = 0),
  R_(start.state = "s(1,4)",  value = 0)
)

Now, we can create the complete MDP problem.

Maze <- MDP(
  name = "Stuart Russell's 3x4 Maze",
  discount = 1,
  horizon = Inf,
  states = gw$states,
  actions = gw$actions,
  start = "s(3,1)",
  transition_prob = P,
  reward = R,
  info = gw$info
)

Maze
#> MDP, list - Stuart Russell's 3x4 Maze
#>   Discount factor: 1
#>   Horizon: Inf epochs
#>   Size: 11 states / 4 actions
#>   Start: s(3,1)
#> 
#>   List components: 'name', 'discount', 'horizon', 'states', 'actions',
#>     'start', 'transition_prob', 'reward', 'info'

Solving the Maze

We can solve the problem with the default solver method which is value iteration.

sol <- solve_MDP(Maze)
sol
#> MDP, list - Stuart Russell's 3x4 Maze
#>   Discount factor: 1
#>   Horizon: Inf epochs
#>   Size: 11 states / 4 actions
#>   Start: s(3,1)
#>   Solved:
#>     Method: 'value_iteration'
#>     Solution converged: TRUE
#> 
#>   List components: 'name', 'discount', 'horizon', 'states', 'actions',
#>     'start', 'transition_prob', 'reward', 'info', 'solution'

The output is an object of class MDP which contains the solution as an additional list component. It indicates that the algorithm has converged to a stable solution. The detailed solution information is stored in $solution

sol$solution
#> $method
#> [1] "value_iteration"
#> 
#> $policy
#> $policy[[1]]
#>    states         V action
#> 1  s(1,1) 0.8115564  right
#> 2  s(2,1) 0.7615521     up
#> 3  s(3,1) 0.7052527     up
#> 4  s(1,2) 0.8678082  right
#> 5  s(3,2) 0.6551492   left
#> 6  s(1,3) 0.9178082  right
#> 7  s(2,3) 0.6602740     up
#> 8  s(3,3) 0.6110843   left
#> 9  s(1,4) 0.0000000     up
#> 10 s(2,4) 0.0000000     up
#> 11 s(3,4) 0.3872455   left
#> 
#> 
#> $converged
#> [1] TRUE
#> 
#> $delta
#> [1] 0.0006911852
#> 
#> $iterations
#> [1] 19

The policy can be accessed using the polucy() function.

policy(sol)
#>    states         V action
#> 1  s(1,1) 0.8115564  right
#> 2  s(2,1) 0.7615521     up
#> 3  s(3,1) 0.7052527     up
#> 4  s(1,2) 0.8678082  right
#> 5  s(3,2) 0.6551492   left
#> 6  s(1,3) 0.9178082  right
#> 7  s(2,3) 0.6602740     up
#> 8  s(3,3) 0.6110843   left
#> 9  s(1,4) 0.0000000     up
#> 10 s(2,4) 0.0000000     up
#> 11 s(3,4) 0.3872455   left

The found policy shows for each state the value (i.e., the value function) and the prescribed action. We can visualize the value function.

plot_value_function(sol)

Additional Functions

The package provides several functions to work with models and policies. In the following, we will organize them into categories. More details about each function, its parameters, and examples can be found in the manual pages.

Access to Model Components

Components of the model are stored in a list and can be accessed directly.

str(sol, max.level = 2)
#> List of 10
#>  $ name           : chr "Stuart Russell's 3x4 Maze"
#>  $ discount       : num 1
#>  $ horizon        : num Inf
#>  $ states         : chr [1:11] "s(1,1)" "s(2,1)" "s(3,1)" "s(1,2)" ...
#>  $ actions        : chr [1:4] "up" "right" "down" "left"
#>  $ start          : chr "s(3,1)"
#>  $ transition_prob:List of 4
#>   ..$ up   : num [1:11, 1:11] 0.9 0.8 0 0.1 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ right: num [1:11, 1:11] 0.1 0.1 0 0 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ down : num [1:11, 1:11] 0.1 0 0 0.1 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ left : num [1:11, 1:11] 0.9 0.1 0 0.8 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>  $ reward         :List of 4
#>   ..$ up   : num [1:11, 1:11] -0.04 -0.04 0 -0.04 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ right: num [1:11, 1:11] -0.04 -0.04 0 0 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ down : num [1:11, 1:11] -0.04 0 0 -0.04 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>   ..$ left : num [1:11, 1:11] -0.04 -0.04 0 -0.04 0 0 0 0 0 0 ...
#>   .. ..- attr(*, "dimnames")=List of 2
#>  $ info           :List of 6
#>   ..$ gridworld       : logi TRUE
#>   ..$ dim             : num [1:2] 3 4
#>   ..$ start           : chr "s(3,1)"
#>   ..$ goal            : chr "s(1,4)"
#>   ..$ absorbing_states: chr [1:2] "s(1,4)" "s(2,4)"
#>   ..$ state_labels    :List of 3
#>  $ solution       :List of 5
#>   ..$ method    : chr "value_iteration"
#>   ..$ policy    :List of 1
#>   ..$ converged : logi TRUE
#>   ..$ delta     : num 0.000691
#>   ..$ iterations: int 19
#>  - attr(*, "class")= chr [1:2] "MDP" "list"

The special list element "solution" is only available when the model already contains a policy.

To access components more easily, several accessor functions are available:

  • S() the set of states.
  • A() the set of actions.
  • actions() find available actions for a state.
  • reward_matrix() access the reward structure,
  • start_vector() access the initial state probabilities.
  • transition_matrix() access transition probabilities.
  • transition_graph() converts the transition matrix into a graph for visualization.
  • policy() extracts the policy for a solved model.

These functions can efficiently retrieve individual values and convert the components into sparse and dense representations.

Value Function

  • value_function() extracts the value function from a solved MDP.
  • q_values() calculates (approximate) Q-values for a given model and value function.

Policy

Policies in this package are deterministic policies with one prescribed action per state. They are represented as a data.frame with columns for:

  • state: The state.
  • V: The state’s value (discounted expected utility U) if the policy is followed.
  • action: The prescribed action.

Policies are typically created using solve_MDP() and then stored in the "solution" component of the returned model. Policies can also be created by the following functions:

  • random_policy() create a random policy.
  • manual_policy() specify a policy data.frame manually.
  • greedy_policy() generates a greedy policy using Q-values.

A policy can be added to a model for the use in other functions using add_policy().

The action prescribed by a model with a policy can be calculated using action(). From a matrix with Q-values, greedy_action() can be used to find the best action.

The value function for a policy applied to a model can be estimated using policy_evaluation().

Evaluation

MDP policies can be evaluated using:

  • reward() calculates the expected reward of a policy
  • regret() calculates the regret of a policy relative to a benchmark policy.

Sampling

Trajectories through a MDPs are created using sample_MDP(). The outcome of single actions can be calculated by act().

Acknowledgments

Development of this package was supported in part by National Institute of Standards and Technology (NIST) under grant number 60NANB17D180.

References

Bellman, Richard. 1957. “A Markovian Decision Process.” Indiana University Mathematics Journal 6: 679–84. https://www.jstor.org/stable/24900506.
Chades, Iadine, Guillaume Chapron, Marie-Josee Cros, Frederick Garcia, and Regis Sabbadin. 2017. MDPtoolbox: Markov Decision Processes Toolbox. https://doi.org/10.32614/CRAN.package.MDPtoolbox.
Hahsler, Michael. 2024a. markovDP: Infrastructure for Discrete-Time Markov Decision Processes (MDP). https://github.com/mhahsler/markovDP.
———. 2024b. Pomdp: Infrastructure for Partially Observable Markov Decision Processes (POMDP). https://doi.org/10.32614/CRAN.package.pomdp.
Howard, R. A. 1960. Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press.
Li, Lihong, and Michael Littman. 2008. “Prioritized Sweeping Converges to the Optimal Value Function.” DCS-TR-631. Rutgers University. https://doi.org/10.7282/T3TX3JSX.
Manne, Alan. 1960. “On the Job-Shop Scheduling Problem.” Operations Research 8 (2): 219–23. https://doi.org/10.1287/opre.8.2.219.
Moore, Andrew, and C. G. Atkeson. 1993. “Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time.” Machine Learning 13 (1): 103–30. https://doi.org/10.1007/BF00993104.
Proellochs, Nicolas, and Stefan Feuerriegel. 2020. ReinforcementLearning: Model-Free Reinforcement Learning. https://CRAN.R-project.org/package=ReinforcementLearning.
Puterman, Martin L., and Moon Chirl Shin. 1978. “Modified Policy Iteration Algorithms for Discounted Markov Decision Problems.” Management Science 24: 1127–37. https://doi.org/10.1287/mnsc.24.11.1127.
Rummery, G., and Mahesan Niranjan. 1994. “On-Line Q-Learning Using Connectionist Systems.” Techreport CUED/F-INFENG/TR 166. Cambridge University Engineering Department.
Russell, Stuart J., and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th Edition). Pearson. http://aima.cs.berkeley.edu/.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Second. The MIT Press. http://incompleteideas.net/book/the-book-2nd.html.
Watkins, Christopher J. C. H., and Peter Dayan. 1992. “Q-Learning.” Machine Learning 8 (3): 279–92. https://doi.org/10.1007/BF00992698.