--- title: "markovDP: Discrete-Time Markov Decision Processes (MDPs)" author: "Michael Hahsler" bibliography: references.bib link-citations: yes vignette: > %\VignetteEncoding{UTF-8} %\VignetteIndexEntry{markovDP: Discrete-Time Markov Decision Processes (MDPs)} %\VignetteEngine{knitr::rmarkdown} output: rmarkdown::html_vignette editor_options: markdown: wrap: 72 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The package **markovDP** [@CRAN_markovDP] provides the infrastructure to work with discrete-time Markov Decision Processes (MDPs) [@Bellman1957; @Howard1960] in R. The focus is on convenience in formulating MDPs, the support of sparse representations (using sparse matrices, lists and data.frames) and visualization of results. Some key components are implemented in C++ to speed up computation. It also provides to the following popular solving procedures: - **Dynamic Programming** - Value Iteration [@Bellman1957] - Modified Policy Iteration [@Howard1960; @Puterman1978] - Prioritized Sweeping [@Moore1993; @Li2008] - **Linear Programming** - Primal Formulation [@Manne1960] - **Monte Carlo Control** - Monte Carlo Control with Exploring Starts [@Sutton1998] - On-policy Monte Carlo Control [@Sutton1998] - Off-policy Monte Carlo Control [@Sutton1998] - **Termporal Differencing** - Q-Learning [@Watkins1992] - Sarsa [@Rummery1994] - Expected Sarsa [@Sutton1998] - **Sampling** - Random-sample one-step tabular Q-planning [@Sutton1998] The implementations follow the description is [@Russell2020] and [@Sutton1998] closely. The implementations represent the state space explicitly, so only problems with small to medium state spaces can be used. It is intended to work with simplified models, to be useful to teach how the different methods work, and to experiment with new algorithmic ideas. **markovDP** provides an alternative implementation to the existing R package **MDPToolbox** [@CRAN_MDPtoolbox]. The main difference is that **markovDP** has a stronger focus on visualizing models and resulting policies. It is also designed with compatibility with the package **pomdp** for Partially Observable Markov Processes [@CRAN_pomdp] in mind. We also implement some reinforcement learning algorithms (e.g., Q-learning) to solve MDP models. Here the MDP is defines the simulated environment the agent interacts with. Reinforcement learning algorithm implementations can also be found in the R package **ReinforcementLearning** [@Proellochs2020] which focuses on model-free learning from pre-defined observations or interaction with an environment. In this document, we will give a very brief introduction to the concept of MDPs, describe the features of the R package **markovDP**, and illustrate the usage with a toy example. # Markov Decision Processes A Markov decision process (MDP) [@Bellman1957; @Howard1960] is a discrete-time stochastic control process. In each time step, an agent can perform an action which affect the system (i.e., may cause the system state to change). The agent's goal is to maximize its expected future rewards that depend on the sequence of system states and the agent's actions in the future. The goal is to find the optimal policy that guides the agent's actions. The MDP framework is general enough to model a variety of real-world sequential decision-making problems. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. ## Definition A discrete-time MDP can formally be described by - $\mathcal{S} = \{s_1, s_2, \dots, s_n\}$ is the set of fully observable states. - $\mathcal{A} = \{a_1, a_2, \dots, a_m\}$ is the set of actions. - $\mathcal{P}$ a set of conditional transition probabilities. $p(s'|s,a)$ is the probability for the state transition $s \rightarrow s'$ conditioned on the taken action $a$. - $\mathcal{R}$ is the set of all possible rewards. The reward is defined by a reward function $r(s, a, s')$ returning the immediate reward received when transitioning from state $s$ to $s'$ while using action $a$. $R_t$ is a random variable for the immediate reward at time $t$. - $\gamma \in [0, 1]$ is the discount factor. At each time step $t$, the environment is in some known state $s \in \mathcal{S}$. The agent chooses an action $a \in \mathcal{A}$, which causes the environment to transition to state $s' \in \mathcal{S}$ with probability $p(s' \mid s,a)$ and the agent receives the immediate reward $r(s,a, s')$ as a realization of the random variable $R_t$. This process repeats for each time step $t$. The goal is for the agent to choose actions that maximizes the expected sum of discounted future immediate rewards. We fist consider the infinite time horizon case. The optimal action for an MDP depends on the current state and can be specified as a deterministic policy $\pi_*$ which gives for each state the optimal action $\pi_*(s)$. This leads to the following optimization problem. $$\pi_* = \max_\pi \mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^t R_t | \pi\right].$$ For a finite time horizon, only the expectation over the sum up to the time horizon $T$ is used and the the optimal policy also depends on the time step $t$ and the start state $S_0$ specified as a distribution over the states. ## Notation Used in the Package The notation used in the package largely follows [@Sutton1998]. | Notation | Variables/Functions in Code | Description | |------|------|------| | $\mathcal{S}$ | `states`, `S()` | the set of state | | $s, s'$ | `s`, `state`, `start.state`, `end.state` | a state | | $\mathcal{A}$ | `actions`, `A()` | the set of actions | | $a$ | `a`, `action` | a action | | $t$ | `t` | discrete time step| | $T$ | `T`, `horizon` | final time step (length) of an episode, horizon | | $p(s' | s, a)$ | `transition_matrix()`, `p` | probability of transition from state $s$ to state $s'$ by taking action $a$ | | $r(s,a,s')$| `reward_matrix()`, `reward`, `r` | expected immediate reward on transition from `s` to `s'` under action `a` | | $\gamma$ | `discount` | discount factor | | $S_t, A_t, R_t$| | random variables for state, action and reward at time $t$ | | $s_0, S_0$ | `start`, `start_vector()` | single start state, start distribution over states | | $\pi$ | vector `pi`, `policy` | a deterministic policy with an action for each state | | $\pi(s)$ | | prescribed action for state $s$ in a deterministic policy | | $v_\pi(s), v_*(s)$ | | value of state $s$ under policy $\pi$ or under the optimal policy | | $V$ | vector `V` | the tabular estimate of $v_\pi(s)$ or $v_*(s)$ | | $q_\pi(s,a), q_*(s,a)$| | value of taking action $a$ in state $s$ under policy $\pi$ or the optimal policy | | $Q$ | matrix `Q` | the tabular estimate of $q_\pi(s,a)$ or $q_*(s,a)$ | | $B_\pi$| `bellman_operator()` | Bellman operator | | $B_*$| `bellman_update()` | Bellman update | | $\text{BE}$ | vector `BE` | Bellman error vector (state value difference due to a Bellman update) | | $\text{VE}$| `value_error()` | State value difference between two policies | | $\delta_t$| `delta`| temporal difference error | | $\Delta$ | `delta` | max. absolute Bellman error used in value iteration | # Package Functionality ```{r setup} library("markovDP") ``` Solving a MDP problem with the **markovDP** package consists of two steps: 1. Define a MDP problem using the function `MDP()`, and 2. solve the problem using `solve_MDP()`. ## Defining an MDP Problem The `MDP()` function has the following arguments, each corresponds to one of the elements of a MDP. ```{r} str(args(MDP)) ``` where - `states` defines the set of states. - `actions` defines the set of actions. - `transition_prob` defines the conditional transition probabilities $p(s' \mid s,a)$, - `reward` specifies the reward function with entries for $r(s, a, s')$, - `discount` is the discount factor in the range $[0,1]$, - `horizon` is the problem horizon as the number of periods to consider. - `start` defines in what state the problem starts. It is specified as a probability distribution over the states. While specifying the discount rate and the set of states, and actions in code is straight-forward. Some arguments can be specified in different ways. The initial state `start` can be specified as - A vector of $n$ probabilities that add up to 1, where $n$ is the number of states. ```{r, eval = FALSE} start <- c(0.5, 0.3, 0.2) ``` - The string `"uniform"` for a uniform distribution over all states. ```{r, eval = FALSE} start <- "uniform" ``` - A vector of integer indices specifying a subset as start states. The initial probability is uniform over these states. For example, start only in state 3 or start in state 1 and 3: ```{r, eval = FALSE} start <- 3 start <- c(1, 3) ``` - A vector of strings specifying a subset as equally likely start states. ```{r, eval = FALSE} start <- "state3" start <- c("state1", "state3") ``` - A vector of strings starting with `"-"` specifying which states to exclude from the uniform initial probability distribution. ```{r, eval = FALSE} start <- c("-", "state2") ``` The transition probabilities (`transition_prob`) and reward function (`reward`) can be specified in several ways: - As a `data.frame` created using `rbind()` and the helper functions `T_()` and `R_()`. This is the preferred and most sparse representation for most problems. - A named list of matrices representing the transition probabilities or rewards. Each list elements corresponds to an action. - A function with the model as the first argument and then the same arguments `T_()` or `R_()` that returns the probability or reward. More details can be found in the manual page for `MDP()`. ## Solving a MDP MDP problems are solved with the function `solve_MDP()` with the following arguments. ```{r} str(args(solve_MDP)) ``` The `model` argument is a MDP problem created using the `MDP()` function. The `method` argument specifies what algorithm the solver should use. Available methods including `"value_iteration"`, `"policy_iteration"`, `"q_learning"`, `"sarsa"`, `"expected_sarsa"` and several more. # Toy Example: Steward Russell's 4x3 Maze Gridworld MDP We will demonstrate how to use the package with the 4x3 Maze Gridworld described in Chapter 17 of the textbook "Artificial Intelligence: A Modern Approach" (AIMA) [@Russell2020]. The simple maze has the following layout: ``` 1234 Transition model: ###### .8 (action direction) 1# +# ^ 2# # -# | 3#S # .1 <-|-> .1 ###### ``` We represent the maze states as a gridworld matrix with 3 rows and 4 columns. The states are labeled s(row, col) representing the position in the matrix. The \# (state s(2,2)) in the middle of the maze is an obstruction and not reachable. Rewards are associated with transitions. The default reward (penalty) is -0.04 for each action taken. The start state marked with S is s(3,1). Transitioning to + (state s(1,4)) gives a reward of +1.0, transitioning to - (state s\_(2,4)) has a reward of -1.0. Both these states are absorbing (i.e., terminal) states. Actions are movements (up, right, down, left). The actions are unreliable with a .8 chance to move in the correct direction and a 0.1 chance to instead to move in a perpendicular direction. This means that the maze has a stochastic transition model. ## Specifying the Stochastic Maze ```{r} library("markovDP") ``` After loading the library, we create the states using a gridworld helper function. We first look at the state layout of a $3 \times 4$ maze. ```{r } gw_matrix(gw_init(dim = c(3, 4))) ``` Then we initialize the grid world with start state, goal state, absorbing states and the wall is an unreachable state. ```{r } gw <- gw_init(dim = c(3, 4), start = "s(3,1)", goal = "s(1,4)", absorbing_states = c("s(1,4)", "s(2,4)"), unreachable_states = "s(2,2)", state_labels = list( "s(3,1)" = "Start", "s(2,4)" = "-1", "s(1,4)" = "Goal: +1") ) gw_matrix(gw) ``` We see that the unreachable state was automatically excluded from the state space. The initialization function also creates a deterministic transition function for a grid world with deterministic movement in `gw$transition_prob`, but we need to create our own stochastic transition function. The following function returns a probability vector for a given action in a given start state. The values are probabilities to reach an end state. ```{r } P <- function(model, action, start.state) { action <- match.arg(action, choices = A(model)) P <- structure(numeric(length(S(model))), names = S(model)) # absorbing states if (start.state %in% model$info$absorbing_states) { P[start.state] <- 1 return(P) } if (action %in% c("up", "down")) { error_direction <- c("right", "left") } else { error_direction <- c("up", "down") } rc <- gw_s2rc(start.state) delta <- list( up = c(-1, 0), down = c(+1, 0), right = c(0, +1), left = c(0, -1) ) # there are 3 directions. For blocked directions, stay in place # 1) action works .8 rc_new <- gw_rc2s(rc + delta[[action]]) if (rc_new %in% S(model)) P[rc_new] <- .8 else P[start.state] <- .8 # 2) off to the right .1 rc_new <- gw_rc2s(rc + delta[[error_direction[1]]]) if (rc_new %in% S(model)) P[rc_new] <- .1 else P[start.state] <- P[start.state] + .1 # 3) off to the left .1 rc_new <- gw_rc2s(rc + delta[[error_direction[2]]]) if (rc_new %in% S(model)) P[rc_new] <- .1 else P[start.state] <- P[start.state] + .1 P } ``` Next, we specify the reward. The most convenient way is a table. Every time a reward is calculated, the last matching row will be used. We set the default reward for any transition to -0.04 which will be used if no other row matches the start or end state. Then we set the reward for the terminal states is +/-1 plus the cost to transition to the state. Finally, we make sure that staying in the absorbing states does not add to the reward. ```{r } R <- rbind( R_( value = -0.04), R_(end.state = "s(2,4)", value = -1 - 0.04), R_(end.state = "s(1,4)", value = +1 - 0.04), R_(start.state = "s(2,4)", value = 0), R_(start.state = "s(1,4)", value = 0) ) ``` Now, we can create the complete MDP problem. ```{r } Maze <- MDP( name = "Stuart Russell's 3x4 Maze", discount = 1, horizon = Inf, states = gw$states, actions = gw$actions, start = "s(3,1)", transition_prob = P, reward = R, info = gw$info ) Maze ``` ## Solving the Maze We can solve the problem with the default solver method which is value iteration. ```{r} sol <- solve_MDP(Maze) sol ``` The output is an object of class MDP which contains the solution as an additional list component. It indicates that the algorithm has converged to a stable solution. The detailed solution information is stored in `$solution` ```{r} sol$solution ``` The policy can be accessed using the `polucy()` function. ```{r} policy(sol) ``` The found policy shows for each state the value (i.e., the value function) and the prescribed action. We can visualize the value function. ```{r} plot_value_function(sol) ``` # Additional Functions The package provides several functions to work with models and policies. In the following, we will organize them into categories. More details about each function, its parameters, and examples can be found in the manual pages. ## Access to Model Components Components of the model are stored in a list and can be accessed directly. ```{r } str(sol, max.level = 2) ``` The special list element `"solution"` is only available when the model already contains a policy. To access components more easily, several accessor functions are available: - `S()` the set of states. - `A()` the set of actions. - `actions()` find available actions for a state. - `reward_matrix()` access the reward structure, - `start_vector()` access the initial state probabilities. - `transition_matrix()` access transition probabilities. - `transition_graph()` converts the transition matrix into a graph for visualization. - `policy()` extracts the policy for a solved model. These functions can efficiently retrieve individual values and convert the components into sparse and dense representations. ## Value Function - `value_function()` extracts the value function from a solved MDP. - `q_values()` calculates (approximate) Q-values for a given model and value function. ## Policy Policies in this package are deterministic policies with one prescribed action per state. They are represented as a data.frame with columns for: - `state`: The state. - `V`: The state's value (discounted expected utility U) if the policy is followed. - `action`: The prescribed action. Policies are typically created using `solve_MDP()` and then stored in the `"solution"` component of the returned model. Policies can also be created by the following functions: - `random_policy()` create a random policy. - `manual_policy()` specify a policy data.frame manually. - `greedy_policy()` generates a greedy policy using Q-values. A policy can be added to a model for the use in other functions using `add_policy()`. The action prescribed by a model with a policy can be calculated using `action()`. From a matrix with Q-values, `greedy_action()` can be used to find the best action. The value function for a policy applied to a model can be estimated using `policy_evaluation()`. ## Evaluation MDP policies can be evaluated using: - `reward()` calculates the expected reward of a policy - `regret()` calculates the regret of a policy relative to a benchmark policy. ## Sampling Trajectories through a MDPs are created using `sample_MDP()`. The outcome of single actions can be calculated by `act()`. ## Acknowledgments Development of this package was supported in part by National Institute of Standards and Technology (NIST) under grant number [60NANB17D180](https://www.nist.gov/ctl/pscr/safe-net-integrated-connected-vehicle-computing-platform). # References