Title:  Infrastructure for DiscreteTime Markov Decision Processes (MDP) 

Description:  Provides the infrastructure to work with Markov Decision Processes (MDPs) in R. The focus is on convenience in formulating MDPs, the support of sparse representations (using sparse matrices, lists and data.frames) and visualization of results. Some key components are implemented in C++ to speed up computation. Several popular solvers are implemented. 
Authors:  Michael Hahsler [aut, cph, cre] 
Maintainer:  Michael Hahsler <[email protected]> 
License:  GPL (>=3) 
Version:  0.99.0 
Built:  20241010 20:15:58 UTC 
Source:  https://github.com/mhahsler/markovDP 
Functions to provide uniform access to different parts of the MDP problem description.
normalize_MDP( model, transition_prob = TRUE, reward = TRUE, start = FALSE, sparse = NULL, precompute_absorbing = TRUE, precompute_unreachable = TRUE, check_and_fix = FALSE, progress = TRUE ) start_vector(model, start = NULL, sparse = NULL) reward_matrix( model, action = NULL, start.state = NULL, end.state = NULL, ..., sparse = NULL, simplify = FALSE ) transition_matrix( model, action = NULL, start.state = NULL, end.state = NULL, ..., sparse = NULL, simplify = FALSE, trans_keyword = TRUE )
normalize_MDP( model, transition_prob = TRUE, reward = TRUE, start = FALSE, sparse = NULL, precompute_absorbing = TRUE, precompute_unreachable = TRUE, check_and_fix = FALSE, progress = TRUE ) start_vector(model, start = NULL, sparse = NULL) reward_matrix( model, action = NULL, start.state = NULL, end.state = NULL, ..., sparse = NULL, simplify = FALSE ) transition_matrix( model, action = NULL, start.state = NULL, end.state = NULL, ..., sparse = NULL, simplify = FALSE, trans_keyword = TRUE )
model 
A MDP object. 
transition_prob 
logical; convert the transition probabilities into a list of matrices. 
reward 
logical; convert the reward model into a list of matrices. 
start 
a start state description (see MDP). If 
sparse 
logical; use sparse matrix representation? 
precompute_absorbing 
logical; should absorbing states be precalculated? 
precompute_unreachable 
logical; should unreachable states be precalculated? 
check_and_fix 
logical; checks the structure of the problem description. 
progress 
logical; show a progress bar with estimated time for completion. 
action 
name or index of an action. 
start.state , end.state

name or index of the state. 
... 
further arguments are passed on. 
simplify 
logical; try to simplify action lists into a vector or matrix? 
trans_keyword 
logical; translate keywords like "uniform" into matrices. 
Several parts of the MDP description can be defined in different ways. In particular,
the fields transition_prob
, reward
, and start
can be defined using matrices, data frames,
keywords, or functions. See MDP for details.
The functions provided here, provide unified access to the data in these fields
to make writing code easier.
$T(s's,a)$
transition_matrix()
accesses the transition model. The complete model
is a list with one element for each action. Each element contains a states x states matrix
with $s$
(start.state
) as rows and $s'$
(end.state
) as columns.
Matrices with a density below 50% can be requested in sparse format
(as a Matrix::dgCMatrix).
$R(s,s',a)$
reward_matrix()
accesses the reward model.
The preferred representation is a data.frame with the
columns action
, start.state
, end.state
,
and value
. This is a sparse representation.
The dense representation is a list of lists of matrices.
The list levels are $a$
(action
) and $s$
(start.state
).
The matrices are column vectors with rows representing $s'$
(end.state
).
The accessor converts the column vectors automatically into matrices with
start states as rows and end states as columns. This conversion can be suppressed
by calling reward_matrix(..., state_matrix = FALSE)
Note that the reward structure cannot be efficiently stored using a standard sparse matrix
since there might be a fixed cost for each action
resulting in no entries with 0.
start_vector()
translates the start state description into a probability vector.
Different components can be specified in various ways. It is often
necessary to convert each component into a specific form (e.g., a
dense matrix) to save time during access.
Convert the Complete MDP Description into a consistent form
normalize_MDP()
converts all components of the MDP description
into a consistent form and
returns a new MDP definition where transition_prob
,
reward
, and start
are normalized. This includes the internal
representation (dense, sparse, as a data.frame) and
also, states
, and actions
are ordered as given in the problem
definition to make safe access using numerical indices possible. Normalized
MDP descriptions can be
used in custom code that expects consistently a certain format.
The default behavior of sparse = NULL
uses parse matrices for large models
where the dense transition model would need more
than options("MDP_SPARSE_LIMIT")
(the default is about 100 MB which
can be changed using
options()
). Smaller models use faster dense
matrices.
A list or a list of lists of matrices.
Michael Hahsler
Other MDP:
MDP()
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
data("Maze") gw_matrix(Maze) # here is the internal structure of the Maze object str(Maze) # List of A transition matrices. One per action in the from start.states x end.states Maze$transition_prob transition_matrix(Maze) transition_matrix(Maze, action = "up", sparse = FALSE) transition_matrix(Maze, action = "up", start.state = "s(3,1)", end.state = "s(2,1)" ) # List of list of reward matrices. 1st level is action and second level is the # start state in the form of a column vector with elements for end states. Maze$reward reward_matrix(Maze) reward_matrix(Maze, sparse = TRUE) reward_matrix(Maze, action = "up", start.state = "s(3,1)", end.state = "s(2,1)" ) # Translate the initial start probability vector Maze$start start_vector(Maze, sparse = FALSE) start_vector(Maze, sparse = "states") start_vector(Maze, sparse = "index") # Normalize the whole model using sparse representation Maze_norm < normalize_MDP(Maze, sparse = TRUE) str(Maze_norm)
data("Maze") gw_matrix(Maze) # here is the internal structure of the Maze object str(Maze) # List of A transition matrices. One per action in the from start.states x end.states Maze$transition_prob transition_matrix(Maze) transition_matrix(Maze, action = "up", sparse = FALSE) transition_matrix(Maze, action = "up", start.state = "s(3,1)", end.state = "s(2,1)" ) # List of list of reward matrices. 1st level is action and second level is the # start state in the form of a column vector with elements for end states. Maze$reward reward_matrix(Maze) reward_matrix(Maze, sparse = TRUE) reward_matrix(Maze, action = "up", start.state = "s(3,1)", end.state = "s(2,1)" ) # Translate the initial start probability vector Maze$start start_vector(Maze, sparse = FALSE) start_vector(Maze, sparse = "states") start_vector(Maze, sparse = "index") # Normalize the whole model using sparse representation Maze_norm < normalize_MDP(Maze, sparse = TRUE) str(Maze_norm)
Performs an action in a state and returns the new state and reward.
act(model, state, action = NULL, ...)
act(model, state, action = NULL, ...)
model 
an MDP model. 
state 
the current state. 
action 
the chosen action. If the action is not specified ( 
... 
if action is unspecified, then the additional parameters are
passed on to 
a names list with the old_state
, the action
,
the next reward
and state
.
Michael Hahsler
Other MDP:
MDP()
,
accessors
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
data(Maze) act(Maze, "s(1,3)", "right") # solve the maze and then ask for actions using the policy sol < solve_MDP(Maze) act(sol, "s(1,3)") # make the policy in sol epsilonsoft and ask 10 times for the action replicate(10, act(sol, "s(1,3)", epsilon = .2))
data(Maze) act(Maze, "s(1,3)", "right") # solve the maze and then ask for actions using the policy sol < solve_MDP(Maze) act(sol, "s(1,3)") # make the policy in sol epsilonsoft and ask 10 times for the action replicate(10, act(sol, "s(1,3)", epsilon = .2))
Returns an action given a deterministic policy. The policy can be made epsilonsoft.
action(model, ...) ## S3 method for class 'MDP' action(model, state, epsilon = 0, epoch = 1, ...)
action(model, ...) ## S3 method for class 'MDP' action(model, state, epsilon = 0, epoch = 1, ...)
model 
a solved MDP. 
... 
further parameters are passed on. 
state 
the state. 
epsilon 
make the policy epsilon soft. 
epoch 
what epoch of the policy should be used. Use 1 for converged policies. 
The name of the optimal action as a factor.
Michael Hahsler
Other policy:
add_policy()
,
bellman_update()
,
policy()
,
policy_evaluation()
,
q_values()
,
reward()
,
value_function()
data("Maze") Maze sol < solve_MDP(Maze) policy(sol) action(sol, state = "s(1,3)") ## choose from an epsilonsoft policy table(replicate(100, action(sol, state = "s(1,3)", epsilon = 0.1)))
data("Maze") Maze sol < solve_MDP(Maze) policy(sol) action(sol, state = "s(1,3)") ## choose from an epsilonsoft policy table(replicate(100, action(sol, state = "s(1,3)", epsilon = 0.1)))
Add a policy to an MDP problem description allows the user to test policies on modified problem descriptions or to test manually created policies.
add_policy(model, policy)
add_policy(model, policy)
model 
a MDP model description. 
policy 
a policy data.frame. 
The new policy needs to be a data.frame with one row for each state in the order the states are defined in the model. The only required column is
action
: the action prescribed in the state corresponding to the row.
Optional columns are
state
: the state names in the order of the states in the model.
The needed names can be obtained by from the $states
element of the model.
U
: with the utility given by the value function for the state.
The model description with the added policy.
Michael Hahsler
Other MDP:
MDP()
,
accessors
,
act()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
Other policy:
action()
,
bellman_update()
,
policy()
,
policy_evaluation()
,
q_values()
,
reward()
,
value_function()
data(Maze) sol < solve_MDP(Maze) sol policy(sol) reward(sol) # Add a random policy random_pol < random_policy(Maze) random_pol sol_random < add_policy(Maze, random_pol) policy(sol_random) reward(sol_random)
data(Maze) sol < solve_MDP(Maze) sol policy(sol) reward(sol) # Add a random policy random_pol < random_policy(Maze) random_pol sol_random < add_policy(Maze, random_pol) policy(sol_random) reward(sol_random)
Determine the set of actions available in a state.
available_actions(model, state, neg_inf_reward = TRUE, stay_in_place = TRUE)
available_actions(model, state, neg_inf_reward = TRUE, stay_in_place = TRUE)
model 
a MDP object. 
state 
a character vector of length one specifying the state. 
neg_inf_reward 
logical; consider an action that produced 
stay_in_place 
logical; consider an action that results in the same state with a probability of 1 as unavailable. Note that this will mean that absorbing states have no available action! 
Unavailable actions are modeled as actions that have an immediate
reward of Inf
in the reward function.
a character vector with the available actions.
a vector with the available actions.
Michael Hahsler
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
data(DynaMaze) gw_plot(DynaMaze) # The the following actions are always available: DynaMaze$actions # only right and down is unavailable for s(1,1) because they # make the agent stay in place. available_actions(DynaMaze, state = "s(1,1)") # An action that leaves the grid currently is allowed but does not do # anything. act(DynaMaze, "s(1,1)", "up")
data(DynaMaze) gw_plot(DynaMaze) # The the following actions are always available: DynaMaze$actions # only right and down is unavailable for s(1,1) because they # make the agent stay in place. available_actions(DynaMaze, state = "s(1,1)") # An action that leaves the grid currently is allowed but does not do # anything. act(DynaMaze, "s(1,1)", "up")
Update the value function with a Bellman update.
bellman_update(model, U)
bellman_update(model, U)
model 
an MDP problem specification. 
U 
a vector with value function representing the state utilities (expected sum of discounted rewards from that point on). A single 0 can be used as a shorthand for a value function with all 0s. 
The Bellman updates a value function given the model defining
$T$
, $\gamma$
and $R$
by applying the Bellman equation as an update rule for each state:
$U_{k+1}(s) \leftarrow \max_{a \in A(s)} \sum_{s'} T(s'  s,a) [R(s,a) + \gamma U_k(s')]$
a list with the updated state value vector U and the taken actions pi.
Michael Hahsler
Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
Other policy:
action()
,
add_policy()
,
policy()
,
policy_evaluation()
,
q_values()
,
reward()
,
value_function()
data(Maze) Maze # single Bellman update from a all 0 value function bellman_update(Maze, U = 0) # perform simple value iteration for 10 iterations U < list(U = 0) for (i in seq(10)) U < bellman_update(Maze, U$U) U
data(Maze) Maze # single Bellman update from a all 0 value function bellman_update(Maze, U = 0) # perform simple value iteration for 10 iterations U < list(U = 0) for (i in seq(10)) U < bellman_update(Maze, U$U) U
The cliff walking gridworld MDP example from Chapter 6 of the textbook "Reinforcement Learning: An Introduction."
An object of class MDP.
The cliff walking gridworld has the following layout:
The gridworld is represented as a 4 x 12 matrix of states.
The states are labeled with their x and y coordinates.
The start state is in the bottom left corner.
Each action has a reward of 1, falling off the cliff has a reward of 100 and
returns the agent back to the start. The episode is finished once the agent
reaches the absorbing goal state in the bottom right corner.
No discounting is used (i.e., $\gamma = 1$
).
Richard S. Sutton and Andrew G. Barto (2018). Reinforcement Learning: An Introduction Second Edition, MIT Press, Cambridge, MA.
Other MDP_examples:
DynaMaze
,
MDP()
,
Maze
,
Windy_gridworld
Other gridworld:
DynaMaze
,
Maze
,
Windy_gridworld
,
gridworld
data(Cliff_walking) Cliff_walking gw_matrix(Cliff_walking) gw_matrix(Cliff_walking, what = "labels") # The Goal is an absorbing state absorbing_states(Cliff_walking, sparse = "states") # visualize the transition graph gw_plot_transition_graph(Cliff_walking) # solve using different methods sol < solve_MDP(Cliff_walking) sol policy(sol) gw_plot(sol)
data(Cliff_walking) Cliff_walking gw_matrix(Cliff_walking) gw_matrix(Cliff_walking, what = "labels") # The Goal is an absorbing state absorbing_states(Cliff_walking, sparse = "states") # visualize the transition graph gw_plot_transition_graph(Cliff_walking) # solve using different methods sol < solve_MDP(Cliff_walking) sol policy(sol) gw_plot(sol)
Default discrete and continuous colors used in
the package markovDP
.
colors_discrete(n, col = NULL) colors_continuous(val, col = NULL)
colors_discrete(n, col = NULL) colors_continuous(val, col = NULL)
n 
number of states. 
col 
custom color palette. 
val 
a vector with values to be translated to colors. 
colors_discrete()
returns a color palette and
colors_continuous()
returns the colors associated with the supplied values.
colors_discrete(5) colors_continuous(runif(10))
colors_discrete(5) colors_continuous(runif(10))
The Dyna Maze from Chapter 8 of the textbook "Reinforcement Learning: An Introduction."
An object of class MDP.
The simple 6x9 maze with a few walls.
Richard S. Sutton and Andrew G. Barto (2018). Reinforcement Learning: An Introduction Second Edition, MIT Press, Cambridge, MA.
Other MDP_examples:
Cliff_walking
,
MDP()
,
Maze
,
Windy_gridworld
Other gridworld:
Cliff_walking
,
Maze
,
Windy_gridworld
,
gridworld
Other MDP_examples:
Cliff_walking
,
MDP()
,
Maze
,
Windy_gridworld
Other gridworld:
Cliff_walking
,
Maze
,
Windy_gridworld
,
gridworld
data(DynaMaze) DynaMaze gw_matrix(DynaMaze) gw_matrix(DynaMaze, what = "labels") gw_plot_transition_graph(DynaMaze)
data(DynaMaze) DynaMaze gw_matrix(DynaMaze) gw_matrix(DynaMaze, what = "labels") gw_plot_transition_graph(DynaMaze)
Finds the reachable state space from a transition model function
that takes the arguments model
, action
, start.state
and
returns a named vector with the probabilities for the resulting end.states
(typically only the ones with a probability greater than 1).
find_reachable_states( transition_function, start_state, actions, model = NULL, horizon = Inf, progress = TRUE )
find_reachable_states( transition_function, start_state, actions, model = NULL, horizon = Inf, progress = TRUE )
transition_function 
a transition function (see details for requirements). 
start_state 
labels of the start states. 
actions 
a vector with the available actions. 
model 
if needed, the model passed on to the transition model function. 
horizon 
only return states reachable in the given horizon. 
progress 
logical; show a progress bar? 
The function performs a (depthlimited) depthfirst traversal of the search state space and returns a vector with the names of all encountered states. This vector can be used as the states for creating a MDP model.
a character vector with all reachable states.
Michael Hahsler
# define a MDP for TicTacToe # state description: matrix with the characters _, x, and o # can be converted into a label of 9 characters # set of actions A < as.character(1:9) # helper functions ttt_empty_board < function() matrix('_', ncol = 3, nrow = 3) ttt_state2label < function(state) paste(state, collapse = '') ttt_label2state < function(label) matrix(strsplit(label, "")[[1]], nrow = 3, ncol = 3) ttt_available_actions < function(state) { if (length(state) == 1L) state < ttt_label2state(state) which(state == "_") } ttt_result < function(state, player, action) { if (length(state) == 1L) state < ttt_label2state(state) if (state[action] != "_") stop("Illegal action.") state[action] < player state } ttt_terminal < function(state) { if (length(state) == 1L) state < ttt_label2state(state) # Check the board for a win and return one of # 'x', 'o', 'd' (draw), or 'n' (for next move) win_possibilities < rbind(state, t(state), diag(state), diag(t(state))) wins < apply(win_possibilities, MARGIN = 1, FUN = function(x) { if (x[1] != '_' && length(unique(x)) == 1) x[1] else '_' }) if (any(wins == 'x')) return('x') if (any(wins == 'o')) return('o') # Check for draw if (sum(state == '_') < 1) return('d') return('n') } ttt_other < function(player) if (player == 'x') 'o' else 'x' # define the transition function T < function(model, action, start.state) { action < as.integer(action) # absorbing states if (start.state %in% c('win', 'loss', 'draw')) { return(structure(1, names = start.state)) } # avoid illegal action by making them a loss if (!(action %in% ttt_available_actions(ttt_label2state(start.state)))) { return(structure(1, names = "loss")) } # make x's move next_state < ttt_result(ttt_label2state(start.state), 'x', action) # ttt_terminal? term < ttt_terminal(next_state) if (term == 'x') { return(structure(1, names = "win")) } else if (term == 'o') { return(structure(1, names = "loss")) } else if (term == 'd') { return(structure(1, names = "draw")) } # it is o's turn actions_of_o < ttt_available_actions(next_state) possible_end_states < lapply( actions_of_o, FUN = function(a) ttt_result(next_state, 'o', a) ) # fix ttt_terminal states term < sapply(possible_end_states, ttt_terminal) possible_end_states < sapply(possible_end_states, ttt_state2label) possible_end_states[term == 'x'] < 'win' possible_end_states[term == 'o'] < 'loss' possible_end_states[term == 'd'] < 'draw' possible_end_states < unique(possible_end_states) return(structure(rep( 1 / length(possible_end_states), length(possible_end_states) ), names = possible_end_states)) } # define the reward R < rbind( R_( value = 0), R_(end.state = 'win', value = +1), R_(end.state = 'loss', value = 1), R_(end.state = 'draw', value = +.5), # Note: there is no more reward once the agent is in a terminal state R_(start.state = 'win', value = 0), R_(start.state = 'loss', value = 0), R_(start.state = 'draw', value = 0) ) # start state start < "_________" # find the reachable state space S < find_reachable_states(T, start_state = start, actions = A) head(S) tictactoe < MDP(S, A, T, R, discount = 1, start = start, name = "TicTacToe") tictactoe
# define a MDP for TicTacToe # state description: matrix with the characters _, x, and o # can be converted into a label of 9 characters # set of actions A < as.character(1:9) # helper functions ttt_empty_board < function() matrix('_', ncol = 3, nrow = 3) ttt_state2label < function(state) paste(state, collapse = '') ttt_label2state < function(label) matrix(strsplit(label, "")[[1]], nrow = 3, ncol = 3) ttt_available_actions < function(state) { if (length(state) == 1L) state < ttt_label2state(state) which(state == "_") } ttt_result < function(state, player, action) { if (length(state) == 1L) state < ttt_label2state(state) if (state[action] != "_") stop("Illegal action.") state[action] < player state } ttt_terminal < function(state) { if (length(state) == 1L) state < ttt_label2state(state) # Check the board for a win and return one of # 'x', 'o', 'd' (draw), or 'n' (for next move) win_possibilities < rbind(state, t(state), diag(state), diag(t(state))) wins < apply(win_possibilities, MARGIN = 1, FUN = function(x) { if (x[1] != '_' && length(unique(x)) == 1) x[1] else '_' }) if (any(wins == 'x')) return('x') if (any(wins == 'o')) return('o') # Check for draw if (sum(state == '_') < 1) return('d') return('n') } ttt_other < function(player) if (player == 'x') 'o' else 'x' # define the transition function T < function(model, action, start.state) { action < as.integer(action) # absorbing states if (start.state %in% c('win', 'loss', 'draw')) { return(structure(1, names = start.state)) } # avoid illegal action by making them a loss if (!(action %in% ttt_available_actions(ttt_label2state(start.state)))) { return(structure(1, names = "loss")) } # make x's move next_state < ttt_result(ttt_label2state(start.state), 'x', action) # ttt_terminal? term < ttt_terminal(next_state) if (term == 'x') { return(structure(1, names = "win")) } else if (term == 'o') { return(structure(1, names = "loss")) } else if (term == 'd') { return(structure(1, names = "draw")) } # it is o's turn actions_of_o < ttt_available_actions(next_state) possible_end_states < lapply( actions_of_o, FUN = function(a) ttt_result(next_state, 'o', a) ) # fix ttt_terminal states term < sapply(possible_end_states, ttt_terminal) possible_end_states < sapply(possible_end_states, ttt_state2label) possible_end_states[term == 'x'] < 'win' possible_end_states[term == 'o'] < 'loss' possible_end_states[term == 'd'] < 'draw' possible_end_states < unique(possible_end_states) return(structure(rep( 1 / length(possible_end_states), length(possible_end_states) ), names = possible_end_states)) } # define the reward R < rbind( R_( value = 0), R_(end.state = 'win', value = +1), R_(end.state = 'loss', value = 1), R_(end.state = 'draw', value = +.5), # Note: there is no more reward once the agent is in a terminal state R_(start.state = 'win', value = 0), R_(start.state = 'loss', value = 0), R_(start.state = 'draw', value = 0) ) # start state start < "_________" # find the reachable state space S < find_reachable_states(T, start_state = start, actions = A) head(S) tictactoe < MDP(S, A, T, R, discount = 1, start = start, name = "TicTacToe") tictactoe
Helper functions for gridworld MDPs to convert between state names and gridworld positions, and for visualizing policies.
gw_init( dim, actions = c("up", "right", "down", "left"), start = NULL, goal = NULL, absorbing_states = NULL, unreachable_states = NULL, remove_unreachable_states = TRUE, state_labels = list() ) gw_maze_MDP( dim, start, goal, walls = NULL, actions = c("up", "right", "down", "left"), goal_reward = 100, step_cost = 1, restart = FALSE, discount = 1, horizon = Inf, info = NULL, normalize = FALSE, name = NA ) gw_random_maze( dim, wall_prob = 0.2, start = NULL, goal = NULL, normalize = FALSE ) gw_path(model, start = NULL, goal = NULL, horizon = NULL) gw_matrix(model, epoch = 1L, what = "states") gw_plot( model, epoch = 1L, actions = "character", states = TRUE, index = FALSE, labels = TRUE, impossible_actions = FALSE, main = NULL, cex = 1, offset = 0.5, lines = TRUE, col = hcl.colors(100, "YlOrRd", rev = TRUE), unreachable_col = "gray20", ... ) gw_plot_transition_graph( x, hide_unreachable_states = TRUE, remove.loops = TRUE, vertex.color = "gray", vertex.shape = "square", vertex.size = 10, vertex.label = NA, edge.arrow.size = 0.3, margin = 0.2, main = NULL, ... ) gw_animate(model, method, n, zlim = NULL, ...) gw_read_maze(file, discount = 1, restart = FALSE, name = "Maze") gw_s2rc(s) gw_rc2s(rc) gw_transition_prob(model, action, start.state) gw_transition_prob_end_state(model, action, start.state, end.state)
gw_init( dim, actions = c("up", "right", "down", "left"), start = NULL, goal = NULL, absorbing_states = NULL, unreachable_states = NULL, remove_unreachable_states = TRUE, state_labels = list() ) gw_maze_MDP( dim, start, goal, walls = NULL, actions = c("up", "right", "down", "left"), goal_reward = 100, step_cost = 1, restart = FALSE, discount = 1, horizon = Inf, info = NULL, normalize = FALSE, name = NA ) gw_random_maze( dim, wall_prob = 0.2, start = NULL, goal = NULL, normalize = FALSE ) gw_path(model, start = NULL, goal = NULL, horizon = NULL) gw_matrix(model, epoch = 1L, what = "states") gw_plot( model, epoch = 1L, actions = "character", states = TRUE, index = FALSE, labels = TRUE, impossible_actions = FALSE, main = NULL, cex = 1, offset = 0.5, lines = TRUE, col = hcl.colors(100, "YlOrRd", rev = TRUE), unreachable_col = "gray20", ... ) gw_plot_transition_graph( x, hide_unreachable_states = TRUE, remove.loops = TRUE, vertex.color = "gray", vertex.shape = "square", vertex.size = 10, vertex.label = NA, edge.arrow.size = 0.3, margin = 0.2, main = NULL, ... ) gw_animate(model, method, n, zlim = NULL, ...) gw_read_maze(file, discount = 1, restart = FALSE, name = "Maze") gw_s2rc(s) gw_rc2s(rc) gw_transition_prob(model, action, start.state) gw_transition_prob_end_state(model, action, start.state, end.state)
dim 
vector of length two with the x and y extent of the gridworld. 
actions 
how to show actions. Options are:
simple 
start , goal

start and goal states. If 
absorbing_states 
a vector with state labels for absorbing states. 
unreachable_states 
a vector with state labels for unreachable states. These states will be excluded. 
remove_unreachable_states 
logical; remove unreachable states from the state space? 
state_labels 
a list with labels for states. The element names need to be state names. 
walls 
a vector with state labels for walls. Walls will become unreachable states. 
goal_reward 
reward to transition to the goal state. 
step_cost 
cost of each action that does not lead to the goal state. 
restart 
logical; if 
discount , horizon

MDP discount factor, and horizon. 
info 
A list with additional information. Has to contain the gridworld
dimensions as element 
normalize 
logical; should the description be normalized for
faster access using 
name 
a string to identify the MDP problem. 
wall_prob 
probability to make a tile a wall. 
model , x

a solved gridworld MDP. 
epoch 
epoch for unconverged finitehorizon solutions. 
what 
What should be returned in the matrix. Options are:

states 
logical; show state names. 
index 
logical; show the state indices. 
labels 
logical; show state labels. 
impossible_actions 
logical; show the value and the action for absorbing or unreachable states. 
main 
a main title for the plot. Defaults to the name of the problem. 
cex 
expansion factor for the action. 
offset 
move the state labels out of the way (in fractions of a character width). 
lines 
logical; draw lines to separate states. 
col 
a colors for the utility values. 
unreachable_col 
a color used for unreachable states. Use 
... 
further arguments are passed on to 
hide_unreachable_states 
logical; do not show unreachable states. 
remove.loops 
logical; do not show transitions from a state back to itself. 
vertex.color , vertex.shape , vertex.size , vertex.label , edge.arrow.size

see 
margin 
a single number specifying the margin of the plot. Can be used if the graph does not fit inside the plotting area. 
method 
an MDP solution method for 
n 
number of iterations to animate. 
zlim 
limits for visualizing the state value. 
file 
filename for a maze text file. 
s 
a state label or a vector of labels. 
rc 
a vector of length two with the row and column coordinate of a state in the gridworld matrix. A matrix with one state per row can be also supplied. 
action , start.state , end.state

parameters for the transition function. 
Gridworlds are implemented with state names s(row,col)
, where
row
and col
are locations in the matrix representing the gridworld.
The actions are "up"
, "right"
, "down"
, and "left"
.
gw_init()
initializes a new gridworld creating a matrix
of states with the given dimensions. Other action names
can be specified, but they must have the same effects in the same order
as above. Unreachable states (walls) and absorbing state can be defined.
This information can be used to build a custom gridworld MDP.
Several helper functions are provided to use states, look at the state layout, and plot policies on the gridworld.
gw_maze_MDP()
helps to easily define mazelike gridworld MDPs.
By default, the goal state is absorbing, but with restart = TRUE
, the
agent restarts the problem at the start state every time it reaches the goal
and receives the reward. Note that this implies that the goal state itself
becomes unreachable.
gw_path()
checks if a solved gridworld has a policy that
leads from the start to the goal. Note this function currently samples only a single path which is
an issue with stochastic transitions!
gw_matrix()
returns different information
(state names, values, actions, etc.) as a matrix. Note that some gridworlds
have unreachable states removed. These states will be represented in the
matrix as NA
.
gw_plot()
plots a gridworld.
gw_plot_transition_graph()
plots the transition graph
using the gridworld matrix as the layout.
gw_animate()
applies algorithms from solve_MDP()
iteration
by iteration and visualized the state utilities. This helps to understand
how the algorithms work.
gw_read_maze()
reads a maze in text format from a file
and converts it into a gridworld MDP.
gw_s2rc()
and gw_rc2s
help with converting from
state names to xycoordinates and vice versa.
gw_transition_prob()
and gw_transition_prob3()
provide the standard transition functions for a gridworld.
gw_maze_MDP()
returns an MDP object.
gw_path()
returns a list with the elements "path"
,
"reward"
and "solved"
.
gw_animate()
returns the final solution invisibly.
Other gridworld:
Cliff_walking
,
DynaMaze
,
Maze
,
Windy_gridworld
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
# Defines states, actions and a transition model for a standard gridworld gw < gw_init( dim = c(7, 7), unreachable_states = c("s(2,2)", "s(7,3)", "s(3,6)"), absorbing_states = "s(4,4)", state_labels = list("s(4,4)" = "Black Hole") ) str(gw) # display the state labels in the gridworld (states not represented in the # model are shown as NA) gw_matrix(gw) gw_matrix(gw, what = "label") gw_matrix(gw, what = "absorbing") gw_matrix(gw, what = "unreachable") # a transition function for regular moves in the gridworld is provided gw_transition_prob(gw, "right", "s(1,1)") gw_transition_prob_end_state(gw, "right", "s(1,1)", "s(1,2)") # convert between state names and row/column indices gw_s2rc("s(1,1)") gw_rc2s(c(1, 1)) # The information in gw can be used to build a custom MDP. # We modify the standard transition function so there is a 50% chance that # you will get sucked into the black hole from the adjacent squares. trans_black_hole < function(model, action, start.state, end.state) { # states around the black hole if (start.state %in% c( "s(3,3)", "s(3,4)", "s(3,5)", "s(4,3)", "s(4,5)", "s(5,3)", "s(5,4)", "s(5,5)" )) { if (end.state == "s(4,4)") { return(.5 + gw_transition_prob_end_state(model, action, start.state, end.state) * .5) } else { return(gw_transition_prob_end_state(model, action, start.state, end.state) * .5) } } # use the standard gridworld movement gw_transition_prob_end_state(model, action, start.state, end.state) } black_hole < MDP( states = gw$states, actions = gw$actions, transition_prob = trans_black_hole, reward = rbind(R_( value = +1), R_(end.state = "s(4,4)", value = 100), R_(start.state = "s(4,4)", value = 0) ), info = gw$info, name = "Black hole" ) black_hole black_hole < normalize_MDP(black_hole) gw_plot_transition_graph(black_hole) # solve the problem sol < solve_MDP(black_hole, error = 1) gw_matrix(sol, what = "values") gw_plot(sol) # the optimal policy is to fly around, but avoid the black hole. # Build a Maze: The Dyna Maze from Chapter 8 in the RL book DynaMaze < gw_maze_MDP( dim = c(6, 9), start = "s(3,1)", goal = "s(1,9)", walls = c( "s(2,3)", "s(3,3)", "s(4,3)", "s(5,6)", "s(1,8)", "s(2,8)", "s(3,8)" ), restart = TRUE, discount = 0.95, name = "Dyna Maze", ) DynaMaze gw_matrix(DynaMaze) gw_matrix(DynaMaze, what = "labels") gw_plot_transition_graph(DynaMaze) # Note that the problems resets if the goal state would be reached. sol < solve_MDP(DynaMaze, method = "lp") gw_matrix(sol, what = "values") gw_matrix(sol, what = "actions") gw_plot(sol, states = TRUE) # check if we found a solution gw_path(sol) # Read a maze from a text file # (X are walls, S is the start and G is the goal) # some examples are installed with the package maze_dir < system.file("mazes", package = "markovDP") dir(maze_dir) file.show(file.path(maze_dir, "small_maze.txt")) maze < gw_read_maze(file.path(maze_dir, "small_maze.txt")) maze gw_matrix(maze, what = "label") gw_plot(maze) # Prioritized sweeping is especially effective for larger mazes. sol < solve_MDP(maze, method = "prioritized_sweeping") sol gw_plot(sol) gw_path(sol) # A maze can also be created directly from a character vector maze < gw_read_maze( textConnection(c("XXXXXX", "XS GX", "XXXXXX"))) gw_plot(maze) # Create a small random maze rand_maze < gw_random_maze(dim = c(5, 5)) gw_plot(rand_maze)
# Defines states, actions and a transition model for a standard gridworld gw < gw_init( dim = c(7, 7), unreachable_states = c("s(2,2)", "s(7,3)", "s(3,6)"), absorbing_states = "s(4,4)", state_labels = list("s(4,4)" = "Black Hole") ) str(gw) # display the state labels in the gridworld (states not represented in the # model are shown as NA) gw_matrix(gw) gw_matrix(gw, what = "label") gw_matrix(gw, what = "absorbing") gw_matrix(gw, what = "unreachable") # a transition function for regular moves in the gridworld is provided gw_transition_prob(gw, "right", "s(1,1)") gw_transition_prob_end_state(gw, "right", "s(1,1)", "s(1,2)") # convert between state names and row/column indices gw_s2rc("s(1,1)") gw_rc2s(c(1, 1)) # The information in gw can be used to build a custom MDP. # We modify the standard transition function so there is a 50% chance that # you will get sucked into the black hole from the adjacent squares. trans_black_hole < function(model, action, start.state, end.state) { # states around the black hole if (start.state %in% c( "s(3,3)", "s(3,4)", "s(3,5)", "s(4,3)", "s(4,5)", "s(5,3)", "s(5,4)", "s(5,5)" )) { if (end.state == "s(4,4)") { return(.5 + gw_transition_prob_end_state(model, action, start.state, end.state) * .5) } else { return(gw_transition_prob_end_state(model, action, start.state, end.state) * .5) } } # use the standard gridworld movement gw_transition_prob_end_state(model, action, start.state, end.state) } black_hole < MDP( states = gw$states, actions = gw$actions, transition_prob = trans_black_hole, reward = rbind(R_( value = +1), R_(end.state = "s(4,4)", value = 100), R_(start.state = "s(4,4)", value = 0) ), info = gw$info, name = "Black hole" ) black_hole black_hole < normalize_MDP(black_hole) gw_plot_transition_graph(black_hole) # solve the problem sol < solve_MDP(black_hole, error = 1) gw_matrix(sol, what = "values") gw_plot(sol) # the optimal policy is to fly around, but avoid the black hole. # Build a Maze: The Dyna Maze from Chapter 8 in the RL book DynaMaze < gw_maze_MDP( dim = c(6, 9), start = "s(3,1)", goal = "s(1,9)", walls = c( "s(2,3)", "s(3,3)", "s(4,3)", "s(5,6)", "s(1,8)", "s(2,8)", "s(3,8)" ), restart = TRUE, discount = 0.95, name = "Dyna Maze", ) DynaMaze gw_matrix(DynaMaze) gw_matrix(DynaMaze, what = "labels") gw_plot_transition_graph(DynaMaze) # Note that the problems resets if the goal state would be reached. sol < solve_MDP(DynaMaze, method = "lp") gw_matrix(sol, what = "values") gw_matrix(sol, what = "actions") gw_plot(sol, states = TRUE) # check if we found a solution gw_path(sol) # Read a maze from a text file # (X are walls, S is the start and G is the goal) # some examples are installed with the package maze_dir < system.file("mazes", package = "markovDP") dir(maze_dir) file.show(file.path(maze_dir, "small_maze.txt")) maze < gw_read_maze(file.path(maze_dir, "small_maze.txt")) maze gw_matrix(maze, what = "label") gw_plot(maze) # Prioritized sweeping is especially effective for larger mazes. sol < solve_MDP(maze, method = "prioritized_sweeping") sol gw_plot(sol) gw_path(sol) # A maze can also be created directly from a character vector maze < gw_read_maze( textConnection(c("XXXXXX", "XS GX", "XXXXXX"))) gw_plot(maze) # Create a small random maze rand_maze < gw_random_maze(dim = c(5, 5)) gw_plot(rand_maze)
The 4x3 maze is described in Chapter 17 of the textbook "Artificial Intelligence: A Modern Approach" (AIMA).
An object of class MDP.
The simple maze has the following layout:
1234 Transition model: ###### .8 (action direction) 1# +# ^ 2# # #  3#S # .1 <> .1 ######
We represent the maze states as a gridworld matrix with 3 rows and
4 columns. The states are labeled s(row, col)
representing the position in
the matrix.
The # (state s(2,2)
) in the middle of the maze is an obstruction and not reachable.
Rewards are associated with transitions. The default reward (penalty) is 0.04.
The start state marked with S
is s(3,1)
.
Transitioning to +
(state s(1,4)
) gives a reward of +1.0,
transitioning to 
(state s_(2,4)
)
has a reward of 1.0. Both these states are absorbing
(i.e., terminal) states.
Actions are movements (up
, right
, down
, left
). The actions are
unreliable with a .8 chance
to move in the correct direction and a 0.1 chance to instead to move in a
perpendicular direction leading to a stochastic transition model.
Note that the problem has reachable terminal states which leads to a proper policy
(that is guaranteed to reach a terminal state). This means that the solution also
converges without discounting (discount = 1
).
Russell,9 S. J. and Norvig, P. (2020). Artificial Intelligence: A modern approach. 4rd ed.
Other MDP_examples:
Cliff_walking
,
DynaMaze
,
MDP()
,
Windy_gridworld
Other gridworld:
Cliff_walking
,
DynaMaze
,
Windy_gridworld
,
gridworld
# The problem can be loaded using data(Maze). # Here is the complete problem definition. # We first look at the state layout gw_matrix(gw_init(dim = c(3, 4))) # the wall at s(2,2) is unreachable gw < gw_init(dim = c(3, 4), start = "s(3,1)", goal = "s(1,4)", absorbing_states = c("s(1,4)", "s(2,4)"), unreachable_states = "s(2,2)", state_labels = list( "s(3,1)" = "Start", "s(2,4)" = "1", "s(1,4)" = "Goal: +1") ) gw_matrix(gw) gw_matrix(gw, what = "index") gw_matrix(gw, what = "labels") # gw_init has created the following information str(gw) # the transition function is stochastic so we cannot use the standard # gridworld gw$transition_prob() function and have to replace it T < function(model, action, start.state) { action < match.arg(action, choices = model$actions) P < structure(numeric(length(model$states)), names = model$states) # absorbing states if (start.state %in% model$info$absorbing_states) { P[start.state] < 1 return(P) } if (action %in% c("up", "down")) { error_direction < c("right", "left") } else { error_direction < c("up", "down") } rc < gw_s2rc(start.state) delta < list( up = c(1, 0), down = c(+1, 0), right = c(0, +1), left = c(0, 1) ) # there are 3 directions. For blocked directions, stay in place # 1) action works .8 rc_new < gw_rc2s(rc + delta[[action]]) if (rc_new %in% model$states) P[rc_new] < .8 else P[start.state] < .8 # 2) off to the right .1 rc_new < gw_rc2s(rc + delta[[error_direction[1]]]) if (rc_new %in% model$states) P[rc_new] < .1 else P[start.state] < P[start.state] + .1 # 3) off to the left .1 rc_new < gw_rc2s(rc + delta[[error_direction[2]]]) if (rc_new %in% model$states) P[rc_new] < .1 else P[start.state] < P[start.state] + .1 P } T(gw, "up", "s(3,1)") R < rbind( R_( value = 0.04), R_(end.state = "s(2,4)", value = 1), R_(end.state = "s(1,4)", value = +1), R_(start.state = "s(2,4)", value = 0), R_(start.state = "s(1,4)", value = 0) ) Maze < MDP( name = "Stuart Russell's 3x4 Maze", discount = 1, horizon = Inf, states = gw$states, actions = gw$actions, start = "s(3,1)", transition_prob = T, reward = R, info = gw$info ) Maze str(Maze) gw_matrix(Maze) gw_matrix(Maze, what = "labels") gw_plot(Maze) # find absorbing (terminal) states absorbing_states(Maze) maze_solved < solve_MDP(Maze) policy(maze_solved) gw_matrix(maze_solved, what = "values") gw_matrix(maze_solved, what = "actions") gw_plot(maze_solved)
# The problem can be loaded using data(Maze). # Here is the complete problem definition. # We first look at the state layout gw_matrix(gw_init(dim = c(3, 4))) # the wall at s(2,2) is unreachable gw < gw_init(dim = c(3, 4), start = "s(3,1)", goal = "s(1,4)", absorbing_states = c("s(1,4)", "s(2,4)"), unreachable_states = "s(2,2)", state_labels = list( "s(3,1)" = "Start", "s(2,4)" = "1", "s(1,4)" = "Goal: +1") ) gw_matrix(gw) gw_matrix(gw, what = "index") gw_matrix(gw, what = "labels") # gw_init has created the following information str(gw) # the transition function is stochastic so we cannot use the standard # gridworld gw$transition_prob() function and have to replace it T < function(model, action, start.state) { action < match.arg(action, choices = model$actions) P < structure(numeric(length(model$states)), names = model$states) # absorbing states if (start.state %in% model$info$absorbing_states) { P[start.state] < 1 return(P) } if (action %in% c("up", "down")) { error_direction < c("right", "left") } else { error_direction < c("up", "down") } rc < gw_s2rc(start.state) delta < list( up = c(1, 0), down = c(+1, 0), right = c(0, +1), left = c(0, 1) ) # there are 3 directions. For blocked directions, stay in place # 1) action works .8 rc_new < gw_rc2s(rc + delta[[action]]) if (rc_new %in% model$states) P[rc_new] < .8 else P[start.state] < .8 # 2) off to the right .1 rc_new < gw_rc2s(rc + delta[[error_direction[1]]]) if (rc_new %in% model$states) P[rc_new] < .1 else P[start.state] < P[start.state] + .1 # 3) off to the left .1 rc_new < gw_rc2s(rc + delta[[error_direction[2]]]) if (rc_new %in% model$states) P[rc_new] < .1 else P[start.state] < P[start.state] + .1 P } T(gw, "up", "s(3,1)") R < rbind( R_( value = 0.04), R_(end.state = "s(2,4)", value = 1), R_(end.state = "s(1,4)", value = +1), R_(start.state = "s(2,4)", value = 0), R_(start.state = "s(1,4)", value = 0) ) Maze < MDP( name = "Stuart Russell's 3x4 Maze", discount = 1, horizon = Inf, states = gw$states, actions = gw$actions, start = "s(3,1)", transition_prob = T, reward = R, info = gw$info ) Maze str(Maze) gw_matrix(Maze) gw_matrix(Maze, what = "labels") gw_plot(Maze) # find absorbing (terminal) states absorbing_states(Maze) maze_solved < solve_MDP(Maze) policy(maze_solved) gw_matrix(maze_solved, what = "values") gw_matrix(maze_solved, what = "actions") gw_plot(maze_solved)
Defines all the elements of a discretetime finite statespace MDP problem.
MDP( states, actions, transition_prob, reward, discount = 0.9, horizon = Inf, start = "uniform", info = NULL, name = NA ) is_solved_MDP(model, stop = FALSE) is_converged_MDP(model) T_(action = NA, start.state = NA, end.state = NA, probability) R_(action = NA, start.state = NA, end.state = NA, value) epoch_to_episode(model, epoch)
MDP( states, actions, transition_prob, reward, discount = 0.9, horizon = Inf, start = "uniform", info = NULL, name = NA ) is_solved_MDP(model, stop = FALSE) is_converged_MDP(model) T_(action = NA, start.state = NA, end.state = NA, probability) R_(action = NA, start.state = NA, end.state = NA, value) epoch_to_episode(model, epoch)
states 
a character vector specifying the names of the states. 
actions 
a character vector specifying the names of the available actions. 
transition_prob 
Specifies the transition probabilities between states. 
reward 
Specifies the rewards dependent on action and states. 
discount 
numeric; discount rate between 0 and 1. 
horizon 
numeric; Number of epochs. 
start 
Specifies in which state the MDP starts. 
info 
A list with additional information. 
name 
a string to identify the MDP problem. 
model 
an 
stop 
logical; stop with an error. 
action 
action as a action label or integer. The value 
start.state , end.state

state as a state label or an integer. The value 
probability , value

Values used in the helper functions 
epoch 
integer; an epoch that should be converted to the corresponding episode in a timedependent MDP. 
Markov decision processes (MDPs) are discretetime stochastic control
process. We implement here MDPs with a finite state space.
MDP()
defines all the element of an MDP problem including the discount rate, the
set of states, the set of actions,the transition
probabilities, and the rewards.
In the following we use the following notation. The MDP is a 5duple:
$(S,A,T,R, \gamma)$
.
$S$
is the set of states; $A$
is the set of actions; $T$
are the conditional transition probabilities
between states; $R$
is the reward function; and
$\gamma$
is the discount factor. We will use lower case letters to
represent a member of a set, e.g., $s$
is a specific state. To refer to
the size of a set we will use cardinality, e.g., the number of actions is
$A$
.
$S, s, s'$
: 'states', start.state', 'end.state'
$A, a$
: 'actions', 'action'
State names and actions can be specified as strings or index numbers
(e.g., start.state
can be specified as the index of the state in states
).
For the specification as data.frames below, NA
can be used to mean
any start.state
, end.state
or action
.
$T(s'  s, a)$
Transition probability to transition to state $s'$
from given state $s$
and action $a$
. The transition probabilities can be
specified in the following ways:
A data.frame with columns exactly like the arguments of T_()
.
You can use rbind()
with helper function T_()
to create this data
frame. Probabilities can be specified multiple times and the definition that
appears last in the data.frame will take affect.
A named list of matrices, one for each action. Each matrix is square with
rows representing start states $s$
and columns representing end states $s'$
.
Instead of a matrix, also the strings 'identity'
or 'uniform'
can be specified.
A function with the following arguments:
A function with the argument list model
, action
, start.state
, end.state
which returns a single transition probability.
A function with the argument list model
, action
, start.state
which returns a transition probability vector for all end states. This vector can
be dense, a Matrix::sparseVector or a named vector only containing
the nonzero probabilities named by the corresponding end state.
The arguments action
, start.state
, and end.state
will be always
called with the state names as a character vectors of length 1.
$R(a, s, s')$
The reward function can be specified in the following ways:
A data frame with columns named exactly like the arguments of R_()
.
You can use rbind()
with helper function R_()
to create this data frame. Rewards can be specified
multiple times and the definition that
appears last in the data.frame will take affect.
A named list of matrices, one for each action. Each matrix is square with
rows representing start states $s$
and columns representing end states $s'$
.
A function following the same rules as for transition probabilities.
To avoid overflow problems with rewards, reward values should stay well within the
range of
[1e10, +1e10]
. Inf
can be used as the reward for unavailable actions and
will be translated into a large negative reward for solvers that only support
finite reward values.
The start state of the agent can be a single state or a distribution over the states.
The start state definition is used as the default when the reward is calculated by reward()
and for sampling with sample_MDP()
.
Options to specify the start state are:
A string specifying the name of a single starting state.
An integer in the range $1$
to $n$
to specify the index of a single starting state.
The string "uniform"
where the start state is chosen using a uniform distribution over all states.
A probability distribution over the states. That is, a vector
of $S$
probabilities, that add up to $1$
.
The default state state is a uniform distribution over all states.
See accessors.
The function returns an object of class MDP which is list with
the model specification. solve_MDP()
reads the object and adds a list element called
'solution'
.
Michael Hahsler
Other MDP:
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
Other MDP_examples:
Cliff_walking
,
DynaMaze
,
Maze
,
Windy_gridworld
# simple MDP example # # states: s1 s2 s3 s4 # transitions: forward moves > and backward moves < # start: s1 # reward: s1, s2, s4 = 0 and s3 = 1 car < MDP( states = c("s1", "s2", "s3", "s4"), actions = c("forward", "back", "stop"), transition < list( forward = rbind(c(0, 1, 0, 0), c(0, 0, 1, 0), c(0, 0, 0, 1), c(0, 0, 0, 1)), back = rbind(c(1, 0, 0, 0), c(1, 0, 0, 0), c(0, 1, 0, 0), c(0, 0, 1, 0)), stop = "identity" ), reward = rbind( R_(value = 0), R_(end.state = "s3", value = 1) ), discount = 0.9, start = "s1", name = "Simple Car MDP" ) car # internal representation str(car) # accessing elements start_vector(car, sparse = "states") transition_matrix(car) transition_matrix(car, sparse = TRUE) reward_matrix(car) reward_matrix(car, sparse = TRUE) sol < solve_MDP(car) policy(sol)
# simple MDP example # # states: s1 s2 s3 s4 # transitions: forward moves > and backward moves < # start: s1 # reward: s1, s2, s4 = 0 and s3 = 1 car < MDP( states = c("s1", "s2", "s3", "s4"), actions = c("forward", "back", "stop"), transition < list( forward = rbind(c(0, 1, 0, 0), c(0, 0, 1, 0), c(0, 0, 0, 1), c(0, 0, 0, 1)), back = rbind(c(1, 0, 0, 0), c(1, 0, 0, 0), c(0, 1, 0, 0), c(0, 0, 1, 0)), stop = "identity" ), reward = rbind( R_(value = 0), R_(end.state = "s3", value = 1) ), discount = 0.9, start = "s1", name = "Simple Car MDP" ) car # internal representation str(car) # accessing elements start_vector(car, sparse = "states") transition_matrix(car) transition_matrix(car, sparse = TRUE) reward_matrix(car) reward_matrix(car, sparse = TRUE) sol < solve_MDP(car) policy(sol)
Extracts the policy from a solved model or create a policy. All policies are deterministic.
policy(model, epoch = NULL, drop = TRUE) random_policy( model, prob = NULL, estimate_U = FALSE, only_available_actions = FALSE, ... ) manual_policy(model, actions, U = NULL, estimate_U = FALSE)
policy(model, epoch = NULL, drop = TRUE) random_policy( model, prob = NULL, estimate_U = FALSE, only_available_actions = FALSE, ... ) manual_policy(model, actions, U = NULL, estimate_U = FALSE)
model 
A solved MDP object. 
epoch 
return the policy of the given epoch. 
drop 
logical; drop the list for converged, epochindependent policies. 
prob 
probability vector for random actions for 
estimate_U 
logical; estimate the value function
using 
only_available_actions 
logical; only sample from available actions?
(see 
... 
is passed on to 
actions 
a vector with the action (either the action label or the numeric id) for each state. 
U 
a vector with the value function for the policy. If 
For an MDP, the deterministic policy is a data.frame with columns for:
state
: The state.
U
: The state's value (discounted expected utility U) if the policy
is followed.
action
: The prescribed action.
For unconverged, finitehorizon problems, the solution is a policy for each epoch. This is returned as a list of data.frames.
A data.frame containing the policy. If drop = FALSE
then the policy is returned
as a list with the policy for each epoch.
Michael Hahsler
Other policy:
action()
,
add_policy()
,
bellman_update()
,
policy_evaluation()
,
q_values()
,
reward()
,
value_function()
data("Maze") sol < solve_MDP(Maze) sol ## policy with value function and optimal action. policy(sol) plot_value_function(sol) gw_plot(sol) ## create a random policy pi_random < random_policy(Maze, estimate_U = TRUE) pi_random gw_plot(add_policy(Maze, pi_random)) ## create a manual policy (go up and in some squares to the right) acts < rep("up", times = length(Maze$states)) names(acts) < Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] < "right" pi_manual < manual_policy(Maze, acts) pi_manual gw_plot(add_policy(Maze, pi_manual)) ## Finite horizon (we use incremental pruning because grid does not converge) sol < solve_MDP(model = Maze, horizon = 3) sol policy(sol) gw_plot(sol)
data("Maze") sol < solve_MDP(Maze) sol ## policy with value function and optimal action. policy(sol) plot_value_function(sol) gw_plot(sol) ## create a random policy pi_random < random_policy(Maze, estimate_U = TRUE) pi_random gw_plot(add_policy(Maze, pi_random)) ## create a manual policy (go up and in some squares to the right) acts < rep("up", times = length(Maze$states)) names(acts) < Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] < "right" pi_manual < manual_policy(Maze, acts) pi_manual gw_plot(add_policy(Maze, pi_manual)) ## Finite horizon (we use incremental pruning because grid does not converge) sol < solve_MDP(model = Maze, horizon = 3) sol policy(sol) gw_plot(sol)
Estimate the value function for a policy applied to a model by repeatedly applying the Bellman operator.
policy_evaluation( model, pi, U = NULL, k_backups = 1000L, theta = 0.001, matrix = TRUE, progress = TRUE, verbose = FALSE ) bellman_operator(model, pi, U)
policy_evaluation( model, pi, U = NULL, k_backups = 1000L, theta = 0.001, matrix = TRUE, progress = TRUE, verbose = FALSE ) bellman_operator(model, pi, U)
model 
an MDP problem specification. 
pi 
a policy as a data.frame with at least columns for states and action. 
U 
a vector with value function representing the state utilities
(expected sum of discounted rewards from that point on).
If 
k_backups 
number of look ahead steps used for approximate policy evaluation
used by the policy iteration method. Set k_backups to 
theta 
stop when the largest change in a state value is less
than 
matrix 
logical; if 
progress 
logical; show a progress bar with estimated time for completion. 
verbose 
logical; should progress and approximation errors be printed. 
The Bellman operator updates a value function given the model defining
$T$
, $\gamma$
and $R$
, and a policy
$\pi$
by applying the Bellman equation as an update rule for each state:
$U_{k+1}(s) =\sum_a \pi_{as} \sum_{s'} T(s'  s,a) [R(s,a) + \gamma U_k(s')]$
A policy can be evaluated by applying the Bellman operator till convergence.
In each iteration, all states are updated. In this implementation updating is
stopped afterk_backups
iterations or after the
largest update
$U_{k+1}  U_k_\infty < \theta.$
a vector with (approximate) state values (U).
Michael Hahsler
Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
Other policy:
action()
,
add_policy()
,
bellman_update()
,
policy()
,
q_values()
,
reward()
,
value_function()
data(Maze) Maze # create several policies: # 1. optimal policy using value iteration maze_solved < solve_MDP(Maze, method = "value_iteration") pi_opt < policy(maze_solved) pi_opt # 2. a manual policy (go up and in some squares to the right) acts < rep("up", times = length(Maze$states)) names(acts) < Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] < "right" pi_manual < manual_policy(Maze, acts) pi_manual # 3. a random policy set.seed(1234) pi_random < random_policy(Maze) pi_random # 4. an improved policy based on one policy evaluation and # policy improvement step. u < policy_evaluation(Maze, pi_random) q < q_values(Maze, U = u) pi_greedy < greedy_policy(q) pi_greedy #' compare the approx. value functions for the policies (we restrict #' the number of backups for the random policy since it may not converge) rbind( random = policy_evaluation(Maze, pi_random, k_backups = 100), manual = policy_evaluation(Maze, pi_manual), greedy = policy_evaluation(Maze, pi_greedy), optimal = policy_evaluation(Maze, pi_opt) )
data(Maze) Maze # create several policies: # 1. optimal policy using value iteration maze_solved < solve_MDP(Maze, method = "value_iteration") pi_opt < policy(maze_solved) pi_opt # 2. a manual policy (go up and in some squares to the right) acts < rep("up", times = length(Maze$states)) names(acts) < Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] < "right" pi_manual < manual_policy(Maze, acts) pi_manual # 3. a random policy set.seed(1234) pi_random < random_policy(Maze) pi_random # 4. an improved policy based on one policy evaluation and # policy improvement step. u < policy_evaluation(Maze, pi_random) q < q_values(Maze, U = u) pi_greedy < greedy_policy(q) pi_greedy #' compare the approx. value functions for the policies (we restrict #' the number of backups for the random policy since it may not converge) rbind( random = policy_evaluation(Maze, pi_random, k_backups = 100), manual = policy_evaluation(Maze, pi_manual), greedy = policy_evaluation(Maze, pi_greedy), optimal = policy_evaluation(Maze, pi_opt) )
Implementation several functions useful to deal with Qvalues for MDPs which maps a state/action pair to a utility value.
q_values(model, U = NULL) greedy_action(Q, s, epsilon = 0, prob = FALSE) greedy_policy(Q)
q_values(model, U = NULL) greedy_action(Q, s, epsilon = 0, prob = FALSE) greedy_policy(Q)
model 
an MDP problem specification. 
U 
a vector with value function representing the state utilities
(expected sum of discounted rewards from that point on).
If 
Q 
an action value function with Qvalues as a state by action matrix. 
s 
a state. 
epsilon 
an 
prob 
logical; return a probability distribution over the actions. 
Implemented functions are:
q_values()
calculates (approximates)
Qvalues for a given model and value function using the Bellman
optimality equation:
$q(s,a) = \sum_{s'} T(s's,a) [R(s,a) + \gamma U(s')]$
Qvalues are calculated if $U = U^*$
, the optimal value function
otherwise we get an approximation.
Qvalues can be used as the input for several other functions.
greedy_action()
returns the action with the largest Qvalue given a
state.
greedy_policy()
generates a greedy policy using Qvalues.
q_values()
returns a state by action matrix specifying the Qfunction,
i.e., the action value for executing each action in each state. The Qvalues
are calculated from the value function (U) and the transition model.
greedy_action()
returns the action with the highest qvalue
for state s
. If prob = TRUE
, then a vector with
the probability for each action is returned.
greedy_policy()
returns the greedy policy given Q
.
Michael Hahsler
Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
Other policy:
action()
,
add_policy()
,
bellman_update()
,
policy()
,
policy_evaluation()
,
reward()
,
value_function()
data(Maze) Maze # create a random policy and calculate qvalues pi_random < random_policy(Maze) u < policy_evaluation(Maze, pi_random) q < q_values(Maze, U = u) # get the greedy policy form the qvalues pi_greedy < greedy_policy(q) pi_greedy gw_plot(add_policy(Maze, pi_greedy), main = "Maze: Greedy Policy") greedy_action(q, "s(3,1)", epsilon = 0, prob = FALSE) greedy_action(q, "s(3,1)", epsilon = 0, prob = TRUE) greedy_action(q, "s(3,1)", epsilon = .1, prob = TRUE)
data(Maze) Maze # create a random policy and calculate qvalues pi_random < random_policy(Maze) u < policy_evaluation(Maze, pi_random) q < q_values(Maze, U = u) # get the greedy policy form the qvalues pi_greedy < greedy_policy(q) pi_greedy gw_plot(add_policy(Maze, pi_greedy), main = "Maze: Greedy Policy") greedy_action(q, "s(3,1)", epsilon = 0, prob = FALSE) greedy_action(q, "s(3,1)", epsilon = 0, prob = TRUE) greedy_action(q, "s(3,1)", epsilon = .1, prob = TRUE)
Calculates the regret and related measures for a policy relative to a benchmark policy.
regret(policy, benchmark, start = NULL, run_policy_eval = TRUE, ...) action_discrepancy(policy, benchmark, proportion = FALSE) mean_value_error(policy, benchmark, square = TRUE)
regret(policy, benchmark, start = NULL, run_policy_eval = TRUE, ...) action_discrepancy(policy, benchmark, proportion = FALSE) mean_value_error(policy, benchmark, square = TRUE)
policy 
a solved MDP containing the policy to calculate the regret for. 
benchmark 
a solved MDP with the (optimal) policy. Regret is calculated relative to this policy. 
start 
start state distribution. If NULL then the start state of the 
... 
further arguments are passed on to 
proportion 
logical; should the action discrepancy be reported as a proportion of states with a different action. 
square 
logical; should the value error be squared? 
policy_eval 
logical; run policy evaluation to reestimate state values. 
Regret is defined as $V^{\pi^*}(s_0)  V^{\pi}(s_0)$
with $V^\pi$
representing the expected longterm
state value (represented by the value function) given the policy $\pi$
and the start
state $s_0$
.
Note that for regret, usually the optimal policy $\pi^*$
is used as the benchmark.
Since the optimal policy may not be known, regret relative to the best known policy can be used.
Regret is only valid with converged value functions. This means that either the
solver has converged, or the value function was estimated for the policy using
converged policy_evaluation()
.
The action discrepancy is the number of states in the policy for which the prescribed action in the policy differs. This implementation calculates the Q matrix for the benchmark to make sure that actions with the same Qvalue are considers as correct.
The mean value error is the sum of the squared differences in state values between the two solutions.
regret()
returns the regret as a difference of expected longterm rewards.
action_discrepancy()
returns the number or proportion of diverging actions.
mean_value_error()
returns the mean squared or absolute difference in the
value function.
Michael Hahsler
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
data(Maze) sol_optimal < solve_MDP(Maze) policy(sol_optimal) # a manual policy (go up and in some squares to the right) acts < rep("up", times = length(Maze$states)) names(acts) < Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] < "right" sol_manual < add_policy(Maze, manual_policy(Maze, acts, estimate_U = TRUE)) policy(sol_manual) regret(sol_manual, benchmark = sol_optimal) action_discrepancy(sol_manual, benchmark = sol_optimal) mean_value_error(sol_manual, benchmark = sol_optimal)
data(Maze) sol_optimal < solve_MDP(Maze) policy(sol_optimal) # a manual policy (go up and in some squares to the right) acts < rep("up", times = length(Maze$states)) names(acts) < Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] < "right" sol_manual < add_policy(Maze, manual_policy(Maze, acts, estimate_U = TRUE)) policy(sol_manual) regret(sol_manual, benchmark = sol_optimal) action_discrepancy(sol_manual, benchmark = sol_optimal) mean_value_error(sol_manual, benchmark = sol_optimal)
This function calculates the expected total reward for an MDP policy given a start state (distribution). The value is calculated using the value function stored in the MDP solution.
reward(model, ...) ## S3 method for class 'MDP' reward(model, start = NULL, epoch = 1L, ...)
reward(model, ...) ## S3 method for class 'MDP' reward(model, start = NULL, epoch = 1L, ...)
model 
a solved MDP object. 
... 
further arguments are passed on. 
start 
specification of the current state (see argument start in MDP for details). By default the start state defined in the model as start is used. Multiple states can be specified as rows in a matrix. 
epoch 
epoch for a finitehorizon solutions. 
The reward is typically calculated using the value function
of the solution. If these are not available, then sample_MDP()
is
used instead with a warning.
reward()
returns a vector of reward values, one for each belief if a matrix is specified.
state 
start state to calculate the reward for. if 
Michael Hahsler
Other policy:
action()
,
add_policy()
,
bellman_update()
,
policy()
,
policy_evaluation()
,
q_values()
,
value_function()
data("Maze") Maze gw_matrix(Maze) sol < solve_MDP(Maze) policy(sol) # reward for the start state s(3,1) specified in the model reward(sol) # reward for starting next to the goal at s(1,3) reward(sol, start = "s(1,3)") # expected reward when we start from a random state reward(sol, start = "uniform")
data("Maze") Maze gw_matrix(Maze) sol < solve_MDP(Maze) policy(sol) # reward for the start state s(3,1) specified in the model reward(sol) # reward for starting next to the goal at s(1,3) reward(sol, start = "s(1,3)") # expected reward when we start from a random state reward(sol, start = "uniform")
Rounds a vector such that the sum of 1 is preserved. Rounds a matrix such that each row sum up to 1. One entry is adjusted after rounding such that the rounding error is the smallest.
round_stochastic(x, digits = 7)
round_stochastic(x, digits = 7)
x 
a stochastic vector or a rowstochastic matrix. 
digits 
number of digits for rounding. 
The rounded vector or matrix.
# regular rounding would not sum up to 1 x < c(0.333, 0.334, 0.333) round_stochastic(x) round_stochastic(x, digits = 2) round_stochastic(x, digits = 1) round_stochastic(x, digits = 0) # round a stochastic matrix m < matrix(runif(15), ncol = 3) m < sweep(m, 1, rowSums(m), "/") m round_stochastic(m, digits = 2) round_stochastic(m, digits = 1) round_stochastic(m, digits = 0)
# regular rounding would not sum up to 1 x < c(0.333, 0.334, 0.333) round_stochastic(x) round_stochastic(x, digits = 2) round_stochastic(x, digits = 1) round_stochastic(x, digits = 0) # round a stochastic matrix m < matrix(runif(15), ncol = 3) m < sweep(m, 1, rowSums(m), "/") m round_stochastic(m, digits = 2) round_stochastic(m, digits = 1) round_stochastic(m, digits = 0)
Sample trajectories through an MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilongreedy policy and then update the state.
sample_MDP( model, n = 100, start = NULL, horizon = NULL, epsilon = NULL, exploring_starts = FALSE, delta_horizon = 0.001, return_trajectories = FALSE, engine = NULL, progress = TRUE, verbose = FALSE, ... )
sample_MDP( model, n = 100, start = NULL, horizon = NULL, epsilon = NULL, exploring_starts = FALSE, delta_horizon = 0.001, return_trajectories = FALSE, engine = NULL, progress = TRUE, verbose = FALSE, ... )
model 
an MDP model. 
n 
number of trajectories. 
start 
probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform". 
horizon 
epochs end once an absorbing state is reached or after
the maximal number of epochs specified via 
epsilon 
the probability of random actions for using an epsilongreedy policy. Default for solved models is 0 and for unsolved model 1. 
exploring_starts 
logical; randomly sample a start/action combination to start the episode from. 
delta_horizon 
precision used to determine the horizon for infinitehorizon problems. 
return_trajectories 
logical; return the complete trajectories. 
engine 

progress 
show a progress bar? 
verbose 
report used parameters 
... 
further arguments are ignored. 
The default is a
faster C++ implementation (engine = 'cpp'
).
A native R implementation is available (engine = 'r'
).
Both implementations support parallel execution using the package
foreach. To enable parallel execution, a parallel backend like
doparallel needs to be available needs to be registered (see
doParallel::registerDoParallel()
).
Note that small samples are slower using parallelization. Therefore, C++ simulations
with n * horizon less than 100,000 are always executed using a single worker.
A list with elements:
avg_reward
: The average discounted reward.
reward
: Reward for each trajectory.
action_cnt
: Action counts.
state_cnt
: State counts.
trajectories
: A data.frame with the trajectories. Each row
contains the episode
id, the time
step, the state s
,
the chosen action a
,
the reward r
, and the next state s_prime
. Trajectories are
only returned for return_trajectories = TRUE
.
Michael Hahsler
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
# enable parallel simulation # doParallel::registerDoParallel() data(Maze) # solve the MDP for 5 epochs and no discounting sol < solve_MDP(Maze, discount = 1) sol # U in the policy is and estimate of the utility of being in a state when using the optimal policy. policy(sol) gw_matrix(sol, what = "action") ## Example 1: simulate 100 trajectories following the policy, # only the final belief state is returned sim < sample_MDP(sol, n = 100, horizon = 10, verbose = TRUE) sim # Note that all simulations start at s_1 and that the simulated avg. reward # is therefore an estimate to the U value for the start state s_1. policy(sol)[1, ] # Calculate proportion of actions taken in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: simulate starting following a uniform distribution over all # states and return all trajectories sim < sample_MDP(sol, n = 100, start = "uniform", horizon = 10, return_trajectories = TRUE ) head(sim$trajectories) # how often was each state visited? table(sim$trajectories$s)
# enable parallel simulation # doParallel::registerDoParallel() data(Maze) # solve the MDP for 5 epochs and no discounting sol < solve_MDP(Maze, discount = 1) sol # U in the policy is and estimate of the utility of being in a state when using the optimal policy. policy(sol) gw_matrix(sol, what = "action") ## Example 1: simulate 100 trajectories following the policy, # only the final belief state is returned sim < sample_MDP(sol, n = 100, horizon = 10, verbose = TRUE) sim # Note that all simulations start at s_1 and that the simulated avg. reward # is therefore an estimate to the U value for the start state s_1. policy(sol)[1, ] # Calculate proportion of actions taken in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: simulate starting following a uniform distribution over all # states and return all trajectories sim < sample_MDP(sol, n = 100, start = "uniform", horizon = 10, return_trajectories = TRUE ) head(sim$trajectories) # how often was each state visited? table(sim$trajectories$s)
Implementation of value iteration, modified policy iteration and other methods based on reinforcement learning techniques to solve finite state space MDPs.
solve_MDP(model, method = "value_iteration", ...) solve_MDP_DP( model, method = "value_iteration", horizon = NULL, discount = NULL, iter_max = 1000L, error = 0.001, k_backups = 10L, U = NULL, matrix = TRUE, continue = FALSE, progress = TRUE, verbose = FALSE, ... ) solve_MDP_LP( model, method = "lp", horizon = NULL, discount = NULL, verbose = FALSE, ... ) solve_MDP_MC( model, method = "MC_exploring_starts", horizon = NULL, discount = NULL, n = 100, ..., Q = NULL, epsilon = 0.2, first_visit = TRUE, matrix = TRUE, continue = FALSE, progress = TRUE, verbose = FALSE ) solve_MDP_TD( model, method = "q_learning", horizon = NULL, discount = NULL, alpha = 0.5, epsilon = 0.2, n = 1000, Q = NULL, matrix = TRUE, continue = FALSE, progress = TRUE, verbose = FALSE ) solve_MDP_sampling( model, method = "q_planning", horizon = NULL, discount = NULL, alpha = function(t, n) 1/n, n = 1000, Q = NULL, continue = FALSE, progress = TRUE, verbose = FALSE )
solve_MDP(model, method = "value_iteration", ...) solve_MDP_DP( model, method = "value_iteration", horizon = NULL, discount = NULL, iter_max = 1000L, error = 0.001, k_backups = 10L, U = NULL, matrix = TRUE, continue = FALSE, progress = TRUE, verbose = FALSE, ... ) solve_MDP_LP( model, method = "lp", horizon = NULL, discount = NULL, verbose = FALSE, ... ) solve_MDP_MC( model, method = "MC_exploring_starts", horizon = NULL, discount = NULL, n = 100, ..., Q = NULL, epsilon = 0.2, first_visit = TRUE, matrix = TRUE, continue = FALSE, progress = TRUE, verbose = FALSE ) solve_MDP_TD( model, method = "q_learning", horizon = NULL, discount = NULL, alpha = 0.5, epsilon = 0.2, n = 1000, Q = NULL, matrix = TRUE, continue = FALSE, progress = TRUE, verbose = FALSE ) solve_MDP_sampling( model, method = "q_planning", horizon = NULL, discount = NULL, alpha = function(t, n) 1/n, n = 1000, Q = NULL, continue = FALSE, progress = TRUE, verbose = FALSE )
model 
an MDP problem specification. 
method 
string; one of the following solution methods:

... 
further parameters are passed on to the solver function. 
horizon 
an integer with the number of epochs for problems with a
finite planning horizon. If set to 
discount 
discount factor in range 
iter_max 
maximum number of iterations allowed to converge. If the maximum is reached then the nonconverged solution is returned with a warning. 
error 
value iteration: maximum error allowed in the utility of any state (i.e., the maximum policy loss) used as the termination criterion. 
k_backups 
policy iteration: number of look ahead steps used for approximate policy evaluation used by the policy iteration method. 
U 
a vector with initial utilities used for each state. If

matrix 
logical; if 
continue 
logical; Continue with an unconverged solution specified in 
progress 
logical; show a progress bar with estimated time for completion. 
verbose 
logical, if set to 
n 
number of episodes used for learning. 
Q 
a action value matrix. 
epsilon 
used for 
first_visit 
if 
alpha 
step size in 
Several solvers are available. Note that some solvers are only implemented for finitehorizon problems.
Implemented are the following dynamic programming methods (following Russell and Norvig, 2010):
Modified Policy Iteration (Howard 1960; Puterman and Shin 1978) starts with a random policy and iteratively performs a sequence of
approximate policy evaluation (estimate the value function for the
current policy using k_backups
and function policy_evaluation()
, and
policy improvement (calculate a greedy policy given the value function). The algorithm stops when it converges to a stable policy (i.e., no changes between two iterations).
Value Iteration (Bellman 1957) starts with
an arbitrary value function (by default all 0s) and iteratively
updates the value function for each state using the Bellman equation.
The iterations
are terminated either after iter_max
iterations or when the solution converges.
Approximate convergence is achieved
for discounted problems (with $\gamma < 1$
)
when the maximal value function change for any state $\delta$
is
$\delta \le error (1\gamma) / \gamma$
. It can be shown that this means
that no state value is more than
$error$
from the value in the optimal value function. For undiscounted
problems, we use $\delta \le error$
.
A greedy policy is calculated from the final value function. Value iteration can be seen as policy iteration with truncated policy evaluation.
Prioritized Sweeping (Moore and Atkeson, 1993; Andre et al., 1997; Li and Littman, 2008) approximate the optimal value function by iteratively adjusting one state at a time. The state to be updated is chosen depending on its priority which reflects how much a state value may change given the most recently updated other states that can be directly reached via an action. This update order often lead to faster convergence compared to sweeping the whole state state in regular value iteration.
We implement the two priority update strategies described as PS and GenPS by Li and Littman.
PS (Moore and Atkeson, 1993) updates the priority of a state $H(s)$
using:
$\forall{s \in S}: H_{t+1}(s) \leftarrow \begin{cases}
\max(H_{t}(s), \Delta_t \max_{a \in A}(T(s_ts,a)) \text{ for } s \ne s_{t+1} \\
\Delta_t \max_{a \in A}(T(s_ts,a) \text{ for } s = s_{t+1}
\end{cases}$
where $\Delta_t = V_{t+1}(s_t)  V_t(s_t) = E(s_t; V_{t+1})$
, i.e.,
the Bellman error for the updated state.
GenPS (Andre et al., 1997) updates all state priorities using their current Bellman error:
$\forall{s \in S}: H_{t+1}(s) \leftarrow E(s; V_{t+1})$
where $E(s; V_{t+1}) = \max_{a \in A} \left[R(s,a) + \gamma \sum_{s \in S} T(s's,a) V(s')\right]  V(s)$
.
is a state's Bellman error.
The update method can be chosen using the additional parameter H_update
as the character string "PS_random"
, "PS_error"
or "GenPS"
.
The default is H_update = "GenPS"
. For PS, random means that the
priority vector is initialized with random values (larger than 0),
and error means they are initialized with the Bellman error as in
GenPS. However, this requires one complete sweep over all states.
This implementation stops updating when the largest priority values
over all states is less than the specified error
.
Since the logarithm does not sweep through the whole state space for each
iteration, iter_max
is converted into an equivalent number of state updates
n = \text{iter_max} S
.
Note that policies converge earlier than value functions.
The following linear programming formulation (Manne 1960) is implemented. For the optimal value function, the Bellman equation holds:
$V^*(s) = \max_{a \in A}\sum_{s' \in S} T(s, a, s') [ R(s, a, s') + \gamma V^*(s')]\; \forall a\in A, s \in S$
We can find the optimal value function by solving the following linear program:
$\text{min} \sum_{s\in S} V(s)$
subject to
$V(s) \ge \sum_{s' \in S} T(s, a, s') [R(s, a, s') + \gamma V(s')],\; \forall a\in A, s \in S$
Note:
The discounting factor has to be strictly less than 1.
Additional parameters to to solve_MDP
are passed on to lpSolve::lp()
.
We use the solver in
lpSolve::lp()
which requires all decision variables (state values) to be nonnegative.
To ensure this, for negative rewards, all rewards as shifted so the
smallest reward is
0. This transformation does not change the optimal policy.
Monte Carlo control simulates a whole episode using the current behavior policy and then updates the target policy before simulating the next episode. Implemented are the following temporal difference control methods described in Sutton and Barto (2020).
Monte Carlo Control with exploring Starts uses the same greedy policy for behavior and target (onpolicy). To make sure all states/action pairs are explored, it uses exploring starts meaning that new episodes are started at a randomly chosen state using a randomly chooses action.
Onpolicy Monte Carlo Control uses for behavior and as the target policy an epsilongreedy policy.
Offpolicy Monte Carlo Control uses for behavior an arbitrary policy (we use an epsilongreedy policy) and learns a greedy policy using importance sampling.
Implemented are several temporal difference control methods
described in Sutton and Barto (2020).
Note that the MDP transition and reward models are used
for these reinforcement learning methods only to sample from
the environment.
The algorithms use a step size parameter $\alpha$
(learning rate) for the
updates and the exploration parameter $\epsilon$
for
the $\epsilon$
greedy behavior policy.
If the model has absorbing states to terminate episodes, then no maximal episode length
(horizon
) needs to
be specified. To make sure that the algorithm does finish in a reasonable amount of time,
episodes are stopped after 1000 actions (with a warning). For models without absorbing states,
the episode length has to be specified via horizon
.
QLearning (Watkins and Dayan 1992) is an offpolicy temporal difference method that uses
an $\epsilon$
greedy behavior policy and learns a greedy target
policy.
Sarsa (Rummery and Niranjan 1994) is an onpolicy method that follows and learns
an $\epsilon$
greedy policy. The final $\epsilon$
greedy policy
is converted into a greedy policy.
Expected Sarsa (R. S. Sutton and Barto 2018). We implement an onpolicy version that uses
the expected value under the current policy for the update.
It moves deterministically in the same direction as Sarsa would
move in expectation. Because it uses the expectation, we can
set the step size $\alpha$
to large values and 1 is common.
A simple, not very effective, planning method proposed by Sutton and Barto (2020) in Chapter 8.
Randomsample onestep tabular Qplanning randomly selects a state/action pair and samples the resulting reward and next state from the model. This information is used to update a single Qtable value.
solve_MDP()
returns an object of class MDP which is a list with the
model specifications (model
), the solution (solution
).
The solution is a list with the elements:
policy
a list representing the policy graph. The list only has one
element for converged solutions.
converged
did the algorithm converge (NA
) for finitehorizon problems.
delta
final $\delta$
(value iteration and infinitehorizon only)
iterations
number of iterations to convergence (infinitehorizon only)
Michael Hahsler
Andre, D., Friedman, N., and Parr, R. 1997. "Generalized prioritized sweeping." In Advances in Neural Information Processing Systems 10, pp. 10011007. NeurIPS Proceedings
Bellman, Richard. 1957. "A Markovian Decision Process." Indiana University Mathematics Journal 6: 67984. https://www.jstor.org/stable/24900506.
Howard, R. A. 1960. "Dynamic Programming and Markov Processes." Cambridge, MA: MIT Press.
Li, Lihong, and Michael Littman. 2008. "Prioritized Sweeping Converges to the Optimal Value Function." DCSTR631. Rutgers University. doi:10.7282/T3TX3JSX
Manne, Alan. 1960. "On the JobShop Scheduling Problem." Operations Research 8 (2): 21923. doi:10.1287/opre.8.2.219.
Moore, Andrew, and C. G. Atkeson. 1993. "Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time." Machine Learning 13 (1): 103–30. doi:10.1007/BF00993104.
Puterman, Martin L., and Moon Chirl Shin. 1978. "Modified Policy Iteration Algorithms for Discounted Markov Decision Problems." Management Science 24: 112737. doi:10.1287/mnsc.24.11.1127.
Rummery, G., and Mahesan Niranjan. 1994. "OnLine QLearning Using Connectionist Systems." Techreport CUED/FINFENG/TR 166. Cambridge University Engineering Department.
Russell, Stuart J., and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th Edition). Pearson. http://aima.cs.berkeley.edu/.
Sutton, R. 1988. "Learning to Predict by the Method of Temporal Differences." Machine Learning 3: 944. https://link.springer.com/article/10.1007/BF00115009.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Second. The MIT Press. http://incompleteideas.net/book/thebook2nd.html.
Watkins, Christopher J. C. H., and Peter Dayan. 1992. "QLearning." Machine Learning 8 (3): 27992. doi:10.1007/BF00992698.
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
transition_graph()
,
unreachable_and_absorbing
,
value_function()
data(Maze) Maze # use value iteration maze_solved < solve_MDP(Maze, method = "value_iteration") maze_solved policy(maze_solved) # plot the value function U plot_value_function(maze_solved) # Gridworld solutions can be visualized gw_plot(maze_solved) # Use linear programming maze_solved < solve_MDP(Maze, method = "lp") maze_solved policy(maze_solved) # use prioritized sweeping (which is known to be fast for mazes) maze_solved < solve_MDP(Maze, method = "prioritized_sweeping") policy(maze_solved) # finite horizon maze_solved < solve_MDP(Maze, method = "value_iteration", horizon = 3) policy(maze_solved) gw_plot(maze_solved, epoch = 1) gw_plot(maze_solved, epoch = 2) gw_plot(maze_solved, epoch = 3) # create a random policy where action n is very likely and approximate # the value function. We change the discount factor to .9 for this. Maze_discounted < Maze Maze_discounted$discount < .9 pi < random_policy(Maze_discounted, prob = c(n = .7, e = .1, s = .1, w = 0.1) ) pi # compare the utility function for the random policy with the function for the optimal # policy found by the solver. maze_solved < solve_MDP(Maze) policy_evaluation(Maze, pi, k_backup = 100) policy_evaluation(Maze, policy(maze_solved), k_backup = 100) # Note that the solver already calculates the utility function and returns it with the policy policy(maze_solved) # Learn a Policy using QLearning maze_learned < solve_MDP(Maze, method = "q_learning", n = 100, horizon = 100) maze_learned maze_learned$solution policy(maze_learned) plot_value_function(maze_learned) gw_plot(maze_learned)
data(Maze) Maze # use value iteration maze_solved < solve_MDP(Maze, method = "value_iteration") maze_solved policy(maze_solved) # plot the value function U plot_value_function(maze_solved) # Gridworld solutions can be visualized gw_plot(maze_solved) # Use linear programming maze_solved < solve_MDP(Maze, method = "lp") maze_solved policy(maze_solved) # use prioritized sweeping (which is known to be fast for mazes) maze_solved < solve_MDP(Maze, method = "prioritized_sweeping") policy(maze_solved) # finite horizon maze_solved < solve_MDP(Maze, method = "value_iteration", horizon = 3) policy(maze_solved) gw_plot(maze_solved, epoch = 1) gw_plot(maze_solved, epoch = 2) gw_plot(maze_solved, epoch = 3) # create a random policy where action n is very likely and approximate # the value function. We change the discount factor to .9 for this. Maze_discounted < Maze Maze_discounted$discount < .9 pi < random_policy(Maze_discounted, prob = c(n = .7, e = .1, s = .1, w = 0.1) ) pi # compare the utility function for the random policy with the function for the optimal # policy found by the solver. maze_solved < solve_MDP(Maze) policy_evaluation(Maze, pi, k_backup = 100) policy_evaluation(Maze, policy(maze_solved), k_backup = 100) # Note that the solver already calculates the utility function and returns it with the policy policy(maze_solved) # Learn a Policy using QLearning maze_learned < solve_MDP(Maze, method = "q_learning", n = 100, horizon = 100) maze_learned maze_learned$solution policy(maze_learned) plot_value_function(maze_learned) gw_plot(maze_learned)
Returns the transition model as an igraph object.
transition_graph( x, action = NULL, state_col = NULL, simplify_transitions = TRUE, remove_unavailable_actions = TRUE ) plot_transition_graph( x, action = NULL, state_col = NULL, simplify_transitions = TRUE, main = NULL, ... ) curve_multiple_directed(graph, start = 0.3)
transition_graph( x, action = NULL, state_col = NULL, simplify_transitions = TRUE, remove_unavailable_actions = TRUE ) plot_transition_graph( x, action = NULL, state_col = NULL, simplify_transitions = TRUE, main = NULL, ... ) curve_multiple_directed(graph, start = 0.3)
x 
object of class MDP. 
action 
the name or id of an action or a set of actions. By default the transition model for all actions is returned. 
state_col 
colors used to represent the states. 
simplify_transitions 
logical; combine parallel transition arcs into a single arc. 
remove_unavailable_actions 
logical; don't show arrows for unavailable actions. 
main 
a main title for the plot. 
... 
further arguments are passed on to 
graph 
The input graph. 
start 
The curvature at the two extreme edges. 
The transition model of an MDP is a Markov Chain. This function extracts the transition model as an igraph object.
returns the transition model as an igraph object.
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
unreachable_and_absorbing
,
value_function()
data("Maze") g < transition_graph(Maze) g plot_transition_graph(Maze) plot_transition_graph(Maze, vertex.size = 20, edge.label.cex = .1, edge.arrow.size = .5, margin = .5 ) ## Plot using the igraph library library(igraph) plot(g) # plot with a different layout plot(g, layout = igraph::layout_with_sugiyama, vertex.size = 20, edge.label.cex = .6 ) ## Use visNetwork (if installed) if (require(visNetwork)) { g_vn < toVisNetworkData(g) nodes < g_vn$nodes edges < g_vn$edges visNetwork(nodes, edges) %>% visNodes(physics = FALSE) %>% visEdges(smooth = list(type = "curvedCW", roundness = .6), arrows = "to") }
data("Maze") g < transition_graph(Maze) g plot_transition_graph(Maze) plot_transition_graph(Maze, vertex.size = 20, edge.label.cex = .1, edge.arrow.size = .5, margin = .5 ) ## Plot using the igraph library library(igraph) plot(g) # plot with a different layout plot(g, layout = igraph::layout_with_sugiyama, vertex.size = 20, edge.label.cex = .6 ) ## Use visNetwork (if installed) if (require(visNetwork)) { g_vn < toVisNetworkData(g) nodes < g_vn$nodes edges < g_vn$edges visNetwork(nodes, edges) %>% visNodes(physics = FALSE) %>% visEdges(smooth = list(type = "curvedCW", roundness = .6), arrows = "to") }
Find unreachable and absorbing states using the transition model.
unreachable_states( model, horizon = Inf, sparse = "states", use_precomputed = TRUE, progress = TRUE, ... ) absorbing_states( model, state = NULL, sparse = "states", use_precomputed = TRUE, ... ) remove_unreachable_states(model, use_precomputed = TRUE, ...)
unreachable_states( model, horizon = Inf, sparse = "states", use_precomputed = TRUE, progress = TRUE, ... ) absorbing_states( model, state = NULL, sparse = "states", use_precomputed = TRUE, ... ) remove_unreachable_states(model, use_precomputed = TRUE, ...)
model 
a MDP object. 
horizon 
only states that can be reached within the horizon are reachable. 
sparse 
logical; return a sparse logical vector? 
use_precomputed 
logical; should precomputed values in the MDP be used? 
progress 
logical; show a progress bar? 
... 
further arguments are passed on. 
state 
logical; check a single state. This can be much faster if the model contains a transition model implemented as a function. 
The function unreachable_states()
checks if states
cannot be reached from any other state. It performs a depthfirst
search which can be slow.
The search breaks cycles to avoid an infinite loop.
The search depth can be restricted using
horizon
.
The function remove_unreachable_states()
simplifies a model by
removing unreachable states from the model description.
The function absorbing_states()
checks if a state or a set of states are
absorbing (terminal states). A state is absorbing
if there is for all actions a probability of 1 for staying in the state.
unreachable_states()
returns a
logical vector indicating the unreachable states.
absorbing_states()
returns a logical vector indicating
if the states are absorbing (terminal).
remove_unreachable_states()
returns a model with all
unreachable states removed.
Michael Hahsler
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
data(Maze) gw_matrix(Maze) gw_matrix(Maze, what = "labels") # 1 and +1 are absorbing states absorbing_states(Maze) absorbing_states(Maze, sparse = FALSE) absorbing_states(Maze, sparse = "states") # all states in the model are reachable unreachable_states(Maze) unreachable_states(Maze, sparse = FALSE) unreachable_states(Maze, sparse = "states")
data(Maze) gw_matrix(Maze) gw_matrix(Maze, what = "labels") # 1 and +1 are absorbing states absorbing_states(Maze) absorbing_states(Maze, sparse = FALSE) absorbing_states(Maze, sparse = "states") # all states in the model are reachable unreachable_states(Maze) unreachable_states(Maze, sparse = FALSE) unreachable_states(Maze, sparse = "states")
Extracts the value function from a solved MDP.
value_function(model, drop = TRUE) plot_value_function( model, epoch = 1, legend = TRUE, col = NULL, ylab = "Value", las = 3, main = NULL, ... )
value_function(model, drop = TRUE) plot_value_function( model, epoch = 1, legend = TRUE, col = NULL, ylab = "Value", las = 3, main = NULL, ... )
model 
a solved MDP. 
drop 
logical; drop the list for converged, epochindependent value functions. 
epoch 
epoch for finite time horizon solutions. 
legend 
logical; show legend. 
col , ylab , las

are passed on to 
main 
a main title for the plot. Defaults to the name of the problem. 
... 
further arguments are passed on to 
the function as a numeric vector with one value for each state.
Michael Hahsler
Other policy:
action()
,
add_policy()
,
bellman_update()
,
policy()
,
policy_evaluation()
,
q_values()
,
reward()
Other MDP:
MDP()
,
accessors
,
act()
,
add_policy()
,
available_actions()
,
bellman_update()
,
gridworld
,
policy_evaluation()
,
q_values()
,
regret()
,
sample_MDP()
,
solve_MDP()
,
transition_graph()
,
unreachable_and_absorbing
data("Maze") sol < solve_MDP(Maze) sol value_function(sol) plot_value_function(sol) ## finitehorizon problem sol < solve_MDP(Maze, horizon = 3) policy(sol) value_function(sol) plot_value_function(sol, epoch = 1) plot_value_function(sol, epoch = 2) plot_value_function(sol, epoch = 3) # For a gridworld we can also plot is like this gw_plot(sol, epoch = 1) gw_plot(sol, epoch = 2) gw_plot(sol, epoch = 3)
data("Maze") sol < solve_MDP(Maze) sol value_function(sol) plot_value_function(sol) ## finitehorizon problem sol < solve_MDP(Maze, horizon = 3) policy(sol) value_function(sol) plot_value_function(sol, epoch = 1) plot_value_function(sol, epoch = 2) plot_value_function(sol, epoch = 3) # For a gridworld we can also plot is like this gw_plot(sol, epoch = 1) gw_plot(sol, epoch = 2) gw_plot(sol, epoch = 3)
The Windy gridworld MDP example from Chapter 6 of the textbook "Reinforcement Learning: An Introduction."
An object of class MDP.
The gridworld has the following layout:
The grid world is represented as a 7 x 10 matrix of states. In the middle region the next states are shifted upward by wind (the strength in number of squares is given below each column). For example, if the agent is one cell to the right of the goal, then the action left takes the agent to the cell just above the goal.
No discounting is used (i.e., $\gamma = 1$
).
Richard S. Sutton and Andrew G. Barto (2018). Reinforcement Learning: An Introduction Second Edition, MIT Press, Cambridge, MA.
Other MDP_examples:
Cliff_walking
,
DynaMaze
,
MDP()
,
Maze
Other gridworld:
Cliff_walking
,
DynaMaze
,
Maze
,
gridworld
data(Windy_gridworld) Windy_gridworld gw_matrix(Windy_gridworld) gw_matrix(Windy_gridworld, what = "labels") gw_plot(Windy_gridworld) # The Goal is an absorbing state absorbing_states(Windy_gridworld, sparse = "states") # visualize the transition graph gw_plot_transition_graph(Windy_gridworld) # solve using value iteration sol < solve_MDP(Windy_gridworld) sol policy(sol) gw_plot(sol)
data(Windy_gridworld) Windy_gridworld gw_matrix(Windy_gridworld) gw_matrix(Windy_gridworld, what = "labels") gw_plot(Windy_gridworld) # The Goal is an absorbing state absorbing_states(Windy_gridworld, sparse = "states") # visualize the transition graph gw_plot_transition_graph(Windy_gridworld) # solve using value iteration sol < solve_MDP(Windy_gridworld) sol policy(sol) gw_plot(sol)