Title: | Infrastructure for Partially Observable Markov Decision Processes (POMDP) |
---|---|
Description: | Provides the infrastructure to define and analyze the solutions of Partially Observable Markov Decision Process (POMDP) models. Interfaces for various exact and approximate solution algorithms are available including value iteration, point-based value iteration and SARSOP. Smallwood and Sondik (1973) <doi:10.1287/opre.21.5.1071>. |
Authors: | Michael Hahsler [aut, cph, cre] , Hossein Kamalzadeh [ctb] |
Maintainer: | Michael Hahsler <[email protected]> |
License: | GPL (>=3) |
Version: | 1.2.4 |
Built: | 2025-01-04 06:30:15 UTC |
Source: | https://github.com/mhahsler/pomdp |
Functions to provide uniform access to different parts of the POMDP/MDP problem description.
start_vector(x) normalize_POMDP( x, sparse = TRUE, trans_start = FALSE, trans_function = TRUE, trans_keyword = FALSE ) normalize_MDP( x, sparse = TRUE, trans_start = FALSE, trans_function = TRUE, trans_keyword = FALSE ) reward_matrix( x, action = NULL, start.state = NULL, end.state = NULL, observation = NULL, episode = NULL, epoch = NULL, sparse = FALSE ) reward_val( x, action, start.state, end.state = NULL, observation = NULL, episode = NULL, epoch = NULL ) transition_matrix( x, action = NULL, start.state = NULL, end.state = NULL, episode = NULL, epoch = NULL, sparse = FALSE, trans_keyword = TRUE ) transition_val(x, action, start.state, end.state, episode = NULL, epoch = NULL) observation_matrix( x, action = NULL, end.state = NULL, observation = NULL, episode = NULL, epoch = NULL, sparse = FALSE, trans_keyword = TRUE ) observation_val( x, action, end.state, observation, episode = NULL, epoch = NULL )
start_vector(x) normalize_POMDP( x, sparse = TRUE, trans_start = FALSE, trans_function = TRUE, trans_keyword = FALSE ) normalize_MDP( x, sparse = TRUE, trans_start = FALSE, trans_function = TRUE, trans_keyword = FALSE ) reward_matrix( x, action = NULL, start.state = NULL, end.state = NULL, observation = NULL, episode = NULL, epoch = NULL, sparse = FALSE ) reward_val( x, action, start.state, end.state = NULL, observation = NULL, episode = NULL, epoch = NULL ) transition_matrix( x, action = NULL, start.state = NULL, end.state = NULL, episode = NULL, epoch = NULL, sparse = FALSE, trans_keyword = TRUE ) transition_val(x, action, start.state, end.state, episode = NULL, epoch = NULL) observation_matrix( x, action = NULL, end.state = NULL, observation = NULL, episode = NULL, epoch = NULL, sparse = FALSE, trans_keyword = TRUE ) observation_val( x, action, end.state, observation, episode = NULL, epoch = NULL )
x |
|
sparse |
logical; use sparse matrices when the density is below 50% and keeps data.frame representation
for the reward field. |
trans_start |
logical; expand the start to a probability vector? |
trans_function |
logical; convert functions into matrices? |
trans_keyword |
logical; convert distribution keywords (uniform and identity)
in |
action |
name or index of an action. |
start.state , end.state
|
name or index of the state. |
observation |
name or index of observation. |
episode , epoch
|
Episode or epoch used for time-dependent POMDPs. Epochs are internally converted to the episode using the model horizon. |
Several parts of the POMDP/MDP description can be defined in different ways. In particular,
the fields transition_prob
, observation_prob
, reward
, and start
can be defined using matrices, data frames,
keywords, or functions. See POMDP for details. The functions provided here, provide unified access to the data in these fields
to make writing code easier.
transition_matrix()
accesses the transition model. The complete model
is a list with one element for each action. Each element contains a states x states matrix
with (
start.state
) as rows and (
end.state
) as columns.
Matrices with a density below 50% can be requested in sparse format
(as a Matrix::dgCMatrix).
observation_matrix()
accesses the observation model. The complete model is a
list with one element for each action. Each element contains a states x observations matrix
with (
start.state
) as rows and (
observation
) as columns.
Matrices with a density below 50% can be requested in sparse format
(as a Matrix::dgCMatrix)
reward_matrix()
accesses the reward model.
The preferred representation is a data.frame with the
columns action
, start.state
, end.state
,
observation
, and value
. This is a sparse representation.
The dense representation is a list of lists of matrices.
The list levels are (
action
) and (
start.state
).
The matrices have rows representing (
end.state
)
and columns representing (
observations
).
The reward structure cannot be efficiently stored using a standard sparse matrix
since there might be a fixed cost for each action
resulting in no entries with 0.
start_vector()
translates the initial probability vector description into a numeric vector.
normalize_POMDP()
returns a new POMDP definition where transition_prob
,
observations_prob
, reward
, and start
are normalized.
Also, states
, actions
, and observations
are ordered as given in the problem
definition to make safe access using numerical indices possible. Normalized POMDP descriptions can be
used in custom code that expects consistently a certain format.
A list or a list of lists of matrices.
Michael Hahsler
Other POMDP:
MDP2POMDP
,
POMDP()
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
data("Tiger") # List of |A| transition matrices. One per action in the from start.states x end.states Tiger$transition_prob transition_matrix(Tiger) transition_val(Tiger, action = "listen", start.state = "tiger-left", end.state = "tiger-left") # List of |A| observation matrices. One per action in the from states x observations Tiger$observation_prob observation_matrix(Tiger) observation_val(Tiger, action = "listen", end.state = "tiger-left", observation = "tiger-left") # List of list of reward matrices. 1st level is action and second level is the # start state in the form end state x observation Tiger$reward reward_matrix(Tiger) reward_matrix(Tiger, sparse = TRUE) reward_matrix(Tiger, action = "open-right", start.state = "tiger-left", end.state = "tiger-left", observation = "tiger-left") # Translate the initial belief vector Tiger$start start_vector(Tiger) # Normalize the whole model Tiger_norm <- normalize_POMDP(Tiger) Tiger_norm$transition_prob ## Visualize transition matrix for action 'open-left' plot_transition_graph(Tiger) ## Use a function for the Tiger transition model trans <- function(action, end.state, start.state) { ## listen has an identity matrix if (action == 'listen') if (end.state == start.state) return(1) else return(0) # other actions have a uniform distribution return(1/2) } Tiger$transition_prob <- trans # transition_matrix evaluates the function transition_matrix(Tiger)
data("Tiger") # List of |A| transition matrices. One per action in the from start.states x end.states Tiger$transition_prob transition_matrix(Tiger) transition_val(Tiger, action = "listen", start.state = "tiger-left", end.state = "tiger-left") # List of |A| observation matrices. One per action in the from states x observations Tiger$observation_prob observation_matrix(Tiger) observation_val(Tiger, action = "listen", end.state = "tiger-left", observation = "tiger-left") # List of list of reward matrices. 1st level is action and second level is the # start state in the form end state x observation Tiger$reward reward_matrix(Tiger) reward_matrix(Tiger, sparse = TRUE) reward_matrix(Tiger, action = "open-right", start.state = "tiger-left", end.state = "tiger-left", observation = "tiger-left") # Translate the initial belief vector Tiger$start start_vector(Tiger) # Normalize the whole model Tiger_norm <- normalize_POMDP(Tiger) Tiger_norm$transition_prob ## Visualize transition matrix for action 'open-left' plot_transition_graph(Tiger) ## Use a function for the Tiger transition model trans <- function(action, end.state, start.state) { ## listen has an identity matrix if (action == 'listen') if (end.state == start.state) return(1) else return(0) # other actions have a uniform distribution return(1/2) } Tiger$transition_prob <- trans # transition_matrix evaluates the function transition_matrix(Tiger)
Determine the set of actions available in a state.
actions(x, state)
actions(x, state)
x |
a |
state |
a character vector of length one specifying the state. |
Unavailable actions are modeled here a actions that have an immediate
reward of -Inf
in the reward function.
a character vector with the available actions.
a vector with the available actions.
Michael Hahsler
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
data(RussianTiger) # The normal actions are "listen", "open-left", and "open-right". # In the state "done" only the action "nothing" is available. actions(RussianTiger, state = "tiger-left") actions(RussianTiger, state = "tiger-right") actions(RussianTiger, state = "done")
data(RussianTiger) # The normal actions are "listen", "open-left", and "open-right". # In the state "done" only the action "nothing" is available. actions(RussianTiger, state = "tiger-left") actions(RussianTiger, state = "tiger-right") actions(RussianTiger, state = "done")
Add a policy to a POMDP problem description allows the user to test policies on modified problem descriptions or to test manually created policies.
add_policy(model, policy)
add_policy(model, policy)
model |
a POMDP or MDP model description. |
policy |
a policy data.frame. |
The model description with the added policy.
Michael Hahsler
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
data(Tiger) sol <- solve_POMDP(Tiger) sol # Example 1: Use the solution policy on a changed POMDP problem # where listening is perfect and simulate the expected reward perfect_Tiger <- Tiger perfect_Tiger$observation_prob <- list( listen = diag(1, length(perfect_Tiger$states), length(perfect_Tiger$observations)), `open-left` = "uniform", `open-right` = "uniform" ) sol_perfect <- add_policy(perfect_Tiger, sol) sol_perfect simulate_POMDP(sol_perfect, n = 1000)$avg_reward # Example 2: Handcraft a policy and apply it to the Tiger problem # original policy policy(sol) plot_value_function(sol) plot_belief_space(sol) # create a policy manually where the agent opens a door at a believe of # roughly 2/3 (note the alpha vectors do not represent # a valid value function) p <- list( data.frame( `tiger-left` = c(1, 0, -2), `tiger-right` = c(-2, 0, 1), action = c("open-right", "listen", "open-left"), check.names = FALSE )) p custom_sol <- add_policy(Tiger, p) custom_sol policy(custom_sol) plot_value_function(custom_sol) plot_belief_space(custom_sol) simulate_POMDP(custom_sol, n = 1000)$avg_reward
data(Tiger) sol <- solve_POMDP(Tiger) sol # Example 1: Use the solution policy on a changed POMDP problem # where listening is perfect and simulate the expected reward perfect_Tiger <- Tiger perfect_Tiger$observation_prob <- list( listen = diag(1, length(perfect_Tiger$states), length(perfect_Tiger$observations)), `open-left` = "uniform", `open-right` = "uniform" ) sol_perfect <- add_policy(perfect_Tiger, sol) sol_perfect simulate_POMDP(sol_perfect, n = 1000)$avg_reward # Example 2: Handcraft a policy and apply it to the Tiger problem # original policy policy(sol) plot_value_function(sol) plot_belief_space(sol) # create a policy manually where the agent opens a door at a believe of # roughly 2/3 (note the alpha vectors do not represent # a valid value function) p <- list( data.frame( `tiger-left` = c(1, 0, -2), `tiger-right` = c(-2, 0, 1), action = c("open-right", "listen", "open-left"), check.names = FALSE )) p custom_sol <- add_policy(Tiger, p) custom_sol policy(custom_sol) plot_value_function(custom_sol) plot_belief_space(custom_sol) simulate_POMDP(custom_sol, n = 1000)$avg_reward
The cliff walking gridworld MDP example from Chapter 6 of the textbook "Reinforcement Learning: An Introduction."
An object of class MDP.
The cliff walking gridworld has the following layout:
The gridworld is represented as a 4 x 12 matrix of states.
The states are labeled with their x and y coordinates.
The start state is in the bottom left corner.
Each action has a reward of -1, falling off the cliff has a reward of -100 and
returns the agent back to the start. The episode is finished once the agent
reaches the absorbing goal state in the bottom right corner.
No discounting is used (i.e., ).
Richard S. Sutton and Andrew G. Barto (2018). Reinforcement Learning: An Introduction Second Edition, MIT Press, Cambridge, MA.
Other MDP_examples:
DynaMaze
,
MDP()
,
Maze
,
Windy_gridworld
Other gridworld:
DynaMaze
,
Maze
,
Windy_gridworld
,
gridworld
data(Cliff_walking) Cliff_walking gridworld_matrix(Cliff_walking) gridworld_matrix(Cliff_walking, what = "labels") # The Goal is an absorbing state which(absorbing_states(Cliff_walking)) # visualize the transition graph gridworld_plot_transition_graph(Cliff_walking) # solve using different methods sol <- solve_MDP(Cliff_walking) sol policy(sol) gridworld_plot_policy(sol) sol <- solve_MDP(Cliff_walking, method = "q_learning", N = 100) sol policy(sol) gridworld_plot_policy(sol) sol <- solve_MDP(Cliff_walking, method = "sarsa", N = 100) sol policy(sol) gridworld_plot_policy(sol) sol <- solve_MDP(Cliff_walking, method = "expected_sarsa", N = 100, alpha = 1) policy(sol) gridworld_plot_policy(sol)
data(Cliff_walking) Cliff_walking gridworld_matrix(Cliff_walking) gridworld_matrix(Cliff_walking, what = "labels") # The Goal is an absorbing state which(absorbing_states(Cliff_walking)) # visualize the transition graph gridworld_plot_transition_graph(Cliff_walking) # solve using different methods sol <- solve_MDP(Cliff_walking) sol policy(sol) gridworld_plot_policy(sol) sol <- solve_MDP(Cliff_walking, method = "q_learning", N = 100) sol policy(sol) gridworld_plot_policy(sol) sol <- solve_MDP(Cliff_walking, method = "sarsa", N = 100) sol policy(sol) gridworld_plot_policy(sol) sol <- solve_MDP(Cliff_walking, method = "expected_sarsa", N = 100, alpha = 1) policy(sol) gridworld_plot_policy(sol)
Default discrete and continuous colors used in pomdp for states (nodes), beliefs and values.
colors_discrete(n, col = NULL) colors_continuous(val, col = NULL)
colors_discrete(n, col = NULL) colors_continuous(val, col = NULL)
n |
number of states. |
col |
custom color palette. |
val |
a vector with values to be translated to colors. |
colors_discrete()
returns a color palette and
colors_continuous()
returns the colors associated with the supplied values.
colors_discrete(5) colors_continuous(runif(10))
colors_discrete(5) colors_continuous(runif(10))
The Dyna Maze from Chapter 8 of the textbook "Reinforcement Learning: An Introduction."
An object of class MDP.
The simple 6x9 maze with a few walls.
Richard S. Sutton and Andrew G. Barto (2018). Reinforcement Learning: An Introduction Second Edition, MIT Press, Cambridge, MA.
Other MDP_examples:
Cliff_walking
,
MDP()
,
Maze
,
Windy_gridworld
Other gridworld:
Cliff_walking
,
Maze
,
Windy_gridworld
,
gridworld
Other MDP_examples:
Cliff_walking
,
MDP()
,
Maze
,
Windy_gridworld
Other gridworld:
Cliff_walking
,
Maze
,
Windy_gridworld
,
gridworld
data(DynaMaze) DynaMaze gridworld_matrix(DynaMaze) gridworld_matrix(DynaMaze, what = "labels") gridworld_plot_transition_graph(DynaMaze)
data(DynaMaze) DynaMaze gridworld_matrix(DynaMaze) gridworld_matrix(DynaMaze, what = "labels") gridworld_plot_transition_graph(DynaMaze)
Estimate a belief for each alpha vector (segment of the value function) which represents a node in the policy graph.
estimate_belief_for_nodes( x, method = "auto", belief = NULL, verbose = FALSE, ... )
estimate_belief_for_nodes( x, method = "auto", belief = NULL, verbose = FALSE, ... )
x |
object of class POMDP containing a solved and converged POMDP problem. |
method |
character string specifying the estimation method. Methods include
|
belief |
start belief used for method trajectories. |
verbose |
logical; show which method is used. |
... |
parameters are passed on to |
estimate_belief_for_nodes()
can estimate the belief in several ways:
Use belief points explored by the solver. Some solvers return explored belief points. We assign the belief points to the nodes and average each nodes belief.
Follow trajectories (breadth first) till all policy graph nodes have been visited and
return the encountered belief. This implementation returns the first (i.e., shallowest) belief point
that is encountered is used and no averaging is performed. parameter n
can be used to
limit the number of nodes searched.
Sample a large set of possible belief points, assigning them to the nodes and then averaging
the belief over the points assigned to each node. This will return a central belief for the node.
Additional parameters like method
and the sample size n
are passed on to sample_belief_space()
.
If no belief point is generated for a segment, then a
warning is produced. In this case, the number of sampled points can be increased.
Notes:
Each method may return a different answer. The only thing that is guaranteed is that the returned belief falls in the range where the value function segment is maximal.
If some nodes not belief points are sampled, or the node is not reachable from the initial belief,
then a vector with all NaN
s will be returned with a warning.
returns a list with matrices with a belief for each policy graph node. The list elements are the epochs and converged solutions only have a single element.
Other policy:
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
data("Tiger") # Infinite horizon case with converged solution sol <- solve_POMDP(model = Tiger, method = "grid") sol # default method auto uses the belief points used in the algorithm (if available). estimate_belief_for_nodes(sol, verbose = TRUE) # use belief points obtained from trajectories estimate_belief_for_nodes(sol, method = "trajectories", verbose = TRUE) # use a random uniform sample estimate_belief_for_nodes(sol, method = "random", verbose = TRUE) # Finite horizon example with three epochs. sol <- solve_POMDP(model = Tiger, horizon = 3) sol estimate_belief_for_nodes(sol)
data("Tiger") # Infinite horizon case with converged solution sol <- solve_POMDP(model = Tiger, method = "grid") sol # default method auto uses the belief points used in the algorithm (if available). estimate_belief_for_nodes(sol, verbose = TRUE) # use belief points obtained from trajectories estimate_belief_for_nodes(sol, method = "trajectories", verbose = TRUE) # use a random uniform sample estimate_belief_for_nodes(sol, method = "random", verbose = TRUE) # Finite horizon example with three epochs. sol <- solve_POMDP(model = Tiger, horizon = 3) sol estimate_belief_for_nodes(sol)
Helper functions for gridworld MDPs to convert between state names and gridworld positions, and for visualizing policies.
gridworld_init( dim, action_labels = c("up", "right", "down", "left"), unreachable_states = NULL, absorbing_states = NULL, labels = NULL ) gridworld_maze_MDP( dim, start, goal, walls = NULL, action_labels = c("up", "right", "down", "left"), goal_reward = 1, step_cost = 0, restart = FALSE, discount = 0.9, horizon = Inf, info = NULL, name = NA ) gridworld_s2rc(s) gridworld_rc2s(rc) gridworld_matrix(model, epoch = 1L, what = "states") gridworld_plot_policy( model, epoch = 1L, actions = "character", states = FALSE, labels = TRUE, absorbing_state_action = FALSE, main = NULL, cex = 1, offset = 0.5, lines = TRUE, ... ) gridworld_plot_transition_graph( x, hide_unreachable_states = TRUE, remove.loops = TRUE, vertex.color = "gray", vertex.shape = "square", vertex.size = 10, vertex.label = NA, edge.arrow.size = 0.3, margin = 0.2, main = NULL, ... ) gridworld_animate(x, method, n, zlim = NULL, ...)
gridworld_init( dim, action_labels = c("up", "right", "down", "left"), unreachable_states = NULL, absorbing_states = NULL, labels = NULL ) gridworld_maze_MDP( dim, start, goal, walls = NULL, action_labels = c("up", "right", "down", "left"), goal_reward = 1, step_cost = 0, restart = FALSE, discount = 0.9, horizon = Inf, info = NULL, name = NA ) gridworld_s2rc(s) gridworld_rc2s(rc) gridworld_matrix(model, epoch = 1L, what = "states") gridworld_plot_policy( model, epoch = 1L, actions = "character", states = FALSE, labels = TRUE, absorbing_state_action = FALSE, main = NULL, cex = 1, offset = 0.5, lines = TRUE, ... ) gridworld_plot_transition_graph( x, hide_unreachable_states = TRUE, remove.loops = TRUE, vertex.color = "gray", vertex.shape = "square", vertex.size = 10, vertex.label = NA, edge.arrow.size = 0.3, margin = 0.2, main = NULL, ... ) gridworld_animate(x, method, n, zlim = NULL, ...)
dim |
vector of length two with the x and y extent of the gridworld. |
action_labels |
vector with four action labels that move the agent up, right, down, and left. |
unreachable_states |
a vector with state labels for unreachable states. These states will be excluded. |
absorbing_states |
a vector with state labels for absorbing states. |
labels |
logical; show state labels. |
start , goal
|
labels for the start state and the goal state. |
walls |
a vector with state labels for walls. Walls will become unreachable states. |
goal_reward |
reward to transition to the goal state. |
step_cost |
cost of each action that does not lead to the goal state. |
restart |
logical; if |
discount , horizon
|
MDP discount factor, and horizon. |
info |
A list with additional information. Has to contain the gridworld
dimensions as element |
name |
a string to identify the MDP problem. |
s |
a state label. |
rc |
a vector of length two with the row and column coordinate of a state in the gridworld matrix. |
model , x
|
a solved gridworld MDP. |
epoch |
epoch for unconverged finite-horizon solutions. |
what |
What should be returned in the matrix. Options are:
|
actions |
how to show actions. Options are:
simple |
states |
logical; show state names. |
absorbing_state_action |
logical; show the value and the action for absorbing states. |
main |
a main title for the plot. Defaults to the name of the problem. |
cex |
expansion factor for the action. |
offset |
move the state labels out of the way (in fractions of a character width). |
lines |
logical; draw lines to separate states. |
... |
further arguments are passed on to |
hide_unreachable_states |
logical; do not show unreachable states. |
remove.loops |
logical; do not show transitions from a state back to itself. |
vertex.color , vertex.shape , vertex.size , vertex.label , edge.arrow.size
|
see |
margin |
a single number specifying the margin of the plot. Can be used if the graph does not fit inside the plotting area. |
method |
a MDP solution method for |
n |
number of iterations to animate. |
zlim |
limits for visualizing the state value. |
Gridworlds are implemented with state names s(row,col)
, where
row
and col
are locations in the matrix representing the gridworld.
The actions are "up"
, "right"
, "down"
, and "left"
.
gridworld_init()
initializes a new gridworld creating a matrix
of states with the given dimensions. Other action names
can be specified, but they must have the same effects in the same order
as above. Unreachable states (walls) and absorbing state can be defined.
This information can be used to build a custom gridworld MDP.
Several helper functions are provided to use states, look at the state layout, and plot policies on the gridworld.
gridworld_maze_MDP()
helps to easily define maze-like gridworld MDPs.
By default, the goal state is absorbing, but with restart = TRUE
, the
agent restarts the problem at the start state every time it reaches the goal
and receives the reward. Note that this implies that the goal state itself
becomes unreachable.
gridworld_animate()
applies algorithms from solve_MDP()
iteration
by iteration and visualized the state utilities. This helps to understand
how the algorithms work.
Other gridworld:
Cliff_walking
,
DynaMaze
,
Maze
,
Windy_gridworld
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
# Defines states, actions and a transition model for a standard gridworld gw <- gridworld_init(dim = c(7,7), unreachable_states = c("s(2,2)", "s(7,3)", "s(3,6)"), absorbing_states = "s(4,4)", labels = list("s(4,4)" = "Black Hole") ) gw$states gw$actions gw$info # display the state labels in the gridworld gridworld_matrix(gw) gridworld_matrix(gw, what = "label") gridworld_matrix(gw, what = "reachable") gridworld_matrix(gw, what = "absorbing") # a transition function for regular moves in the gridworld is provided gw$transition_prob("right", "s(1,1)", "s(1,2)") gw$transition_prob("right", "s(2,1)", "s(2,2)") ### we cannot move into an unreachable state gw$transition_prob("right", "s(2,1)", "s(2,1)") ### but the agent stays in place # convert between state names and row/column indices gridworld_s2rc("s(1,1)") gridworld_rc2s(c(1,1)) # The information in gw can be used to build a custom MDP. # We modify the standard transition function so there is a 50% chance that # you will get sucked into the black hole from the adjacent squares. trans_black_hole <- function(action = NA, start.state = NA, end.state = NA) { # ignore the action next to the black hole if (start.state %in% c("s(3,3)", "s(3,4)", "s(3,5)", "s(4,3)", "s(4,5)", "s(5,3)", "s(5,4)", "s(5,5)")) { if(end.state == "s(4,4)") return(.5) else return(gw$transition_prob(action, start.state, end.state) * .5) } # use the standard gridworld movement gw$transition_prob(action, start.state, end.state) } black_hole <- MDP(states = gw$states, actions = gw$actions, transition_prob = trans_black_hole, reward = rbind(R_(value = +1), R_(end.state = "s(4,4)", value = -100)), info = gw$info, name = "Black hole" ) black_hole gridworld_plot_transition_graph(black_hole) # solve the problem sol <- solve_MDP(black_hole) gridworld_matrix(sol, what = "values") gridworld_plot_policy(sol) # the optimal policy is to fly around, but avoid the black hole. # Build a Maze: The Dyna Maze from Chapter 8 in the RL book DynaMaze <- gridworld_maze_MDP( dim = c(6,9), start = "s(3,1)", goal = "s(1,9)", walls = c("s(2,3)", "s(3,3)", "s(4,3)", "s(5,6)", "s(1,8)", "s(2,8)", "s(3,8)"), restart = TRUE, discount = 0.95, name = "Dyna Maze", ) DynaMaze gridworld_matrix(DynaMaze) gridworld_matrix(DynaMaze, what = "labels") gridworld_plot_transition_graph(DynaMaze) # Note that the problems resets if the goal state would be reached. sol <- solve_MDP(DynaMaze) gridworld_matrix(sol, what = "values") gridworld_matrix(sol, what = "actions") gridworld_plot_policy(sol) gridworld_plot_policy(sol, actions = "label", cex = 1, states = FALSE) # visualize the first 3 iterations of value iteration gridworld_animate(DynaMaze, method = "value", n = 3)
# Defines states, actions and a transition model for a standard gridworld gw <- gridworld_init(dim = c(7,7), unreachable_states = c("s(2,2)", "s(7,3)", "s(3,6)"), absorbing_states = "s(4,4)", labels = list("s(4,4)" = "Black Hole") ) gw$states gw$actions gw$info # display the state labels in the gridworld gridworld_matrix(gw) gridworld_matrix(gw, what = "label") gridworld_matrix(gw, what = "reachable") gridworld_matrix(gw, what = "absorbing") # a transition function for regular moves in the gridworld is provided gw$transition_prob("right", "s(1,1)", "s(1,2)") gw$transition_prob("right", "s(2,1)", "s(2,2)") ### we cannot move into an unreachable state gw$transition_prob("right", "s(2,1)", "s(2,1)") ### but the agent stays in place # convert between state names and row/column indices gridworld_s2rc("s(1,1)") gridworld_rc2s(c(1,1)) # The information in gw can be used to build a custom MDP. # We modify the standard transition function so there is a 50% chance that # you will get sucked into the black hole from the adjacent squares. trans_black_hole <- function(action = NA, start.state = NA, end.state = NA) { # ignore the action next to the black hole if (start.state %in% c("s(3,3)", "s(3,4)", "s(3,5)", "s(4,3)", "s(4,5)", "s(5,3)", "s(5,4)", "s(5,5)")) { if(end.state == "s(4,4)") return(.5) else return(gw$transition_prob(action, start.state, end.state) * .5) } # use the standard gridworld movement gw$transition_prob(action, start.state, end.state) } black_hole <- MDP(states = gw$states, actions = gw$actions, transition_prob = trans_black_hole, reward = rbind(R_(value = +1), R_(end.state = "s(4,4)", value = -100)), info = gw$info, name = "Black hole" ) black_hole gridworld_plot_transition_graph(black_hole) # solve the problem sol <- solve_MDP(black_hole) gridworld_matrix(sol, what = "values") gridworld_plot_policy(sol) # the optimal policy is to fly around, but avoid the black hole. # Build a Maze: The Dyna Maze from Chapter 8 in the RL book DynaMaze <- gridworld_maze_MDP( dim = c(6,9), start = "s(3,1)", goal = "s(1,9)", walls = c("s(2,3)", "s(3,3)", "s(4,3)", "s(5,6)", "s(1,8)", "s(2,8)", "s(3,8)"), restart = TRUE, discount = 0.95, name = "Dyna Maze", ) DynaMaze gridworld_matrix(DynaMaze) gridworld_matrix(DynaMaze, what = "labels") gridworld_plot_transition_graph(DynaMaze) # Note that the problems resets if the goal state would be reached. sol <- solve_MDP(DynaMaze) gridworld_matrix(sol, what = "values") gridworld_matrix(sol, what = "actions") gridworld_plot_policy(sol) gridworld_plot_policy(sol, actions = "label", cex = 1, states = FALSE) # visualize the first 3 iterations of value iteration gridworld_animate(DynaMaze, method = "value", n = 3)
The 4x3 maze is described in Chapter 17 of the textbook "Artificial Intelligence: A Modern Approach" (AIMA).
An object of class MDP.
The simple maze has the following layout:
1234 Transition model: ###### .8 (action direction) 1# +# ^ 2# # -# | 3#S # .1 <-|-> .1 ######
We represent the maze states as a gridworld matrix with 3 rows and
4 columns. The states are labeled s(row, col)
representing the position in
the matrix.
The # (state s(2,2)
) in the middle of the maze is an obstruction and not reachable.
Rewards are associated with transitions. The default reward (penalty) is -0.04.
The start state marked with S
is s(3,1)
.
Transitioning to +
(state s(1,4)
) gives a reward of +1.0,
transitioning to -
(state s_(2,4)
)
has a reward of -1.0. Both these states are absorbing
(i.e., terminal) states.
Actions are movements (up
, right
, down
, left
). The actions are
unreliable with a .8 chance
to move in the correct direction and a 0.1 chance to instead to move in a
perpendicular direction leading to a stochastic transition model.
Note that the problem has reachable terminal states which leads to a proper policy
(that is guaranteed to reach a terminal state). This means that the solution also
converges without discounting (discount = 1
).
Russell,9 S. J. and Norvig, P. (2020). Artificial Intelligence: A modern approach. 4rd ed.
Other MDP_examples:
Cliff_walking
,
DynaMaze
,
MDP()
,
Windy_gridworld
Other gridworld:
Cliff_walking
,
DynaMaze
,
Windy_gridworld
,
gridworld
# The problem can be loaded using data(Maze). # Here is the complete problem definition: gw <- gridworld_init(dim = c(3, 4), unreachable_states = c("s(2,2)")) gridworld_matrix(gw) # the transition function is stochastic so we cannot use the standard # gridworld gw$transition_prob() function T <- function(action, start.state, end.state) { action <- match.arg(action, choices = gw$actions) # absorbing states if (start.state %in% c('s(1,4)', 's(2,4)')) { if (start.state == end.state) return(1) else return(0) } # actions are stochastic so we cannot use gw$trans_prob if(action %in% c("up", "down")) error_direction <- c("right", "left") else error_direction <- c("up", "down") rc <- gridworld_s2rc(start.state) delta <- list(up = c(-1, 0), down = c(+1, 0), right = c(0, +1), left = c(0, -1)) P <- matrix(0, nrow = 3, ncol = 4) add_prob <- function(P, rc, a, value) { new_rc <- rc + delta[[a]] if (!(gridworld_rc2s(new_rc) %in% gw$states)) new_rc <- rc P[new_rc[1], new_rc[2]] <- P[new_rc[1], new_rc[2]] + value P } P <- add_prob(P, rc, action, .8) P <- add_prob(P, rc, error_direction[1], .1) P <- add_prob(P, rc, error_direction[2], .1) P[rbind(gridworld_s2rc(end.state))] } T("up", "s(3,1)", "s(2,1)") R <- rbind( R_(end.state = NA, value = -0.04), R_(end.state = 's(2,4)', value = -1), R_(end.state = 's(1,4)', value = +1), R_(start.state = 's(2,4)', value = 0), R_(start.state = 's(1,4)', value = 0) ) Maze <- MDP( name = "Stuart Russell's 3x4 Maze", discount = 1, horizon = Inf, states = gw$states, actions = gw$actions, start = "s(3,1)", transition_prob = T, reward = R, info = list(gridworld_dim = c(3, 4), gridworld_labels = list( "s(3,1)" = "Start", "s(2,4)" = "-1", "s(1,4)" = "Goal: +1" ) ) ) Maze str(Maze) gridworld_matrix(Maze) gridworld_matrix(Maze, what = "labels") # find absorbing (terminal) states which(absorbing_states(Maze)) maze_solved <- solve_MDP(Maze) policy(maze_solved) gridworld_matrix(maze_solved, what = "values") gridworld_matrix(maze_solved, what = "actions") gridworld_plot_policy(maze_solved)
# The problem can be loaded using data(Maze). # Here is the complete problem definition: gw <- gridworld_init(dim = c(3, 4), unreachable_states = c("s(2,2)")) gridworld_matrix(gw) # the transition function is stochastic so we cannot use the standard # gridworld gw$transition_prob() function T <- function(action, start.state, end.state) { action <- match.arg(action, choices = gw$actions) # absorbing states if (start.state %in% c('s(1,4)', 's(2,4)')) { if (start.state == end.state) return(1) else return(0) } # actions are stochastic so we cannot use gw$trans_prob if(action %in% c("up", "down")) error_direction <- c("right", "left") else error_direction <- c("up", "down") rc <- gridworld_s2rc(start.state) delta <- list(up = c(-1, 0), down = c(+1, 0), right = c(0, +1), left = c(0, -1)) P <- matrix(0, nrow = 3, ncol = 4) add_prob <- function(P, rc, a, value) { new_rc <- rc + delta[[a]] if (!(gridworld_rc2s(new_rc) %in% gw$states)) new_rc <- rc P[new_rc[1], new_rc[2]] <- P[new_rc[1], new_rc[2]] + value P } P <- add_prob(P, rc, action, .8) P <- add_prob(P, rc, error_direction[1], .1) P <- add_prob(P, rc, error_direction[2], .1) P[rbind(gridworld_s2rc(end.state))] } T("up", "s(3,1)", "s(2,1)") R <- rbind( R_(end.state = NA, value = -0.04), R_(end.state = 's(2,4)', value = -1), R_(end.state = 's(1,4)', value = +1), R_(start.state = 's(2,4)', value = 0), R_(start.state = 's(1,4)', value = 0) ) Maze <- MDP( name = "Stuart Russell's 3x4 Maze", discount = 1, horizon = Inf, states = gw$states, actions = gw$actions, start = "s(3,1)", transition_prob = T, reward = R, info = list(gridworld_dim = c(3, 4), gridworld_labels = list( "s(3,1)" = "Start", "s(2,4)" = "-1", "s(1,4)" = "Goal: +1" ) ) ) Maze str(Maze) gridworld_matrix(Maze) gridworld_matrix(Maze, what = "labels") # find absorbing (terminal) states which(absorbing_states(Maze)) maze_solved <- solve_MDP(Maze) policy(maze_solved) gridworld_matrix(maze_solved, what = "values") gridworld_matrix(maze_solved, what = "actions") gridworld_plot_policy(maze_solved)
Defines all the elements of a finite state-space MDP problem.
MDP( states, actions, transition_prob, reward, discount = 0.9, horizon = Inf, start = "uniform", info = NULL, name = NA ) is_solved_MDP(x, stop = FALSE)
MDP( states, actions, transition_prob, reward, discount = 0.9, horizon = Inf, start = "uniform", info = NULL, name = NA ) is_solved_MDP(x, stop = FALSE)
states |
a character vector specifying the names of the states. |
actions |
a character vector specifying the names of the available actions. |
transition_prob |
Specifies the transition probabilities between states. |
reward |
Specifies the rewards dependent on action, states and observations. |
discount |
numeric; discount rate between 0 and 1. |
horizon |
numeric; Number of epochs. |
start |
Specifies in which state the MDP starts. |
info |
A list with additional information. |
name |
a string to identify the MDP problem. |
x |
a |
stop |
logical; stop with an error. |
Markov decision processes (MDPs) are discrete-time stochastic control
process with completely observable states. We implement here
MDPs with a finite state space. similar to POMDP
models, but without the observation model. The 'observations'
column in
the the reward specification is always missing.
make_partially_observable()
reformulates an MDP as a POMDP by adding an observation
model with one observation per state
that reveals the current state. This is achieved by adding identity
observation probability matrices.
More details on specifying the model components can be found in the documentation for POMDP.
The function returns an object of class MDP which is list with
the model specification. solve_MDP()
reads the object and adds a list element called
'solution'
.
Michael Hahsler
Other MDP:
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
Other MDP_examples:
Cliff_walking
,
DynaMaze
,
Maze
,
Windy_gridworld
# Michael's Sleepy Tiger Problem is like the POMDP Tiger problem, but # has completely observable states because the tiger is sleeping in front # of the door. This makes the problem an MDP. STiger <- MDP( name = "Michael's Sleepy Tiger Problem", discount = .9, states = c("tiger-left" , "tiger-right"), actions = c("open-left", "open-right", "do-nothing"), start = "uniform", # opening a door resets the problem transition_prob = list( "open-left" = "uniform", "open-right" = "uniform", "do-nothing" = "identity"), # the reward helper R_() expects: action, start.state, end.state, observation, value reward = rbind( R_("open-left", "tiger-left", v = -100), R_("open-left", "tiger-right", v = 10), R_("open-right", "tiger-left", v = 10), R_("open-right", "tiger-right", v = -100), R_("do-nothing", v = 0) ) ) STiger sol <- solve_MDP(STiger) sol policy(sol) plot_value_function(sol) # convert the MDP into a POMDP and solve STiger_POMDP <- make_partially_observable(STiger) sol2 <- solve_POMDP(STiger_POMDP) sol2 policy(sol2) plot_value_function(sol2, ylim = c(80, 120))
# Michael's Sleepy Tiger Problem is like the POMDP Tiger problem, but # has completely observable states because the tiger is sleeping in front # of the door. This makes the problem an MDP. STiger <- MDP( name = "Michael's Sleepy Tiger Problem", discount = .9, states = c("tiger-left" , "tiger-right"), actions = c("open-left", "open-right", "do-nothing"), start = "uniform", # opening a door resets the problem transition_prob = list( "open-left" = "uniform", "open-right" = "uniform", "do-nothing" = "identity"), # the reward helper R_() expects: action, start.state, end.state, observation, value reward = rbind( R_("open-left", "tiger-left", v = -100), R_("open-left", "tiger-right", v = 10), R_("open-right", "tiger-left", v = 10), R_("open-right", "tiger-right", v = -100), R_("do-nothing", v = 0) ) ) STiger sol <- solve_MDP(STiger) sol policy(sol) plot_value_function(sol) # convert the MDP into a POMDP and solve STiger_POMDP <- make_partially_observable(STiger) sol2 <- solve_POMDP(STiger_POMDP) sol2 policy(sol2) plot_value_function(sol2, ylim = c(80, 120))
Implementation several functions useful to deal with MDP policies.
q_values_MDP(model, U = NULL) MDP_policy_evaluation( pi, model, U = NULL, k_backups = 1000, theta = 0.001, verbose = FALSE ) greedy_MDP_action(s, Q, epsilon = 0, prob = FALSE) random_MDP_policy(model, prob = NULL) manual_MDP_policy(model, actions) greedy_MDP_policy(Q)
q_values_MDP(model, U = NULL) MDP_policy_evaluation( pi, model, U = NULL, k_backups = 1000, theta = 0.001, verbose = FALSE ) greedy_MDP_action(s, Q, epsilon = 0, prob = FALSE) random_MDP_policy(model, prob = NULL) manual_MDP_policy(model, actions) greedy_MDP_policy(Q)
model |
an MDP problem specification. |
U |
a vector with value function representing the state utilities
(expected sum of discounted rewards from that point on).
If |
pi |
a policy as a data.frame with at least columns for states and action. |
k_backups |
number of look ahead steps used for approximate policy evaluation
used by the policy iteration method. Set k_backups to |
theta |
stop when the largest change in a state value is less
than |
verbose |
logical; should progress and approximation errors be printed. |
s |
a state. |
Q |
an action value function with Q-values as a state by action matrix. |
epsilon |
an |
prob |
probability vector for random actions for |
actions |
a vector with the action (either the action label or the numeric id) for each state. |
Implemented functions are:
q_values_MDP()
calculates (approximates)
Q-values for a given model using the Bellman
optimality equation:
Q-values can be used as the input for several other functions.
MDP_policy_evaluation()
evaluates a policy for a model and returns
(approximate) state values by applying the Bellman equation as an update
rule for each state and iteration
:
In each iteration, all states are updated. Updating is stopped after
k_backups
iterations or after the
largest update .
greedy_MDP_action()
returns the action with the largest Q-value given a
state.
random_MDP_policy()
, manual_MDP_policy()
, and greedy_MDP_policy()
generates different policies. These policies can be added to a problem
using add_policy()
.
q_values_MDP()
returns a state by action matrix specifying the Q-function,
i.e., the action value for executing each action in each state. The Q-values
are calculated from the value function (U) and the transition model.
MDP_policy_evaluation()
returns a vector with (approximate)
state values (U).
greedy_MDP_action()
returns the action with the highest q-value
for state s
. If prob = TRUE
, then a vector with
the probability for each action is returned.
random_MDP_policy()
returns a data.frame with the columns state and action to define a policy.
manual_MDP_policy()
returns a data.frame with the columns state and action to define a policy.
greedy_MDP_policy()
returns the greedy policy given Q
.
Michael Hahsler
Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.
Other MDP:
MDP()
,
MDP2POMDP
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
data(Maze) Maze # create several policies: # 1. optimal policy using value iteration maze_solved <- solve_MDP(Maze, method = "value_iteration") maze_solved pi_opt <- policy(maze_solved) pi_opt gridworld_plot_policy(add_policy(Maze, pi_opt), main = "Optimal Policy") # 2. a manual policy (go up and in some squares to the right) acts <- rep("up", times = length(Maze$states)) names(acts) <- Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] <- "right" pi_manual <- manual_MDP_policy(Maze, acts) pi_manual gridworld_plot_policy(add_policy(Maze, pi_manual), main = "Manual Policy") # 3. a random policy set.seed(1234) pi_random <- random_MDP_policy(Maze) pi_random gridworld_plot_policy(add_policy(Maze, pi_random), main = "Random Policy") # 4. an improved policy based on one policy evaluation and # policy improvement step. u <- MDP_policy_evaluation(pi_random, Maze) q <- q_values_MDP(Maze, U = u) pi_greedy <- greedy_MDP_policy(q) pi_greedy gridworld_plot_policy(add_policy(Maze, pi_greedy), main = "Greedy Policy") #' compare the approx. value functions for the policies (we restrict #' the number of backups for the random policy since it may not converge) rbind( random = MDP_policy_evaluation(pi_random, Maze, k_backups = 100), manual = MDP_policy_evaluation(pi_manual, Maze), greedy = MDP_policy_evaluation(pi_greedy, Maze), optimal = MDP_policy_evaluation(pi_opt, Maze) ) # For many functions, we first add the policy to the problem description # to create a "solved" MDP maze_random <- add_policy(Maze, pi_random) maze_random # plotting plot_value_function(maze_random) gridworld_plot_policy(maze_random) # compare to a benchmark regret(maze_random, benchmark = maze_solved) # calculate greedy actions for state 1 q <- q_values_MDP(maze_random) q greedy_MDP_action(1, q, epsilon = 0, prob = FALSE) greedy_MDP_action(1, q, epsilon = 0, prob = TRUE) greedy_MDP_action(1, q, epsilon = .1, prob = TRUE)
data(Maze) Maze # create several policies: # 1. optimal policy using value iteration maze_solved <- solve_MDP(Maze, method = "value_iteration") maze_solved pi_opt <- policy(maze_solved) pi_opt gridworld_plot_policy(add_policy(Maze, pi_opt), main = "Optimal Policy") # 2. a manual policy (go up and in some squares to the right) acts <- rep("up", times = length(Maze$states)) names(acts) <- Maze$states acts[c("s(1,1)", "s(1,2)", "s(1,3)")] <- "right" pi_manual <- manual_MDP_policy(Maze, acts) pi_manual gridworld_plot_policy(add_policy(Maze, pi_manual), main = "Manual Policy") # 3. a random policy set.seed(1234) pi_random <- random_MDP_policy(Maze) pi_random gridworld_plot_policy(add_policy(Maze, pi_random), main = "Random Policy") # 4. an improved policy based on one policy evaluation and # policy improvement step. u <- MDP_policy_evaluation(pi_random, Maze) q <- q_values_MDP(Maze, U = u) pi_greedy <- greedy_MDP_policy(q) pi_greedy gridworld_plot_policy(add_policy(Maze, pi_greedy), main = "Greedy Policy") #' compare the approx. value functions for the policies (we restrict #' the number of backups for the random policy since it may not converge) rbind( random = MDP_policy_evaluation(pi_random, Maze, k_backups = 100), manual = MDP_policy_evaluation(pi_manual, Maze), greedy = MDP_policy_evaluation(pi_greedy, Maze), optimal = MDP_policy_evaluation(pi_opt, Maze) ) # For many functions, we first add the policy to the problem description # to create a "solved" MDP maze_random <- add_policy(Maze, pi_random) maze_random # plotting plot_value_function(maze_random) gridworld_plot_policy(maze_random) # compare to a benchmark regret(maze_random, benchmark = maze_solved) # calculate greedy actions for state 1 q <- q_values_MDP(maze_random) q greedy_MDP_action(1, q, epsilon = 0, prob = FALSE) greedy_MDP_action(1, q, epsilon = 0, prob = TRUE) greedy_MDP_action(1, q, epsilon = .1, prob = TRUE)
Convert a MDP into POMDP by adding an observation model or a POMDP into a MDP by making the states observable.
make_partially_observable(x, observations = NULL, observation_prob = NULL) make_fully_observable(x)
make_partially_observable(x, observations = NULL, observation_prob = NULL) make_fully_observable(x)
x |
a |
observations |
a character vector specifying the names of the available observations. |
observation_prob |
Specifies the observation probabilities (see POMDP for details). |
make_partially_observable()
adds an observation model to an MDP. If no observations and
observation probabilities are provided, then an observation for each state is created
with identity observation matrices. This means we have a fully observable model
encoded as a POMDP.
make_fully_observable()
removes the observation model from a POMDP and returns
an MDP.
a MDP
or a POMDP
object.
Michael Hahsler
Other MDP:
MDP()
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
Other POMDP:
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
# Turn the Maze MDP into a partially observable problem. # Here each state has an observation, so it is still a fully observable problem # encoded as a POMDP. data("Maze") Maze Maze_POMDP <- make_partially_observable(Maze) Maze_POMDP sol <- solve_POMDP(Maze_POMDP) policy(sol) simulate_POMDP(sol, n = 1, horizon = 100, return_trajectories = TRUE)$trajectories # Make the Tiger POMDP fully observable data("Tiger") Tiger Tiger_MDP <- make_fully_observable(Tiger) Tiger_MDP sol <- solve_MDP(Tiger_MDP) policy(sol) # The result is not exciting since we can observe where the tiger is!
# Turn the Maze MDP into a partially observable problem. # Here each state has an observation, so it is still a fully observable problem # encoded as a POMDP. data("Maze") Maze Maze_POMDP <- make_partially_observable(Maze) Maze_POMDP sol <- solve_POMDP(Maze_POMDP) policy(sol) simulate_POMDP(sol, n = 1, horizon = 100, return_trajectories = TRUE)$trajectories # Make the Tiger POMDP fully observable data("Tiger") Tiger Tiger_MDP <- make_fully_observable(Tiger) Tiger_MDP sol <- solve_MDP(Tiger_MDP) policy(sol) # The result is not exciting since we can observe where the tiger is!
Determines the optimal action for a policy (solved POMDP) for a given belief at a given epoch.
optimal_action(model, belief = NULL, epoch = 1)
optimal_action(model, belief = NULL, epoch = 1)
model |
a solved POMDP. |
belief |
The belief (probability distribution over the states) as a
vector or a matrix with multiple belief states as rows. If |
epoch |
what epoch of the policy should be used. Use 1 for converged policies. |
The name of the optimal action.
Michael Hahsler
Other policy:
estimate_belief_for_nodes()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
data("Tiger") Tiger sol <- solve_POMDP(model = Tiger) # these are the states sol$states # belief that tiger is to the left optimal_action(sol, c(1, 0)) optimal_action(sol, "tiger-left") # belief that tiger is to the right optimal_action(sol, c(0, 1)) optimal_action(sol, "tiger-right") # belief is 50/50 optimal_action(sol, c(.5, .5)) optimal_action(sol, "uniform") # the POMDP is converged, so all epoch give the same result. optimal_action(sol, "tiger-right", epoch = 10)
data("Tiger") Tiger sol <- solve_POMDP(model = Tiger) # these are the states sol$states # belief that tiger is to the left optimal_action(sol, c(1, 0)) optimal_action(sol, "tiger-left") # belief that tiger is to the right optimal_action(sol, c(0, 1)) optimal_action(sol, "tiger-right") # belief is 50/50 optimal_action(sol, c(.5, .5)) optimal_action(sol, "uniform") # the POMDP is converged, so all epoch give the same result. optimal_action(sol, "tiger-right", epoch = 10)
Plots the optimal action, the node in the policy graph or the reward for a given set of belief points on a line (2D) or on a ternary plot (3D). If no points are given, points are sampled using a regular arrangement or randomly from the (projected) belief space.
plot_belief_space( model, projection = NULL, epoch = 1, sample = "regular", n = 100, what = c("action", "pg_node", "reward"), legend = TRUE, pch = 20, col = NULL, jitter = 0, oneD = TRUE, ... )
plot_belief_space( model, projection = NULL, epoch = 1, sample = "regular", n = 100, what = c("action", "pg_node", "reward"), legend = TRUE, pch = 20, col = NULL, jitter = 0, oneD = TRUE, ... )
model |
a solved POMDP. |
projection |
Sample in a projected belief space. See |
epoch |
display this epoch. |
sample |
a matrix with belief points as rows or a character string
specifying the |
n |
number of points sampled. |
what |
what to plot. |
legend |
logical; add a legend? If the legend is covered by the plot then you need to increase the plotting region of the plotting device. |
pch |
plotting symbols. |
col |
plotting colors. |
jitter |
jitter amount for 2D belief spaces (good values are between 0 and 1, while using |
oneD |
plot projections on two states in one dimension. |
... |
additional arguments are passed on to |
Returns invisibly the sampled points.
Michael Hahsler
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
# two-state POMDP data("Tiger") sol <- solve_POMDP(Tiger) plot_belief_space(sol) plot_belief_space(sol, oneD = FALSE) plot_belief_space(sol, n = 10) plot_belief_space(sol, n = 100, sample = "random") # plot the belief points used by the grid-based solver plot_belief_space(sol, sample = sol $solution$belief_points_solver) # plot different measures plot_belief_space(sol, what = "pg_node") plot_belief_space(sol, what = "reward") # three-state POMDP # Note: If the plotting region is too small then the legend might run into the plot data("Three_doors") sol <- solve_POMDP(Three_doors) sol # plotting needs the suggested package Ternary if ("Ternary" %in% installed.packages()) { plot_belief_space(sol) plot_belief_space(sol, n = 10000) plot_belief_space(sol, what = "reward", sample = "random", n = 1000) plot_belief_space(sol, what = "pg_node", n = 10000) # holding tiger-left constant at .5 follows this line in the ternary plot Ternary::TernaryLines(list(c(.5, 0, .5), c(.5, .5, 0)), col = "black", lty = 2) # we can plot the projection for this line plot_belief_space(sol, what = "pg_node", n = 1000, projection = c("tiger-left" = .5)) # plot the belief points used by the grid-based solver plot_belief_space(sol, sample = sol$solution$belief_points_solver, what = "pg_node") # plot the belief points obtained using simulated trajectories with an epsilon-greedy policy. # Note that we only use n = 50 to save time. plot_belief_space(sol, sample = simulate_POMDP(sol, n = 50, horizon = 100, epsilon = 0.1, return_beliefs = TRUE)$belief_states) } # plot a 3-state belief space using ggtern (ggplot2) ## Not run: library(ggtern) samp <- sample_belief_space(sol, n = 1000) df <- cbind(as.data.frame(samp), reward_node_action(sol, belief = samp)) df$pg_node <- factor(df$pg_node) ggtern(df, aes(x = `tiger-left`, y = `tiger-center`, z = `tiger-right`)) + geom_point(aes(color = pg_node), size = 2) ggtern(df, aes(x = `tiger-left`, y = `tiger-center`, z = `tiger-right`)) + geom_point(aes(color = action), size = 2) ggtern(df, aes(x = `tiger-left`, y = `tiger-center`, z = `tiger-right`)) + geom_point(aes(color = reward), size = 2) ## End(Not run)
# two-state POMDP data("Tiger") sol <- solve_POMDP(Tiger) plot_belief_space(sol) plot_belief_space(sol, oneD = FALSE) plot_belief_space(sol, n = 10) plot_belief_space(sol, n = 100, sample = "random") # plot the belief points used by the grid-based solver plot_belief_space(sol, sample = sol $solution$belief_points_solver) # plot different measures plot_belief_space(sol, what = "pg_node") plot_belief_space(sol, what = "reward") # three-state POMDP # Note: If the plotting region is too small then the legend might run into the plot data("Three_doors") sol <- solve_POMDP(Three_doors) sol # plotting needs the suggested package Ternary if ("Ternary" %in% installed.packages()) { plot_belief_space(sol) plot_belief_space(sol, n = 10000) plot_belief_space(sol, what = "reward", sample = "random", n = 1000) plot_belief_space(sol, what = "pg_node", n = 10000) # holding tiger-left constant at .5 follows this line in the ternary plot Ternary::TernaryLines(list(c(.5, 0, .5), c(.5, .5, 0)), col = "black", lty = 2) # we can plot the projection for this line plot_belief_space(sol, what = "pg_node", n = 1000, projection = c("tiger-left" = .5)) # plot the belief points used by the grid-based solver plot_belief_space(sol, sample = sol$solution$belief_points_solver, what = "pg_node") # plot the belief points obtained using simulated trajectories with an epsilon-greedy policy. # Note that we only use n = 50 to save time. plot_belief_space(sol, sample = simulate_POMDP(sol, n = 50, horizon = 100, epsilon = 0.1, return_beliefs = TRUE)$belief_states) } # plot a 3-state belief space using ggtern (ggplot2) ## Not run: library(ggtern) samp <- sample_belief_space(sol, n = 1000) df <- cbind(as.data.frame(samp), reward_node_action(sol, belief = samp)) df$pg_node <- factor(df$pg_node) ggtern(df, aes(x = `tiger-left`, y = `tiger-center`, z = `tiger-right`)) + geom_point(aes(color = pg_node), size = 2) ggtern(df, aes(x = `tiger-left`, y = `tiger-center`, z = `tiger-right`)) + geom_point(aes(color = action), size = 2) ggtern(df, aes(x = `tiger-left`, y = `tiger-center`, z = `tiger-right`)) + geom_point(aes(color = reward), size = 2) ## End(Not run)
The function plots the POMDP policy graph for converged POMDP solution and the policy tree for a finite-horizon solution.
plot_policy_graph( x, belief = NULL, engine = c("igraph", "visNetwork"), show_belief = TRUE, state_col = NULL, legend = TRUE, simplify_observations = TRUE, remove_unreachable_nodes = TRUE, ... ) curve_multiple_directed(graph, start = 0.3)
plot_policy_graph( x, belief = NULL, engine = c("igraph", "visNetwork"), show_belief = TRUE, state_col = NULL, legend = TRUE, simplify_observations = TRUE, remove_unreachable_nodes = TRUE, ... ) curve_multiple_directed(graph, start = 0.3)
x |
object of class POMDP containing a solved and converged POMDP problem. |
belief |
the initial belief is used to mark the initial belief state in the
graph of a converged solution and to identify the root node in a policy graph for a finite-horizon solution.
If |
engine |
The plotting engine to be used. |
show_belief |
logical; show estimated belief proportions as a pie chart or color in each node? |
state_col |
colors used to represent the belief over states in each node. Only used if |
legend |
logical; display a legend for colors used belief proportions? |
simplify_observations |
combine parallel observation arcs into a single arc. |
remove_unreachable_nodes |
logical; remove nodes that are not reachable from the start state? Currently only implemented for policy trees for unconverged finite-time horizon POMDPs. |
... |
parameters are passed on to |
graph |
The input graph. |
start |
The curvature at the two extreme edges. |
The policy graph returned by policy_graph()
can be directly plotted. plot_policy_graph()
uses policy_graph()
to get the policy graph and produces an
improved visualization (a legend, tree layout for finite-horizon solutions, better edge curving, etc.).
It also offers an interactive visualization using visNetwork::visIgraph()
.
Each policy graph node is represented by an alpha vector specifying a hyper plane segment. The convex hull of the set of hyperplanes represents the the value function. The policy specifies for each node an optimal action which is printed together with the node ID inside the node. The arcs are labeled with observations. Infinite-horizon converged solutions from a single policy graph. For finite-horizon solution a policy tree is produced. The levels of the tree and the first number in the node label represent the epochs.
For better visualization, we provide a few features:
Show Belief, belief color and legend: A pie chart (or the color) in each node can be used
represent an example of the belief that the agent has if it is in this node.
This can help with interpreting the policy graph. The belief is obtained by calling
estimate_belief_for_nodes()
.
Simplify observations: In some cases, two observations can lead to the same node resulting in two parallel edges. These edges can be collapsed into one labels with the observations.
Remove unreachable nodes: Many algorithms produce unused policy graph nodes which can be filtered to produce a smaller tree structure of actually used nodes. Non-converged policies depend on the initial belief and if an initial belief is specified, then different nodes will be filtered and the tree will look different.
These improvements can be disabled using parameters.
curve_multiple_directed()
is a helper function for plotting igraph graphs similar to igraph::curve_multiple()
but
it also adds curvature to parallel edges that point in opposite directions.
returns invisibly what the plotting engine returns.
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
data("Tiger") ### Policy graphs for converged solutions sol <- solve_POMDP(model = Tiger) sol policy_graph(sol) ## visualization plot_policy_graph(sol) ## use a different graph layout (circle and manual; needs igraph) library("igraph") plot_policy_graph(sol, layout = layout.circle) plot_policy_graph(sol, layout = rbind(c(1,1), c(1,-1), c(0,0), c(-1,-1), c(-1,1)), margin = .2) plot_policy_graph(sol, layout = rbind(c(1,0), c(.5,0), c(0,0), c(-.5,0), c(-1,0)), rescale = FALSE, vertex.size = 15, edge.curved = 2, main = "Tiger Problem") ## hide labels, beliefs and legend plot_policy_graph(sol, show_belief = FALSE, edge.label = NA, vertex.label = NA, legend = FALSE) ## custom larger vertex labels (A, B, ...) plot_policy_graph(sol, vertex.label = LETTERS[1:nrow(policy(sol))], vertex.size = 60, vertex.label.cex = 2, edge.label.cex = .7, vertex.label.color = "white") ## plotting the igraph object directly pg <- policy_graph(sol, show_belief = TRUE, simplify_observations = TRUE, remove_unreachable_nodes = TRUE) ## (e.g., using a tree layout) plot(pg, layout = layout_as_tree(pg, root = 3, mode = "out")) ## change labels (abbreviate observations and use only actions to label the vertices) plot(pg, edge.label = abbreviate(E(pg)$label), vertex.label = V(pg)$action, vertex.size = 20) ## use action to color vertices (requires a graph without a belief pie chart) ## and color edges to represent observations. pg <- policy_graph(sol, show_belief = FALSE, simplify_observations = TRUE, remove_unreachable_nodes = TRUE) plot(pg, vertex.label = NA, vertex.color = factor(V(pg)$action), vertex.size = 20, edge.color = factor(E(pg)$observation), edge.curved = .1 ) acts <- levels(factor(V(pg)$action)) legend("topright", legend = acts, title = "action", col = igraph::categorical_pal(length(acts)), pch = 15) obs <- levels(factor(E(pg)$observation)) legend("bottomright", legend = obs, title = "observation", col = igraph::categorical_pal(length(obs)), lty = 1) ## plot interactive graphs using the visNetwork library. ## Note: the pie chart representation is not available, but colors are used instead. plot_policy_graph(sol, engine = "visNetwork") ## add smooth edges and a layout (note, engine can be abbreviated) plot_policy_graph(sol, engine = "visNetwork", layout = "layout_in_circle", smooth = TRUE) ### Policy trees for finite-horizon solutions sol <- solve_POMDP(model = Tiger, horizon = 4, method = "incprune") policy_graph(sol) plot_policy_graph(sol) # Note: the first number in the node id is the epoch. # plot the policy tree for an initial belief of 90% that the tiger is to the left plot_policy_graph(sol, belief = c(0.9, 0.1)) # Plotting a larger graph (see ? igraph.plotting for plotting options) sol <- solve_POMDP(model = Tiger, horizon = 10, method = "incprune") plot_policy_graph(sol, edge.arrow.size = .1, vertex.label.cex = .5, edge.label.cex = .5) plot_policy_graph(sol, engine = "visNetwork")
data("Tiger") ### Policy graphs for converged solutions sol <- solve_POMDP(model = Tiger) sol policy_graph(sol) ## visualization plot_policy_graph(sol) ## use a different graph layout (circle and manual; needs igraph) library("igraph") plot_policy_graph(sol, layout = layout.circle) plot_policy_graph(sol, layout = rbind(c(1,1), c(1,-1), c(0,0), c(-1,-1), c(-1,1)), margin = .2) plot_policy_graph(sol, layout = rbind(c(1,0), c(.5,0), c(0,0), c(-.5,0), c(-1,0)), rescale = FALSE, vertex.size = 15, edge.curved = 2, main = "Tiger Problem") ## hide labels, beliefs and legend plot_policy_graph(sol, show_belief = FALSE, edge.label = NA, vertex.label = NA, legend = FALSE) ## custom larger vertex labels (A, B, ...) plot_policy_graph(sol, vertex.label = LETTERS[1:nrow(policy(sol))], vertex.size = 60, vertex.label.cex = 2, edge.label.cex = .7, vertex.label.color = "white") ## plotting the igraph object directly pg <- policy_graph(sol, show_belief = TRUE, simplify_observations = TRUE, remove_unreachable_nodes = TRUE) ## (e.g., using a tree layout) plot(pg, layout = layout_as_tree(pg, root = 3, mode = "out")) ## change labels (abbreviate observations and use only actions to label the vertices) plot(pg, edge.label = abbreviate(E(pg)$label), vertex.label = V(pg)$action, vertex.size = 20) ## use action to color vertices (requires a graph without a belief pie chart) ## and color edges to represent observations. pg <- policy_graph(sol, show_belief = FALSE, simplify_observations = TRUE, remove_unreachable_nodes = TRUE) plot(pg, vertex.label = NA, vertex.color = factor(V(pg)$action), vertex.size = 20, edge.color = factor(E(pg)$observation), edge.curved = .1 ) acts <- levels(factor(V(pg)$action)) legend("topright", legend = acts, title = "action", col = igraph::categorical_pal(length(acts)), pch = 15) obs <- levels(factor(E(pg)$observation)) legend("bottomright", legend = obs, title = "observation", col = igraph::categorical_pal(length(obs)), lty = 1) ## plot interactive graphs using the visNetwork library. ## Note: the pie chart representation is not available, but colors are used instead. plot_policy_graph(sol, engine = "visNetwork") ## add smooth edges and a layout (note, engine can be abbreviated) plot_policy_graph(sol, engine = "visNetwork", layout = "layout_in_circle", smooth = TRUE) ### Policy trees for finite-horizon solutions sol <- solve_POMDP(model = Tiger, horizon = 4, method = "incprune") policy_graph(sol) plot_policy_graph(sol) # Note: the first number in the node id is the epoch. # plot the policy tree for an initial belief of 90% that the tiger is to the left plot_policy_graph(sol, belief = c(0.9, 0.1)) # Plotting a larger graph (see ? igraph.plotting for plotting options) sol <- solve_POMDP(model = Tiger, horizon = 10, method = "incprune") plot_policy_graph(sol, edge.arrow.size = .1, vertex.label.cex = .5, edge.label.cex = .5) plot_policy_graph(sol, engine = "visNetwork")
Extracts the policy from a solved POMDP/MDP.
policy(x, drop = TRUE)
policy(x, drop = TRUE)
x |
|
drop |
logical; drop the list for converged, epoch-independent policies. |
A list (one entry per epoch) with the optimal policy. For converged, infinite-horizon problems solutions, a list with only the converged solution is produced. For a POMDP, the policy is a data.frame consisting of:
Part 1: The alpha vectors for the belief states (defines also the utility of the belief). The columns have the names of states.
Part 2: The last column named action
contains the prescribed action.
For an MDP, the policy is a data.frame with columns for:
state
: The state.
U
: The state's value (discounted expected utility U) if the policy
is followed
action
: The prescribed action.
A list with the policy for each epoch. Converged policies
have only one element. If drop = TRUE
then the policy is returned
without a list.
Michael Hahsler
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy_graph()
,
projection()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
data("Tiger") # Infinite horizon sol <- solve_POMDP(model = Tiger) sol # policy with value function, optimal action and transitions for observations. policy(sol) plot_value_function(sol) # Finite horizon (we use incremental pruning because grid does not converge) sol <- solve_POMDP(model = Tiger, method = "incprune", horizon = 3, discount = 1) sol policy(sol) # Note: We see that it is initially better to listen till we make # a decision in the final epoch. # MDP policy data(Maze) sol <- solve_MDP(Maze) policy(sol)
data("Tiger") # Infinite horizon sol <- solve_POMDP(model = Tiger) sol # policy with value function, optimal action and transitions for observations. policy(sol) plot_value_function(sol) # Finite horizon (we use incremental pruning because grid does not converge) sol <- solve_POMDP(model = Tiger, method = "incprune", horizon = 3, discount = 1) sol policy(sol) # Note: We see that it is initially better to listen till we make # a decision in the final epoch. # MDP policy data(Maze) sol <- solve_MDP(Maze) policy(sol)
The function creates a POMDP policy graph for converged POMDP solution and the policy tree for a finite-horizon solution. The graph is represented as an igraph object.
policy_graph( x, belief = NULL, show_belief = FALSE, state_col = NULL, simplify_observations = FALSE, remove_unreachable_nodes = FALSE, ... )
policy_graph( x, belief = NULL, show_belief = FALSE, state_col = NULL, simplify_observations = FALSE, remove_unreachable_nodes = FALSE, ... )
x |
object of class POMDP containing a solved and converged POMDP problem. |
belief |
the initial belief is used to mark the initial belief state in the
grave of a converged solution and to identify the root node in a policy graph for a finite-horizon solution.
If |
show_belief |
logical; show estimated belief proportions as a pie chart or color in each node? |
state_col |
colors used to represent the belief over the states in each node. Only used if |
simplify_observations |
combine parallel observation arcs into a single arc. |
remove_unreachable_nodes |
logical; remove nodes that are not reachable from the start state? Currently only implemented for policy trees for unconverged finite-time horizon POMDPs. |
... |
parameters are passed on to |
Each policy graph node is represented by an alpha vector specifying a hyper plane segment. The convex hull of the set of hyperplanes represents the the value function. The policy specifies for each node an optimal action which is printed together with the node ID inside the node. The arcs are labeled with observations. Infinite-horizon converged solutions from a single policy graph. For finite-horizon solution a policy tree is produced. The levels of the tree and the first number in the node label represent the epochs.
The parameters show_belief
, remove_unreachable_nodes
, and simplify_observations
are
used by plot_policy_graph()
(see there for details) to reduce clutter and make the visualization more readable.
These options are disabled by default for policy_graph()
.
returns the policy graph as an igraph object.
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
projection()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
data("Tiger") ### Policy graphs for converged solutions sol <- solve_POMDP(model = Tiger) sol policy_graph(sol) ## visualization plot_policy_graph(sol) ### Policy trees for finite-horizon solutions sol <- solve_POMDP(model = Tiger, horizon = 4, method = "incprune") policy_graph(sol) plot_policy_graph(sol) # Note: the first number in the node id is the epoch.
data("Tiger") ### Policy graphs for converged solutions sol <- solve_POMDP(model = Tiger) sol policy_graph(sol) ## visualization plot_policy_graph(sol) ### Policy trees for finite-horizon solutions sol <- solve_POMDP(model = Tiger, horizon = 4, method = "incprune") policy_graph(sol) plot_policy_graph(sol) # Note: the first number in the node id is the epoch.
Defines all the elements of a POMDP problem including the discount rate, the set of states, the set of actions, the set of observations, the transition probabilities, the observation probabilities, and rewards.
POMDP( states, actions, observations, transition_prob, observation_prob, reward, discount = 0.9, horizon = Inf, terminal_values = NULL, start = "uniform", info = NULL, name = NA ) is_solved_POMDP(x, stop = FALSE, message = "") is_timedependent_POMDP(x) epoch_to_episode(x, epoch) is_converged_POMDP(x, stop = FALSE, message = "") O_(action = NA, end.state = NA, observation = NA, probability) T_(action = NA, start.state = NA, end.state = NA, probability) R_(action = NA, start.state = NA, end.state = NA, observation = NA, value)
POMDP( states, actions, observations, transition_prob, observation_prob, reward, discount = 0.9, horizon = Inf, terminal_values = NULL, start = "uniform", info = NULL, name = NA ) is_solved_POMDP(x, stop = FALSE, message = "") is_timedependent_POMDP(x) epoch_to_episode(x, epoch) is_converged_POMDP(x, stop = FALSE, message = "") O_(action = NA, end.state = NA, observation = NA, probability) T_(action = NA, start.state = NA, end.state = NA, probability) R_(action = NA, start.state = NA, end.state = NA, observation = NA, value)
states |
a character vector specifying the names of the states. Note that state names have to start with a letter. |
actions |
a character vector specifying the names of the available actions. Note that action names have to start with a letter. |
observations |
a character vector specifying the names of the observations. Note that observation names have to start with a letter. |
transition_prob |
Specifies action-dependent transition probabilities between states. See Details section. |
observation_prob |
Specifies the probability that an action/state combination produces an observation. See Details section. |
reward |
Specifies the rewards structure dependent on action, states and observations. See Details section. |
discount |
numeric; discount factor between 0 and 1. |
horizon |
numeric; Number of epochs. |
terminal_values |
a vector with the terminal values for each state or a
matrix specifying the terminal rewards via a terminal value function (e.g.,
the alpha component produced by |
start |
Specifies the initial belief state of the agent. A vector with the
probability for each state is supplied. Also the string |
info |
A list with additional information. |
name |
a string to identify the POMDP problem. |
x |
a POMDP. |
stop |
logical; stop with an error. |
message |
a error message to be displayed displayed |
epoch |
integer; an epoch that should be converted to the corresponding episode in a time-dependent POMDP. |
action , start.state , end.state , observation , probability , value
|
Values
used in the helper functions |
In the following we use the following notation. The POMDP is a 7-duple:
.
is the set of states;
is the set of actions;
are the conditional transition probabilities
between states;
is the reward function;
is the set of
observations;
are the conditional observation probabilities; and
is the discount factor. We will use lower case letters to
represent a member of a set, e.g.,
is a specific state. To refer to
the size of a set we will use cardinality, e.g., the number of actions is
.
Note that the observation model is in the literature
often also denoted by the letter .
Names used for mathematical symbols in code
:
'states', start.state', 'end.state'
:
'actions', 'action'
:
'observations', 'observation'
State names, actions and observations can be specified as strings or index numbers
(e.g., start.state
can be specified as the index of the state in states
).
For the specification as data.frames below, NA
can be used to mean
any start.state
, end.state
, action
or observation
. Note that some POMDP solvers and the POMDP
file format use '*'
for this purpose.
The specification below map to the format used by pomdp-solve (see http://www.pomdp.org).
Specification of transition probabilities:
Transition probability to transition to state from given state
and action
. The transition probabilities can be
specified in the following ways:
A data.frame with columns exactly like the arguments of T_()
.
You can use rbind()
with helper function T_()
to create this data
frame. Probabilities can be specified multiple times and the definition that
appears last in the data.frame will take affect.
A named list of matrices, one for each action. Each matrix is square with
rows representing start states and columns representing end states
.
Instead of a matrix, also the strings
'identity'
or 'uniform'
can be specified.
A function with the same arguments are T_()
, but no default values
that returns the transition probability.
Specification of observation probabilities:
The POMDP specifies the probability for each observation given an
action
and that the system transitioned to the end state
. These probabilities can be specified in the
following ways:
A data frame with columns named exactly like the arguments of O_()
.
You can use rbind()
with helper function O_()
to create this data frame. Probabilities can be
specified multiple times and the definition that
appears last in the data.frame will take affect.
A named list of matrices, one for each action. Each matrix has
rows representing end states and columns representing an observation
.
Instead of a matrix, also the string
'uniform'
can be specified.
A function with the same arguments are O_()
, but no default values
that returns the observation probability.
Specification of the reward function:
The reward function can be specified in the following ways:
A data frame with columns named exactly like the arguments of R_()
.
You can use rbind()
with helper function R_()
to create this data frame. Rewards can be specified
multiple times and the definition that
appears last in the data.frame will take affect.
A list of lists. The list levels are 'action'
and 'start.state'
. The list elements
are matrices with
rows representing end states and columns representing an observation
.
A function with the same arguments are R_()
, but no default values
that returns the reward.
To avoid overflow problems with rewards, reward values should stay well within the
range of
[-1e10, +1e10]
. -Inf
can be used as the reward for unavailable actions and
will be translated into a large negative reward for solvers that only support
finite reward values.
Start Belief
The initial belief state of the agent is a distribution over the states. It is used to calculate the
total expected cumulative reward printed with the solved model. The function reward()
can be
used to calculate rewards for any belief.
Some methods use this belief to decide which belief states to explore (e.g., the finite grid method).
Options to specify the start belief state are:
A probability distribution over the states. That is, a vector
of probabilities, that add up to
.
The string "uniform"
for a uniform
distribution over all states.
An integer in the range to
to specify the index of a single starting state.
A string specifying the name of a single starting state.
The default initial belief is a uniform distribution over all states.
Convergence
A infinite-horizon POMDP needs to converge to provide a valid value function and policy.
A finite-horizon POMDP may also converging to a infinite horizon solution if the horizon is long enough.
Time-dependent POMDPs
Time dependence of transition probabilities, observation probabilities and
reward structure can be modeled by considering a set of episodes
representing epoch with the same settings. The length of each episode is
specified as a vector for horizon
, where the length is the number of
episodes and each value is the length of the episode in epochs. Transition
probabilities, observation probabilities and/or reward structure can contain
a list with the values for each episode. The helper function epoch_to_episode()
converts
an epoch to the episode it belongs to.
The function returns an object of class POMDP which is list of the model specification.
solve_POMDP()
reads the object and adds a list element named
'solution'
.
Hossein Kamalzadeh, Michael Hahsler
pomdp-solve website: http://www.pomdp.org
Other POMDP:
MDP2POMDP
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
Other POMDP_examples:
POMDP_example_files
,
RussianTiger
,
Tiger
## Defining the Tiger Problem (it is also available via data(Tiger), see ? Tiger) Tiger <- POMDP( name = "Tiger Problem", discount = 0.75, states = c("tiger-left" , "tiger-right"), actions = c("listen", "open-left", "open-right"), observations = c("tiger-left", "tiger-right"), start = "uniform", transition_prob = list( "listen" = "identity", "open-left" = "uniform", "open-right" = "uniform" ), observation_prob = list( "listen" = rbind(c(0.85, 0.15), c(0.15, 0.85)), "open-left" = "uniform", "open-right" = "uniform" ), # the reward helper expects: action, start.state, end.state, observation, value # missing arguments default to NA which matches any value (often denoted as * in POMDPs). reward = rbind( R_("listen", v = -1), R_("open-left", "tiger-left", v = -100), R_("open-left", "tiger-right", v = 10), R_("open-right", "tiger-left", v = 10), R_("open-right", "tiger-right", v = -100) ) ) Tiger ### Defining the Tiger problem using functions trans_f <- function(action, start.state, end.state) { if(action == 'listen') if(end.state == start.state) return(1) else return(0) return(1/2) ### all other actions have a uniform distribution } obs_f <- function(action, end.state, observation) { if(action == 'listen') if(end.state == observation) return(0.85) else return(0.15) return(1/2) } rew_f <- function(action, start.state, end.state, observation) { if(action == 'listen') return(-1) if(action == 'open-left' && start.state == 'tiger-left') return(-100) if(action == 'open-left' && start.state == 'tiger-right') return(10) if(action == 'open-right' && start.state == 'tiger-left') return(10) if(action == 'open-right' && start.state == 'tiger-right') return(-100) stop('Not possible') } Tiger_func <- POMDP( name = "Tiger Problem", discount = 0.75, states = c("tiger-left" , "tiger-right"), actions = c("listen", "open-left", "open-right"), observations = c("tiger-left", "tiger-right"), start = "uniform", transition_prob = trans_f, observation_prob = obs_f, reward = rew_f ) Tiger_func # Defining a Time-dependent version of the Tiger Problem called Scared Tiger # The tiger reacts normally for 3 epochs (goes randomly two one # of the two doors when a door was opened). After 3 epochs he gets # scared and when a door is opened then he always goes to the other door. # specify the horizon for each of the two different episodes Tiger_time_dependent <- Tiger Tiger_time_dependent$name <- "Scared Tiger Problem" Tiger_time_dependent$horizon <- c(normal_tiger = 3, scared_tiger = 3) Tiger_time_dependent$transition_prob <- list( normal_tiger = list( "listen" = "identity", "open-left" = "uniform", "open-right" = "uniform"), scared_tiger = list( "listen" = "identity", "open-left" = rbind(c(0, 1), c(0, 1)), "open-right" = rbind(c(1, 0), c(1, 0)) ) )
## Defining the Tiger Problem (it is also available via data(Tiger), see ? Tiger) Tiger <- POMDP( name = "Tiger Problem", discount = 0.75, states = c("tiger-left" , "tiger-right"), actions = c("listen", "open-left", "open-right"), observations = c("tiger-left", "tiger-right"), start = "uniform", transition_prob = list( "listen" = "identity", "open-left" = "uniform", "open-right" = "uniform" ), observation_prob = list( "listen" = rbind(c(0.85, 0.15), c(0.15, 0.85)), "open-left" = "uniform", "open-right" = "uniform" ), # the reward helper expects: action, start.state, end.state, observation, value # missing arguments default to NA which matches any value (often denoted as * in POMDPs). reward = rbind( R_("listen", v = -1), R_("open-left", "tiger-left", v = -100), R_("open-left", "tiger-right", v = 10), R_("open-right", "tiger-left", v = 10), R_("open-right", "tiger-right", v = -100) ) ) Tiger ### Defining the Tiger problem using functions trans_f <- function(action, start.state, end.state) { if(action == 'listen') if(end.state == start.state) return(1) else return(0) return(1/2) ### all other actions have a uniform distribution } obs_f <- function(action, end.state, observation) { if(action == 'listen') if(end.state == observation) return(0.85) else return(0.15) return(1/2) } rew_f <- function(action, start.state, end.state, observation) { if(action == 'listen') return(-1) if(action == 'open-left' && start.state == 'tiger-left') return(-100) if(action == 'open-left' && start.state == 'tiger-right') return(10) if(action == 'open-right' && start.state == 'tiger-left') return(10) if(action == 'open-right' && start.state == 'tiger-right') return(-100) stop('Not possible') } Tiger_func <- POMDP( name = "Tiger Problem", discount = 0.75, states = c("tiger-left" , "tiger-right"), actions = c("listen", "open-left", "open-right"), observations = c("tiger-left", "tiger-right"), start = "uniform", transition_prob = trans_f, observation_prob = obs_f, reward = rew_f ) Tiger_func # Defining a Time-dependent version of the Tiger Problem called Scared Tiger # The tiger reacts normally for 3 epochs (goes randomly two one # of the two doors when a door was opened). After 3 epochs he gets # scared and when a door is opened then he always goes to the other door. # specify the horizon for each of the two different episodes Tiger_time_dependent <- Tiger Tiger_time_dependent$name <- "Scared Tiger Problem" Tiger_time_dependent$horizon <- c(normal_tiger = 3, scared_tiger = 3) Tiger_time_dependent$transition_prob <- list( normal_tiger = list( "listen" = "identity", "open-left" = "uniform", "open-right" = "uniform"), scared_tiger = list( "listen" = "identity", "open-left" = rbind(c(0, 1), c(0, 1)), "open-right" = rbind(c(1, 0), c(1, 0)) ) )
Some POMDP example files are shipped with the package.
Currently, the following POMDP example files are available:
"light_maze.POMDP"
: a simple maze introduced in Littman (2009).
"shuttle_95.POMDP"
: Transport goods between two space
stations (Chrisman, 1992).
"tiger_aaai.POMDP"
: Tiger Problem introduced in Cassandra et al (1994).
More files can be found at https://www.pomdp.org/examples/
Anthony R. Cassandra, Leslie P Kaelbling, and Michael L. Littman (1994). Acting Optimally in Partially Observable Stochastic Domains. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 1023-1028.
Lonnie Chrisman (1992), Reinforcement Learning with Perceptual Aliasing: The Proceedings of the AAAI Conference on Artificial Intelligence, 10, AAAI-92.
Michael L. Littman (2009), A tutorial on partially observable Markov decision processes, Journal of Mathematical Psychology, Volume 53, Issue 3, June 2009, Pages 119-125. doi:10.1016/j.jmp.2009.01.005
Other POMDP_examples:
POMDP()
,
RussianTiger
,
Tiger
dir(system.file("examples/", package = "pomdp")) model <- read_POMDP(system.file("examples/light_maze.POMDP", package = "pomdp")) model
dir(system.file("examples/", package = "pomdp")) model <- read_POMDP(system.file("examples/light_maze.POMDP", package = "pomdp")) model
High dimensional belief spaces can be projected to lower dimension. This is useful for visualization and
to analyze the belief space and value functions. This definition is used by functions like plot_belief_space()
,
plot_value_function()
, and sample_belief_space()
.
projection(x = NULL, model)
projection(x = NULL, model)
x |
specification of the projection (see Details section). |
model |
a POMDP. |
The belief space is $n-1$ dimensional, were $n$ is the number of states. Note: it is n-1 dimensional since the probabilities
need to add up to 1. A projection fixes the belief value for a set of states. For example, for a 4-state POMDP
(s1, s2, s3, s4), we can project the belief space on s1 and s2 by holding s3 and s4 constant
which is represented by the vector c(s1 = NA, s2 = NA, s3 = 0, s4 = .1)
. We use NA
to represent that the values are not
fixed and the value that the other dimensions are held constant.
We provide several ways to specify a projection:
A vector with values for all dimensions. NA
s are used for the dimension projected on. This is the canonical form
used in this package. Example: c(NA, NA, 0, .1)
A named vector with just the dimensions held constant. Example: c(s3 = 0, s4 = .1)
A vector of state names to project on. All other dimensions are held constant at 0. Example: c("s1", "s2")
A vector with indices of the states to project on. All other dimensions are held constant at 0. Example: c(1, 2)
a canonical description of the projection.
Michael Hahsler
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
model <- POMDP( states = 4, actions = 2, observations = 2, transition_prob = list("identity","identity"), observation_prob = list("uniform","uniform"), reward = rbind(R_(value = 1)) ) projection(NULL, model = model) projection(1:2, model = model) projection(c("s2", "s3"), model = model) projection(c(1,4), model = model) projection(c(s2 = .4, s3 = .2), model = model) projection(c(s1 = .1, s2 = NA, s3 = NA, s4 = .3), model = model)
model <- POMDP( states = 4, actions = 2, observations = 2, transition_prob = list("identity","identity"), observation_prob = list("uniform","uniform"), reward = rbind(R_(value = 1)) ) projection(NULL, model = model) projection(1:2, model = model) projection(c("s2", "s3"), model = model) projection(c(1,4), model = model) projection(c(s2 = .4, s3 = .2), model = model) projection(c(s1 = .1, s2 = NA, s3 = NA, s4 = .3), model = model)
Find reachable and absorbing states in the transition model.
reachable_states(x, states = NULL) absorbing_states(x, states = NULL) remove_unreachable_states(x)
reachable_states(x, states = NULL) absorbing_states(x, states = NULL) remove_unreachable_states(x)
x |
a |
states |
a character vector specifying the names of the states to be
checked. |
The function reachable_states()
checks if states
are reachable using the transition model.
The function absorbing_states()
checks if a state or a set of states are
absorbing (terminal states) with a zero reward (or -Inf
for unavailable actions).
If no states are specified (states = NULL
), then all model states are
checked. This information can be used in simulations to end an episode.
The function remove_unreachable_states()
simplifies a model by
removing unreachable states.
reachable_states()
returns a logical vector indicating
if the states are reachable.
absorbing_states()
returns a logical vector indicating
if the states are absorbing (terminal).
the model with all unreachable states removed
Michael Hahsler
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
data(Maze) gridworld_matrix(Maze, what = "label") # the states marked with +1 and -1 are absorbing absorbing_states(Maze) which(absorbing_states(Maze)) # all states in the model are reachable reachable_states(Maze) which(!reachable_states(Maze))
data(Maze) gridworld_matrix(Maze, what = "label") # the states marked with +1 and -1 are absorbing absorbing_states(Maze) which(absorbing_states(Maze)) # all states in the model are reachable reachable_states(Maze) which(!reachable_states(Maze))
Calculates the regret of a policy relative to a benchmark policy.
regret(policy, benchmark, start = NULL)
regret(policy, benchmark, start = NULL)
policy |
a solved POMDP containing the policy to calculate the regret for. |
benchmark |
a solved POMDP with the (optimal) policy. Regret is calculated relative to this policy. |
start |
the used start (belief) state. If NULL then the start (belief) state of the |
Regret is defined as with
representing the expected long-term
state value (represented by the value function) given the policy
and the start
state
. For POMDPs the start state is the start belief
.
Note that for regret usually the optimal policy is used as the benchmark.
Since the optimal policy may not be known, regret relative to the best known policy can be used.
the regret as a difference of expected long-term rewards.
Michael Hahsler
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
,
value_function()
data(Tiger) sol_optimal <- solve_POMDP(Tiger) sol_optimal # perform exact value iteration for 10 epochs sol_quick <- solve_POMDP(Tiger, method = "enum", horizon = 10) sol_quick regret(sol_quick, benchmark = sol_optimal)
data(Tiger) sol_optimal <- solve_POMDP(Tiger) sol_optimal # perform exact value iteration for 10 epochs sol_quick <- solve_POMDP(Tiger, method = "enum", horizon = 10) sol_quick regret(sol_quick, benchmark = sol_optimal)
This function calculates the expected total reward for a POMDP solution
given a starting belief state. The value is calculated using the value function stored
in the POMDP solution. In addition, the policy graph node that represents the belief state
and the optimal action can also be returned using reward_node_action()
.
reward(x, belief = NULL, epoch = 1, ...) reward_node_action(x, belief = NULL, epoch = 1, ...)
reward(x, belief = NULL, epoch = 1, ...) reward_node_action(x, belief = NULL, epoch = 1, ...)
x |
a solved POMDP object. |
belief |
specification of the current belief state (see argument start in POMDP for details). By default the belief state defined in the model as start is used. Multiple belief states can be specified as rows in a matrix. |
epoch |
return reward for this epoch. Use 1 for converged policies. |
... |
further arguments are passed on. |
The reward is typically calculated using the value function (alpha vectors)
of the solution. If these are not available, then simulate_POMDP()
is
used instead with a warning.
reward()
returns a vector of reward values, one for each belief if a matrix is specified.
reward_node_action()
returns a list with the components
belief_state |
the belief state specified in |
reward |
the total expected reward given a belief and epoch. |
pg_node |
the policy node that represents the belief state. |
action |
the optimal action. |
Michael Hahsler
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
solve_POMDP()
,
solve_SARSOP()
,
value_function()
data("Tiger") sol <- solve_POMDP(model = Tiger) # if no start is specified, a uniform belief is used. reward(sol) # we have additional information that makes us believe that the tiger # is more likely to the left. reward(sol, belief = c(0.85, 0.15)) # we start with strong evidence that the tiger is to the left. reward(sol, belief = "tiger-left") # Note that in this case, the total discounted expected reward is greater # than 10 since the tiger problem resets and another game staring with # a uniform belief is played which produces additional reward. # return reward, the initial node in the policy graph and the optimal action for # two beliefs. reward_node_action(sol, belief = rbind(c(.5, .5), c(.9, .1))) # manually combining reward with belief space sampling to show the value function # (color signifies the optimal action) samp <- sample_belief_space(sol, n = 200) rew <- reward_node_action(sol, belief = samp) plot(rew$belief[,"tiger-right"], rew$reward, col = rew$action, ylim = c(0, 15)) legend(x = "top", legend = levels(rew$action), title = "action", col = 1:3, pch = 1) # this is the piecewise linear value function from the solution plot_value_function(sol, ylim = c(0, 10))
data("Tiger") sol <- solve_POMDP(model = Tiger) # if no start is specified, a uniform belief is used. reward(sol) # we have additional information that makes us believe that the tiger # is more likely to the left. reward(sol, belief = c(0.85, 0.15)) # we start with strong evidence that the tiger is to the left. reward(sol, belief = "tiger-left") # Note that in this case, the total discounted expected reward is greater # than 10 since the tiger problem resets and another game staring with # a uniform belief is played which produces additional reward. # return reward, the initial node in the policy graph and the optimal action for # two beliefs. reward_node_action(sol, belief = rbind(c(.5, .5), c(.9, .1))) # manually combining reward with belief space sampling to show the value function # (color signifies the optimal action) samp <- sample_belief_space(sol, n = 200) rew <- reward_node_action(sol, belief = samp) plot(rew$belief[,"tiger-right"], rew$reward, col = rew$action, ylim = c(0, 15)) legend(x = "top", legend = levels(rew$action), title = "action", col = 1:3, pch = 1) # this is the piecewise linear value function from the solution plot_value_function(sol, ylim = c(0, 10))
Rounds a vector such that the sum of 1 is preserved. Rounds a matrix such that each row sum up to 1. One entry is adjusted after rounding such that the rounding error is the smallest.
round_stochastic(x, digits = 7)
round_stochastic(x, digits = 7)
x |
a stochastic vector or a row-stochastic matrix. |
digits |
number of digits for rounding. |
The rounded vector or matrix.
# regular rounding would not sum up to 1 x <- c(0.333, 0.334, 0.333) round_stochastic(x) round_stochastic(x, digits = 2) round_stochastic(x, digits = 1) round_stochastic(x, digits = 0) # round a stochastic matrix m <- matrix(runif(15), ncol = 3) m <- sweep(m, 1, rowSums(m), "/") m round_stochastic(m, digits = 2) round_stochastic(m, digits = 1) round_stochastic(m, digits = 0)
# regular rounding would not sum up to 1 x <- c(0.333, 0.334, 0.333) round_stochastic(x) round_stochastic(x, digits = 2) round_stochastic(x, digits = 1) round_stochastic(x, digits = 0) # round a stochastic matrix m <- matrix(runif(15), ncol = 3) m <- sweep(m, 1, rowSums(m), "/") m round_stochastic(m, digits = 2) round_stochastic(m, digits = 1) round_stochastic(m, digits = 0)
This is a variation of the Tiger Problem introduced in Cassandra et al (1994) with an absorbing state after a door is opened.
An object of class POMDP.
The original Tiger problem is available as Tiger. The original problem is
an infinite-horizon problem, where when the agent opens a door then the
problem starts over. The infinite-horizon problem can be solved if
a discount factor is used.
The Russian Tiger problem uses no discounting, but instead
adds an absorbing state done
which is reached
after the agent opens a door. It adds the action nothing
to indicate
that the agent does nothing. The nothing
action is only available in the
state done
indicated by a reward of -Inf
from all after states. A new
observation done
is only emitted by the state done
. Also, the Russian
tiger inflicts more pain with a negative reward of -1000.
Other POMDP_examples:
POMDP()
,
POMDP_example_files
,
Tiger
data("RussianTiger") RussianTiger # states, actions, and observations RussianTiger$states RussianTiger$actions RussianTiger$observations # reward (-Inf indicates unavailable actions) RussianTiger$reward sapply(RussianTiger$states, FUN = function(s) actions(RussianTiger, s)) plot_transition_graph(RussianTiger, vertex.size = 30, edge.arrow.size = .3, margin = .5) # absorbing states absorbing_states(RussianTiger) # solve the problem. sol <- solve_POMDP(RussianTiger) policy(sol) plot_policy_graph(sol)
data("RussianTiger") RussianTiger # states, actions, and observations RussianTiger$states RussianTiger$actions RussianTiger$observations # reward (-Inf indicates unavailable actions) RussianTiger$reward sapply(RussianTiger$states, FUN = function(s) actions(RussianTiger, s)) plot_transition_graph(RussianTiger, vertex.size = 30, edge.arrow.size = .3, margin = .5) # absorbing states absorbing_states(RussianTiger) # solve the problem. sol <- solve_POMDP(RussianTiger) policy(sol) plot_policy_graph(sol)
Sample points from belief space using a several sampling strategies.
sample_belief_space(model, projection = NULL, n = 1000, method = "random", ...)
sample_belief_space(model, projection = NULL, n = 1000, method = "random", ...)
model |
a unsolved or solved POMDP. |
projection |
Sample in a projected belief space. See |
n |
size of the sample. For trajectories, it is the number of trajectories. |
method |
character string specifying the sampling strategy. Available
are |
... |
for the trajectory method, further arguments are passed on to |
The purpose of sampling from the belief space is to provide good coverage or to sample belief points that are more likely to be encountered (see trajectory method). The following sampling methods are available:
'random'
samples uniformly sample from the projected belief space using
the method described by Luc Devroye (1986). Sampling is be done in parallel
after a foreach backend is registered.
'regular'
samples points using a
regularly spaced grid. This method is only available for projections on 2 or
3 states.
"trajectories"
returns the belief states encountered in n
trajectories of length horizon
starting at the
model's initial belief. Thus it returns n
x horizon
belief states and will contain duplicates.
Projection is not supported for trajectories. Additional
arguments can include the simulation horizon
and the start belief
which are passed on to simulate_POMDP()
.
Returns a matrix. Each row is a sample from the belief space.
Michael Hahsler
Luc Devroye, Non-Uniform Random Variate Generation, Springer Verlag, 1986.
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
data("Tiger") # random sampling can be done in parallel after registering a backend. # doparallel::registerDoParallel() sample_belief_space(Tiger, n = 5) sample_belief_space(Tiger, n = 5, method = "regular") sample_belief_space(Tiger, n = 1, horizon = 5, method = "trajectories") # sample, determine the optimal action and calculate the expected reward for a solved POMDP # Note: check.names = FALSE is used to preserve the `-` for the state names in the dataframe. sol <- solve_POMDP(Tiger) samp <- sample_belief_space(sol, n = 5, method = "regular") data.frame(samp, action = optimal_action(sol, belief = samp), reward = reward(sol, belief = samp), check.names = FALSE) # sample from a 3 state problem data(Three_doors) Three_doors sample_belief_space(Three_doors, n = 5) sample_belief_space(Three_doors, n = 5, projection = c(`tiger-left` = .1)) if ("Ternary" %in% installed.packages()) { sample_belief_space(Three_doors, n = 9, method = "regular") sample_belief_space(Three_doors, n = 9, method = "regular", projection = c(`tiger-left` = .1)) } sample_belief_space(Three_doors, n = 1, horizon = 5, method = "trajectories")
data("Tiger") # random sampling can be done in parallel after registering a backend. # doparallel::registerDoParallel() sample_belief_space(Tiger, n = 5) sample_belief_space(Tiger, n = 5, method = "regular") sample_belief_space(Tiger, n = 1, horizon = 5, method = "trajectories") # sample, determine the optimal action and calculate the expected reward for a solved POMDP # Note: check.names = FALSE is used to preserve the `-` for the state names in the dataframe. sol <- solve_POMDP(Tiger) samp <- sample_belief_space(sol, n = 5, method = "regular") data.frame(samp, action = optimal_action(sol, belief = samp), reward = reward(sol, belief = samp), check.names = FALSE) # sample from a 3 state problem data(Three_doors) Three_doors sample_belief_space(Three_doors, n = 5) sample_belief_space(Three_doors, n = 5, projection = c(`tiger-left` = .1)) if ("Ternary" %in% installed.packages()) { sample_belief_space(Three_doors, n = 9, method = "regular") sample_belief_space(Three_doors, n = 9, method = "regular", projection = c(`tiger-left` = .1)) } sample_belief_space(Three_doors, n = 1, horizon = 5, method = "trajectories")
Simulate trajectories through a MDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from an epsilon-greedy policy and then update the state.
simulate_MDP( model, n = 100, start = NULL, horizon = NULL, epsilon = NULL, delta_horizon = 0.001, return_trajectories = FALSE, engine = "cpp", verbose = FALSE, ... )
simulate_MDP( model, n = 100, start = NULL, horizon = NULL, epsilon = NULL, delta_horizon = 0.001, return_trajectories = FALSE, engine = "cpp", verbose = FALSE, ... )
model |
a MDP model. |
n |
number of trajectories. |
start |
probability distribution over the states for choosing the starting states for the trajectories. Defaults to "uniform". |
horizon |
epochs end once an absorbing state is reached or after
the maximal number of epochs specified via |
epsilon |
the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1. |
delta_horizon |
precision used to determine the horizon for infinite-horizon problems. |
return_trajectories |
logical; return the complete trajectories. |
engine |
|
verbose |
report used parameters. |
... |
further arguments are ignored. |
A native R implementation is available (engine = 'r'
) and the default is a
faster C++ implementation (engine = 'cpp'
).
Both implementations support parallel execution using the package
foreach. To enable parallel execution, a parallel backend like
doparallel needs to be available needs to be registered (see
doParallel::registerDoParallel()
).
Note that small simulations are slower using parallelization. Therefore, C++ simulations
with n * horizon less than 100,000 are always executed using a single worker.
A list with elements:
avg_reward
: The average discounted reward.
reward
: Reward for each trajectory.
action_cnt
: Action counts.
state_cnt
: State counts.
trajectories
: A data.frame with the trajectories. Each row
contains the episode
id, the time
step, the state s
,
the chosen action a
,
the reward r
, and the next state s_prime
. Trajectories are
only returned for return_trajectories = TRUE
.
Michael Hahsler
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
solve_MDP()
,
transition_graph()
,
value_function()
# enable parallel simulation # doParallel::registerDoParallel() data(Maze) # solve the POMDP for 5 epochs and no discounting sol <- solve_MDP(Maze, discount = 1) sol # U in the policy is and estimate of the utility of being in a state when using the optimal policy. policy(sol) gridworld_matrix(sol, what = "action") ## Example 1: simulate 100 trajectories following the policy, # only the final belief state is returned sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE) sim # Note that all simulations start at s_1 and that the simulated avg. reward # is therefore an estimate to the U value for the start state s_1. policy(sol)[1,] # Calculate proportion of actions taken in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: simulate starting following a uniform distribution over all # states and return all trajectories sim <- simulate_MDP(sol, n = 100, start = "uniform", horizon = 10, return_trajectories = TRUE) head(sim$trajectories) # how often was each state visited? table(sim$trajectories$s)
# enable parallel simulation # doParallel::registerDoParallel() data(Maze) # solve the POMDP for 5 epochs and no discounting sol <- solve_MDP(Maze, discount = 1) sol # U in the policy is and estimate of the utility of being in a state when using the optimal policy. policy(sol) gridworld_matrix(sol, what = "action") ## Example 1: simulate 100 trajectories following the policy, # only the final belief state is returned sim <- simulate_MDP(sol, n = 100, horizon = 10, verbose = TRUE) sim # Note that all simulations start at s_1 and that the simulated avg. reward # is therefore an estimate to the U value for the start state s_1. policy(sol)[1,] # Calculate proportion of actions taken in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: simulate starting following a uniform distribution over all # states and return all trajectories sim <- simulate_MDP(sol, n = 100, start = "uniform", horizon = 10, return_trajectories = TRUE) head(sim$trajectories) # how often was each state visited? table(sim$trajectories$s)
Simulate trajectories through a POMDP. The start state for each trajectory is randomly chosen using the specified belief. The belief is used to choose actions from the the epsilon-greedy policy and then updated using observations.
simulate_POMDP( model, n = 1000, belief = NULL, horizon = NULL, epsilon = NULL, delta_horizon = 0.001, digits = 7L, return_beliefs = FALSE, return_trajectories = FALSE, engine = "cpp", verbose = FALSE, ... )
simulate_POMDP( model, n = 1000, belief = NULL, horizon = NULL, epsilon = NULL, delta_horizon = 0.001, digits = 7L, return_beliefs = FALSE, return_trajectories = FALSE, engine = "cpp", verbose = FALSE, ... )
model |
a POMDP model. |
n |
number of trajectories. |
belief |
probability distribution over the states for choosing the starting states for the trajectories. Defaults to the start belief state specified in the model or "uniform". |
horizon |
number of epochs for the simulation. If |
epsilon |
the probability of random actions for using an epsilon-greedy policy. Default for solved models is 0 and for unsolved model 1. |
delta_horizon |
precision used to determine the horizon for infinite-horizon problems. |
digits |
round probabilities for belief points. |
return_beliefs |
logical; Return all visited belief states? This requires n x horizon memory. |
return_trajectories |
logical; Return the simulated trajectories as a data.frame? |
engine |
|
verbose |
report used parameters. |
... |
further arguments are ignored. |
Simulates n
trajectories.
If no simulation horizon is specified, the horizon of finite-horizon problems
is used. For infinite-horizon problems with , the simulation
horizon
is chosen such that
the worst-case error is no more than
. That is
where is the largest possible absolute reward value used as a
perpetuity starting after
.
A native R implementation (engine = 'r'
) and a faster C++ implementation
(engine = 'cpp'
) are available. Currently, only the R implementation supports
multi-episode problems.
Both implementations support the simulation of trajectories in parallel using the package
foreach. To enable parallel execution, a parallel backend like
doparallel needs to be registered (see
doParallel::registerDoParallel()
).
Note that small simulations are slower using parallelization. C++ simulations
with n * horizon
less than 100,000 are always executed using a single worker.
A list with elements:
avg_reward
: The average discounted reward.
action_cnt
: Action counts.
state_cnt
: State counts.
reward
: Reward for each trajectory.
belief_states
: A matrix with belief states as rows.
trajectories
: A data.frame with the episode
id, time
, the state of the
simulation (simulation_state
), the id of the used alpha vector given the current belief
(see belief_states
above), the action a
and the reward r
.
Michael Hahsler
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
data(Tiger) # solve the POMDP for 5 epochs and no discounting sol <- solve_POMDP(Tiger, horizon = 5, discount = 1, method = "enum") sol policy(sol) # uncomment the following line to register a parallel backend for simulation # (needs package doparallel installed) # doParallel::registerDoParallel() # foreach::getDoParWorkers() ## Example 1: simulate 100 trajectories sim <- simulate_POMDP(sol, n = 100, verbose = TRUE) sim # calculate the percentage that each action is used in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: look at the belief states and the trajectories starting with # an initial start belief. sim <- simulate_POMDP(sol, n = 100, belief = c(.5, .5), return_beliefs = TRUE, return_trajectories = TRUE) head(sim$belief_states) head(sim$trajectories) # plot with added density (the x-axis is the probability of the second belief state) plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6)) lines(density(sim$belief_states[, 2], bw = .02)); axis(2); title(ylab = "Density") ## Example 3: simulate trajectories for an unsolved POMDP which uses an epsilon of 1 # (i.e., all actions are randomized). The simulation horizon for the # infinite-horizon Tiger problem is calculated using delta_horizon. sim <- simulate_POMDP(Tiger, return_beliefs = TRUE, verbose = TRUE) sim$avg_reward hist(sim$reward, breaks = 20) plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6)) lines(density(sim$belief_states[, 1], bw = .05)); axis(2); title(ylab = "Density")
data(Tiger) # solve the POMDP for 5 epochs and no discounting sol <- solve_POMDP(Tiger, horizon = 5, discount = 1, method = "enum") sol policy(sol) # uncomment the following line to register a parallel backend for simulation # (needs package doparallel installed) # doParallel::registerDoParallel() # foreach::getDoParWorkers() ## Example 1: simulate 100 trajectories sim <- simulate_POMDP(sol, n = 100, verbose = TRUE) sim # calculate the percentage that each action is used in the simulation round_stochastic(sim$action_cnt / sum(sim$action_cnt), 2) # reward distribution hist(sim$reward) ## Example 2: look at the belief states and the trajectories starting with # an initial start belief. sim <- simulate_POMDP(sol, n = 100, belief = c(.5, .5), return_beliefs = TRUE, return_trajectories = TRUE) head(sim$belief_states) head(sim$trajectories) # plot with added density (the x-axis is the probability of the second belief state) plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6)) lines(density(sim$belief_states[, 2], bw = .02)); axis(2); title(ylab = "Density") ## Example 3: simulate trajectories for an unsolved POMDP which uses an epsilon of 1 # (i.e., all actions are randomized). The simulation horizon for the # infinite-horizon Tiger problem is calculated using delta_horizon. sim <- simulate_POMDP(Tiger, return_beliefs = TRUE, verbose = TRUE) sim$avg_reward hist(sim$reward, breaks = 20) plot_belief_space(sol, sample = sim$belief_states, jitter = 2, ylim = c(0, 6)) lines(density(sim$belief_states[, 1], bw = .05)); axis(2); title(ylab = "Density")
Implementation of value iteration, modified policy iteration and other methods based on reinforcement learning techniques to solve finite state space MDPs.
solve_MDP(model, method = "value", ...) solve_MDP_DP( model, method = "value_iteration", horizon = NULL, discount = NULL, N_max = 1000, error = 0.01, k_backups = 10, U = NULL, verbose = FALSE ) solve_MDP_TD( model, method = "q_learning", horizon = NULL, discount = NULL, alpha = 0.5, epsilon = 0.1, N = 100, U = NULL, verbose = FALSE )
solve_MDP(model, method = "value", ...) solve_MDP_DP( model, method = "value_iteration", horizon = NULL, discount = NULL, N_max = 1000, error = 0.01, k_backups = 10, U = NULL, verbose = FALSE ) solve_MDP_TD( model, method = "q_learning", horizon = NULL, discount = NULL, alpha = 0.5, epsilon = 0.1, N = 100, U = NULL, verbose = FALSE )
model |
an MDP problem specification. |
method |
string; one of the following solution methods: |
... |
further parameters are passed on to the solver function. |
horizon |
an integer with the number of epochs for problems with a
finite planning horizon. If set to |
discount |
discount factor in range |
N_max |
maximum number of iterations allowed to converge. If the maximum is reached then the non-converged solution is returned with a warning. |
error |
value iteration: maximum error allowed in the utility of any state (i.e., the maximum policy loss) used as the termination criterion. |
k_backups |
policy iteration: number of look ahead steps used for approximate policy evaluation used by the policy iteration method. |
U |
a vector with initial utilities used for each state. If
|
verbose |
logical, if set to |
alpha |
step size in |
epsilon |
used for |
N |
number of episodes used for learning. |
Implemented are the following dynamic programming methods (following Russell and Norvig, 2010):
Modified Policy Iteration starts with a random policy and iteratively performs a sequence of
approximate policy evaluation (estimate the value function for the
current policy using k_backups
and function MDP_policy_evaluation()
), and
policy improvement (calculate a greedy policy given the value function). The algorithm stops when it converges to a stable policy (i.e., no changes between two iterations).
Value Iteration starts with
an arbitrary value function (by default all 0s) and iteratively
updates the value function for each state using the Bellman equation.
The iterations
are terminated either after N_max
iterations or when the solution converges.
Approximate convergence is achieved
for discounted problems (with )
when the maximal value function change for any state
is
. It can be shown that this means
that no state value is more than
from the value in the optimal value function. For undiscounted
problems, we use
.
The greedy policy is calculated from the final value function. Value iteration can be seen as policy iteration with truncated policy evaluation.
Note that the policy converges earlier than the value function.
Implemented are the following temporal difference control methods
described in Sutton and Barto (2020).
Note that the MDP transition and reward models are only used to simulate
the environment for these reinforcement learning methods.
The algorithms use a step size parameter (learning rate) for the
updates and the exploration parameter
for
the
-greedy policy.
If the model has absorbing states to terminate episodes, then no maximal episode length
(horizon
) needs to
be specified. To make sure that the algorithm does finish in a reasonable amount of time,
episodes are stopped after 10,000 actions with a warning. For models without absorbing states,
a episode length has to be specified via horizon
.
Q-Learning is an off-policy temporal difference method that uses
an -greedy behavior policy and learns a greedy target
policy.
Sarsa is an on-policy method that follows and learns
an -greedy policy. The final
-greedy policy
is converted into a greedy policy.
Expected Sarsa: We implement an on-policy version that uses
the expected value under the current policy for the update.
It moves deterministically in the same direction as Sarsa
moves in expectation. Because it uses the expectation, we can
set the step size to large values and even 1.
solve_MDP()
returns an object of class POMDP which is a list with the
model specifications (model
), the solution (solution
).
The solution is a list with the elements:
policy
a list representing the policy graph. The list only has one element for converged solutions.
converged
did the algorithm converge (NA
) for finite-horizon problems.
delta
final (value iteration and infinite-horizon only)
iterations
number of iterations to convergence (infinite-horizon only)
Michael Hahsler
Russell, S., Norvig, P. (2021). Artificial Intelligence: A Modern Approach. Fourth edition. Prentice Hall.
Sutton, R. S., Barto, A. G. (2020). Reinforcement Learning: An Introduction. Second edition. The MIT Press.
Other solver:
solve_POMDP()
,
solve_SARSOP()
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
transition_graph()
,
value_function()
data(Maze) Maze # use value iteration maze_solved <- solve_MDP(Maze, method = "value_iteration") maze_solved policy(maze_solved) # plot the value function U plot_value_function(maze_solved) # Maze solutions can be visualized gridworld_plot_policy(maze_solved) # use modified policy iteration maze_solved <- solve_MDP(Maze, method = "policy_iteration") policy(maze_solved) # finite horizon maze_solved <- solve_MDP(Maze, method = "value_iteration", horizon = 3) policy(maze_solved) gridworld_plot_policy(maze_solved, epoch = 1) gridworld_plot_policy(maze_solved, epoch = 2) gridworld_plot_policy(maze_solved, epoch = 3) # create a random policy where action n is very likely and approximate # the value function. We change the discount factor to .9 for this. Maze_discounted <- Maze Maze_discounted$discount <- .9 pi <- random_MDP_policy(Maze_discounted, prob = c(n = .7, e = .1, s = .1, w = 0.1)) pi # compare the utility function for the random policy with the function for the optimal # policy found by the solver. maze_solved <- solve_MDP(Maze) MDP_policy_evaluation(pi, Maze, k_backup = 100) MDP_policy_evaluation(policy(maze_solved), Maze, k_backup = 100) # Note that the solver already calculates the utility function and returns it with the policy policy(maze_solved) # Learn a Policy using Q-Learning maze_learned <- solve_MDP(Maze, method = "q_learning", N = 100) maze_learned maze_learned$solution policy(maze_learned) plot_value_function(maze_learned) gridworld_plot_policy(maze_learned)
data(Maze) Maze # use value iteration maze_solved <- solve_MDP(Maze, method = "value_iteration") maze_solved policy(maze_solved) # plot the value function U plot_value_function(maze_solved) # Maze solutions can be visualized gridworld_plot_policy(maze_solved) # use modified policy iteration maze_solved <- solve_MDP(Maze, method = "policy_iteration") policy(maze_solved) # finite horizon maze_solved <- solve_MDP(Maze, method = "value_iteration", horizon = 3) policy(maze_solved) gridworld_plot_policy(maze_solved, epoch = 1) gridworld_plot_policy(maze_solved, epoch = 2) gridworld_plot_policy(maze_solved, epoch = 3) # create a random policy where action n is very likely and approximate # the value function. We change the discount factor to .9 for this. Maze_discounted <- Maze Maze_discounted$discount <- .9 pi <- random_MDP_policy(Maze_discounted, prob = c(n = .7, e = .1, s = .1, w = 0.1)) pi # compare the utility function for the random policy with the function for the optimal # policy found by the solver. maze_solved <- solve_MDP(Maze) MDP_policy_evaluation(pi, Maze, k_backup = 100) MDP_policy_evaluation(policy(maze_solved), Maze, k_backup = 100) # Note that the solver already calculates the utility function and returns it with the policy policy(maze_solved) # Learn a Policy using Q-Learning maze_learned <- solve_MDP(Maze, method = "q_learning", N = 100) maze_learned maze_learned$solution policy(maze_learned) plot_value_function(maze_learned) gridworld_plot_policy(maze_learned)
This function utilizes the C implementation of 'pomdp-solve' by Cassandra (2015) to solve problems that are formulated as partially observable Markov decision processes (POMDPs). The result is an optimal or approximately optimal policy.
solve_POMDP( model, horizon = NULL, discount = NULL, initial_belief = NULL, terminal_values = NULL, method = "grid", digits = 7, parameter = NULL, timeout = Inf, verbose = FALSE ) solve_POMDP_parameter()
solve_POMDP( model, horizon = NULL, discount = NULL, initial_belief = NULL, terminal_values = NULL, method = "grid", digits = 7, parameter = NULL, timeout = Inf, verbose = FALSE ) solve_POMDP_parameter()
model |
a POMDP problem specification created with |
horizon |
an integer with the number of epochs for problems with a
finite planning horizon. If set to |
discount |
discount factor in range |
initial_belief |
An initial belief vector. If |
terminal_values |
a vector with the terminal utility values for each state or a
matrix specifying the terminal rewards via a terminal value function (e.g.,
the alpha components produced by |
method |
string; one of the following solution methods: |
digits |
precision used when writing POMDP files (see
|
parameter |
a list with parameters passed on to the pomdp-solve program. |
timeout |
number of seconds for the solver to run. |
verbose |
logical, if set to |
solve_POMDP_parameter()
displays available solver parameter options.
Horizon: Infinite-horizon POMDPs (horizon = Inf
) converge to a
single policy graph. Finite-horizon POMDPs result in a policy tree of a
depth equal to the smaller of the horizon or the number of epochs to
convergence. The policy (and the associated value function) are stored in a
list by epoch. The policy for the first epoch is stored as the first
element. Horizon can also be used to limit the number of epochs used
for value iteration.
Precision: The POMDP solver uses various epsilon values to control
precision for comparing alpha vectors to check for convergence, and solving
LPs. Overall precision can be changed using
parameter = list(epsilon = 1e-3)
.
Methods: Several algorithms using exact value iteration are available:
Enumeration (Sondik 1971).
Two pass (Sondik 1971).
Witness (Littman, Cassandra, Kaelbling, 1996).
Incremental pruning (Zhang and Liu, 1996, Cassandra et al 1997).
In addition, the following approximate value iteration method is available:
Grid implements a variation of point-based value iteration to solve larger POMDPs (PBVI; see Pineau 2003) without dynamic belief set expansion.
Details can be found in (Cassandra, 2015).
Note on POMDP problem size: Finding optimal policies for POMDPs is known to be a prohibitively difficult problem because the belief space grows exponentially with the number of states. Therefore, exact algorithms can be only used for extremely small problems with only a few states. Typically, the researcher needs to simplify the problem description (fewer states, actions and observations) and choose an approximate algorithm with an acceptable level of approximation to make the problem tractable.
Note on method grid: The finite grid method implements a version of Point
Based Value Iteration (PBVI). The used belief points are created
using points that are reachable from the initial belief (start
) by
following all combinations of actions and observations. The default size of the grid is
by 10,000 and
can be set via parameter = list(fg_points = 100)
. Alternatively,
different strategies can be chosen to generate the belief points.
using the parameter fg_type
. In
this implementation, the user can also manually specify a grid of belief
points by providing a matrix with belief points as produced by
sample_belief_space()
as the parameter grid
.
To guarantee convergence in point-based (finite grid) value iteration, the
initial value function must be a lower bound on the optimal value function.
If all rewards are strictly non-negative, an initial value function with an
all-zero vector can be used, and results will be similar to other methods.
However, if the model contains negative rewards, lower bounds can be only
guaranteed by
using an initial value function vector with the values
.
In this case, the value function is guaranteed to converge to the true value
function in the infinite-horizon case, but
finite-horizon value functions may not converge.
solve_POMDP()
produces a warning in this case. The correct value function can be obtained
by using simulate_POMDP()
or switching to a different method.
Time-dependent POMDPs: Time dependence of transition probabilities, observation probabilities and reward structure can be modeled by considering a set of episodes representing epochs with the same settings. In the scared tiger example (see Examples section), the tiger has the normal behavior for the first three epochs (episode 1) and then becomes scared with different transition probabilities for the next three epochs (episode 2). The episodes can be solved in reverse order where the value function is used as the terminal values of the preceding episode. This can be done by specifying a vector of horizons (one horizon for each episode) and then lists with transition matrices, observation matrices, and rewards. If the horizon vector has names, then the lists also need to be named, otherwise they have to be in the same order (the numeric index is used). Only the time-varying matrices need to be specified. An example can be found in Example 4 in the Examples section. The procedure can also be done by calling the solver multiple times (see Example 5).
Policy:
Each policy is a data frame where each row representing a
policy graph node with an associated optimal action and a list of node IDs
to go to depending on the observation (specified as the column names). For
the finite-horizon case, the observation specific node IDs refer to nodes in
the next epoch creating a policy tree. Impossible observations have a
NA
as the next state.
Value function: The value function specifies the value of the value function (the expected reward) over the belief space. The dimensionality of the belief space is $n-1$ where $n$ is the number of states. The value function is stored as a matrix. Each row is associated with a node (row) in the policy graph and represents the coefficients (alpha or V vector) of a hyperplane. It contains one value per state which is the value for the belief state that has a probability of 1 for that state and 0s for all others.
All temporary solver files are stored in the directory returned by tempdir()
.
The solver returns an object of class POMDP which is a list with the
model specifications. Solved POMDPs also have an element called solution
which is a list, and the
solver output (solver_output
). The solution is a list that contains elements like:
method
used solver method.
solver_output
output of the solver program.
converged
did the solution converge?
initial_belief
used initial belief used.
total_expected_reward
total expected reward starting from the the initial belief.
pg
, initial_pg_node
the policy graph (see Details section).
alpha
value function as hyperplanes representing the nodes in the policy graph (see Details section).
belief_points_solver
optional; belief points used by the solver.
Hossein Kamalzadeh, Michael Hahsler
Cassandra, A. (2015). pomdp-solve: POMDP Solver Software, http://www.pomdp.org.
Sondik, E. (1971). The Optimal Control of Partially Observable Markov Processes. Ph.D. Dissertation, Stanford University.
Cassandra, A., Littman M.L., Zhang L. (1997). Incremental Pruning: A Simple, Fast, Exact Algorithm for Partially Observable Markov Decision Processes. UAI'97: Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, August 1997, pp. 54-61.
Monahan, G. E. (1982). A survey of partially observable Markov decision processes: Theory, models, and algorithms. Management Science 28(1):1-16.
Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. (1996). Efficient dynamic-programming updates in partially observable Markov decision processes. Technical Report CS-95-19, Brown University, Providence, RI.
Zhang, N. L., and Liu, W. (1996). Planning in stochastic domains: Problem characteristics and approximation. Technical Report HKUST-CS96-31, Department of Computer Science, Hong Kong University of Science and Technology.
Pineau J., Geoffrey J Gordon G.J., Thrun S.B. (2003). Point-based value iteration: an anytime algorithm for POMDPs. IJCAI'03: Proceedings of the 18th international joint conference on Artificial Intelligence. Pages 1025-1030.
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_SARSOP()
,
value_function()
Other solver:
solve_MDP()
,
solve_SARSOP()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
# display available solver options which can be passed on to pomdp-solve as parameters. solve_POMDP_parameter() ################################################################ # Example 1: Solving the simple infinite-horizon Tiger problem data("Tiger") Tiger # look at the model as a list unclass(Tiger) # inspect an individual field of the model (e.g., the transition probabilities and the reward) Tiger$transition_prob Tiger$reward sol <- solve_POMDP(model = Tiger) sol # look at the solution sol$solution # policy (value function (alpha vectors), optimal action and observation dependent transitions) policy(sol) # plot the policy graph of the infinite-horizon POMDP plot_policy_graph(sol) # value function plot_value_function(sol, ylim = c(0,20)) ################################################################ # Example 2: Solve a problem specified as a POMDP file # using a grid of size 20 sol <- solve_POMDP("http://www.pomdp.org/examples/cheese.95.POMDP", method = "grid", parameter = list(fg_points = 20)) sol policy(sol) plot_policy_graph(sol) # Example 3: Solving a finite-horizon POMDP using the incremental # pruning method (without discounting) sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "incprune") sol # look at the policy tree policy(sol) plot_policy_graph(sol) # note: only open the door in epoch 3 if you get twice the same observation. # Expected reward starting for the models initial belief (uniform): # listen twice and then open the door or listen 3 times reward(sol) # Expected reward for listen twice (-2) and then open-left (-1 + (-1) + 10 = 8) reward(sol, belief = c(1,0)) # Expected reward for just opening the right door (10) reward(sol, belief = c(1,0), epoch = 3) # Expected reward for just opening the right door (0.5 * -100 + 0.95 * 10 = 4.5) reward(sol, belief = c(.95,.05), epoch = 3) ################################################################ # Example 3: Using terminal values (state-dependent utilities after the final epoch) # # Specify 1000 if the tiger is right after 3 (horizon) epochs sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "incprune", terminal_values = c(0, 1000)) sol policy(sol) # Note: The optimal strategy is to never open the left door. If we think the # Tiger is behind the right door, then we just wait for the final payout. If # we think the tiger might be behind the left door, then we open the right # door, are likely to get a small reward and the tiger has a chance of 50\% to # move behind the right door. The second episode is used to gather more # information for the more important # final action. ################################################################ # Example 4: Model time-dependent transition probabilities # The tiger reacts normally for 3 epochs (goes randomly two one # of the two doors when a door was opened). After 3 epochs he gets # scared and when a door is opened then he always goes to the other door. # specify the horizon for each of the two different episodes Tiger_time_dependent <- Tiger Tiger_time_dependent$name <- "Scared Tiger Problem" Tiger_time_dependent$horizon <- c(normal_tiger = 3, scared_tiger = 3) Tiger_time_dependent$transition_prob <- list( normal_tiger = list( "listen" = "identity", "open-left" = "uniform", "open-right" = "uniform"), scared_tiger = list( "listen" = "identity", "open-left" = rbind(c(0, 1), c(0, 1)), "open-right" = rbind(c(1, 0), c(1, 0)) ) ) # Tiger_time_dependent (a higher value for verbose will show more messages) sol <- solve_POMDP(model = Tiger_time_dependent, discount = 1, method = "incprune", verbose = 1) sol policy(sol) # note that the default method to estimate the belief for nodes is following a # trajectory which uses only the first belief reached for each node. Random sampling # can find a better estimate of the central belief of the segment (see nodes 4-1 to 6-3 # in the plots below). plot_policy_graph(sol) plot_policy_graph(sol, method = "random_sample") ################################################################ # Example 5: Alternative method to solve time-dependent POMDPs # 1) create the scared tiger model Tiger_scared <- Tiger Tiger_scared$transition_prob <- list( "listen" = "identity", "open-left" = rbind(c(0, 1), c(0, 1)), "open-right" = rbind(c(1, 0), c(1, 0)) ) # 2) Solve in reverse order. Scared tiger without terminal values first. sol_scared <- solve_POMDP(model = Tiger_scared, horizon = 3, discount = 1, method = "incprune") sol_scared policy(sol_scared) # 3) Solve the regular tiger with the value function of the scared tiger as terminal values sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "incprune", terminal_values = sol_scared$solution$alpha[[1]]) sol policy(sol) # Note: it is optimal to mostly listen till the Tiger gets in the scared mood. Only if # we are extremely sure in the first epoch, then opening a door is optimal. ################################################################ # Example 6: PBVI with a custom grid # Create a search grid by sampling from the belief space in # 10 regular intervals custom_grid <- sample_belief_space(Tiger, n = 10, method = "regular") head(custom_grid) # Visualize the search grid plot_belief_space(sol, sample = custom_grid) # Solve the POMDP using the grid for approximation sol <- solve_POMDP(Tiger, method = "grid", parameter = list(grid = custom_grid)) policy(sol) plot_policy_graph(sol) # note that plot_policy_graph() automatically remove nodes that are unreachable from the # initial node. This behavior can be switched off. plot_policy_graph(sol, remove_unreachable_nodes = FALSE)
# display available solver options which can be passed on to pomdp-solve as parameters. solve_POMDP_parameter() ################################################################ # Example 1: Solving the simple infinite-horizon Tiger problem data("Tiger") Tiger # look at the model as a list unclass(Tiger) # inspect an individual field of the model (e.g., the transition probabilities and the reward) Tiger$transition_prob Tiger$reward sol <- solve_POMDP(model = Tiger) sol # look at the solution sol$solution # policy (value function (alpha vectors), optimal action and observation dependent transitions) policy(sol) # plot the policy graph of the infinite-horizon POMDP plot_policy_graph(sol) # value function plot_value_function(sol, ylim = c(0,20)) ################################################################ # Example 2: Solve a problem specified as a POMDP file # using a grid of size 20 sol <- solve_POMDP("http://www.pomdp.org/examples/cheese.95.POMDP", method = "grid", parameter = list(fg_points = 20)) sol policy(sol) plot_policy_graph(sol) # Example 3: Solving a finite-horizon POMDP using the incremental # pruning method (without discounting) sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "incprune") sol # look at the policy tree policy(sol) plot_policy_graph(sol) # note: only open the door in epoch 3 if you get twice the same observation. # Expected reward starting for the models initial belief (uniform): # listen twice and then open the door or listen 3 times reward(sol) # Expected reward for listen twice (-2) and then open-left (-1 + (-1) + 10 = 8) reward(sol, belief = c(1,0)) # Expected reward for just opening the right door (10) reward(sol, belief = c(1,0), epoch = 3) # Expected reward for just opening the right door (0.5 * -100 + 0.95 * 10 = 4.5) reward(sol, belief = c(.95,.05), epoch = 3) ################################################################ # Example 3: Using terminal values (state-dependent utilities after the final epoch) # # Specify 1000 if the tiger is right after 3 (horizon) epochs sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "incprune", terminal_values = c(0, 1000)) sol policy(sol) # Note: The optimal strategy is to never open the left door. If we think the # Tiger is behind the right door, then we just wait for the final payout. If # we think the tiger might be behind the left door, then we open the right # door, are likely to get a small reward and the tiger has a chance of 50\% to # move behind the right door. The second episode is used to gather more # information for the more important # final action. ################################################################ # Example 4: Model time-dependent transition probabilities # The tiger reacts normally for 3 epochs (goes randomly two one # of the two doors when a door was opened). After 3 epochs he gets # scared and when a door is opened then he always goes to the other door. # specify the horizon for each of the two different episodes Tiger_time_dependent <- Tiger Tiger_time_dependent$name <- "Scared Tiger Problem" Tiger_time_dependent$horizon <- c(normal_tiger = 3, scared_tiger = 3) Tiger_time_dependent$transition_prob <- list( normal_tiger = list( "listen" = "identity", "open-left" = "uniform", "open-right" = "uniform"), scared_tiger = list( "listen" = "identity", "open-left" = rbind(c(0, 1), c(0, 1)), "open-right" = rbind(c(1, 0), c(1, 0)) ) ) # Tiger_time_dependent (a higher value for verbose will show more messages) sol <- solve_POMDP(model = Tiger_time_dependent, discount = 1, method = "incprune", verbose = 1) sol policy(sol) # note that the default method to estimate the belief for nodes is following a # trajectory which uses only the first belief reached for each node. Random sampling # can find a better estimate of the central belief of the segment (see nodes 4-1 to 6-3 # in the plots below). plot_policy_graph(sol) plot_policy_graph(sol, method = "random_sample") ################################################################ # Example 5: Alternative method to solve time-dependent POMDPs # 1) create the scared tiger model Tiger_scared <- Tiger Tiger_scared$transition_prob <- list( "listen" = "identity", "open-left" = rbind(c(0, 1), c(0, 1)), "open-right" = rbind(c(1, 0), c(1, 0)) ) # 2) Solve in reverse order. Scared tiger without terminal values first. sol_scared <- solve_POMDP(model = Tiger_scared, horizon = 3, discount = 1, method = "incprune") sol_scared policy(sol_scared) # 3) Solve the regular tiger with the value function of the scared tiger as terminal values sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "incprune", terminal_values = sol_scared$solution$alpha[[1]]) sol policy(sol) # Note: it is optimal to mostly listen till the Tiger gets in the scared mood. Only if # we are extremely sure in the first epoch, then opening a door is optimal. ################################################################ # Example 6: PBVI with a custom grid # Create a search grid by sampling from the belief space in # 10 regular intervals custom_grid <- sample_belief_space(Tiger, n = 10, method = "regular") head(custom_grid) # Visualize the search grid plot_belief_space(sol, sample = custom_grid) # Solve the POMDP using the grid for approximation sol <- solve_POMDP(Tiger, method = "grid", parameter = list(grid = custom_grid)) policy(sol) plot_policy_graph(sol) # note that plot_policy_graph() automatically remove nodes that are unreachable from the # initial node. This behavior can be switched off. plot_policy_graph(sol, remove_unreachable_nodes = FALSE)
This function uses the C++ implementation of the SARSOP algorithm by Kurniawati, Hsu and Lee (2008) interfaced in package sarsop to solve infinite horizon problems that are formulated as partially observable Markov decision processes (POMDPs). The result is an optimal or approximately optimal policy.
solve_SARSOP( model, horizon = Inf, discount = NULL, terminal_values = NULL, method = "sarsop", digits = 7, parameter = NULL, verbose = FALSE )
solve_SARSOP( model, horizon = Inf, discount = NULL, terminal_values = NULL, method = "sarsop", digits = 7, parameter = NULL, verbose = FALSE )
model |
a POMDP problem specification created with |
horizon |
SARSOP only supports |
discount |
discount factor in range |
terminal_values |
|
method |
string; there is only one method available called |
digits |
precision used when writing POMDP files (see
|
parameter |
a list with parameters passed on to
the function |
verbose |
logical, if set to |
The solver returns an object of class POMDP which is a list with the
model specifications ('model'
), the solution ('solution'
), and the
solver output ('solver_output'
).
Michael Hahsler
Carl Boettiger, Jeroen Ooms and Milad Memarzadeh (2020). sarsop: Approximate POMDP Planning Software. R package version 0.6.6. https://CRAN.R-project.org/package=sarsop
H. Kurniawati, D. Hsu, and W.S. Lee (2008). SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Proc. Robotics: Science and Systems.
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_POMDP()
,
value_function()
Other solver:
solve_MDP()
,
solve_POMDP()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
transition_graph()
,
update_belief()
,
value_function()
,
write_POMDP()
## Not run: # Solving the simple infinite-horizon Tiger problem with SARSOP # You need to install package "sarsop" data("Tiger") Tiger sol <- solve_SARSOP(model = Tiger) sol # look at solver output sol$solver_output # policy (value function (alpha vectors), optimal action and observation dependent transitions) policy(sol) # value function plot_value_function(sol, ylim = c(0,20)) # plot the policy graph plot_policy_graph(sol) # reward of the optimal policy reward(sol) # Solve a problem specified as a POMDP file. The timeout is set to 10 seconds. sol <- solve_SARSOP("http://www.pomdp.org/examples/cheese.95.POMDP", parameter = list(timeout = 10)) sol ## End(Not run)
## Not run: # Solving the simple infinite-horizon Tiger problem with SARSOP # You need to install package "sarsop" data("Tiger") Tiger sol <- solve_SARSOP(model = Tiger) sol # look at solver output sol$solver_output # policy (value function (alpha vectors), optimal action and observation dependent transitions) policy(sol) # value function plot_value_function(sol, ylim = c(0,20)) # plot the policy graph plot_policy_graph(sol) # reward of the optimal policy reward(sol) # Solve a problem specified as a POMDP file. The timeout is set to 10 seconds. sol <- solve_SARSOP("http://www.pomdp.org/examples/cheese.95.POMDP", parameter = list(timeout = 10)) sol ## End(Not run)
The model for the Tiger Problem introduces in Cassandra et al (1994).
An object of class POMDP.
The original Tiger problem was published in Cassandra et al (1994) as follows:
An agent is facing two closed doors and a tiger is put with equal
probability behind one of the two doors represented by the states
tiger-left
and tiger-right
, while treasure is put behind the other door.
The possible actions are listen
for tiger noises or opening a door (actions
open-left
and open-right
). Listening is neither free (the action has a
reward of -1) nor is it entirely accurate. There is a 15\
probability that the agent hears the tiger behind the left door while it is
actually behind the right door and vice versa. If the agent opens door with
the tiger, it will get hurt (a negative reward of -100), but if it opens the
door with the treasure, it will receive a positive reward of 10. After a door
is opened, the problem is reset(i.e., the tiger is randomly assigned to a
door with chance 50/50) and the the agent gets another try.
The three doors problem is an extension of the Tiger problem where the tiger
is behind one of three doors represented by three states (tiger-left
,
tiger-center
, and tiger-right
) and treasure is behind the other two
doors. There are also three open actions and three different observations for
listening.
Anthony R. Cassandra, Leslie P Kaelbling, and Michael L. Littman (1994). Acting Optimally in Partially Observable Stochastic Domains. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pp. 1023-1028.
Other POMDP_examples:
POMDP()
,
POMDP_example_files
,
RussianTiger
data("Tiger") Tiger data("Three_doors") Three_doors
data("Tiger") Tiger data("Three_doors") Three_doors
Returns the transition model as an igraph object.
transition_graph( x, action = NULL, episode = NULL, epoch = NULL, state_col = NULL, simplify_transitions = TRUE, remove_unavailable_actions = TRUE ) plot_transition_graph( x, action = NULL, episode = NULL, epoch = NULL, state_col = NULL, simplify_transitions = TRUE, main = NULL, ... )
transition_graph( x, action = NULL, episode = NULL, epoch = NULL, state_col = NULL, simplify_transitions = TRUE, remove_unavailable_actions = TRUE ) plot_transition_graph( x, action = NULL, episode = NULL, epoch = NULL, state_col = NULL, simplify_transitions = TRUE, main = NULL, ... )
x |
|
action |
the name or id of an action or a set of actions. Bey default the transition model for all actions is returned. |
episode , epoch
|
Episode or epoch used for time-dependent POMDPs. Epochs are internally converted to the episode using the model horizon. |
state_col |
colors used to represent the states. |
simplify_transitions |
logical; combine parallel transition arcs into a single arc. |
remove_unavailable_actions |
logical; don't show arrows for unavailable actions. |
main |
a main title for the plot. |
... |
further arguments are passed on to |
The transition model of a POMDP/MDP is a Markov Chain. This function extracts the transition model as an igraph object.
returns the transition model as an igraph object.
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
update_belief()
,
value_function()
,
write_POMDP()
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
value_function()
data("Tiger") g <- transition_graph(Tiger) g plot_transition_graph(Tiger) plot_transition_graph(Tiger, vertex.size = 20, edge.label.cex = .5, edge.arrow.size = .5, margin = .5) plot_transition_graph(Tiger, vertex.size = 60, edge.label = NA, edge.arrow.size = .5, layout = rbind(c(-1,0), c(+1,0)), rescale = FALSE) ## Plot an individual graph for each actions and use a manual layout. for (a in Tiger$actions) { plot_transition_graph(Tiger, action = a, layout = rbind(c(-1,0), c(+1,0)), rescale = FALSE, main = paste("action:", a)) } ## Plot using the igraph library library(igraph) plot(g) # plot with a fixed layout and curved edges plot(g, layout = rbind(c(-1, 0), c(1, 0)), rescale = FALSE, edge.curved = curve_multiple_directed(g, .8), edge.loop.angle = -pi / 4, vertex.size = 60 ) ## Use visNetwork (if installed) if(require(visNetwork)) { g_vn <- toVisNetworkData(g) nodes <- g_vn$nodes edges <- g_vn$edges # add manual layout nodes$x <- c(-1, 1) * 200 nodes$y <- 0 visNetwork(nodes, edges) %>% visNodes(physics = FALSE) %>% visEdges(smooth = list(type = "curvedCW", roundness = .6), arrows = "to") }
data("Tiger") g <- transition_graph(Tiger) g plot_transition_graph(Tiger) plot_transition_graph(Tiger, vertex.size = 20, edge.label.cex = .5, edge.arrow.size = .5, margin = .5) plot_transition_graph(Tiger, vertex.size = 60, edge.label = NA, edge.arrow.size = .5, layout = rbind(c(-1,0), c(+1,0)), rescale = FALSE) ## Plot an individual graph for each actions and use a manual layout. for (a in Tiger$actions) { plot_transition_graph(Tiger, action = a, layout = rbind(c(-1,0), c(+1,0)), rescale = FALSE, main = paste("action:", a)) } ## Plot using the igraph library library(igraph) plot(g) # plot with a fixed layout and curved edges plot(g, layout = rbind(c(-1, 0), c(1, 0)), rescale = FALSE, edge.curved = curve_multiple_directed(g, .8), edge.loop.angle = -pi / 4, vertex.size = 60 ) ## Use visNetwork (if installed) if(require(visNetwork)) { g_vn <- toVisNetworkData(g) nodes <- g_vn$nodes edges <- g_vn$edges # add manual layout nodes$x <- c(-1, 1) * 200 nodes$y <- 0 visNetwork(nodes, edges) %>% visNodes(physics = FALSE) %>% visEdges(smooth = list(type = "curvedCW", roundness = .6), arrows = "to") }
Update the belief given a taken action and observation.
update_belief( model, belief = NULL, action = NULL, observation = NULL, episode = 1, digits = 7, drop = TRUE )
update_belief( model, belief = NULL, action = NULL, observation = NULL, episode = 1, digits = 7, drop = TRUE )
model |
a POMDP object. |
belief |
the current belief state. Defaults to the start belief state specified in the model or "uniform". |
action |
the taken action. Can also be a vector of multiple actions or, if missing, then all actions are evaluated. |
observation |
the received observation. Can also be a vector of multiple observations or, if missing, then all observations are evaluated. |
episode |
Use transition and observation matrices for the given episode for time-dependent POMDPs (see POMDP). |
digits |
round decimals. |
drop |
logical; drop the result to a vector if only a single belief state is returned. |
Update the belief state (
belief
) with an action and observation
using the update
defined so that
where normalizes the new belief state so the probabilities add up to one.
returns the updated belief state as a named vector.
If action
or observations
is a vector with multiple elements ot missing, then a matrix with all
resulting belief states is returned.
Michael Hahsler
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
value_function()
,
write_POMDP()
data(Tiger) update_belief(c(.5,.5), model = Tiger) update_belief(c(.5,.5), action = "listen", observation = "tiger-left", model = Tiger) update_belief(c(.15,.85), action = "listen", observation = "tiger-right", model = Tiger)
data(Tiger) update_belief(c(.5,.5), model = Tiger) update_belief(c(.5,.5), action = "listen", observation = "tiger-left", model = Tiger) update_belief(c(.15,.85), action = "listen", observation = "tiger-right", model = Tiger)
Extracts the value function from a solved model.
Extracts the alpha vectors describing the value function. This is similar to policy()
which in addition returns the
action prescribed by the solution.
value_function(model, drop = TRUE) plot_value_function( model, projection = NULL, epoch = 1, ylim = NULL, legend = TRUE, col = NULL, lwd = 1, lty = 1, ylab = "Value", ... )
value_function(model, drop = TRUE) plot_value_function( model, projection = NULL, epoch = 1, ylim = NULL, legend = TRUE, col = NULL, lwd = 1, lty = 1, ylab = "Value", ... )
model |
|
drop |
logical; drop the list for converged converged, epoch-independent value functions. |
projection |
Sample in a projected belief space. See |
epoch |
the value function of what epoch should be plotted? Use 1 for converged policies. |
ylim |
the y limits of the plot. |
legend |
logical; show the actions in the visualization? |
col |
potting colors. |
lwd |
line width. |
lty |
line type. |
ylab |
label for the y-axis. |
... |
additional arguments are passed on to |
Plots the value function of a POMDP solution as a line plot. The solution is
projected on two states (i.e., the belief for the other states is held
constant at zero). The value function can also be visualized using plot_belief_space()
.
the function as a matrix with alpha vectors as rows.
Michael Hahsler
Other policy:
estimate_belief_for_nodes()
,
optimal_action()
,
plot_belief_space()
,
plot_policy_graph()
,
policy()
,
policy_graph()
,
projection()
,
reward()
,
solve_POMDP()
,
solve_SARSOP()
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
write_POMDP()
Other MDP:
MDP()
,
MDP2POMDP
,
MDP_policy_functions
,
accessors
,
actions()
,
add_policy()
,
gridworld
,
reachable_and_absorbing
,
regret()
,
simulate_MDP()
,
solve_MDP()
,
transition_graph()
data("Tiger") sol <- solve_POMDP(Tiger) sol # value function for the converged solution value_function(sol) plot_value_function(sol, ylim = c(0,20)) ## finite-horizon problem sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "enum") sol # inspect the value function for all epochs value_function(sol) plot_value_function(sol, epoch = 1, ylim = c(-5, 25)) plot_value_function(sol, epoch = 2, ylim = c(-5, 25)) plot_value_function(sol, epoch = 3, ylim = c(-5, 25)) ## Not run: # using ggplot2 to plot the value function for epoch 3 library(ggplot2) pol <- policy(sol) ggplot(pol[[3]]) + geom_segment(aes(x = 0, y = `tiger-left`, xend = 1, yend = `tiger-right`, color = action)) + coord_cartesian(ylim = c(-5, 15)) + ylab("Value") + xlab("Belief space") ## End(Not run)
data("Tiger") sol <- solve_POMDP(Tiger) sol # value function for the converged solution value_function(sol) plot_value_function(sol, ylim = c(0,20)) ## finite-horizon problem sol <- solve_POMDP(model = Tiger, horizon = 3, discount = 1, method = "enum") sol # inspect the value function for all epochs value_function(sol) plot_value_function(sol, epoch = 1, ylim = c(-5, 25)) plot_value_function(sol, epoch = 2, ylim = c(-5, 25)) plot_value_function(sol, epoch = 3, ylim = c(-5, 25)) ## Not run: # using ggplot2 to plot the value function for epoch 3 library(ggplot2) pol <- policy(sol) ggplot(pol[[3]]) + geom_segment(aes(x = 0, y = `tiger-left`, xend = 1, yend = `tiger-right`, color = action)) + coord_cartesian(ylim = c(-5, 15)) + ylab("Value") + xlab("Belief space") ## End(Not run)
The Windy gridworld MDP example from Chapter 6 of the textbook "Reinforcement Learning: An Introduction."
An object of class MDP.
The gridworld has the following layout:
The grid world is represented as a 7 x 10 matrix of states. In the middle region the next states are shifted upward by wind (the strength in number of squares is given below each column). For example, if the agent is one cell to the right of the goal, then the action left takes the agent to the cell just above the goal.
No discounting is used (i.e., ).
Richard S. Sutton and Andrew G. Barto (2018). Reinforcement Learning: An Introduction Second Edition, MIT Press, Cambridge, MA.
Other MDP_examples:
Cliff_walking
,
DynaMaze
,
MDP()
,
Maze
Other gridworld:
Cliff_walking
,
DynaMaze
,
Maze
,
gridworld
data(Windy_gridworld) Windy_gridworld gridworld_matrix(Windy_gridworld) gridworld_matrix(Windy_gridworld, what = "labels") # The Goal is an absorbing state which(absorbing_states(Windy_gridworld)) # visualize the transition graph gridworld_plot_transition_graph(Windy_gridworld, vertex.size = 10, vertex.label = NA) # solve using value iteration sol <- solve_MDP(Windy_gridworld) sol policy(sol) gridworld_plot_policy(sol)
data(Windy_gridworld) Windy_gridworld gridworld_matrix(Windy_gridworld) gridworld_matrix(Windy_gridworld, what = "labels") # The Goal is an absorbing state which(absorbing_states(Windy_gridworld)) # visualize the transition graph gridworld_plot_transition_graph(Windy_gridworld, vertex.size = 10, vertex.label = NA) # solve using value iteration sol <- solve_MDP(Windy_gridworld) sol policy(sol) gridworld_plot_policy(sol)
Reads and write a POMDP file suitable for the pomdp-solve
program.
write_POMDP(x, file, digits = 7, labels = FALSE) read_POMDP(file, parse = TRUE, normalize = FALSE, verbose = FALSE)
write_POMDP(x, file, digits = 7, labels = FALSE) read_POMDP(file, parse = TRUE, normalize = FALSE, verbose = FALSE)
x |
an object of class POMDP. |
file |
a file name. |
digits |
precision for writing numbers (digits after the decimal point). |
labels |
logical; write original labels or use index numbers? Labels are
restricted to |
parse |
logical; try to parse the model maotrices. Solvers still work with unparsed matrices, but helpers for simulation are not available. |
normalize |
logical; should the description be normalized for faster access (see |
verbose |
logical; report parsed lines. This is useful for debugging a POMDP file. |
POMDP objects read from a POMDP file have an extra element called problem
which contains the original
POMDP specification. The original specification is directly used by external solvers. In addition, the file
is parsed using an experimental POMDP file parser. The parsed information can be used with auxiliary functions
in this package that use fields like the transition matrix, the observation matrix and the reward structure.
The range of useful rewards is restricted by the solver. Here the values are restricted to the range
[-1e10, 1e10]
.
Unavailable actions have a reward of -Inf
which is translated to -2 times the maximum
absolute reward value used in the model.
Notes: The parser for POMDP files is experimental. Please report problems here: https://github.com/mhahsler/pomdp/issues.
read_POMDP()
returns a POMDP object.
Hossein Kamalzadeh, Michael Hahsler
POMDP solver website: https://www.pomdp.org
Other POMDP:
MDP2POMDP
,
POMDP()
,
accessors
,
actions()
,
add_policy()
,
plot_belief_space()
,
projection()
,
reachable_and_absorbing
,
regret()
,
sample_belief_space()
,
simulate_POMDP()
,
solve_POMDP()
,
solve_SARSOP()
,
transition_graph()
,
update_belief()
,
value_function()
data(Tiger) ## show the POMDP file that would be written. write_POMDP(Tiger, file = stdout())
data(Tiger) ## show the POMDP file that would be written. write_POMDP(Tiger, file = stdout())