The package markovDP (Hahsler 2024) implements episodic semi-gradient Sarsa with linear state-value function approximation following Sutton and Barto (2018). The state-action value construction uses the approach described in Geramifard et al. (2013). First, we will use the state features directly and then we will use a Fourier basis (see “Value Function Approximation in Reinforcement Learning Using the Fourier Basis” (2011)).
The state-action value function is approximated by
where is a weight vector and is a feature function that maps each state-action pair to a feature vector. The gradient of the state-action function is
For a small number of actions, we can follow the construction described by Geramifard et al. (2013) which uses a state feature function to construct the complete state-action feature vector. Here, we also add an intercept term. The state-action feature vector has length . It has the intercept and then one component for each action. All these components are set to zero and only the active action component is set to , where is the current state. For example, for the state feature vector and action out of three possible actions , the complete state-action feature vector is . The leading 1 is for the intercept and the zeros represent the two not chosen actions.
This construction is implemented in
add_linear_approx_Q_function()
.
The following helper functions for using approximation are available:
approx_Q_value()
calculates approximate Q values given
the weights in the model or specified weights.approx_greedy_action()
uses approximate Q values given
the weights in the model or specified weights to find the the greedy
action for a state.approx_greedy_policy()
calculates the greedy-policy for
the approximate Q values given the weights in the model or specified
weights.The implementation follows the algorithm given in Sutton and Barto (2018).
The step cost is 1. The start is top-left and the goal (+100 reward) is bottom-right. This is the ideal problem for a linear approximation of the Q-function using the x/y location as state features.
We start with defining an MDP for a small maze without walls.
We construct state features as the x/y coordinates in the gridworld.
state_features <- gw_s2rc(S(m))
state_features
#> [,1] [,2]
#> s(1,1) 1 1
#> s(2,1) 2 1
#> s(3,1) 3 1
#> s(4,1) 4 1
#> s(5,1) 5 1
#> s(1,2) 1 2
#> s(2,2) 2 2
#> s(3,2) 3 2
#> s(4,2) 4 2
#> s(5,2) 5 2
#> s(1,3) 1 3
#> s(2,3) 2 3
#> s(3,3) 3 3
#> s(4,3) 4 3
#> s(5,3) 5 3
#> s(1,4) 1 4
#> s(2,4) 2 4
#> s(3,4) 3 4
#> s(4,4) 4 4
#> s(5,4) 5 4
#> s(1,5) 1 5
#> s(2,5) 2 5
#> s(3,5) 3 5
#> s(4,5) 4 5
#> s(5,5) 5 5
We add the state features with the linear approximation function to
the model. add_linear_approx_Q_function()
constructs
state-action features and adds them with an approximate Q function and a
gradient function to the model. The state-action features are
constructed as a vector with weights for an intercept and the state
features for each action.
Below, we see that the initial weight vector has names organized by
action. x0
represents the intercept and x1
and
x2
represent the x and y coordinate in the maze,
respectively.
m <- add_linear_approx_Q_function(m, state_features)
m$approx_Q_function
#> $f
#> function (s, a, w)
#> sum(w * x(s, a))
#> <bytecode: 0x561c4068ab10>
#> <environment: 0x561c40689db8>
#>
#> $gradient
#> function (s, a, w)
#> x(s, a)
#> <bytecode: 0x561c4068a758>
#> <environment: 0x561c40689db8>
#>
#> $transformation
#> function (x)
#> {
#> x <- (x - min)/(max - min)
#> if (intercept)
#> x <- c(x0 = 1, x)
#> x
#> }
#> <bytecode: 0x561c4068e8a8>
#> <environment: 0x561c40693368>
#>
#> $w_init
#> up.x0 up.x1 up.x2 right.x0 right.x1 right.x2 down.x0 down.x1
#> 0 0 0 0 0 0 0 0
#> down.x2 left.x0 left.x1 left.x2
#> 0 0 0 0
Now, we can solve the model. We initially use a large epsilon for the ϵ-greedy behavior policy to make sure that the goal state can be found with the initially bad policy.
set.seed(2000)
sol <- solve_MDP_APPROX(m, horizon = 1000, n = 10,
alpha = 0.01, epsilon = .5)
gw_plot(sol)
We refine the policy using a reduced epsilon value.
sol <- solve_MDP_APPROX(sol, horizon = 1000, n = 100,
alpha = 0.01, epsilon = 0.05, verbose = FALSE, continue = TRUE)
gw_plot(sol)
gw_matrix(sol, what = "value")
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 35.14908 42.69130 50.23352 57.77574 65.31796
#> [2,] 41.78887 48.27324 55.81546 63.35768 70.89989
#> [3,] 50.10371 55.43508 61.39740 68.93961 76.48183
#> [4,] 58.41855 63.74991 69.08128 74.52155 82.06377
#> [5,] 66.73339 72.06475 77.39612 82.72749 88.05885
Here are the learned weights.
sol$solution$w
#> up.x0 up.x1 up.x2 right.x0 right.x1 right.x2 down.x0
#> 3.7322809 1.9757633 1.9199719 33.4740345 33.2593514 21.3254665 35.1490849
#> down.x1 down.x2 left.x0 left.x1 left.x2
#> 22.3277533 30.1688704 1.4939581 0.6477309 0.8547735
The approximate value function is continuous and can also be displayed using matrix shading and contours
The wall and the -1 absorbing state make linear approximation using just the position more difficult.
Adding the linear approximation translates state names of the format
s(feature list)
automatically.
Maze_approx <- add_linear_approx_Q_function(Maze)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100,
alpha = 0.01, epsilon = 0.3)
gw_plot(sol)
gw_matrix(sol, what = "value")
#> [,1] [,2] [,3] [,4]
#> [1,] 0.03754164 0.08520828 0.13287493 0.1805416
#> [2,] -0.12594533 NA -0.03061204 0.0170546
#> [3,] -0.17897172 -0.20984183 -0.19409901 -0.1464324
The linear approximation cannot deal with the center wall and the -1 absorbing state.
Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_polynomial_basis, order = 1)
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#> [,1] [,2] [,3] [,4]
#> [1,] 0.5038624 0.59386501 0.683867656 0.77387030
#> [2,] 0.3694329 NA 0.064714316 0.02601866
#> [3,] 0.2785708 0.06514918 0.004172039 -0.05548804
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#> [,1] [,2] [,3] [,4]
#> [1,] 0.5916602 0.7757432 0.8331777 0.7323140
#> [2,] 0.6842868 NA 0.6014451 0.4053169
#> [3,] 0.5211166 0.5322500 0.4328252 0.2775929
Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_fourier_basis, order = 2)
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#> [,1] [,2] [,3] [,4]
#> [1,] 0.8039007 0.8530152 0.9321483 0.9621668
#> [2,] 0.7203552 NA 0.8031814 0.6414500
#> [3,] 0.6009483 0.3073787 0.4425806 0.1031225
Maze <- gw_read_maze(
textConnection(c("XXXXXXXXXXXX",
"X X",
"X S X",
"X X",
"XXXXXXXXX X",
"X X",
"X G X",
"X X",
"XXXXXXXXXXXX"
)))
We use a Fourier basis.
Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_fourier_basis, order = 3)
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> [1,] NA NA NA NA NA NA NA
#> [2,] NA -37.00571 -32.305190 0.7019976 32.82877 44.55576 40.346856
#> [3,] NA -22.92956 4.260784 42.0101035 54.95916 36.48745 6.239592
#> [4,] NA 19.16871 45.407858 61.7536996 55.74079 35.47391 21.988292
#> [5,] NA NA NA NA NA NA NA
#> [6,] NA 54.32267 80.436803 86.4649675 74.28118 76.09669 79.364236
#> [7,] NA 94.02564 106.113042 98.0204334 109.25071 108.08643 95.368631
#> [8,] NA 118.65841 104.905130 98.0169676 89.02328 67.79762 53.606674
#> [9,] NA NA NA NA NA NA NA
#> [,8] [,9] [,10] [,11] [,12]
#> [1,] NA NA NA NA NA
#> [2,] 37.618667 47.66486 64.664955 73.04139 NA
#> [3,] -5.882173 14.15749 52.155198 78.33453 NA
#> [4,] 29.170064 50.71066 71.970919 77.88169 NA
#> [5,] NA NA 84.957491 59.54535 NA
#> [6,] 86.271720 89.11826 83.718684 68.88511 NA
#> [7,] 77.709483 61.66809 48.740141 35.88682 NA
#> [8,] 48.837345 26.32749 9.064637 10.57478 NA
#> [9,] NA NA NA NA NA