Solving MDPs with Linear Approximation

Introduction

The package markovDP (Hahsler 2024) implements episodic semi-gradient Sarsa with linear state-value function approximation following Sutton and Barto (2018). The state-action value construction uses the approach described in Geramifard et al. (2013). First, we will use the state features directly and then we will use a Fourier basis (see “Value Function Approximation in Reinforcement Learning Using the Fourier Basis” (2011)).

Linear Approximation

The state-action value function is approximated by

where is a weight vector and is a feature function that maps each state-action pair to a feature vector. The gradient of the state-action function is

State-action Feature Vector Construction

For a small number of actions, we can follow the construction described by Geramifard et al. (2013) which uses a state feature function to construct the complete state-action feature vector. Here, we also add an intercept term. The state-action feature vector has length . It has the intercept and then one component for each action. All these components are set to zero and only the active action component is set to , where is the current state. For example, for the state feature vector and action out of three possible actions , the complete state-action feature vector is . The leading 1 is for the intercept and the zeros represent the two not chosen actions.

This construction is implemented in add_linear_approx_Q_function().

Helper Functions

The following helper functions for using approximation are available:

  • approx_Q_value() calculates approximate Q values given the weights in the model or specified weights.
  • approx_greedy_action() uses approximate Q values given the weights in the model or specified weights to find the the greedy action for a state.
  • approx_greedy_policy() calculates the greedy-policy for the approximate Q values given the weights in the model or specified weights.

Episodic Semi-gradient Sarsa

The implementation follows the algorithm given in Sutton and Barto (2018).

Example 1: A maze without walls

The step cost is 1. The start is top-left and the goal (+100 reward) is bottom-right. This is the ideal problem for a linear approximation of the Q-function using the x/y location as state features.

We start with defining an MDP for a small maze without walls.

library(markovDP)
m <- gw_maze_MDP(c(5, 5), start = "s(1,1)", goal = "s(5,5)")

We construct state features as the x/y coordinates in the gridworld.

state_features <- gw_s2rc(S(m))
state_features
#>        [,1] [,2]
#> s(1,1)    1    1
#> s(2,1)    2    1
#> s(3,1)    3    1
#> s(4,1)    4    1
#> s(5,1)    5    1
#> s(1,2)    1    2
#> s(2,2)    2    2
#> s(3,2)    3    2
#> s(4,2)    4    2
#> s(5,2)    5    2
#> s(1,3)    1    3
#> s(2,3)    2    3
#> s(3,3)    3    3
#> s(4,3)    4    3
#> s(5,3)    5    3
#> s(1,4)    1    4
#> s(2,4)    2    4
#> s(3,4)    3    4
#> s(4,4)    4    4
#> s(5,4)    5    4
#> s(1,5)    1    5
#> s(2,5)    2    5
#> s(3,5)    3    5
#> s(4,5)    4    5
#> s(5,5)    5    5

We add the state features with the linear approximation function to the model. add_linear_approx_Q_function() constructs state-action features and adds them with an approximate Q function and a gradient function to the model. The state-action features are constructed as a vector with weights for an intercept and the state features for each action.
Below, we see that the initial weight vector has names organized by action. x0 represents the intercept and x1 and x2 represent the x and y coordinate in the maze, respectively.

m <- add_linear_approx_Q_function(m, state_features)
m$approx_Q_function
#> $f
#> function (s, a, w) 
#> sum(w * x(s, a))
#> <bytecode: 0x561c4068ab10>
#> <environment: 0x561c40689db8>
#> 
#> $gradient
#> function (s, a, w) 
#> x(s, a)
#> <bytecode: 0x561c4068a758>
#> <environment: 0x561c40689db8>
#> 
#> $transformation
#> function (x) 
#> {
#>     x <- (x - min)/(max - min)
#>     if (intercept) 
#>         x <- c(x0 = 1, x)
#>     x
#> }
#> <bytecode: 0x561c4068e8a8>
#> <environment: 0x561c40693368>
#> 
#> $w_init
#>    up.x0    up.x1    up.x2 right.x0 right.x1 right.x2  down.x0  down.x1 
#>        0        0        0        0        0        0        0        0 
#>  down.x2  left.x0  left.x1  left.x2 
#>        0        0        0        0

Now, we can solve the model. We initially use a large epsilon for the ϵ-greedy behavior policy to make sure that the goal state can be found with the initially bad policy.

set.seed(2000)
sol <- solve_MDP_APPROX(m, horizon = 1000, n = 10,
                     alpha = 0.01, epsilon = .5)
gw_plot(sol)

We refine the policy using a reduced epsilon value.

sol <- solve_MDP_APPROX(sol, horizon = 1000, n = 100,
          alpha = 0.01, epsilon = 0.05, verbose = FALSE, continue = TRUE)
gw_plot(sol)

gw_matrix(sol, what = "value")
#>          [,1]     [,2]     [,3]     [,4]     [,5]
#> [1,] 35.14908 42.69130 50.23352 57.77574 65.31796
#> [2,] 41.78887 48.27324 55.81546 63.35768 70.89989
#> [3,] 50.10371 55.43508 61.39740 68.93961 76.48183
#> [4,] 58.41855 63.74991 69.08128 74.52155 82.06377
#> [5,] 66.73339 72.06475 77.39612 82.72749 88.05885

Here are the learned weights.

sol$solution$w
#>      up.x0      up.x1      up.x2   right.x0   right.x1   right.x2    down.x0 
#>  3.7322809  1.9757633  1.9199719 33.4740345 33.2593514 21.3254665 35.1490849 
#>    down.x1    down.x2    left.x0    left.x1    left.x2 
#> 22.3277533 30.1688704  1.4939581  0.6477309  0.8547735

The approximate value function is continuous and can also be displayed using matrix shading and contours

approx_V_plot(sol)

Example 2: Stuart Russell’s 3x4 Maze using Linear Approximation

data(Maze)
gw_plot(Maze)

Linear Basis

The wall and the -1 absorbing state make linear approximation using just the position more difficult.

Adding the linear approximation translates state names of the format s(feature list) automatically.

Maze_approx <- add_linear_approx_Q_function(Maze)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100,
                     alpha = 0.01, epsilon = 0.3)
gw_plot(sol)

gw_matrix(sol, what = "value")
#>             [,1]        [,2]        [,3]       [,4]
#> [1,]  0.03754164  0.08520828  0.13287493  0.1805416
#> [2,] -0.12594533          NA -0.03061204  0.0170546
#> [3,] -0.17897172 -0.20984183 -0.19409901 -0.1464324

The linear approximation cannot deal with the center wall and the -1 absorbing state.

Order-1 Polynomial Basis

Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_polynomial_basis, order = 1)
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#>           [,1]       [,2]        [,3]        [,4]
#> [1,] 0.5038624 0.59386501 0.683867656  0.77387030
#> [2,] 0.3694329         NA 0.064714316  0.02601866
#> [3,] 0.2785708 0.06514918 0.004172039 -0.05548804
gw_plot(sol)

approx_V_plot(sol)

Radial Basis Function

Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_RBF_basis, n = 5)
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 0.5916602 0.7757432 0.8331777 0.7323140
#> [2,] 0.6842868        NA 0.6014451 0.4053169
#> [3,] 0.5211166 0.5322500 0.4328252 0.2775929
gw_plot(sol)

approx_V_plot(sol)

Order-2 Fourier Basis

Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_fourier_basis, order = 2)
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 0.8039007 0.8530152 0.9321483 0.9621668
#> [2,] 0.7203552        NA 0.8031814 0.6414500
#> [3,] 0.6009483 0.3073787 0.4425806 0.1031225
gw_plot(sol)

approx_V_plot(sol)

Example 3: Wall Maze

Maze <- gw_read_maze(
  textConnection(c("XXXXXXXXXXXX",
                   "X          X",
                   "X S        X",
                   "X          X",
                   "XXXXXXXXX  X",
                   "X          X",
                   "X G        X",
                   "X          X",
                   "XXXXXXXXXXXX"
                   )))

We use a Fourier basis.

Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_fourier_basis, order = 3)
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
#>       [,1]      [,2]       [,3]       [,4]      [,5]      [,6]      [,7]
#>  [1,]   NA        NA         NA         NA        NA        NA        NA
#>  [2,]   NA -37.00571 -32.305190  0.7019976  32.82877  44.55576 40.346856
#>  [3,]   NA -22.92956   4.260784 42.0101035  54.95916  36.48745  6.239592
#>  [4,]   NA  19.16871  45.407858 61.7536996  55.74079  35.47391 21.988292
#>  [5,]   NA        NA         NA         NA        NA        NA        NA
#>  [6,]   NA  54.32267  80.436803 86.4649675  74.28118  76.09669 79.364236
#>  [7,]   NA  94.02564 106.113042 98.0204334 109.25071 108.08643 95.368631
#>  [8,]   NA 118.65841 104.905130 98.0169676  89.02328  67.79762 53.606674
#>  [9,]   NA        NA         NA         NA        NA        NA        NA
#>            [,8]     [,9]     [,10]    [,11] [,12]
#>  [1,]        NA       NA        NA       NA    NA
#>  [2,] 37.618667 47.66486 64.664955 73.04139    NA
#>  [3,] -5.882173 14.15749 52.155198 78.33453    NA
#>  [4,] 29.170064 50.71066 71.970919 77.88169    NA
#>  [5,]        NA       NA 84.957491 59.54535    NA
#>  [6,] 86.271720 89.11826 83.718684 68.88511    NA
#>  [7,] 77.709483 61.66809 48.740141 35.88682    NA
#>  [8,] 48.837345 26.32749  9.064637 10.57478    NA
#>  [9,]        NA       NA        NA       NA    NA
gw_plot(sol)

approx_V_plot(sol)

References

Geramifard, Alborz, Thomas J. Walsh, Tellex Stefanie, Girish Chowdhary, Nicholas Roy, and Jonathan P. How. 2013. https://doi.org/10.1561/2200000042.
Hahsler, Michael. 2024. markovDP: Infrastructure for Discrete-Time Markov Decision Processes (MDP). https://github.com/mhahsler/markovDP.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. Second. The MIT Press. http://incompleteideas.net/book/the-book-2nd.html.
“Value Function Approximation in Reinforcement Learning Using the Fourier Basis.” 2011 25: 380–85. https://doi.org/10.1609/aaai.v25i1.7903.