---
title: "Solving MDPs with Linear Approximation"
author: "Michael Hahsler"
bibliography: references.bib
link-citations: yes
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{Solving MDPs with Linear Approximation}
  %\VignetteEngine{knitr::rmarkdown}
output:
  rmarkdown::html_vignette
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

The package **markovDP** [@CRAN_markovDP]
implements episodic semi-gradient Sarsa with linear 
state-value function approximation following @Sutton1998.
The state-action value construction uses the approach described in 
@Geramifard2013. First, we will use the state features directly
and then we will use a Fourier basis (see @Konidaris2011).

## Linear Approximation
The state-action value function is approximated by
\deqn{\hat{q}(s,a) = \boldsymbol{w}^\top\phi(s,a),}

where \eqn{\boldsymbol{w} \in \mathbb{R}^n} is a weight vector
and \eqn{\phi: S \times A \rightarrow \mathbb{R}^n}  is a
feature function that
maps each state-action pair to a feature vector.
The gradient of the state-action function is
\deqn{\nabla \hat{q}(s,a,\boldsymbol{w}) = \phi(s,a).}

## State-action Feature Vector Construction

For a small number of actions, we can
follow the construction described by @Geramifard2013
which uses a state feature function \eqn{\phi: S \rightarrow \mathbb{R}^{m}}
to construct the complete state-action feature vector.
Here, we also add an intercept term.
The state-action feature vector has length \eqn{1 + |A| \times m}.
It has the intercept and then one component for each action. All these components
are set to zero and only the active action component is set to \eqn{\phi(s)},
where \eqn{s} is the current state.
For example, for the state feature
vector \eqn{\phi(s) = (3,4)} and action \eqn{a=2} out of three possible
actions \eqn{A = \{1, 2, 3\}}, the complete
state-action feature vector is \eqn{\phi(s,a) = (1,0,0,3,4,0,0)}.
The leading 1 is for the intercept and the zeros represent the two not
chosen actions.

This construction is implemented in `add_linear_approx_Q_function()`.

## Helper Functions

The following helper functions for using approximation are available:

* `approx_Q_value()` calculates approximate Q values given the weights in
   the model or specified weights.
* `approx_greedy_action()` uses approximate Q values given the weights in
   the model or specified weights to find the the greedy action for a
   state.
* `approx_greedy_policy()` calculates the greedy-policy
   for the approximate Q values given the weights in
   the model or specified weights.

## Episodic Semi-gradient Sarsa

The implementation follows the algorithm given in @Sutton1998.


## Example 1: A maze without walls

The step cost is 1. The start is top-left and
the goal (+100 reward) is bottom-right.
This is the ideal problem for a linear approximation of the Q-function
using the x/y location as state features.

We start with defining an MDP for a small maze without walls.

```{r }
library(markovDP)
m <- gw_maze_MDP(c(5, 5), start = "s(1,1)", goal = "s(5,5)")
```
    
We construct state features as the x/y coordinates in the gridworld.

```{r }
state_features <- gw_s2rc(S(m))
state_features
```

We add the state features with the linear approximation function to the model.
`add_linear_approx_Q_function()` constructs state-action features
and adds them with an approximate Q function and a gradient function to the model.
The state-action features are constructed as a vector with weights for an 
intercept and the state features for each action.  
Below, we see that the initial weight vector has names organized by action.
`x0` represents the intercept and `x1` and `x2` represent the x and y coordinate in 
the maze, respectively.

```{r } 
m <- add_linear_approx_Q_function(m, state_features)
m$approx_Q_function
```

Now, we can solve the model. We initially use a large epsilon for the
$\epsilon$-greedy behavior policy to make sure that the goal state can be 
found with the initially bad policy.

```{r, fig.asp = 1}
set.seed(2000)
sol <- solve_MDP_APPROX(m, horizon = 1000, n = 10,
                     alpha = 0.01, epsilon = .5)
gw_plot(sol)
```

We refine the policy using a reduced epsilon value.

```{r, fig.asp = 1}
sol <- solve_MDP_APPROX(sol, horizon = 1000, n = 100,
          alpha = 0.01, epsilon = 0.05, verbose = FALSE, continue = TRUE)
gw_plot(sol)
```

```{r }
gw_matrix(sol, what = "value")
```


Here are the learned weights.
```{r }
sol$solution$w
```

The approximate value function is continuous and can also be displayed using 
matrix shading and contours
```{r, fig.asp = 1}
approx_V_plot(sol)
```


## Example 2: Stuart Russell's 3x4 Maze using Linear Approximation

```{r }
data(Maze)
gw_plot(Maze)
```

### Linear Basis

The wall and the -1 absorbing state make linear approximation
using just the position more difficult.


Adding the linear approximation translates state names of the 
format `s(feature list)` automatically.

```{r, fig.asp = 3/4}
Maze_approx <- add_linear_approx_Q_function(Maze)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100,
                     alpha = 0.01, epsilon = 0.3)
gw_plot(sol)
```

```{r }
gw_matrix(sol, what = "value")
```

The linear approximation cannot deal with the center wall and the -1 absorbing state.

### Order-1 Polynomial Basis

```{r}
Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_polynomial_basis, order = 1)
```


```{r }
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
```

```{r, fig.asp = 3/4}
gw_plot(sol)
```

```{r, fig.asp = 3/4}
approx_V_plot(sol)
```

### Radial Basis Function


```{r}
Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_RBF_basis, n = 5)
```


```{r }
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
```

```{r, fig.asp = 3/4}
gw_plot(sol)
```

```{r, fig.asp = 3/4}
approx_V_plot(sol)
```


### Order-2 Fourier Basis


```{r}
Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_fourier_basis, order = 2)
```


```{r }
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
```

```{r, fig.asp = 3/4}
gw_plot(sol)
```

```{r, fig.asp = 3/4}
approx_V_plot(sol)
```

# Example 3: Wall Maze

```{r }
Maze <- gw_read_maze(
  textConnection(c("XXXXXXXXXXXX",
                   "X          X",
                   "X S        X",
                   "X          X",
                   "XXXXXXXXX  X",
                   "X          X",
                   "X G        X",
                   "X          X",
                   "XXXXXXXXXXXX"
                   )))
```

We use a Fourier basis.


```{r}
Maze_approx <- add_linear_approx_Q_function(Maze, transformation = transformation_fourier_basis, order = 3)
```


```{r }
set.seed(2000)
sol <- solve_MDP_APPROX(Maze_approx, horizon = 100, n = 100, alpha = 0.1, epsilon = .1)
gw_matrix(sol, what = "value")
```

```{r, fig.asp = 3/4}
gw_plot(sol)
```

```{r, fig.asp = 3/4}
approx_V_plot(sol)
```


# References