---
title: "Solving Tic-Tac-Toe as a MDP"
author: "Michael Hahsler"
bibliography: references.bib
link-citations: yes
vignette: >
  %\VignetteIndexEntry{Solving Tic-Tac-Toe as a MDP}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  rmarkdown::html_vignette
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = TRUE
)
```

```{r setup}
library(markovDP)
```

## Introduction 

In this vignette we show how the R package **markovDP** [@CRAN_markovDP] can be 
used to define and solve the simple Tix-Tac-Toe game.
We define the game Tic-Tac-Toe from the perspective of player `x` as a simple MDP,
and then try to solve the problem using different dynamic programming and reinforcement
learning techniques.

## Defining Tic-Tac-Toe as an MDP

We implement the game form the perspective of player `x`, where the other 
player is part of the environment. We represent the board as a 3-by-3 matrix. 

### Actions

The board consists of nine positions.
The player `x` can mark any of the nine positions with an `x` as long as they are empty.
We will number the actions from 1 through 9 using the index in the 
matrix (note that R indexed by column). We get the following layout for actions. 

```{r }
matrix(1:9, ncol = 3, nrow = 3)
```

The set of actions of the MDP is therefore:

```{r }
A <- as.character(1:9)
A
```

### State Space

We use a characters in the board matrix to represent the players (`x` and `o`).
For an empty place we use the underscore symbol `_`. As the label for a state,
we use a string of all the places of the board in index order. 
We implement several helper functions.

```{r}
ttt_empty_board <- function() matrix('_', ncol = 3, nrow = 3)

ttt_state2label <- function(state) paste(state, collapse = '')

ttt_label2state <- function(label) matrix(strsplit(label, "")[[1]], nrow = 3, ncol = 3)

ttt_available_actions <- function(state) {
  if (length(state) == 1L) state <- ttt_label2state(state)
  which(state == "_")
}

ttt_result <- function(state, player, action) {
  if (length(state) == 1L) state <- ttt_label2state(state)
  
  if (state[action] != "_")
    stop("Illegal action.")
  
  state[action] <- player
  state
}
```

Next, we perform a sequence of actions to see how a game would progress using the 
result function.

```{r}
b <- ttt_empty_board()
b <- ttt_result(b, 'x', 1)
b <- ttt_result(b, 'o', 4)
b <- ttt_result(b, 'x', 2)
b <- ttt_result(b, 'o', 5)
b <- ttt_result(b, 'x', 3)
b

ttt_state2label(b)

ttt_available_actions(b)
```

We need to determine terminal states by checking if the game is over.

```{r }
ttt_terminal <- function(state) {
  if (length(state) == 1L) state <- ttt_label2state(state)
  
  # Check the board for a win and return one of 
  # 'x', 'o', 'd' (draw), or 'n' (for next move)
  win_possibilities <- rbind(state, t(state), diag(state), diag(t(state)))
  
  wins <- apply(win_possibilities, MARGIN = 1, FUN = function(x) {
    if (x[1] != '_' && length(unique(x)) == 1) x[1]
    else '_'
  })
  
  if (any(wins == 'x')) 
    return('x')
  
  if (any(wins == 'o')) 
    return('o')
  
  # Check for draw
  if (sum(state == '_') < 1) {
    return('d')
  }
  
  return('n')
}
```


```{r }
b
```

```{r }
ttt_terminal(b)
```

## Transition Model


To enumerate the complete reachable state space to define the MDP, we need to know the 
start state (an empty board) and define 
the transition model. The model is a function 
that, given an action and a start state, returns a named vector with all end states
that have a non-zero probability. This is convenient, since we do not know the 
complete and potentially large state space at this point.
We also add the three special states of
`'win'`, `'loss'`, and `'draw'`.

During one interaction with the environment,
player `x` performs an actions following the learned policy. 
We implement `o` as part of the stochastic transition model of the environment.
We return a uniform probability distribution all
actions that `o` has available. 

```{r }
start <- ttt_state2label(ttt_empty_board())
start

P <- function(model, action, start.state) {
  action <- as.integer(action)
  
  # absorbing states
  if (start.state %in% c('win', 'loss', 'draw', 'illegal')) {
    return(structure(1, names = start.state))
  }
  
  # avoid illegal action by going to the very expensive illegal state
  if (!(action %in% ttt_available_actions(start.state))) {
    return(structure(1, names = "illegal"))
  }
  
  # make x's move
  next_state <- ttt_result(start.state, 'x', action)
  
  # terminal?
  term <- ttt_terminal(next_state)
  if (term == 'x') {
    return(structure(1, names = "win"))
  } else if (term == 'o') {
    return(structure(1, names = "loss"))
  } else if (term == 'd') {
    return(structure(1, names = "draw"))
  }
  
  # it is o's turn, try all available actions
  actions_of_o <- ttt_available_actions(next_state)
  possible_end_states <- lapply(
    actions_of_o,
    FUN = function(a)
      ttt_result(next_state, 'o', a)
  )
  
  # fix terminal states
  term <- sapply(possible_end_states, ttt_terminal)
  possible_end_states <- sapply(possible_end_states, ttt_state2label)
  possible_end_states[term == 'x'] <- 'win'
  possible_end_states[term == 'o'] <- 'loss'
  possible_end_states[term == 'd'] <- 'draw'
  
  possible_end_states <- unique(possible_end_states)
  
  return(structure(rep(1 / length(possible_end_states), 
                      length(possible_end_states)), 
                   names = possible_end_states))
}
```

We use `find_reachable_states()` to find all reachable states using depth-first search.
We also add the special state `"NA"` which is only reachable if we use an unavailable 
action.
```{r }
S <- union(c('win', 'loss', 'draw', 'illegal'),
           reachable_states(P, start_state = start, actions = A))
head(S)

```

The total number of reachable states is:

```{r }
length(S)
```

## Reward Function

The player only receives a reward at the end of the game.
The default reward for states is 0 and the
reward for ending up in the states `win`, `loss`, and `draw`. 

```{r }
R <- rbind(
  R_(                    value = 0),
  R_(end.state = 'win',  value = +1),
  R_(end.state = 'loss', value = -1),
  R_(end.state = 'draw', value = 0),
  R_(end.state = 'illegal', value = -Inf),
  # Note: there is no more reward once the agent is in a terminal state
  R_(start.state = 'win',  value = 0),
  R_(start.state = 'loss', value = 0),
  R_(start.state = 'draw', value = 0),
  R_(start.state = 'illegal', value = 0)
)
```


## Constructing the MDP

```{r }
tictactoe <- MDP(
  S,
  A,
  P,
  R,
  discount = 1,
  start = start,
  name = "TicTacToe"
)

tictactoe
```

We normalize the MDP for quicker access.

```{r }
tictactoe$absorbing_states <- c("draw", "loss", "win", "illegal")

n <- normalize_MDP(tictactoe, sparse = TRUE)
```


## Define Test Boards

We define a few boards to compare how different policies decide. 

```{r }
test_policy <- function(sol) {
  boards <- list(empty = rbind(c("_","_","_"),
                              c("_","_","_"),
                              c("_","_","_")),
                 
                 win_4_or_5 = rbind(c("x","_","x"),
                              c("o","_","o"),
                              c("_","_","_")),
                 
                 win_1 = rbind(c("_","x","x"),
                              c("_","_","o"),
                              c("o","o","x")),
                 
                 draw_6 = rbind(c("_","o","_"),
                              c("x","o","x"),
                              c("_","_","_")),
                 
                 loss = rbind(c("o","o","_"),
                              c("x","o","x"),
                              c("x","_","_"))
  )
  
  pol <- policy(sol)
  res <- do.call(rbind, lapply(boards, FUN = function(b) pol[pol$state == ttt_state2label(b), ]))
  res
}
```

## Solve

### LP

```{r }
sol <- solve_MDP(n, method = "LP:LP")
test_policy(sol)
```

### Value Iteration

```{r }
sol <- solve_MDP(n, method = "DP:VI")
test_policy(sol)
```

The expected state value is for the optimal policy playing against a random player.


### Policy Iteration

```{r }
sol <- solve_MDP(n, method = "DP:PI")
test_policy(sol)
```

```{r }
sol <- solve_MDP(n, method = "DP:GenPS")
test_policy(sol)
```


### Q-Learning

```{r }
sol <- solve_MDP(n, method = "TD:q_learning", n = 10000, horizon = 9, Q = 1)
test_policy(sol)
```


### Expected Sarsa

```{r }
sol <- solve_MDP(n, method = "TD:expected_sarsa", n = 50000, alpha = 1, horizon = 9)
test_policy(sol)
```

# References