--- layout: post title: "Dynamic Programming for Super Six" tags: [rstats, python, dynamic programming, reinforcement learning, game theory] comments: yes editor_options: chunk_output_type: console bibliography: /Users/hoehle/Literature/Bibtex/jabref.bib --- ```{r,include=FALSE,echo=FALSE,message=FALSE} ##If default fig.path, then set it. if (knitr::opts_chunk$get("fig.path") == "figure/") { knitr::opts_knit$set( base.dir = '/Users/hoehle/Sandbox/Blog/') knitr::opts_chunk$set(fig.path="figure/source/2023-03-20-dynprog/") } fullFigPath <- paste0(knitr::opts_knit$get("base.dir"),knitr::opts_chunk$get("fig.path")) filePath <- file.path("","Users","hoehle","Sandbox", "Blog", "figure", "source", "2023-03-20-dynprog") knitr::opts_chunk$set(echo = FALSE, fig.width=8, fig.height=3.5, fig.cap='', fig.align='center', dpi=72*2, echo=TRUE)#, global.par = TRUE) options(width=150, scipen=1e3) ``` ```{r, echo=FALSE} suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(reticulate)) reticulate::use_virtualenv("r-reticulate") # Non CRAN packages # devtools::install_github("hadley/emo") ##Configuration options(knitr.table.format = "html") theme_set(theme_minimal()) #if there are more than n rows in the tibble, print only the first m rows. options(tibble.print_max = 10, tibble.print_min = 5) # Fix seed value for the post set.seed(123) ``` ## Abstract: We use dynamic programming to find the optimal strategy for playing Super Six when there are 2 players. To test the cross-language capabilities of Rmarkdown we solve the task by embedding our Python implementation of the value iteration algorithm using the reticulate R-package.
The optimal action for state $s$ is thus the action maximing the sum term in Step 2a, i.e. $\pi(s) = \operatorname{argmax}_{a\in \mathcal{A}(s)} \sum_{s'}P(s'| s,a) \left[ R(s,a,s') + \max_{a'} U(s') \right]$. Either one computes this at the end of the algorithm or one adds this book-keeping step as part of step 2a in the value iteration algorithm. Technically, $\pi\approx \pi^*$, i.e. we only obtain an estimate of the optimal strategy, because of the iterative nature of the algorithm. However, one can show that with sufficient iterations we converge towards $\pi^*$.
## Results
Applying the value iteration Python code for $n=7$ (see code in the Appendix) gives the optimal strategy. Furthermore, we also obtain the probabilities to win from each state. We show the result for all states with at least 3 sticks in the lid. With less than 3 sticks the decision is always to continue throwing.
In the output, the column `strategy` shows whether to continue throwing the dice (`TRUE`) or not (`FALSE`); the column `value` shows the expected value $U(s)$ and `prob` shows the expected probability to win the game given that the opponent also follows the same optimal strategy. Note that for states with $l=1$, no choice is possible, i.e. one has to continue no matter what. For these states the strategy column is always `TRUE`.
```{python, eval=FALSE, echo=FALSE}
from pathlib import Path
print(Path.cwd())
```
```{python VALUEITER, file=file.path(filePath, "value_iteration.py"), echo=FALSE}
# reticulate::py_run_file(file.path(filePath, "value_iteration.py"))
```
```{r, echo=FALSE}
py$s_best %>% filter(i>=3)
```
One can, e.g., compare the results with the numbers in Fig. 4 of @jehn2021. The optimal strategy for a two player game is thus to continue as long as only three pits are filled. If four slots are filled, one would only continue, if the situation is 4/1/1. If 5 pits are filled, one should always stop (if possible). The figure below illustrates the strategy for all $l=0$ states graphically: the y-axis gives $i$, whereas the label of each box is $j/k$ followed by the expected probability (with just 2 decimals) to win from this state:
```{r optim_strategy, width=10, height=2, echo=FALSE}
size_x <- 12
pi <- py$s_best %>% filter(l==0)
pi <- pi %>% mutate(x=(size_x*j+k)*size_x - 4*size_x*j^1.25, y=i, label=sprintf("%d/%d\n%.2f",j, k, prob))
par(mar=c(0,5,0,0))
plot(c(0,0), xlim=c(100,450), ylim=c(0,6), type="n", xlab="", ylab="Occupied pits in lid", axes=FALSE)
for (i in 1:nrow(pi)) {
one_s <- pi[i,]
rect(xleft= one_s[["x"]], ybottom=one_s[["y"]], xright=one_s[["x"]]+(size_x), one_s[["y"]]+0.8, col=if_else(one_s[["strategy"]], "darkolivegreen3", "salmon2"))
text( one_s[["x"]]+size_x/2, one_s[["y"]]+0.4, one_s[["label"]], cex=0.5, adj=c(0.5,0.5))
}
axis(2, at=0:5+0.5, labels=0:5)
legend(x="topright", fill=c("darkolivegreen3", "salmon2"), legend = c("continue", "stop"), title="Action:", cex=1)
```
### Probability to Win as Start-Player
One surprising finding from the simulations in [How to Win a Game (or More) of Super Six](https://mhoehle.github.io/blog/2023/03/13/super6.html) was that the start player had a big winning advantage. We use the above algorithm to analytically calculate the probability that the starting players wins the game in a situation where each player initially has 4 sticks.
To compute the winning probability we need to keep in mind that in round 1 we have the additional restriction that only one throw of the dice is allowed for each player. As a consequence, we need to treat the outcome of round 1 separately. There are four possible states that Player 1 starts from in round 2:
1. Both players throw a 6 ($p=\frac{1}{36}$) in round 1; round 2 continues from state $0/3/3/1$
3. One of the players (but not both!) throws a six ($p=\frac{10}{36}$), round 2 continues from state $1/3/3/1$
2. Both sticks from round 1 end up in the lid ($p=\frac{5}{6}\frac{4}{6}=\frac{20}{36}$); round 2 continues from state $2/3/3/1$
3. Player 1 throws anything but a 6 and Player 2 throws the same ($p=\frac{5}{6}\frac{1}{6}=\frac{5}{36}$); round 2 continues from state
$0/3/5/1$
We can thus calculate the probability to win as start player in a 4-stick game (assuming both players follow the optimal winning strategy) as
$$
\frac{1}{36} p_{\text{win}}^{\pi^*}(0/3/3/1) + \frac{10}{36} p_{\text{win}}^{\pi^*}(1/3/3/1) + \frac{20}{36} p_{\text{win}}^{\pi^*}(2/3/3/1) + \frac{5}{36} p_{\text{win}}^{\pi^*}(0/3/5/1).
$$
We compute this in R with reticulate calls to the Python function calculating the winning probability for a given state:
```{r p_beginner_wins, echo=TRUE}
# States and their probabilities after round 1
round2 <- tibble(
state = list(c(0,3,3,1), c(1,3,3,1), c(2,3,3,1), c(0,3,5,1)),
prob = c(1, 10, 20, 5) / 36
)
# Compute the probability to win from round 2 on for a given state
p_win <- function(s) {
py$optimal_strategy(as.integer(sum(s))) %>%
filter( i == s[1] & j == s[2] &
k == s[3] & l == s[4]) %>%
pull(prob)
}
# Compute winning probabilities for each state and do the weighted
# sum to get the overall probability to win as start player
(res <- round2 %>%
rowwise() %>%
mutate(p_win = p_win(state)) %>%
ungroup() %>%
#unnest_wider(col=state, names_sep=".") %>%
summarise(p_win = sum(prob*p_win)))
```
This corroborates the previous finding that the start player has a `r scales::percent(res %>% as.numeric)` probability to win the game when both players play the optimal strategy.
## Discussion
Dynamic programming is an important foundation for reinforcement learning. We can use it to find optimal strategies in fully observable stochastic environments. For environments with reasonable sized state spaces and with a known stochastic model for the transitions between states, dynamic programming is a direct method to get exact solutions. No complex approximate solutions, as e.g. discussed in Part 2 of @sutton_burto2020, are needed.
## Appendix: Python code
The Python code for value iteration can be found as [`value_iteration.py`](`r paste0("{{ site.baseurl }}/",knitr::opts_chunk$get("fig.path"), "value_iteration.py")`).
```{python}
<