--- layout: post title: "Dynamic Programming for Super Six" tags: [rstats, python, dynamic programming, reinforcement learning, game theory] comments: yes editor_options: chunk_output_type: console bibliography: /Users/hoehle/Literature/Bibtex/jabref.bib --- ```{r,include=FALSE,echo=FALSE,message=FALSE} ##If default fig.path, then set it. if (knitr::opts_chunk$get("fig.path") == "figure/") { knitr::opts_knit$set( base.dir = '/Users/hoehle/Sandbox/Blog/') knitr::opts_chunk$set(fig.path="figure/source/2023-03-20-dynprog/") } fullFigPath <- paste0(knitr::opts_knit$get("base.dir"),knitr::opts_chunk$get("fig.path")) filePath <- file.path("","Users","hoehle","Sandbox", "Blog", "figure", "source", "2023-03-20-dynprog") knitr::opts_chunk$set(echo = FALSE, fig.width=8, fig.height=3.5, fig.cap='', fig.align='center', dpi=72*2, echo=TRUE)#, global.par = TRUE) options(width=150, scipen=1e3) ``` ```{r, echo=FALSE} suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(reticulate)) reticulate::use_virtualenv("r-reticulate") # Non CRAN packages # devtools::install_github("hadley/emo") ##Configuration options(knitr.table.format = "html") theme_set(theme_minimal()) #if there are more than n rows in the tibble, print only the first m rows. options(tibble.print_max = 10, tibble.print_min = 5) # Fix seed value for the post set.seed(123) ``` ## Abstract: We use dynamic programming to find the optimal strategy for playing Super Six when there are 2 players. To test the cross-language capabilities of Rmarkdown we solve the task by embedding our Python implementation of the value iteration algorithm using the reticulate R-package. ```{r,results='asis',echo=FALSE} cat(paste0("![]({{ site.baseurl }}/",knitr::opts_chunk$get("fig.path"),"optim_strategy-1.png"),")") ```

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The [R-markdown source code](`r str_c("https://raw.githubusercontent.com/mhoehle/hoehleatsu.github.io/master/_source/",current_input())`) of this blog is available under a [GNU General Public License (GPL v3)](https://www.gnu.org/licenses/gpl-3.0.html) license from GitHub. ## Introduction The present post finds the optimal game strategy for playing the dice game of Super Six with two players by using dynamic programming to solve the corresponding Markov Decision Process (MDP). We use the `reticulate` R package to run our Python implementation of the value iteration algorithm within the Rmarkdown document. For more details about Super Six, see the previous Blog post [*How to Win a Game (or More) of Super Six*](https://mhoehle.github.io/blog/2023/03/13/super6.html). Appendix 1 of the post discussed optimal strategies. The present post is a natural successor of this post, because it shows the details of calculating optimal strategies for two player games by value iteration [@russell_norvig2020, @sutton_burto2020]. This provides a clearer way to get to the strategy than, e.g., @jehn2021 or the C++ program by [Nuno Hulthberg](https://github.com/mhoehle/mhoehle.github.io/blob/main/blog/figure/source/2023-03-13-super6/hultberg2.cpp). ## Notation In mathematical notation the current state $s$ corresponds to the $i/j/k$ situation, with $0\leq i\leq 5$, $j\geq 0$ and $k\geq 0$. In order to distinguish between the first turn of a player in a round (where the dice has to be thrown) and the subsequent turns, where the player can decide whether to throw the dice or stop, we add a fourth component $l\in \{0,1\}$ to the state, which is a binary indicator whether the move is a forced moved (1) or not (0). The action space for a state $i/j/k/l$ with $l=1$ is $\mathcal{A}(s) = \{\texttt{throw}\}$, whereas for $l=0$ it is $\mathcal{A}(s) = \{\texttt{throw},\texttt{stop}\}$. If $j=0$ while $k>0$ then Player 1 won. If $k=0$ while $i>0$ then Player 2 won. In order to handle this *episodic task* [@sutton_burto2020] we introduce an terminal state, which provides no additional future reward. ### State Transition The state transition probabilities of the Markov decision process are given as follows. For each throw of the dice from an $i/j/k/l$ state there are three cases to consider when throwing the dice: 1. you roll a six (probability $\frac{1}{6}$) and move to the $i/(j-1)/k/0$ state 2. (if $0\leq i< 5$) you hit a free spot (probability $\frac{5-i}{6}$) and move to the $(i+1)/j/k/0$ state 3. (if $00$} \\ -1 & \text{if $k=0$ and $j>0$} \\ 0 & \text{otherwise} \end{array} \right. $$ Side note: Since linear transformations of the reward function leave the optimal strategy unchanged [Sect. 17.1.3, @russell_norvig2020], it is equally possible to work with rewards $R^*(s'=(i,j,k,l)) = \frac{1}{2}R(s'=(i,j,k,l)) + \frac{1}{2}$. This would imply values 1 for winning and 0 for loosing and translates directly the expected reward computations into computing expected probabilities for winning, which is what @jehn2021 computed. ### Bellman equations The expected utility obtained from following stragy $\pi$ is defined as $$ U^{\pi}(s) = E\left[ \sum_{t=0}^\infty R(S_t, \pi(S_t), S_{t+1})\right] $$ The terminal state with future reward 0 ensures that $U^\pi(s)<\infty$. No discounting of the rewards is thus used in our approach. Our aim is now to find a strategy $\pi^*(s)$, which optimizes this expected utility, i.e. $$ \pi_s^* = \operatorname{argmax}_{\pi} U^{\pi}(s) $$ For state $s$ we thus choose the action that maximizes the reward for the next step plus the expected utility of the subsequent states: $$ \pi^*(s) = \operatorname{argmax}_{a\in \mathcal{A}(s)} \sum_{s'}P(s'| s,a) \left[ R(s,a,s') + \max_{a'} U(s') \right] $$ This expressions for a state $s$ is also called the **Bellman equation**: \begin{align*} U(s) &=\max_{a\in \mathcal{A}(s)} \sum_{s'}P(s'| s,a) \left[ R(s,a,s') + U(s') \right] \end{align*} Hence, for a state $i/j/k/0$ the action "continue" leads to to the following expansion of the sum term in the above equation: $$ \begin{align*} & \underbrace{\frac{1}{6} [R(i/j-1/k/0) + U(i/j-1/k/0)]}_{\text{stick into hole}} \\ &+ \underbrace{\frac{5-i}{6} [R(i+1/j-1/k/0) + U(i+1/j-1/k/0)]}_{\text{put stick into lid}}\\ &- \underbrace{\frac{i}{6} [R(i-1/k/j+1/1) + U(i-1/k/j+1/1)]}_{\text{take stick from lid}} \end{align*} $$ and the action to stop has value $R(i-1/k/j+1/1) + U(i-1/k/j+1/1)$. ## Value Iteration We use **value iteration** [Sect. 17.2.1, @russell_norvig2020], [Sect 4.4, @sutton_burto2020] to solve the MDP. Let $U(s)$ and $U'$ be collections, which contain the value function for each $s$ in the state space. We initialize $U(s)=U'(s)=0$ for all states and proceed by the following algorithm: ---- **Algorithm 1**: Value iteration ---- **repeat** 1. $U \gets U'; \delta \gets 0$ 2. **For all** states $s$ **do** a. $U'(s) \gets \max_{a\in \mathcal{A}(s)} \sum_{s'}P(s'| s,a) \left[ R(s,a,s') + U(s') \right]$ b. **if** $|U(s')-U(s)| > \delta$ **then** $\delta \gets |U(s')-U(s)|$ **until** $\delta \leq \epsilon$; ---

The optimal action for state $s$ is thus the action maximing the sum term in Step 2a, i.e. $\pi(s) = \operatorname{argmax}_{a\in \mathcal{A}(s)} \sum_{s'}P(s'| s,a) \left[ R(s,a,s') + \max_{a'} U(s') \right]$. Either one computes this at the end of the algorithm or one adds this book-keeping step as part of step 2a in the value iteration algorithm. Technically, $\pi\approx \pi^*$, i.e. we only obtain an estimate of the optimal strategy, because of the iterative nature of the algorithm. However, one can show that with sufficient iterations we converge towards $\pi^*$. ## Results Applying the value iteration Python code for $n=7$ (see code in the Appendix) gives the optimal strategy. Furthermore, we also obtain the probabilities to win from each state. We show the result for all states with at least 3 sticks in the lid. With less than 3 sticks the decision is always to continue throwing. In the output, the column `strategy` shows whether to continue throwing the dice (`TRUE`) or not (`FALSE`); the column `value` shows the expected value $U(s)$ and `prob` shows the expected probability to win the game given that the opponent also follows the same optimal strategy. Note that for states with $l=1$, no choice is possible, i.e. one has to continue no matter what. For these states the strategy column is always `TRUE`. ```{python, eval=FALSE, echo=FALSE} from pathlib import Path print(Path.cwd()) ``` ```{python VALUEITER, file=file.path(filePath, "value_iteration.py"), echo=FALSE} # reticulate::py_run_file(file.path(filePath, "value_iteration.py")) ``` ```{r, echo=FALSE} py$s_best %>% filter(i>=3) ``` One can, e.g., compare the results with the numbers in Fig. 4 of @jehn2021. The optimal strategy for a two player game is thus to continue as long as only three pits are filled. If four slots are filled, one would only continue, if the situation is 4/1/1. If 5 pits are filled, one should always stop (if possible). The figure below illustrates the strategy for all $l=0$ states graphically: the y-axis gives $i$, whereas the label of each box is $j/k$ followed by the expected probability (with just 2 decimals) to win from this state: ```{r optim_strategy, width=10, height=2, echo=FALSE} size_x <- 12 pi <- py$s_best %>% filter(l==0) pi <- pi %>% mutate(x=(size_x*j+k)*size_x - 4*size_x*j^1.25, y=i, label=sprintf("%d/%d\n%.2f",j, k, prob)) par(mar=c(0,5,0,0)) plot(c(0,0), xlim=c(100,450), ylim=c(0,6), type="n", xlab="", ylab="Occupied pits in lid", axes=FALSE) for (i in 1:nrow(pi)) { one_s <- pi[i,] rect(xleft= one_s[["x"]], ybottom=one_s[["y"]], xright=one_s[["x"]]+(size_x), one_s[["y"]]+0.8, col=if_else(one_s[["strategy"]], "darkolivegreen3", "salmon2")) text( one_s[["x"]]+size_x/2, one_s[["y"]]+0.4, one_s[["label"]], cex=0.5, adj=c(0.5,0.5)) } axis(2, at=0:5+0.5, labels=0:5) legend(x="topright", fill=c("darkolivegreen3", "salmon2"), legend = c("continue", "stop"), title="Action:", cex=1) ``` ### Probability to Win as Start-Player One surprising finding from the simulations in [How to Win a Game (or More) of Super Six](https://mhoehle.github.io/blog/2023/03/13/super6.html) was that the start player had a big winning advantage. We use the above algorithm to analytically calculate the probability that the starting players wins the game in a situation where each player initially has 4 sticks. To compute the winning probability we need to keep in mind that in round 1 we have the additional restriction that only one throw of the dice is allowed for each player. As a consequence, we need to treat the outcome of round 1 separately. There are four possible states that Player 1 starts from in round 2: 1. Both players throw a 6 ($p=\frac{1}{36}$) in round 1; round 2 continues from state $0/3/3/1$ 3. One of the players (but not both!) throws a six ($p=\frac{10}{36}$), round 2 continues from state $1/3/3/1$ 2. Both sticks from round 1 end up in the lid ($p=\frac{5}{6}\frac{4}{6}=\frac{20}{36}$); round 2 continues from state $2/3/3/1$ 3. Player 1 throws anything but a 6 and Player 2 throws the same ($p=\frac{5}{6}\frac{1}{6}=\frac{5}{36}$); round 2 continues from state $0/3/5/1$ We can thus calculate the probability to win as start player in a 4-stick game (assuming both players follow the optimal winning strategy) as $$ \frac{1}{36} p_{\text{win}}^{\pi^*}(0/3/3/1) + \frac{10}{36} p_{\text{win}}^{\pi^*}(1/3/3/1) + \frac{20}{36} p_{\text{win}}^{\pi^*}(2/3/3/1) + \frac{5}{36} p_{\text{win}}^{\pi^*}(0/3/5/1). $$ We compute this in R with reticulate calls to the Python function calculating the winning probability for a given state: ```{r p_beginner_wins, echo=TRUE} # States and their probabilities after round 1 round2 <- tibble( state = list(c(0,3,3,1), c(1,3,3,1), c(2,3,3,1), c(0,3,5,1)), prob = c(1, 10, 20, 5) / 36 ) # Compute the probability to win from round 2 on for a given state p_win <- function(s) { py$optimal_strategy(as.integer(sum(s))) %>% filter( i == s[1] & j == s[2] & k == s[3] & l == s[4]) %>% pull(prob) } # Compute winning probabilities for each state and do the weighted # sum to get the overall probability to win as start player (res <- round2 %>% rowwise() %>% mutate(p_win = p_win(state)) %>% ungroup() %>% #unnest_wider(col=state, names_sep=".") %>% summarise(p_win = sum(prob*p_win))) ``` This corroborates the previous finding that the start player has a `r scales::percent(res %>% as.numeric)` probability to win the game when both players play the optimal strategy. ## Discussion Dynamic programming is an important foundation for reinforcement learning. We can use it to find optimal strategies in fully observable stochastic environments. For environments with reasonable sized state spaces and with a known stochastic model for the transitions between states, dynamic programming is a direct method to get exact solutions. No complex approximate solutions, as e.g. discussed in Part 2 of @sutton_burto2020, are needed. ## Appendix: Python code The Python code for value iteration can be found as [`value_iteration.py`](`r paste0("{{ site.baseurl }}/",knitr::opts_chunk$get("fig.path"), "value_iteration.py")`). ```{python} <> ``` ## References