{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Toward $\\textit{à la carte}$ reprogramming: a least action path approach\n",
    "The least action path (LAP) principle, first proposed as early as 1744 by Maupertuis $\\citb{terrall}$ and famously advocated by Feynman with his reformulation of quantum mechanics via the path integral of the classical Hamilton action $\\citb{fey65}$, has previously been used in predicting the optimal transition path of cell fate transition for simplistic and designed systems $\\citb{Qiu2012-yt, Wang2014-zc, Wang8257}$ We reason that with  the reconstructed continuous, differentiable vector field, we can extend the LAP approach to real datasets in transcriptomic space to computationally explore optimal paths for differentiation and reprogramming (dedifferentiation and transdifferentiation), which then helps identify key transcription factors whose expression levels vary strongest along these paths. \n",
    "\n",
    "The transcriptomic vector field encodes dynamical information of pathways connecting different cell types. For the development, the developmental paths, connecting progenitors and stable cell types, such as HSCs and megakaryocytes (\\textbf{Fig. 7B}), are characterized by vector field streamlines, where cells need to overcome little to no dynamical barrier. On the contrary, the reversed process, dedifferentiation, requires cells to migrate against the streamline, overcome the developmental barrier to eventually become progenitors and reacquire multipotency. Stable cell types are attractors that are separated by attractor barriers (\\textbf{Box 1}, \\textbf{Fig. 1}), and during transdifferentiation processes, cells from one stable attractor overcome these barriers and transverse into another stable attractor. Overcoming the dedifferentiation/transdifferentiation barriers driven by stochasticity are \\textit{rare transitions}, as evidenced by extremely low experimental reprogramming efficiency, as low as 0.001–0.01\\% \\citb{Merkl2013}. This is why reprogramming factors are generally needed, which reshape the developmental landscape in favor of the reprogramming transitions. In the context of cell state transitions, there are two seemingly similar but fundamentally different concepts worth additional clarification:\n",
    "\n",
    "\n",
    "- transition time: The expected waiting time for a cell to initiate and finish the transition between two states, regardless of the path it takes. This corresponds to the experimentally measured time for one cell type to commit into another.\n",
    "    \n",
    "- traversal time: The time the cell spends traveling along a specific path. Theoretically, this is the time for a single cell to complete the cell type conversion once the cell has decided on the commitment.\n",
    "\n",
    "\n",
    "Given the vector field function, $\\boldsymbol f$, optimal pathways of  cell fate conversion can be mathematically analyzed by least action paths (LAPs) \\citb{freidlin2012random, onsager1953, Maier1997}. The action is defined as:\n",
    "$\\begin{align*}\n",
    "    \\newcommand{\\trp}{\\mathsf{T}}\n",
    "    S_T(\\boldsymbol x) = \\frac{1}{2} \\int_{0}^{T}\\mathrm dt {\\Big(\\boldsymbol v(t) - \\boldsymbol f\\big(\\boldsymbol x(t)\\big)\\Big)}^\\trp \\boldsymbol{D}^{-1}\n",
    "    \\Big(\\boldsymbol v(t) - \\boldsymbol f\\big(\\boldsymbol x(t)\\big)\\Big),\n",
    "\\end{align*}$\n",
    "\n",
    "$\\boldsymbol x$ is a path and $\\boldsymbol v$ its tangential velocity (the path is parametrized by time $t$, so $\\boldsymbol v(t) = \\dot{\\boldsymbol x}(t)$). $\\boldsymbol{D}$ is the diffusion coefficient matrix accounting for the stochasticity of gene expression, and for simplicity here we assume it to be a constant. $T$ is the time needed for a cell to traverse the path. By this definition, a path that strictly follows a streamline of a vector field whose tangential velocity also equals the evaluated velocity of the vector field has zero action, whereas any deviation increases action. In other words, developmental processes are (mostly) a spontaneous process and driven by intrinsic cell states, whereas dedifferentiation requires external forces such as ectopic expression of exogenous TFs or specific chemical inductions. \n",
    "\n",
    "Computationally, given the starting and end cell states $\\boldsymbol x_0$ and $\\boldsymbol x_{n}$, such as HSCs and megakaryocytes (\\textbf{Fig. 7B}), and a specific traversal time $T$, the LAP can be found by discretizing the path as a sequence of points $P=\\{\\boldsymbol x_0, \\boldsymbol x_1, \\dots, \\boldsymbol x_n\\}$, which forms $n$ line segments. For each line segment, the discrete tangential velocity can be calculated as $\\boldsymbol v_k = (\\boldsymbol x_k-\\boldsymbol x_{k-1})/\\Delta t$, where $\\Delta t = T/n$. The action along the discrete path is defined as \\citb{Perez-Carrasco2016, Tang2017}:\n",
    "\\begin{align*} \n",
    "    S_T(P) = \\frac{1}{2D}\\sum_{k=1}^{n} \\Big(\\boldsymbol v_k - \\boldsymbol f(\\boldsymbol y_k)\\Big)^2\\Delta t ,\n",
    "\\end{align*}\n",
    "where $y_k$ are the middle points of the line segments, i.e., $\\boldsymbol y_k = (\\boldsymbol x_{k-1} + \\boldsymbol x_k)/2$. Given a traversal time $T$, the LAP is a path such that:\n",
    "\\begin{align*} \n",
    "    P^* = \\underset{P}{\\operatorname{argmin}}\\ S_T(P) = \\underset{P}{\\operatorname{argmin}}\\ \\frac{1}{2D}\\sum_{k=1}^{n} \\Big(\\boldsymbol v_k - \\boldsymbol f(\\boldsymbol y_k)\\Big)^2\\Delta t .\n",
    "\\end{align*}\n",
    "To obtain the global LAP, the optimal traversal time $T^*$ is determined as:\n",
    "\\begin{align*} \n",
    "    T^* = \\underset{T}{\\operatorname{argmin}}\\ S_T(P)\n",
    "\\end{align*}\n",
    "\n",
    "The algorithm discretizes the path as a sequence of points, $P=\\{\\boldsymbol x_0, \\boldsymbol x_1, \\dots, \\boldsymbol x_n\\}$, which forms $n$ line segments. For each line segment, the discrete tangential velocity can be calculated as $\\boldsymbol v_k=(\\boldsymbol x_k - \\boldsymbol x_{k-1})/\\Delta t$, where $\\Delta t$ is the time step for the cell to move from $\\boldsymbol x_{k-1}$. In addition to the deterministic vector field, we also assume a certain degree of stochasticity in the system:\n",
    "\\begin{align*}\n",
    "    \\dot{\\boldsymbol x} = \\boldsymbol f(\\boldsymbol x) + \\sigma \\boldsymbol\\eta(t),\n",
    "\\end{align*}\n",
    "where $\\boldsymbol\\eta(t)$ is a stochastic white noise and $\\boldsymbol\\sigma$ the size of it. The action $S$ along the discrete path is defined as (Perez-Carrasco et al., 2016):\n",
    "\\begin{align*}\n",
    "    S(P, \\Delta t) = \\frac{1}{2D}\\sum_{k=1}^{n}\\Big(\\boldsymbol v_k - \\boldsymbol f(\\boldsymbol y_k)\\Big)^2\\Delta t,\n",
    "\\end{align*}\n",
    "where $\\boldsymbol y_k$ are the middle points of the line segments, i.e., $\\boldsymbol y_k = (\\boldsymbol x_{k-1} + \\boldsymbol x_k)/2$. We have also assumed the diffusion matrix to be a constant $D$, such that $D=\\sigma^2/2$. It is intuitive that a path whose tangential velocities $\\boldsymbol v$ align with the vector field has smaller action than paths that do not. The LAP is a path such that:\n",
    "\\begin{align*}\n",
    "    P^* = \\underset{P, \\Delta t}{\\operatorname{argmin}} S(P, \\Delta t) = \\underset{P, \\Delta t}{\\operatorname{argmin}}\\frac{1}{2D}\\sum_{k=1}^{n}\\Big(\\boldsymbol v_k - \\boldsymbol f(\\boldsymbol y_k)\\Big)^2\\Delta t,\n",
    "\\end{align*}\n",
    "The algorithm for finding the LAP therefore consists of two steps:  \n",
    "Minimization of the action by varying the time step. The optimal time step  given a fixed path  is a simple univariate least square minimization, i.e.:\n",
    "$ \\begin{align*}\n",
    "        \\Delta t^* = \\underset{\\Delta t}{\\operatorname{argmin}}\\frac{1}{2D}\\sum_{k=1}^{n}\\Big(\\frac{\\boldsymbol x_k - \\boldsymbol x_{k-1}}{\\Delta t} - \\boldsymbol f(\\boldsymbol y_k)\\Big)^2\\Delta t,\n",
    "    \\end{align*}$\n",
    "Minimization of the action by varying the path without moving the starting and end points. The optimal path  given a fixed time step  is found by:\n",
    "$ \\begin{align*}\n",
    "        P^* = \\underset{\\{\\boldsymbol x_1, \\boldsymbol x_2, \\dots, \\boldsymbol x_{n-1}\\}}{\\operatorname{argmin}}\\frac{1}{2D}\\sum_{k=1}^{n}\\Big(\\frac{\\boldsymbol x_k - \\boldsymbol x_{k-1}}{\\Delta t} - \\boldsymbol f\\big(\\frac{\\boldsymbol x_{k-1} + \\boldsymbol x_k}{2}\\big)\\Big)^2\\Delta t,\n",
    "    \\end{align*}$\n",
    "For a $d$-dimensional vector field, the number of variables in the above optimization problem is $d\\times n$. To mitigate the computational cost, the Jacobian of the action w.r.t. the path (more specifically, the a-th component of the $k$-th point) is analytically computed:\n",
    "$\\begin{align*}    \\frac{\\partial{S}}{\\partial{x_k^a}} =& \\frac{1}{D}\\Big(v_k^a - v_{k+1}^a + f^a(\\boldsymbol y_{k+1}) - f^a(\\boldsymbol y_k)\\Big)\\\\\n",
    "    &-\\frac{1}{2D}\\Big(\\big(\\boldsymbol v_{k+1} - \\boldsymbol f(\\boldsymbol x_{k+1})\\big) \\cdot \\frac{\\partial{f}}{\\partial{x^a}}\\Big|_{\\boldsymbol x_{k+1}} + \\big(\\boldsymbol v_k - \\boldsymbol f(\\boldsymbol x_k)\\big)\\cdot\\frac{\\partial f}{\\partial{x^a}}\\Big|_{\\boldsymbol x_k}\\Big)\n",
    "    \\end{align*}$\n",
    "Note that the partial derivative of the vector field  is the $a$-th row of the Jacobian of the vector field. With the analytical Jacobian, the computation efficiency of the LAP optimization improves tremendously, making the LAP calculation feasible to operate in high-dimensional space, such as the top 30 PCs.\t\n",
    "\n",
    "The LAP is found by iterating between the two steps, and empirically we found that the path converges in two or three iterations. By default, the LAP optimization is initialized with the interpolated shortest path on the kNN graph of cells.\n",
    "\n",
    "Notably, when LAPs are calculated in the PCA space, we can transform them back to the original gene expression space to predict the full transcriptomic kinetics along the optimal path, inspect waves of those kinetics along the path, and do so in absolute time units when the vector field used is based on tscRNA-seq.\n",
    "\n",
    "For rare transitions with $S_{T^*} \\gg 0$ (e.g., dedifferentiation and transdifferentiation), the transition rate (number of transitions per unit time) is proportional to the exponential of actions of all paths. The Freidlin–Wentzell theorem dictates that the LAP with the minimal traversal time (which will be referred to as the optimal path below) contributes the most to this transition rate \\citb{freidlin2012random, onsager1953, Maier1997, Aurell2002}:\n",
    "\\begin{align*}\n",
    "    R(A\\rightarrow B) \\approx C\\exp(-S_{T^*}),\n",
    "\\end{align*}\n",
    "where $A$ and $B$ are two cell types, $S_{T^*}$ the action of the optimal path, and $C$ a proportional factor. Furthermore, the transition time, or more specifically the mean first passage time (MFPT), is related to the transition rate:\n",
    "\\begin{align*}\n",
    "    \\mathrm{MFPT} = \\frac{1}{R(A\\rightarrow B)}\n",
    "\\end{align*}\n",
    "Therefore, the action of the optimal path predicts both the likelihood and transition time for such rare transitions. Again, most reprogramming experiments take a few weeks or months, depending on the exact initial and terminal cell states \\cite{takahashi2006induction}. \n",
    "\n",
    "For natural transitions between points that are connected by the vector field streamlines (e.g., from a repulsor to an adjacent attractor), the actions of LAPs, within a certain range of $T$, are all zero, because a path following the streamline downstream is a LAP with zero action. The above approximation that the LAP contributes the most to the transition rate no longer applies. Differentiation processes are often close to such natural transitions, and the action of a differentiation LAP cannot tell us any information on the transition rate. However, LAPs are still the most probable paths for cells to take, as they are optimized to follow the streamline of the vector field. The waiting time for the cell to initiate the transition is negligible in this case, so the transition time can be approximated by the traversal time of the LAP.\n",
    "\n",
    "In addition to the computation of transition time and traversal time, analyzing gene expression variations along LAPs provides essential information on regulatory genes, and their dynamics, during cell fate transitions. We calculate the mean squared displacement (MSD) for every gene $i$ along the optimal path:\n",
    "\\begin{align*}\n",
    "    \\mathrm{MSD}_i = \\sum_{t=0}^{T} \\big(y_i(t) - y_i(0)\\big)^2\n",
    "\\end{align*}\n",
    "Genes with large MSD are potentially genes that regulate the corresponding transitions.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{bibliography}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "149170a34b80921e9f067d63e9cd48cc3daa8835f3726c438cfa63a39a78a9cb"
  },
  "kernelspec": {
   "display_name": "Python 3.9.7 64-bit ('dynamo-dev': conda)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}