{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A tiny regex challenge solved without another regex\n",
    "\n",
    "This notebook presents a small challenge a friend of mine asked me (in Python).\n",
    "I'll write Python code valid for versions $\\geq$ 3.6, and to show of I use the [`typing`](https://docs.python.org/3/library/typing.html) module to have type hints."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:15:56.881685Z",
     "start_time": "2020-10-13T10:15:56.871310Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.6.9 (default, Jul 17 2020, 12:50:27) \n",
      "[GCC 8.4.0]\n"
     ]
    }
   ],
   "source": [
    "import sys\n",
    "print(sys.version)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:17:20.831024Z",
     "start_time": "2020-10-13T10:17:20.827758Z"
    }
   },
   "outputs": [],
   "source": [
    "from typing import List, Tuple\n",
    "\n",
    "Position = int\n",
    "Interval = Tuple[Position, Position]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction: the problem a friend of mine asked me"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:13:36.314530Z",
     "start_time": "2020-10-13T10:13:36.308045Z"
    }
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:17:31.967836Z",
     "start_time": "2020-10-13T10:17:31.948168Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def bad_events(pattern: str, string: str) -> List[Interval]:\n",
    "    # m.span(1) = (m.start(1), m.end(1))\n",
    "    return [m.span(1) for m in re.finditer(f\"(?=({pattern}))\", string)]\n",
    "\n",
    "pat = \"aca\"\n",
    "strng = \"acacavcacabacacbbacacazdbacaca\"\n",
    "\n",
    "# Do you know if there is a regex trick to obtain\n",
    "# [(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]\n",
    "# instead of\n",
    "bad_events(pat, strng)\n",
    "# [(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n",
    "# ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The solution I came up with\n",
    "\n",
    "Let's write a simple function that will read this list of intervals, and compress the ones that are not disjoint.\n",
    "\n",
    "For instance, when reading `[(0, 3), (2, 5)]`, the second interval is not disjoint from the first one, so both can be compressed to `(0, 5)`, which is disjoint from the next one `(7, 10)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Let's test (in constant time wrt $n$ number of intervals) if two consecutive intervals are disjoint or not:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:29:15.270638Z",
     "start_time": "2020-10-13T10:29:15.256255Z"
    }
   },
   "outputs": [],
   "source": [
    "def are_not_disjoint(interval1: Interval, interval2: Interval) -> bool:\n",
    "    x1, y1 = interval1\n",
    "    assert x1 <= y1, f\"Error: interval = {intervals1} is not a valid interval.\"\n",
    "    x2, y2 = interval2\n",
    "    assert x2 <= y2, f\"Error: interval = {intervals2} is not a valid interval.\"\n",
    "    if x1 <= x2 <= y1 <= y2:   # interval1 finishes in interval2\n",
    "        return True\n",
    "    elif x2 <= x1 <= y2 <= y1: # interval2 finishes in interval1\n",
    "        return True\n",
    "    elif x1 <= x2 <= y2 <= y1: # interval2 is included in interval1\n",
    "        return True\n",
    "    elif x2 <= x1 <= y1 <= y2: # interval1 is included in interval2\n",
    "        return True\n",
    "    return False"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:29:34.500917Z",
     "start_time": "2020-10-13T10:29:34.491246Z"
    }
   },
   "outputs": [],
   "source": [
    "assert are_not_disjoint((0, 3), (2, 5))  # True\n",
    "assert not are_not_disjoint((0, 5), (7, 10))  # False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Let's compute the union of two consecutive intervals, if they are not disjoint: (again in constant time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:25:47.218268Z",
     "start_time": "2020-10-13T10:25:47.209122Z"
    }
   },
   "outputs": [],
   "source": [
    "def union_intervals(interval1: Interval, interval2: Interval) -> bool:\n",
    "    x1, y1 = interval1\n",
    "    assert x1 <= y1, f\"Error: interval = {intervals1} is not a valid interval.\"\n",
    "    x2, y2 = interval2\n",
    "    assert x2 <= y2, f\"Error: interval = {intervals2} is not a valid interval.\"\n",
    "    return (min(x1, x2), max(y1, y2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:25:54.352949Z",
     "start_time": "2020-10-13T10:25:54.340977Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0, 5)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "union_intervals((0, 3), (2, 5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- And now we are reading to compress the list of intervals (in linear time):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:39:16.779225Z",
     "start_time": "2020-10-13T10:39:16.761417Z"
    }
   },
   "outputs": [],
   "source": [
    "def compress_intervals(intervals: List[Interval]) -> List[Interval]:\n",
    "    intervals_after_compression: List[Interval] = []\n",
    "    n = len(intervals)\n",
    "    assert n > 0\n",
    "    current_interval = intervals[0]   # eg (0, 3)\n",
    "    i = 1\n",
    "    # as long as we can read another interval in the list\n",
    "    while i < n:  # ==> O(n) as the inside of the loop is O(1)\n",
    "        next_interval = intervals[i]  # eg (2, 5)\n",
    "        if are_not_disjoint(current_interval, next_interval):\n",
    "            # eg (0, 3) and (2, 5) -> (0, 5)\n",
    "            current_interval = union_intervals(current_interval, next_interval)\n",
    "        else:\n",
    "            # eg (0, 5) and (7, 10) -> (0, 5) is added,\n",
    "            intervals_after_compression.append(current_interval)\n",
    "            # and current_interval = next_interval = (7, 10)\n",
    "            current_interval = next_interval\n",
    "        i += 1\n",
    "    # we add the last current interval if it was not yet added\n",
    "    if current_interval not in intervals_after_compression:\n",
    "        intervals_after_compression.append(current_interval)\n",
    "    return intervals_after_compression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:39:18.078802Z",
     "start_time": "2020-10-13T10:39:18.067726Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n"
     ]
    }
   ],
   "source": [
    "# Do you know if there is a regex trick to obtain\n",
    "# [(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]\n",
    "# instead of\n",
    "intervals = bad_events(pat, strng)\n",
    "print(intervals)\n",
    "# [(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n",
    "# ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:37:38.636376Z",
     "start_time": "2020-10-13T10:37:38.631541Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compress_intervals(intervals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- So now we can write the requested function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:40:04.143256Z",
     "start_time": "2020-10-13T10:40:04.138945Z"
    }
   },
   "outputs": [],
   "source": [
    "def bad_events_compressed(pat: str, strng: str) -> List[Interval]:\n",
    "    intervals1 = bad_events(pat, strng)\n",
    "    intervals2 = compress_intervals(intervals1)\n",
    "    return intervals2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:41:19.652742Z",
     "start_time": "2020-10-13T10:41:19.644418Z"
    }
   },
   "outputs": [],
   "source": [
    "def test(pat: str, strng: str) -> None:\n",
    "    print(f\"For pattern {pat} and string {strng}, the bad events uncompressed are:\\n{bad_events(pat, strng)}\\nand the bad events compressed are:\\n{bad_events_compressed(pat, strng)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:41:22.726773Z",
     "start_time": "2020-10-13T10:41:22.716726Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "For pattern aca and string acacavcacabacacbbacacazdbacaca, the bad events uncompressed are:\n",
      "[(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n",
      "and the bad events compressed are:\n",
      "[(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]\n"
     ]
    }
   ],
   "source": [
    "test(pat, strng)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:41:41.238414Z",
     "start_time": "2020-10-13T10:41:41.227023Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "For pattern acab and string acabacabacabacacavcacabacacbbacacazdbacacaacacavcacabacacbbacacazdbacacaacabacab, the bad events uncompressed are:\n",
      "[(0, 4), (4, 8), (8, 12), (19, 23), (49, 53), (72, 76), (76, 80)]\n",
      "and the bad events compressed are:\n",
      "[(0, 12), (19, 23), (49, 53), (72, 80)]\n"
     ]
    }
   ],
   "source": [
    "test(\"acab\", \"acabacabacabacacavcacabacacbbacacazdbacacaacacavcacabacacbbacacazdbacacaacabacab\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-13T10:42:06.988200Z",
     "start_time": "2020-10-13T10:42:06.978483Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "For pattern merci and string mercimerciderienmercimerki, the bad events uncompressed are:\n",
      "[(0, 5), (5, 10), (16, 21)]\n",
      "and the bad events compressed are:\n",
      "[(0, 10), (16, 21)]\n"
     ]
    }
   ],
   "source": [
    "test(\"merci\", \"mercimerciderienmercimerki\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "It was fun!\n",
    "\n",
    "> See [GitHub.com/Naereen/notebooks](https://github.com/Naereen/notebooks/) for other notebooks!\n",
    "> This one and all the others I wrote are open-source [under the MIT License](https://lbesson.mit-license.org/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}