{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A tiny regex challenge solved without another regex\n", "\n", "This notebook presents a small challenge a friend of mine asked me (in Python).\n", "I'll write Python code valid for versions $\\geq$ 3.6, and to show of I use the [`typing`](https://docs.python.org/3/library/typing.html) module to have type hints." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:15:56.881685Z", "start_time": "2020-10-13T10:15:56.871310Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3.6.9 (default, Jul 17 2020, 12:50:27) \n", "[GCC 8.4.0]\n" ] } ], "source": [ "import sys\n", "print(sys.version)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:17:20.831024Z", "start_time": "2020-10-13T10:17:20.827758Z" } }, "outputs": [], "source": [ "from typing import List, Tuple\n", "\n", "Position = int\n", "Interval = Tuple[Position, Position]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction: the problem a friend of mine asked me" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:13:36.314530Z", "start_time": "2020-10-13T10:13:36.308045Z" } }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:17:31.967836Z", "start_time": "2020-10-13T10:17:31.948168Z" } }, "outputs": [ { "data": { "text/plain": [ "[(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def bad_events(pattern: str, string: str) -> List[Interval]:\n", " # m.span(1) = (m.start(1), m.end(1))\n", " return [m.span(1) for m in re.finditer(f\"(?=({pattern}))\", string)]\n", "\n", "pat = \"aca\"\n", "strng = \"acacavcacabacacbbacacazdbacaca\"\n", "\n", "# Do you know if there is a regex trick to obtain\n", "# [(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]\n", "# instead of\n", "bad_events(pat, strng)\n", "# [(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n", "# ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The solution I came up with\n", "\n", "Let's write a simple function that will read this list of intervals, and compress the ones that are not disjoint.\n", "\n", "For instance, when reading `[(0, 3), (2, 5)]`, the second interval is not disjoint from the first one, so both can be compressed to `(0, 5)`, which is disjoint from the next one `(7, 10)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's test (in constant time wrt $n$ number of intervals) if two consecutive intervals are disjoint or not:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:29:15.270638Z", "start_time": "2020-10-13T10:29:15.256255Z" } }, "outputs": [], "source": [ "def are_not_disjoint(interval1: Interval, interval2: Interval) -> bool:\n", " x1, y1 = interval1\n", " assert x1 <= y1, f\"Error: interval = {intervals1} is not a valid interval.\"\n", " x2, y2 = interval2\n", " assert x2 <= y2, f\"Error: interval = {intervals2} is not a valid interval.\"\n", " if x1 <= x2 <= y1 <= y2: # interval1 finishes in interval2\n", " return True\n", " elif x2 <= x1 <= y2 <= y1: # interval2 finishes in interval1\n", " return True\n", " elif x1 <= x2 <= y2 <= y1: # interval2 is included in interval1\n", " return True\n", " elif x2 <= x1 <= y1 <= y2: # interval1 is included in interval2\n", " return True\n", " return False" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:29:34.500917Z", "start_time": "2020-10-13T10:29:34.491246Z" } }, "outputs": [], "source": [ "assert are_not_disjoint((0, 3), (2, 5)) # True\n", "assert not are_not_disjoint((0, 5), (7, 10)) # False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Let's compute the union of two consecutive intervals, if they are not disjoint: (again in constant time)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:25:47.218268Z", "start_time": "2020-10-13T10:25:47.209122Z" } }, "outputs": [], "source": [ "def union_intervals(interval1: Interval, interval2: Interval) -> bool:\n", " x1, y1 = interval1\n", " assert x1 <= y1, f\"Error: interval = {intervals1} is not a valid interval.\"\n", " x2, y2 = interval2\n", " assert x2 <= y2, f\"Error: interval = {intervals2} is not a valid interval.\"\n", " return (min(x1, x2), max(y1, y2))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:25:54.352949Z", "start_time": "2020-10-13T10:25:54.340977Z" } }, "outputs": [ { "data": { "text/plain": [ "(0, 5)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "union_intervals((0, 3), (2, 5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- And now we are reading to compress the list of intervals (in linear time):" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:39:16.779225Z", "start_time": "2020-10-13T10:39:16.761417Z" } }, "outputs": [], "source": [ "def compress_intervals(intervals: List[Interval]) -> List[Interval]:\n", " intervals_after_compression: List[Interval] = []\n", " n = len(intervals)\n", " assert n > 0\n", " current_interval = intervals[0] # eg (0, 3)\n", " i = 1\n", " # as long as we can read another interval in the list\n", " while i < n: # ==> O(n) as the inside of the loop is O(1)\n", " next_interval = intervals[i] # eg (2, 5)\n", " if are_not_disjoint(current_interval, next_interval):\n", " # eg (0, 3) and (2, 5) -> (0, 5)\n", " current_interval = union_intervals(current_interval, next_interval)\n", " else:\n", " # eg (0, 5) and (7, 10) -> (0, 5) is added,\n", " intervals_after_compression.append(current_interval)\n", " # and current_interval = next_interval = (7, 10)\n", " current_interval = next_interval\n", " i += 1\n", " # we add the last current interval if it was not yet added\n", " if current_interval not in intervals_after_compression:\n", " intervals_after_compression.append(current_interval)\n", " return intervals_after_compression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Example:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:39:18.078802Z", "start_time": "2020-10-13T10:39:18.067726Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n" ] } ], "source": [ "# Do you know if there is a regex trick to obtain\n", "# [(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]\n", "# instead of\n", "intervals = bad_events(pat, strng)\n", "print(intervals)\n", "# [(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n", "# ?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:37:38.636376Z", "start_time": "2020-10-13T10:37:38.631541Z" } }, "outputs": [ { "data": { "text/plain": [ "[(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "compress_intervals(intervals)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- So now we can write the requested function:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:40:04.143256Z", "start_time": "2020-10-13T10:40:04.138945Z" } }, "outputs": [], "source": [ "def bad_events_compressed(pat: str, strng: str) -> List[Interval]:\n", " intervals1 = bad_events(pat, strng)\n", " intervals2 = compress_intervals(intervals1)\n", " return intervals2" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:41:19.652742Z", "start_time": "2020-10-13T10:41:19.644418Z" } }, "outputs": [], "source": [ "def test(pat: str, strng: str) -> None:\n", " print(f\"For pattern {pat} and string {strng}, the bad events uncompressed are:\\n{bad_events(pat, strng)}\\nand the bad events compressed are:\\n{bad_events_compressed(pat, strng)}\")" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:41:22.726773Z", "start_time": "2020-10-13T10:41:22.716726Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "For pattern aca and string acacavcacabacacbbacacazdbacaca, the bad events uncompressed are:\n", "[(0, 3), (2, 5), (7, 10), (11, 14), (17, 20), (19, 22), (25, 28), (27, 30)]\n", "and the bad events compressed are:\n", "[(0, 5), (7, 10), (11, 14), (17, 22), (25, 30)]\n" ] } ], "source": [ "test(pat, strng)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other examples" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:41:41.238414Z", "start_time": "2020-10-13T10:41:41.227023Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "For pattern acab and string acabacabacabacacavcacabacacbbacacazdbacacaacacavcacabacacbbacacazdbacacaacabacab, the bad events uncompressed are:\n", "[(0, 4), (4, 8), (8, 12), (19, 23), (49, 53), (72, 76), (76, 80)]\n", "and the bad events compressed are:\n", "[(0, 12), (19, 23), (49, 53), (72, 80)]\n" ] } ], "source": [ "test(\"acab\", \"acabacabacabacacavcacabacacbbacacazdbacacaacacavcacabacacbbacacazdbacacaacabacab\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2020-10-13T10:42:06.988200Z", "start_time": "2020-10-13T10:42:06.978483Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "For pattern merci and string mercimerciderienmercimerki, the bad events uncompressed are:\n", "[(0, 5), (5, 10), (16, 21)]\n", "and the bad events compressed are:\n", "[(0, 10), (16, 21)]\n" ] } ], "source": [ "test(\"merci\", \"mercimerciderienmercimerki\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "It was fun!\n", "\n", "> See [GitHub.com/Naereen/notebooks](https://github.com/Naereen/notebooks/) for other notebooks!\n", "> This one and all the others I wrote are open-source [under the MIT License](https://lbesson.mit-license.org/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }