1  Short study of the Lempel-Ziv complexity
1.1  Short definition
1.2  Python implementation
1.3  Tests (1/2)
1.4  Cython implementation
1.5  Numba implementation
1.6  Tests (2/2)
1.7  Benchmarks
1.8  Complexity ?
1.9  Conclusion
1.10  (Experimental) Julia implementation
1.11  Ending notes
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Short study of the Lempel-Ziv complexity\n", "\n", "In this short [Jupyter notebook](https://www.Jupyter.org/) aims at defining and explaining the [Lempel-Ziv complexity](https://en.wikipedia.org/wiki/Lempel-Ziv_complexity).\n", "\n", "[I](http://perso.crans.org/besson/) will give examples, and benchmarks of different implementations.\n", "\n", "- **Reference:** Abraham Lempel and Jacob Ziv, *« On the Complexity of Finite Sequences »*, IEEE Trans. on Information Theory, January 1976, p. 75–81, vol. 22, n°1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Short definition\n", "The Lempel-Ziv complexity is defined as the number of different substrings encountered as the stream is viewed from begining to the end.\n", "\n", "As an example:\n", "\n", "python\n", ">>> s = '1001111011000010'\n", ">>> lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010\n", "8\n", "\n", "\n", "Marking in the different substrings, this sequence $s$ has complexity $\\mathrm{Lempel}$-$\\mathrm{Ziv}(s) = 6$ because $s = 1001111011000010 = 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010$.\n", "\n", "- See the page https://en.wikipedia.org/wiki/Lempel-Ziv_complexity for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other examples:\n", "\n", "python\n", ">>> lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010\n", "7\n", ">>> lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000\n", "9\n", ">>> lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101\n", "10\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Python implementation" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [], "source": [ "def lempel_ziv_complexity(sequence):\n", " \"\"\"Lempel-Ziv complexity for a binary sequence, in simple Python code.\"\"\"\n", " sub_strings = set()\n", " n = len(sequence)\n", " ind = 0\n", " inc = 1\n", " # this while loop runs at most n times\n", " while True:\n", " if ind + inc > len(sequence):\n", " break\n", " # this can take some time, takes O(inc)\n", " sub_str = sequence[ind : ind + inc]\n", " # and this also, takes a O(log |size set|) in worst case\n", " # max value for inc = n / size set at the end\n", " # so worst case is that the set contains sub strings of the same size\n", " # and the worst loop takes a O(n / |S| * log(|S|))\n", " # ==> so if n/|S| is constant, it gives O(n log(n)) at the end\n", " # but if n/|S| = O(n) then it gives O(n^2)\n", " if sub_str in sub_strings:\n", " inc += 1\n", " else:\n", " sub_strings.add(sub_str)\n", " ind += inc\n", " inc = 1\n", " return len(sub_strings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Tests (1/2)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = '1001111011000010'\n", "lempel_ziv_complexity(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6.47 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" ] } ], "source": [ "%timeit lempel_ziv_complexity(s)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "8.82 µs ± 795 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n" ] } ], "source": [ "%timeit lempel_ziv_complexity('100111101100001000001010')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import random\n", "\n", "def random_string(size, alphabet=\"ABCDEFGHIJKLMNOPQRSTUVWXYZ\"):\n", " return \"\".join(random.choices(alphabet, k=size))\n", "\n", "def random_binary_sequence(size):\n", " return random_string(size, alphabet=\"01\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'JYLRJBDBFGTBMLFKNYQMJXIZJKKEVONGVKUCHNSSJCYROTATJDACWKCLWDEULMZWSQHJFFCQGMRCINHRIOLMEWWEPTOUUECJWAAN'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'1110000010101011101000110010001100000011101110011010100101100110100010110110111000111000101100001010'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_string(100)\n", "random_binary_sequence(100)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "For random strings in A..Z...\n", " of sizes 10, Lempel-Ziv complexity runs in:\n", "7.64 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", " of sizes 100, Lempel-Ziv complexity runs in:\n", "49.6 µs ± 6.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", " of sizes 1000, Lempel-Ziv complexity runs in:\n", "591 µs ± 78.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", " of sizes 10000, Lempel-Ziv complexity runs in:\n", "5.2 ms ± 770 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", " of sizes 100000, Lempel-Ziv complexity runs in:\n", "52.2 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "\n", "For random binary sequences...\n", " of sizes 10, Lempel-Ziv complexity runs in:\n", "6.04 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", " of sizes 100, Lempel-Ziv complexity runs in:\n", "46.8 µs ± 3.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", " of sizes 1000, Lempel-Ziv complexity runs in:\n", "491 µs ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", " of sizes 10000, Lempel-Ziv complexity runs in:\n", "5.76 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", " of sizes 100000, Lempel-Ziv complexity runs in:\n", "65.3 ms ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "for (r, name) in zip(\n", " [random_string, random_binary_sequence],\n", " [\"random strings in A..Z\", \"random binary sequences\"]\n", " ):\n", " print(\"\\nFor {}...\".format(name))\n", " for n in [10, 100, 1000, 10000, 100000]:\n", " print(\" of sizes {}, Lempel-Ziv complexity runs in:\".format(n))\n", " %timeit lempel_ziv_complexity(r(n))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can start to see that the time complexity of this function seems to grow linearly as the size grows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Cython implementation\n", "As [this blog post](https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/) explains it, we can easily try to use [Cython](http://Cython.org/) in a notebook cell.\n", "\n", "> See [the Cython documentation](http://docs.cython.org/en/latest/src/quickstart/build.html#using-the-jupyter-notebook) for more information." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "%load_ext cython" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "%%cython\n", "import cython\n", "\n", "ctypedef unsigned int DTYPE_t\n", "\n", "@cython.boundscheck(False) # turn off bounds-checking for entire function, quicker but less safe\n", "def lempel_ziv_complexity_cython(str sequence not None):\n", " \"\"\"Lempel-Ziv complexity for a string, in simple Cython code (C extension).\"\"\"\n", " \n", " cdef set sub_strings = set()\n", " cdef str sub_str = \"\"\n", " cdef DTYPE_t n = len(sequence)\n", " cdef DTYPE_t ind = 0\n", " cdef DTYPE_t inc = 1\n", " while True:\n", " if ind + inc > len(sequence):\n", " break\n", " sub_str = sequence[ind : ind + inc]\n", " if sub_str in sub_strings:\n", " inc += 1\n", " else:\n", " sub_strings.add(sub_str)\n", " ind += inc\n", " inc = 1\n", " return len(sub_strings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let try it!" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = '1001111011000010'\n", "lempel_ziv_complexity_cython(s) # 1 / 0 / 01 / 11 / 10 / 110 / 00 / 010" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.97 µs ± 590 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", "843 ns ± 38.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n" ] } ], "source": [ "%timeit lempel_ziv_complexity(s)\n", "%timeit lempel_ziv_complexity_cython(s)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity_cython('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity_cython('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity_cython('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for a test of the speed?" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "For random strings in A..Z...\n", " of sizes 10, Lempel-Ziv complexity in Cython runs in:\n", "3.9 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", " of sizes 100, Lempel-Ziv complexity in Cython runs in:\n", "25.8 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", " of sizes 1000, Lempel-Ziv complexity in Cython runs in:\n", "276 µs ± 50.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", " of sizes 10000, Lempel-Ziv complexity in Cython runs in:\n", "2.43 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", " of sizes 100000, Lempel-Ziv complexity in Cython runs in:\n", "28 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n", "\n", "For random binary sequences...\n", " of sizes 10, Lempel-Ziv complexity in Cython runs in:\n", "4.06 µs ± 444 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", " of sizes 100, Lempel-Ziv complexity in Cython runs in:\n", "29 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", " of sizes 1000, Lempel-Ziv complexity in Cython runs in:\n", "270 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", " of sizes 10000, Lempel-Ziv complexity in Cython runs in:\n", "2.74 ms ± 405 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", " of sizes 100000, Lempel-Ziv complexity in Cython runs in:\n", "30.9 ms ± 5.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "for (r, name) in zip(\n", " [random_string, random_binary_sequence],\n", " [\"random strings in A..Z\", \"random binary sequences\"]\n", " ):\n", " print(\"\\nFor {}...\".format(name))\n", " for n in [10, 100, 1000, 10000, 100000]:\n", " print(\" of sizes {}, Lempel-Ziv complexity in Cython runs in:\".format(n))\n", " %timeit lempel_ziv_complexity_cython(r(n))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> $\\implies$ Yay! It seems faster indeed! but only x2 times faster..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Numba implementation\n", "As [this blog post](https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/) explains it, we can also try to use [Numba](http://Numba.PyData.org/) in a notebook cell." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "from numba import jit" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "@jit\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", " \"\"\"Lempel-Ziv complexity for a sequence, in Python code using numba.jit() for automatic speedup (hopefully).\"\"\"\n", "\n", " sub_strings = set()\n", " n : int= len(sequence)\n", "\n", " ind : int = 0\n", " inc : int = 1\n", " while True:\n", " if ind + inc > len(sequence):\n", " break\n", " sub_str : str = sequence[ind : ind + inc]\n", " if sub_str in sub_strings:\n", " inc += 1\n", " else:\n", " sub_strings.add(sub_str)\n", " ind += inc\n", " inc = 1\n", " return len(sub_strings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let try it!" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ ":1: NumbaWarning: \n", "Compilation is falling back to object mode WITH looplifting enabled because Function \"lempel_ziv_complexity_numba\" failed type inference due to: Internal error at .\n", "\n", "[1] During: resolving callee type: BoundFunction(set.add for set(undefined))\n", "[2] During: typing of call at (17)\n", "\n", "Enable logging at debug level for details.\n", "\n", "File \"\", line 17:\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", " \n", " else:\n", " sub_strings.add(sub_str)\n", " ^\n", "\n", " @jit\n", ":1: NumbaWarning: \n", "Compilation is falling back to object mode WITHOUT looplifting enabled because Function \"lempel_ziv_complexity_numba\" failed type inference due to: cannot determine Numba type of \n", "\n", "File \"\", line 9:\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", " \n", " ind : int = 0\n", " inc : int = 1\n", " ^\n", "\n", " @jit\n", "/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:178: NumbaWarning: Function \"lempel_ziv_complexity_numba\" was compiled in object mode without forceobj=True, but has lifted loops.\n", "\n", "File \"\", line 2:\n", "@jit\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", "^\n", "\n", " state.func_ir.loc))\n", "/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:187: NumbaDeprecationWarning: \n", "Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.\n", "\n", "For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit\n", "\n", "File \"\", line 2:\n", "@jit\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", "^\n", "\n", " warnings.warn(errors.NumbaDeprecationWarning(msg, state.func_ir.loc))\n", ":1: NumbaWarning: \n", "Compilation is falling back to object mode WITHOUT looplifting enabled because Function \"lempel_ziv_complexity_numba\" failed type inference due to: non-precise type pyobject\n", "[1] During: typing of argument at (9)\n", "\n", "File \"\", line 9:\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", " \n", " ind : int = 0\n", " inc : int = 1\n", " ^\n", "\n", " @jit\n", "/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:178: NumbaWarning: Function \"lempel_ziv_complexity_numba\" was compiled in object mode without forceobj=True.\n", "\n", "File \"\", line 9:\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", " \n", " ind : int = 0\n", " inc : int = 1\n", " ^\n", "\n", " state.func_ir.loc))\n", "/usr/local/lib/python3.7/dist-packages/numba/object_mode_passes.py:187: NumbaDeprecationWarning: \n", "Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.\n", "\n", "For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit\n", "\n", "File \"\", line 9:\n", "def lempel_ziv_complexity_numba(sequence : str) -> int:\n", " \n", " ind : int = 0\n", " inc : int = 1\n", " ^\n", "\n", " warnings.warn(errors.NumbaDeprecationWarning(msg, state.func_ir.loc))\n" ] }, { "data": { "text/plain": [ "8" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = '1001111011000010'\n", "lempel_ziv_complexity_numba(s) # 1 / 0 / 01 / 1110 / 1100 / 0010" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43.7 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%timeit lempel_ziv_complexity_numba(s)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity_numba('1010101010101010') # 1, 0, 10, 101, 01, 010, 1010" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity_numba('1001111011000010000010') # 1, 0, 01, 11, 10, 110, 00, 010, 000\n", " 9" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lempel_ziv_complexity_numba('100111101100001000001010') # 1, 0, 01, 11, 10, 110, 00, 010, 000, 0101" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "48.1 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%timeit lempel_ziv_complexity_numba('100111101100001000001010')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> $\\implies$ Well... It doesn't seem that much faster from the naive Python code.\n", "> We specified the signature when calling [@numba.jit](http://numba.pydata.org/numba-doc/latest/user/jit.html), and used the more appropriate data structure (string is probably the smaller, numpy array are probably faster).\n", "> But even these tricks didn't help that much.\n", "\n", "> I tested, and without specifying the signature, the fastest approach is using string, compared to using lists or numpy arrays.\n", "> Note that the [@jit](http://numba.pydata.org/numba-doc/latest/user/jit.html)-powered function is compiled at runtime when first being called, so the signature used for the *first* call is determining the signature used by the compile function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Tests (2/2)\n", "\n", "To test more robustly, let us generate some (uniformly) random binary sequences." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "from numpy.random import binomial\n", "\n", "def bernoulli(p, size=1):\n", " \"\"\"One or more samples from a Bernoulli of probability p.\"\"\"\n", " return binomial(1, p, size)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bernoulli(0.5, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's probably not optimal, but we can generate a string with:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'11110010000011111101'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "''.join(str(i) for i in bernoulli(0.5, 20))" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "def random_binary_sequence(n, p=0.5):\n", " \"\"\"Uniform random binary sequence of size n, with rate of 0/1 being p.\"\"\"\n", " return ''.join(str(i) for i in bernoulli(p, n))" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'10111010011001100110111110001111100001100111111010'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'00000001000000001000000000001000100100000000000000'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'00011000000000101010110000010011000100001000100001'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'00000111111111000101000100100001100000101100000110'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'11111100111011110010110111111101110011111011111111'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/plain": [ "'11111011111110111111111111111111111111111111111110'" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_binary_sequence(50)\n", "random_binary_sequence(50, p=0.1)\n", "random_binary_sequence(50, p=0.25)\n", "random_binary_sequence(50, p=0.5)\n", "random_binary_sequence(50, p=0.75)\n", "random_binary_sequence(50, p=0.9)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And so, this function can test to check that the three implementations (naive, Cython-powered, Numba-powered) always give the same result." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "def tests_3_functions(n, p=0.5, debug=True):\n", " s = random_binary_sequence(n, p=p)\n", " c1 = lempel_ziv_complexity(s)\n", " if debug:\n", " print(\"Sequence s = {} ==> complexity C = {}\".format(s, c1))\n", " c2 = lempel_ziv_complexity_cython(s)\n", " c3 = lempel_ziv_complexity_numba(s)\n", " assert c1 == c2 == c3, \"Error: the sequence {} gave different values of the Lempel-Ziv complexity from 3 functions ({}, {}, {})...\".format(s, c1, c2, c3)\n", " return c1" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence s = 11111 ==> complexity C = 2\n" ] }, { "data": { "text/plain": [ "2" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tests_3_functions(5)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence s = 11000110100011110101 ==> complexity C = 9\n" ] }, { "data": { "text/plain": [ "9" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tests_3_functions(20)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence s = 01000001001110111101111110011100101001011001111001 ==> complexity C = 17\n" ] }, { "data": { "text/plain": [ "17" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tests_3_functions(50)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence s = 10110110111101100010011010110001001001101100110000110110010100111001110100001110001111110111100011011011001010000001111011110000101101011110011010111000111111101110111110010100110011000111011101100101101101000000111001010100100010000010011011100111010101001100001110101100101000000111010110011001010000100001111110110010011111100100001000011010011001001010100001111011101101000001100001011101010101100101100011000101101010010101001010011011011010000010100101101000110100100000111001100100110011011101 ==> complexity C = 99\n" ] }, { "data": { "text/plain": [ "99" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tests_3_functions(500)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sequence s = 10100100011000100101101111111001010001101101010100011101100101101100101111100110010110001010010101010000011101110011101111110110010101001011010010101000000111001011001000000101001010111001110000100101001111100001100000010111001100000010100010011010110001111111001111001101001101111111001111111110010110101100110010111001110101101010111010011000010100000000000111010100110000001110010101001101000100100111101100110110111001101011110101110011000001001101100101010100101100101010000100111111110100011101111110001110110011011001110100101000101000110010101010010000100101111010001111000100001110001000101011110111010000000110110000111110101110110010000110110110010110101110100011100010011011100101010001011111101001100100000100001101011010000011010001101101001011001100010101101110011100011111111111001001111100110101011110011100000010010100001110010111011011111011000101111001011101000001100000000101000010010110101010000011100110100101100111110101011010101100111010000100011111000111110001011100010011111110000001010010010000001101101001111110010100000101101111110101110001110001110001000001001011010101000011101000000110001001110001110101110011100111000011001100111011011000111110001111011100001001110011101100001000010010100011001101001001110110101101010111001001001001011000001011001111010001000110111011101101001100100001100001100010001011011010001001000101010001011101011111001010011000010011000110111111100110010101100111010010111101110110111011101110001101001000010010001001111001111111101000101111010101110100011110100001101100000100100010101011001101001111110001111010111101001100101000100100011000100011100101001110110010100011100001011010110110111000100011000101101000010100100101100100001000110010000110100100011010011000100001101011011010111010010110100110011111101011010101011101111111101011111001011010111010100110111001101001101011011001100001110000001011110001010100110101000010100101111000110111001101110000000011010101100010111100111011111100111110001110101010001000101111111110011111100010110010111011110111110111011110001110001101101101111110111100110101000001101111110111100010001110111110001001100011100000010111100100101011001101101011010110110011010110100111001011011011110000000010001110000011111111100011011001001001100100001110011000111110010110100100001011100001001000101000101001001101001000000110001100100001101111101001100100101000011011011111101000110101101101010000110001111011110011111001110111001100111010101010011110111001110111000000110101001010010111101000100100100101001001000111010100010111101011110100100111100110101010011101110011101110110101101111000111001111001010100001111110000100000001011000110001110101000001001001010001001000101110011010100100000110010000011100111001011111001001110010001011100101011101101011101111001000101101101011011110100001101100001011001101011000101000000001000110011011100010011010000111010011100010100010010010101000111101110110101011110100000011011010100011010100101011001100100000000110101011000100111001110001101001101001001110111011100111001001001000101011011111110100100000010111011100000101001001101110100010111100000000010101010000010100110011111001000010000111101111100100001000110000111110011001011011111011101111111011010001101111100010010010010010101000001000011110000000000101110111000001110000000111111011111001000110000000001010000010010110111000100101100111111010011101101111101111111011101101011100000011101111111111001111001100101001011000111010000110001001010111010100011111100011001101011110000100000010001100001000011111111111010111100111001010110000001011111110101000000100101010001001001111011110010011110000001100100110011110110110001110100100001000100001010110011110000100001110111011101101110100001010110111001101101010000011001101001100001111011000100110000101110100011011001000001100100101001110001000010111101110000111100101100011011011100110100001110000010111111001100011010001101110010000011000001111011010010100110101010011011101000111000100011010110101111011011001111100010001010110001111001000001101111010010001001010110111010000011110110101000010011001100100100000111011110001011110111000101001011010100000111111011100100101001001001100000010110110001100101011110001011000000010110111010111000000000110110100010111000111011101001110011101001100011000000101001111000101011101000010111111010100001000100000000110101100010110000111000110001101001010010100111010101000111101000110010100110010001001110110101000010010101001011001100001000100000101110100011000010000100010101100111001010100010111111010110000001000011100110000110110110010111111011010110001011000011011111000001001100100110110001100100110111110000101011000100001000110110100000001101110001100010101001011111111110100010001001001101101101101011000110000011000100010011111000110011011011100000101001001110011010100001011001011110100110001011011010001100111001011001101111110101101011101000111010111000000001010001001011000001001000101001010011010001110000001000010001110011001001110110100011101100110101110001100101010000010001000101101 ==> complexity C = 654\n" ] }, { "data": { "text/plain": [ "654" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tests_3_functions(5000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Benchmarks\n", "\n", "On two example of strings (binary sequences), we can compare our three implementation." ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6.28 µs ± 295 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", "1.22 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n", "50.1 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%timeit lempel_ziv_complexity('100111101100001000001010')\n", "%timeit lempel_ziv_complexity_cython('100111101100001000001010')\n", "%timeit lempel_ziv_complexity_numba('100111101100001000001010')" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "25.7 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "5.43 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)\n", "84.4 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" ] } ], "source": [ "%timeit lempel_ziv_complexity('10011110110000100000101000100100101010010111111011001111111110101001010110101010')\n", "%timeit lempel_ziv_complexity_cython('10011110110000100000101000100100101010010111111011001111111110101001010110101010')\n", "%timeit lempel_ziv_complexity_numba('10011110110000100000101000100100101010010111111011001111111110101001010110101010')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let check the time used by all the three functions, for longer and longer sequences:" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "229 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", "148 µs ± 52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", "123 µs ± 6.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "205 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "288 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", "819 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit tests_3_functions(10, debug=False)\n", "%timeit tests_3_functions(20, debug=False)\n", "%timeit tests_3_functions(40, debug=False)\n", "%timeit tests_3_functions(80, debug=False)\n", "%timeit tests_3_functions(160, debug=False)\n", "%timeit tests_3_functions(320, debug=False)" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "def test_cython(n):\n", " s = random_binary_sequence(n)\n", " c = lempel_ziv_complexity_cython(s)\n", " return c" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "23.9 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "40 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "53.4 µs ± 5.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "109 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "207 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", "369 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n" ] } ], "source": [ "%timeit test_cython(10)\n", "%timeit test_cython(20)\n", "%timeit test_cython(40)\n", "%timeit test_cython(80)\n", "%timeit test_cython(160)\n", "%timeit test_cython(320)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "679 µs ± 135 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n", "1.85 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "2.97 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "5.47 ms ± 494 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n" ] } ], "source": [ "%timeit test_cython(640)\n", "%timeit test_cython(1280)\n", "%timeit test_cython(2560)\n", "%timeit test_cython(5120)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10.7 ms ± 891 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n", "20.9 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit test_cython(10240)\n", "%timeit test_cython(20480)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Complexity ?\n", "$\\implies$ The function lempel_ziv_complexity_cython seems to be indeed (almost) linear in $n$, the length of the binary sequence $S$.\n", "\n", "But let check more precisely, as it could also have a complexity of $\\mathcal{O}(n \\log n)$." ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "sns.set(context=\"notebook\", style=\"darkgrid\", palette=\"hls\", font=\"sans-serif\", font_scale=1.4)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import timeit" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "sizes = np.array(np.trunc(np.logspace(1, 6, 30)), dtype=int)\n", "\n", "times = np.array([\n", " timeit.timeit(\n", " stmt=\"lempel_ziv_complexity_cython(random_string({}))\".format(n),\n", " globals=globals(),\n", " number=10,\n", " )\n", " for n in sizes\n", "])" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "